[ovs-dev] [PATCH v12 1/1] Avoid dp_hash recirculation for balance-tcp bond selection mode

Eelco Chaudron echaudro at redhat.com
Tue Mar 31 11:00:52 UTC 2020


Hi Vishal,

In addition to the memory leak found by Matteo I have some more comments 
below.

Cheers,

Eelco

On 11 Mar 2020, at> On 11 Mar 2020, at 15:15, Vishal Deep Ajmera wrote:
>
> Problem:
> --------
> In OVS-DPDK, flows with output over a bond interface of type 
> “balance-tcp”
> (using a hash on TCP/UDP 5-tuple) get translated by the ofproto layer 
> into
> "HASH" and "RECIRC" datapath actions. After recirculation, the packet 
> is
> forwarded to the bond member port based on 8-bits of the datapath hash
> value computed through dp_hash. This causes performance degradation in 
> the
> following ways:
>
> 1. The recirculation of the packet implies another lookup of the 
> packet’s
> flow key in the exact match cache (EMC) and potentially Megaflow 
> classifier
> (DPCLS). This is the biggest cost factor.
>
> 2. The recirculated packets have a new “RSS” hash and compete with 
> the
> original packets for the scarce number of EMC slots. This implies more
> EMC misses and potentially EMC thrashing causing costly DPCLS lookups.
>
> 3. The 256 extra megaflow entries per bond for dp_hash bond selection 
> put
> additional load on the revalidation threads.
>
> Owing to this performance degradation, deployments stick to 
> “balance-slb”
> bond mode even though it does not do active-active load balancing for
> VXLAN- and GRE-tunnelled traffic because all tunnel packet have the 
> same
> source MAC address.
>
> Proposed optimization:
> ----------------------
> This proposal introduces a new load-balancing output action instead of
> recirculation.
>
> Maintain one table per-bond (could just be an array of uint16's) and
> program it the same way internal flows are created today for each 
> possible
> hash value(256 entries) from ofproto layer. Use this table to 
> load-balance
> flows as part of output action processing.
>
> Currently xlate_normal() -> output_normal() -> 
> bond_update_post_recirc_rules()
> -> bond_may_recirc() and compose_output_action__() generate
> “dp_hash(hash_l4(0))” and “recirc(<RecircID>)” actions. In 
> this case the
> RecircID identifies the bond. For the recirculated packets the ofproto 
> layer
> installs megaflow entries that match on RecircID and masked dp_hash 
> and send
> them to the corresponding output port.
>
> Instead, we will now generate action as
>     "lb_output(bond,<bond id>)"
>
> This combines hash computation (only if needed, else re-use RSS hash) 
> and
> inline load-balancing over the bond. This action is used *only* for 
> balance-tcp
> bonds in OVS-DPDK datapath (the OVS kernel datapath remains 
> unchanged).
>
> Example:
> --------
> Current scheme:
> ---------------
> With 1 IP-UDP flow:
>
> flow-dump from pmd on cpu core: 2
> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), 
> packets:2828969, bytes:181054016, used:0.000s, 
> actions:hash(hash_l4(0)),recirc(0x1)
>
> recirc_id(0x1),dp_hash(0x113683bd/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), 
> packets:2828937, bytes:181051968, used:0.000s, actions:2
>
> With 8 IP-UDP flows (with random UDP src port): (all hitting same 
> DPCL):
>
> flow-dump from pmd on cpu core: 2
> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), 
> packets:2674009, bytes:171136576, used:0.000s, 
> actions:hash(hash_l4(0)),recirc(0x1)
>
> recirc_id(0x1),dp_hash(0xf8e02b7e/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), 
> packets:377395, bytes:24153280, used:0.000s, actions:2
> recirc_id(0x1),dp_hash(0xb236c260/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), 
> packets:333486, bytes:21343104, used:0.000s, actions:1
> recirc_id(0x1),dp_hash(0x7d89eb18/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), 
> packets:348461, bytes:22301504, used:0.000s, actions:1
> recirc_id(0x1),dp_hash(0xa78d75df/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), 
> packets:633353, bytes:40534592, used:0.000s, actions:2
> recirc_id(0x1),dp_hash(0xb58d846f/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), 
> packets:319901, bytes:20473664, used:0.001s, actions:2
> recirc_id(0x1),dp_hash(0x24534406/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), 
> packets:334985, bytes:21439040, used:0.001s, actions:1
> recirc_id(0x1),dp_hash(0x3cf32550/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), 
> packets:326404, bytes:20889856, used:0.001s, actions:1
>
> New scheme:
> -----------
> We can do with a single flow entry (for any number of new flows):
>
> in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), 
> packets:2674009, bytes:171136576, used:0.000s, 
> actions:lb_output(bond,1)
>
> A new CLI has been added to dump datapath bond cache as given below.
>
> “sudo ovs-appctl dpif-netdev/dp-bond-show [dp]”

EC> You changed the command to bond-show

> root at ubuntu-190:performance_scripts # sudo ovs-appctl 
> dpif-netdev/dp-bond-show
> Bond cache:
>         bond-id 1 :
>                 bucket 0 - slave 2
>                 bucket 1 - slave 1
>                 bucket 2 - slave 2
>                 bucket 3 - slave 1
>
> Performance improvement:
> ------------------------
> With a prototype of the proposed idea, the following perf improvement 
> is seen
> with Phy-VM-Phy UDP traffic, single flow. With multiple flows, the 
> improvement
> is even more enhanced (due to reduced number of flows).
>
> 1 VM:
> *****
> +--------------------------------------+
> |                 mpps                 |
> +--------------------------------------+
> | Flows  master  with-opt.   %delta    |
> +--------------------------------------+
> | 1      4.34    5.59        28.86
> | 10     4.09    5.43        32.61
> | 400    3.44    5.35        55.25
> | 1k     3.31    5.25        58.41
> | 10k    2.46    4.43        79.78
> | 100k   2.32    4.27        83.59
> | 500k   2.29    4.23        84.57
> +--------------------------------------+
> mpps: million packets per second.
> packet size 64 bytes.
>
> Signed-off-by: Manohar Krishnappa Chidambaraswamy <manukc at gmail.com>
> Co-authored-by: Manohar Krishnappa Chidambaraswamy <manukc at gmail.com>
> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera at ericsson.com>
>
> CC: Jan Scheurich <jan.scheurich at ericsson.com>
> CC: Venkatesan Pradeep <venkatesan.pradeep at ericsson.com>
> CC: Ben Pfaff <blp at ovn.org>
> CC: Ilya Maximets <i.maximets at ovn.org>
> CC: Ian Stokes <ian.stokes at intel.com>
> CC: David Marchand <david.marchand at redhat.com>
> CC: Matteo Croce <mcroce at redhat.com>
> CC: Eelco Chaudron <echaudro at redhat.com>
>
> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera at ericsson.com>
> ---
>  datapath/linux/compat/include/linux/openvswitch.h |   1 +
>  lib/dpif-netdev.c                                 | 399 
> +++++++++++++++++++---
>  lib/dpif-netlink.c                                |   3 +
>  lib/dpif-provider.h                               |  11 +
>  lib/dpif.c                                        |  49 +++
>  lib/dpif.h                                        |  13 +
>  lib/odp-execute.c                                 |   2 +
>  lib/odp-util.c                                    |   4 +
>  ofproto/bond.c                                    |  68 +++-
>  ofproto/bond.h                                    |   5 +
>  ofproto/ofproto-dpif-ipfix.c                      |   1 +
>  ofproto/ofproto-dpif-sflow.c                      |   3 +-
>  ofproto/ofproto-dpif-xlate.c                      |  32 +-
>  ofproto/ofproto-dpif.c                            |  30 ++
>  ofproto/ofproto-dpif.h                            |   8 +-
>  ofproto/ofproto-provider.h                        |   3 +
>  ofproto/ofproto.c                                 |  10 +
>  ofproto/ofproto.h                                 |   1 +
>  tests/lacp.at                                     |   9 +
>  vswitchd/bridge.c                                 |  13 +
>  vswitchd/vswitch.xml                              |  23 ++
>  21 files changed, 632 insertions(+), 56 deletions(-)
>
> diff --git a/datapath/linux/compat/include/linux/openvswitch.h 
> b/datapath/linux/compat/include/linux/openvswitch.h
> index 2f0c655..8073d39 100644
> --- a/datapath/linux/compat/include/linux/openvswitch.h
> +++ b/datapath/linux/compat/include/linux/openvswitch.h
> @@ -1021,6 +1021,7 @@ enum ovs_action_attr {
>  	OVS_ACTION_ATTR_TUNNEL_PUSH,   /* struct ovs_action_push_tnl*/
>  	OVS_ACTION_ATTR_TUNNEL_POP,    /* u32 port number. */
>  	OVS_ACTION_ATTR_DROP,          /* u32 xlate_error. */
> +	OVS_ACTION_ATTR_LB_OUTPUT,     /* u32 bond-id. */
>  #endif
>  	__OVS_ACTION_ATTR_MAX,	      /* Nothing past this will be accepted
>  				       * from userspace. */
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index d393aab..68f3a95 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -309,6 +309,7 @@ struct pmd_auto_lb {
>   *
>   *    dp_netdev_mutex (global)
>   *    port_mutex
> + *    bond_mutex
>   *    non_pmd_mutex
>   */
>  struct dp_netdev {
> @@ -376,6 +377,12 @@ struct dp_netdev {
>
>      struct conntrack *conntrack;
>      struct pmd_auto_lb pmd_alb;
> +
> +    /* Bonds.
> +     *
> +     */
> +    struct ovs_mutex bond_mutex; /* Protects updates of 'tx_bonds'. 
> */
> +    struct cmap tx_bonds; /* Contains 'struct tx_bond'. */
>  };
>
>  static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
> @@ -607,6 +614,20 @@ struct tx_port {
>      struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
>  };
>
> +/* Contained by struct tx_bond 'slave_buckets'. */
> +struct slave_entry {
> +    odp_port_t slave_id;
> +    atomic_ullong n_packets;
> +    atomic_ullong n_bytes;
> +};
> +
> +/* Contained by struct dp_netdev_pmd_thread's 'tx_bonds'. */
> +struct tx_bond {
> +    struct cmap_node node;
> +    uint32_t bond_id;
> +    struct slave_entry slave_buckets[BOND_BUCKETS];
> +};
> +
>  /* A set of properties for the current processing loop that is not 
> directly
>   * associated with the pmd thread itself, but with the packets being
>   * processed or the short-term system configuration (for example, 
> time).
> @@ -739,6 +760,11 @@ struct dp_netdev_pmd_thread {
>       * read by the pmd thread. */
>      struct hmap tx_ports OVS_GUARDED;
>
> +    struct ovs_mutex bond_mutex;    /* Protects updates of 
> 'tx_bonds'. */
> +    /* Map of 'tx_bond's used for transmission.  Written by the main 
> thread
> +     * and read by the pmd thread. */
> +    struct cmap tx_bonds;
> +
>      /* These are thread-local copies of 'tx_ports'.  One contains 
> only tunnel
>       * ports (that support push_tunnel/pop_tunnel), the other 
> contains ports
>       * with at least one txq (that support send).  A port can be in 
> both.
> @@ -830,6 +856,10 @@ static void dp_netdev_del_rxq_from_pmd(struct 
> dp_netdev_pmd_thread *pmd,
>  static int
>  dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
>                                     bool force);
> +static void dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread 
> *pmd,
> +                                         struct tx_bond *bond);
> +static void dp_netdev_del_bond_tx_from_pmd(struct 
> dp_netdev_pmd_thread *pmd,
> +                                           uint32_t bond_id);
>
>  static void reconfigure_datapath(struct dp_netdev *dp)
>      OVS_REQUIRES(dp->port_mutex);
> @@ -1396,6 +1426,49 @@ pmd_perf_show_cmd(struct unixctl_conn *conn, 
> int argc,
>      par.command_type = PMD_INFO_PERF_SHOW;
>      dpif_netdev_pmd_info(conn, argc, argv, &par);
>  }
> +
> +static void
> +dpif_netdev_bond_show(struct unixctl_conn *conn, int argc,
> +                      const char *argv[], void *aux OVS_UNUSED)
> +{
> +    struct ds reply = DS_EMPTY_INITIALIZER;
> +
> +    ovs_mutex_lock(&dp_netdev_mutex);
> +
> +    struct dp_netdev *dp = NULL;

EC> Blank line after definition, and maybe move it before mutex lock

> +    if (argc == 2) {
> +        dp = shash_find_data(&dp_netdevs, argv[1]);
> +    } else if (shash_count(&dp_netdevs) == 1) {
> +        /* There's only one datapath. */
> +        dp = shash_first(&dp_netdevs)->data;
> +    }
> +    if (!dp) {
> +        ovs_mutex_unlock(&dp_netdev_mutex);
> +        unixctl_command_reply_error(conn,
> +                                    "please specify an existing 
> datapath");
> +        return;
> +    }
> +
> +    ovs_mutex_lock(&dp->bond_mutex);
> +    if (cmap_count(&dp->tx_bonds) > 0) {
> +        struct tx_bond *dp_bond_entry;

EC> Blank line after definition

> +        ds_put_cstr(&reply, "\nBonds:\n");
> +        CMAP_FOR_EACH (dp_bond_entry, node, &dp->tx_bonds) {
> +            ds_put_format(&reply, "\tbond-id %"PRIu32":\n",
> +                          dp_bond_entry->bond_id);
> +            for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> +                ds_put_format(&reply, "\t\tbucket %d - slave 
> %"PRIu32"\n",
> +                  bucket,
> +                  
> odp_to_u32(dp_bond_entry->slave_buckets[bucket].slave_id));

EC> Fix indentation, please check other places also

> +            }
> +        }
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +    ovs_mutex_unlock(&dp_netdev_mutex);
> +    unixctl_command_reply(conn, ds_cstr(&reply));
> +    ds_destroy(&reply);
> +}
> +
>  

>  static int
>  dpif_netdev_init(void)
> @@ -1427,6 +1500,9 @@ dpif_netdev_init(void)
>                               "[-us usec] [-q qlen]",
>                               0, 10, pmd_perf_log_set_cmd,
>                               NULL);
> +    unixctl_command_register("dpif-netdev/bond-show", "[dp]",
> +                             0, 1, dpif_netdev_bond_show,
> +                             NULL);
>      return 0;
>  }
>
> @@ -1551,6 +1627,9 @@ create_dp_netdev(const char *name, const struct 
> dpif_class *class,
>      ovs_mutex_init_recursive(&dp->port_mutex);
>      hmap_init(&dp->ports);
>      dp->port_seq = seq_create();
> +    ovs_mutex_init(&dp->bond_mutex);
> +    cmap_init(&dp->tx_bonds);
> +
>      fat_rwlock_init(&dp->upcall_rwlock);
>
>      dp->reconfigure_seq = seq_create();
> @@ -1657,6 +1736,12 @@ dp_delete_meter(struct dp_netdev *dp, uint32_t 
> meter_id)
>      }
>  }
>
> +static uint32_t
> +hash_bond_id(uint32_t bond_id)
> +{
> +    return hash_int(bond_id, 0);
> +}
> +
>  /* Requires dp_netdev_mutex so that we can't get a new reference to 
> 'dp'
>   * through the 'dp_netdevs' shash while freeing 'dp'. */
>  static void
> @@ -1664,6 +1749,7 @@ dp_netdev_free(struct dp_netdev *dp)
>      OVS_REQUIRES(dp_netdev_mutex)
>  {
>      struct dp_netdev_port *port, *next;
> +    struct tx_bond *bond;
>
>      shash_find_and_delete(&dp_netdevs, dp->name);
>
> @@ -1673,6 +1759,13 @@ dp_netdev_free(struct dp_netdev *dp)
>      }
>      ovs_mutex_unlock(&dp->port_mutex);
>
> +    ovs_mutex_lock(&dp->bond_mutex);
> +    CMAP_FOR_EACH (bond, node, &dp->tx_bonds) {
> +        cmap_remove(&dp->tx_bonds, &bond->node, 
> hash_bond_id(bond->bond_id));
> +        ovsrcu_postpone(free, bond);
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +
>      dp_netdev_destroy_all_pmds(dp, true);
>      cmap_destroy(&dp->poll_threads);
>
> @@ -1691,6 +1784,9 @@ dp_netdev_free(struct dp_netdev *dp)
>      hmap_destroy(&dp->ports);
>      ovs_mutex_destroy(&dp->port_mutex);
>
> +    cmap_destroy(&dp->tx_bonds);
> +    ovs_mutex_destroy(&dp->bond_mutex);
> +
>      /* Upcalls must be disabled at this point */
>      dp_netdev_destroy_upcall_lock(dp);
>
> @@ -4421,6 +4517,20 @@ tx_port_lookup(const struct hmap *hmap, 
> odp_port_t port_no)
>      return NULL;
>  }
>
> +static struct tx_bond *
> +tx_bond_lookup(const struct cmap *tx_bonds, uint32_t bond_id)
> +{
> +    struct tx_bond *tx;
> +    uint32_t hash = hash_bond_id(bond_id);
> +
> +    CMAP_FOR_EACH_WITH_HASH (tx, node, hash, tx_bonds) {
> +        if (tx->bond_id == bond_id) {
> +            return tx;
> +        }
> +    }
> +    return NULL;
> +}
> +
>  static int
>  port_reconfigure(struct dp_netdev_port *port)
>  {
> @@ -4919,6 +5029,7 @@ reconfigure_datapath(struct dp_netdev *dp)
>      struct hmapx busy_threads = HMAPX_INITIALIZER(&busy_threads);
>      struct dp_netdev_pmd_thread *pmd;
>      struct dp_netdev_port *port;
> +    struct tx_bond *bond;

EC> I would move this below inside the if statement

>      int wanted_txqs;
>
>      dp->last_reconfigure_seq = seq_read(dp->reconfigure_seq);
> @@ -5060,17 +5171,24 @@ reconfigure_datapath(struct dp_netdev *dp)
>          }
>      }
>
> -    /* Add every port to the tx cache of every pmd thread, if it's 
> not
> -     * there already and if this pmd has at least one rxq to poll. */
> +    /* Add every port to the tx cache and bond to the cmap of every 
> pmd thread,
> +     * if it's not there already and if this pmd has at least one rxq 
> to poll.
> +     */
> +    ovs_mutex_lock(&dp->bond_mutex);
>      CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
>          ovs_mutex_lock(&pmd->port_mutex);
>          if (hmap_count(&pmd->poll_list) || pmd->core_id == 
> NON_PMD_CORE_ID) {
>              HMAP_FOR_EACH (port, node, &dp->ports) {
>                  dp_netdev_add_port_tx_to_pmd(pmd, port);
>              }
> +
> +            CMAP_FOR_EACH (bond, node, &dp->tx_bonds) {
> +                dp_netdev_add_bond_tx_to_pmd(pmd, bond);
> +            }

EC> We only add bond here, but do we not also need to remove them if no 
longer needed?

>          }
>          ovs_mutex_unlock(&pmd->port_mutex);
>      }
> +    ovs_mutex_unlock(&dp->bond_mutex);
>
>      /* Reload affected pmd threads. */
>      reload_affected_pmds(dp);
> @@ -6110,6 +6228,7 @@ dp_netdev_configure_pmd(struct 
> dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>      atomic_init(&pmd->reload, false);
>      ovs_mutex_init(&pmd->flow_mutex);
>      ovs_mutex_init(&pmd->port_mutex);
> +    ovs_mutex_init(&pmd->bond_mutex);
>      cmap_init(&pmd->flow_table);
>      cmap_init(&pmd->classifiers);
>      pmd->ctx.last_rxq = NULL;
> @@ -6120,6 +6239,7 @@ dp_netdev_configure_pmd(struct 
> dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>      hmap_init(&pmd->tx_ports);
>      hmap_init(&pmd->tnl_port_cache);
>      hmap_init(&pmd->send_port_cache);
> +    cmap_init(&pmd->tx_bonds);
>      /* init the 'flow_cache' since there is no
>       * actual thread created for NON_PMD_CORE_ID. */
>      if (core_id == NON_PMD_CORE_ID) {
> @@ -6135,11 +6255,17 @@ static void
>  dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
>  {
>      struct dpcls *cls;
> +    struct tx_bond *tx;
>
>      dp_netdev_pmd_flow_flush(pmd);
>      hmap_destroy(&pmd->send_port_cache);
>      hmap_destroy(&pmd->tnl_port_cache);
>      hmap_destroy(&pmd->tx_ports);
> +    CMAP_FOR_EACH (tx, node, &pmd->tx_bonds) {
> +        cmap_remove(&pmd->tx_bonds, &tx->node, 
> hash_bond_id(tx->bond_id));
> +        ovsrcu_postpone(free, tx);
> +    }
> +    cmap_destroy(&pmd->tx_bonds);
>      hmap_destroy(&pmd->poll_list);
>      /* All flows (including their dpcls_rules) have been deleted 
> already */
>      CMAP_FOR_EACH (cls, node, &pmd->classifiers) {
> @@ -6151,6 +6277,7 @@ dp_netdev_destroy_pmd(struct 
> dp_netdev_pmd_thread *pmd)
>      ovs_mutex_destroy(&pmd->flow_mutex);
>      seq_destroy(pmd->reload_seq);
>      ovs_mutex_destroy(&pmd->port_mutex);
> +    ovs_mutex_destroy(&pmd->bond_mutex);
>      free(pmd);
>  }
>
> @@ -6304,6 +6431,53 @@ dp_netdev_del_port_tx_from_pmd(struct 
> dp_netdev_pmd_thread *pmd,
>      free(tx);
>      pmd->need_reload = true;
>  }
> +
> +/* Add bond to the tx bond cmap of 'pmd'. */
> +static void
> +dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
> +                             struct tx_bond *bond)
> +{
> +    struct tx_bond *tx;
> +
> +    ovs_mutex_lock(&pmd->bond_mutex);
> +    tx = tx_bond_lookup(&pmd->tx_bonds, bond->bond_id);
> +    if (tx) {
> +        struct tx_bond *new_tx = xmemdup(bond, sizeof *bond);
> +
> +        /* Copy the stats for each bucket. */
> +        for (int i = 0; i < BOND_BUCKETS; i++) {
> +            uint64_t n_packets, n_bytes;
> +
> +            atomic_read_relaxed(&tx->slave_buckets[i].n_packets, 
> &n_packets);
> +            atomic_read_relaxed(&tx->slave_buckets[i].n_bytes, 
> &n_bytes);
> +            atomic_init(&new_tx->slave_buckets[i].n_packets, 
> n_packets);
> +            atomic_init(&new_tx->slave_buckets[i].n_bytes, n_bytes);
> +        }
> +        cmap_replace(&pmd->tx_bonds, &tx->node, &new_tx->node,
> +                     hash_bond_id(bond->bond_id));
> +        ovsrcu_postpone(free, tx);
> +    } else {
> +        tx = xmemdup(bond, sizeof *bond);
> +        cmap_insert(&pmd->tx_bonds, &tx->node, 
> hash_bond_id(bond->bond_id));
> +    }
> +    ovs_mutex_unlock(&pmd->bond_mutex);
> +}
> +
> +/* Delete bond from the tx bond cmap of 'pmd'. */
> +static void
> +dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
> +                               uint32_t bond_id)
> +{
> +    struct tx_bond *tx;
> +
> +    ovs_mutex_lock(&pmd->bond_mutex);
> +    tx = tx_bond_lookup(&pmd->tx_bonds, bond_id);
> +    if (tx) {
> +        cmap_remove(&pmd->tx_bonds, &tx->node, 
> hash_bond_id(tx->bond_id));
> +        ovsrcu_postpone(free, tx);
> +    }
> +    ovs_mutex_unlock(&pmd->bond_mutex);
> +}
>  

>  static char *
>  dpif_netdev_get_datapath_version(void)
> @@ -7129,6 +7303,88 @@ dp_execute_userspace_action(struct 
> dp_netdev_pmd_thread *pmd,
>      }
>  }
>
> +static bool
> +dp_execute_output_action(struct dp_netdev_pmd_thread *pmd,
> +                         struct dp_packet_batch *packets_,
> +                         bool should_steal, odp_port_t port_no)
> +{
> +    struct tx_port *p = pmd_send_port_cache_lookup(pmd, port_no);
> +    if (!OVS_LIKELY(p)) {
> +        return false;
> +    }
> +
> +    struct dp_packet_batch out;
> +    if (!should_steal) {
> +        dp_packet_batch_clone(&out, packets_);
> +        dp_packet_batch_reset_cutlen(packets_);
> +        packets_ = &out;
> +    }
> +    dp_packet_batch_apply_cutlen(packets_);
> +#ifdef DPDK_NETDEV
> +    if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
> +                     && packets_->packets[0]->source
> +                     != p->output_pkts.packets[0]->source)) {
> +        /* netdev-dpdk assumes that all packets in a single
> +         * output batch has the same source. Flush here to
> +         * avoid memory access issues. */
> +        dp_netdev_pmd_flush_output_on_port(pmd, p);
> +    }
> +#endif
> +    if (dp_packet_batch_size(&p->output_pkts)
> +        + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) {
> +        /* Flush here to avoid overflow. */
> +        dp_netdev_pmd_flush_output_on_port(pmd, p);
> +    }
> +    if (dp_packet_batch_is_empty(&p->output_pkts)) {
> +        pmd->n_output_batches++;
> +    }
> +
> +    struct dp_packet *packet;
> +    DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> +        p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> +            pmd->ctx.last_rxq;
> +        dp_packet_batch_add(&p->output_pkts, packet);
> +    }
> +    return true;
> +}
> +
> +static bool
> +dp_execute_lb_output_action(struct dp_netdev_pmd_thread *pmd,
> +                            struct dp_packet_batch *packets_,
> +                            bool should_steal, uint32_t bond)
> +{
> +    struct dp_packet *packet;
> +    uint32_t i;
> +    const uint32_t cnt = dp_packet_batch_size(packets_);
> +    struct tx_bond *p_bond = tx_bond_lookup(&pmd->tx_bonds, bond);
> +
> +    if (!p_bond) {
> +        return false;
> +    }
> +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
> +        /*
> +         * Lookup the bond-hash table using hash to get the slave.
> +         */
> +        uint32_t hash = dp_packet_get_rss_hash(packet);
> +        struct slave_entry *s_entry = &p_bond->slave_buckets[hash & 
> BOND_MASK];
> +        odp_port_t bond_member = s_entry->slave_id;
> +        uint32_t size = dp_packet_size(packet);
> +
> +        struct dp_packet_batch output_pkt;
> +        dp_packet_batch_init_packet(&output_pkt, packet);
> +        if (OVS_LIKELY(dp_execute_output_action(pmd, &output_pkt, 
> should_steal,
> +                                                bond_member))) {
> +            /* Update slave stats. */
> +            non_atomic_ullong_add(&s_entry->n_packets, 1);
> +            non_atomic_ullong_add(&s_entry->n_bytes, size);
> +        } else {
> +            dp_packet_batch_refill(packets_, packet, i);
> +        }
> +    }
> +
> +    return dp_packet_batch_is_empty(packets_);
> +}
> +
>  static void
>  dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
>                const struct nlattr *a, bool should_steal)
> @@ -7144,43 +7400,18 @@ dp_execute_cb(void *aux_, struct 
> dp_packet_batch *packets_,
>
>      switch ((enum ovs_action_attr)type) {
>      case OVS_ACTION_ATTR_OUTPUT:
> -        p = pmd_send_port_cache_lookup(pmd, nl_attr_get_odp_port(a));
> -        if (OVS_LIKELY(p)) {
> -            struct dp_packet *packet;
> -            struct dp_packet_batch out;
> -
> -            if (!should_steal) {
> -                dp_packet_batch_clone(&out, packets_);
> -                dp_packet_batch_reset_cutlen(packets_);
> -                packets_ = &out;
> -            }
> -            dp_packet_batch_apply_cutlen(packets_);
> -
> -#ifdef DPDK_NETDEV
> -            if 
> (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
> -                             && packets_->packets[0]->source
> -                                != 
> p->output_pkts.packets[0]->source)) {
> -                /* XXX: netdev-dpdk assumes that all packets in a 
> single
> -                 *      output batch has the same source. Flush here 
> to
> -                 *      avoid memory access issues. */
> -                dp_netdev_pmd_flush_output_on_port(pmd, p);
> -            }
> -#endif
> -            if (dp_packet_batch_size(&p->output_pkts)
> -                + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) 
> {
> -                /* Flush here to avoid overflow. */
> -                dp_netdev_pmd_flush_output_on_port(pmd, p);
> -            }
> -
> -            if (dp_packet_batch_is_empty(&p->output_pkts)) {
> -                pmd->n_output_batches++;
> -            }
> +        if (dp_execute_output_action(pmd, packets_, should_steal,
> +                                     nl_attr_get_odp_port(a))) {
> +            return;
> +        } else {
> +            COVERAGE_ADD(datapath_drop_invalid_port,
> +                         dp_packet_batch_size(packets_));
> +        }
> +        break;
>
> -            DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> -                
> p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> -                                                             
> pmd->ctx.last_rxq;
> -                dp_packet_batch_add(&p->output_pkts, packet);
> -            }
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
> +        if (dp_execute_lb_output_action(pmd, packets_, should_steal,
> +                                        nl_attr_get_u32(a))) {
>              return;
>          } else {
>              COVERAGE_ADD(datapath_drop_invalid_port,
> @@ -7738,6 +7969,95 @@ dpif_netdev_ipf_dump_done(struct dpif *dpif 
> OVS_UNUSED, void *ipf_dump_ctx)
>
>  }
>
> +static int
> +dpif_netdev_bond_add(struct dpif *dpif, uint32_t bond_id,
> +                     odp_port_t *slave_map)
> +{
> +    struct dp_netdev *dp = get_dp_netdev(dpif);
> +    struct dp_netdev_pmd_thread *pmd;
> +
> +    ovs_mutex_lock(&dp->bond_mutex);
> +    /* Prepare new bond mapping. */
> +    struct tx_bond *new_tx = xzalloc(sizeof *new_tx);
> +
> +    new_tx->bond_id = bond_id;
> +    for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> +        new_tx->slave_buckets[bucket].slave_id = slave_map[bucket];
> +    }

EC> I would move the lock here...

> +    /* Check if bond already existed. */
> +    struct tx_bond *old_tx = tx_bond_lookup(&dp->tx_bonds, bond_id);
> +    if (old_tx) {
> +        cmap_replace(&dp->tx_bonds, &old_tx->node, &new_tx->node,
> +                     hash_bond_id(bond_id));
> +        ovsrcu_postpone(free, old_tx);
> +    } else {
> +        cmap_insert(&dp->tx_bonds, &new_tx->node,
> +                    hash_bond_id(bond_id));
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +
> +    /* Update all PMDs with new bond mapping. */
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        dp_netdev_add_bond_tx_to_pmd(pmd, new_tx);
> +    }
> +    return 0;
> +}
> +
> +static int
> +dpif_netdev_bond_del(struct dpif *dpif, uint32_t bond_id)
> +{
> +    struct dp_netdev *dp = get_dp_netdev(dpif);
> +
> +    ovs_mutex_lock(&dp->bond_mutex);
> +    /* Check if bond existed. */
> +    struct tx_bond *tx = tx_bond_lookup(&dp->tx_bonds, bond_id);
> +    if (tx) {
> +        cmap_remove(&dp->tx_bonds, &tx->node, hash_bond_id(bond_id));
> +        ovsrcu_postpone(free, tx);
> +    } else {
> +        /* Bond is not present. */
> +        ovs_mutex_unlock(&dp->bond_mutex);
> +        return 0;
> +    }
> +    ovs_mutex_unlock(&dp->bond_mutex);
> +
> +    /* Remove the bond map in all pmds. */
> +    struct dp_netdev_pmd_thread *pmd;

EC> Newline?

> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        dp_netdev_del_bond_tx_from_pmd(pmd, bond_id);
> +    }
> +    return 0;
> +}
> +
> +static int
> +dpif_netdev_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> +                           uint64_t *n_bytes)
> +{
> +    struct dp_netdev *dp = get_dp_netdev(dpif);
> +    struct dp_netdev_pmd_thread *pmd;
> +
> +    /* Search the bond in all PMDs. */
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        struct tx_bond *pmd_bond_entry
> +            = tx_bond_lookup(&pmd->tx_bonds, bond_id);

EC> Do we need the mutex here?

> +        if (!pmd_bond_entry) {
> +            continue;
> +        }
> +
> +        /* Read bond stats. */
> +        for (int i = 0; i < BOND_BUCKETS; i++) {
> +            uint64_t pmd_n_bytes;
> +            atomic_read_relaxed(
> +                 &pmd_bond_entry->slave_buckets[i].n_bytes,
> +                 &pmd_n_bytes);
> +            n_bytes[i] += pmd_n_bytes;
> +        }
> +    }
> +    return 0;
> +}
> +
>  const struct dpif_class dpif_netdev_class = {
>      "netdev",
>      true,                       /* cleanup_required */
> @@ -7811,6 +8131,9 @@ const struct dpif_class dpif_netdev_class = {
>      dpif_netdev_meter_set,
>      dpif_netdev_meter_get,
>      dpif_netdev_meter_del,
> +    dpif_netdev_bond_add,
> +    dpif_netdev_bond_del,
> +    dpif_netdev_bond_stats_get,
>  };
>
>  static void
> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> index 5b5c96d..116f8fc 100644
> --- a/lib/dpif-netlink.c
> +++ b/lib/dpif-netlink.c
> @@ -3993,6 +3993,9 @@ const struct dpif_class dpif_netlink_class = {
>      dpif_netlink_meter_set,
>      dpif_netlink_meter_get,
>      dpif_netlink_meter_del,
> +    NULL,                       /* bond_add */
> +    NULL,                       /* bond_del */
> +    NULL,                       /* bond_stats_get */
>  };
>
>  static int
> diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
> index b77317b..781c7ca 100644
> --- a/lib/dpif-provider.h
> +++ b/lib/dpif-provider.h
> @@ -616,6 +616,17 @@ struct dpif_class {
>       * zero. */
>      int (*meter_del)(struct dpif *, ofproto_meter_id meter_id,
>                       struct ofputil_meter_stats *, uint16_t n_bands);
> +
> +    /* Adds a bond with 'bond_id' and the slave-map to 'dpif'. */
> +    int (*bond_add)(struct dpif *dpif, uint32_t bond_id,
> +                    odp_port_t *slave_map);
> +
> +    /* Removes bond identified by 'bond_id' from 'dpif'. */
> +    int (*bond_del)(struct dpif *dpif, uint32_t bond_id);
> +
> +    /* Reads bond stats from 'dpif'. */
> +    int (*bond_stats_get)(struct dpif *dpif, uint32_t bond_id,
> +                          uint64_t *n_bytes);
>  };
>
>  extern const struct dpif_class dpif_netlink_class;
> diff --git a/lib/dpif.c b/lib/dpif.c
> index 9d9c716..c8dc8ed 100644
> --- a/lib/dpif.c
> +++ b/lib/dpif.c
> @@ -1170,6 +1170,7 @@ dpif_execute_helper_cb(void *aux_, struct 
> dp_packet_batch *packets_,
>
>      case OVS_ACTION_ATTR_CT:
>      case OVS_ACTION_ATTR_OUTPUT:
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
>      case OVS_ACTION_ATTR_TUNNEL_PUSH:
>      case OVS_ACTION_ATTR_TUNNEL_POP:
>      case OVS_ACTION_ATTR_USERSPACE:
> @@ -1220,6 +1221,7 @@ dpif_execute_helper_cb(void *aux_, struct 
> dp_packet_batch *packets_,
>          struct dp_packet *clone = NULL;
>          uint32_t cutlen = dp_packet_get_cutlen(packet);
>          if (cutlen && (type == OVS_ACTION_ATTR_OUTPUT
> +                        || type == OVS_ACTION_ATTR_LB_OUTPUT
>                          || type == OVS_ACTION_ATTR_TUNNEL_PUSH
>                          || type == OVS_ACTION_ATTR_TUNNEL_POP
>                          || type == OVS_ACTION_ATTR_USERSPACE)) {
> @@ -1879,6 +1881,16 @@ dpif_supports_explicit_drop_action(const struct 
> dpif *dpif)
>      return dpif_is_netdev(dpif);
>  }
>
> +bool
> +dpif_supports_lb_output_action(const struct dpif *dpif)
> +{
> +    /*
> +     * Balance-tcp optimization is currently supported in netdev
> +     * datapath only.
> +     */
> +    return dpif_is_netdev(dpif);
> +}
> +
>  /* Meters */
>  void
>  dpif_meter_get_features(const struct dpif *dpif,
> @@ -1976,3 +1988,40 @@ dpif_meter_del(struct dpif *dpif, 
> ofproto_meter_id meter_id,
>      }
>      return error;
>  }
> +
> +int
> +dpif_bond_add(struct dpif *dpif, uint32_t bond_id, odp_port_t 
> *slave_map)
> +{
> +    int error = -ENOTSUP;
> +
> +    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) {
> +        error = dpif->dpif_class->bond_add(dpif, bond_id, slave_map);
> +    }
> +
> +    return error;
> +}
> +
> +int
> +dpif_bond_del(struct dpif *dpif, uint32_t bond_id)
> +{
> +    int error = -ENOTSUP;
> +
> +    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) {
> +        error = dpif->dpif_class->bond_del(dpif, bond_id);
> +    }
> +
> +    return error;
> +}
> +
> +int
> +dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> +                    uint64_t *n_bytes)
> +{
> +    int error = -ENOTSUP;
> +
> +    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_stats_get) 
> {
> +        error = dpif->dpif_class->bond_stats_get(dpif, bond_id, 
> n_bytes);
> +    }
> +
> +    return error;
> +}
> diff --git a/lib/dpif.h b/lib/dpif.h
> index 4df8f7c..2f63b2c 100644
> --- a/lib/dpif.h
> +++ b/lib/dpif.h
> @@ -891,6 +891,19 @@ int dpif_meter_get(const struct dpif *, 
> ofproto_meter_id meter_id,
>                     struct ofputil_meter_stats *, uint16_t n_bands);
>  int dpif_meter_del(struct dpif *, ofproto_meter_id meter_id,
>                     struct ofputil_meter_stats *, uint16_t n_bands);
> +
> +/* Bonding. */
> +
> +/* Bit-mask for hashing a flow down to a bucket. */
> +#define BOND_MASK 0xff
> +#define BOND_BUCKETS (BOND_MASK + 1)
> +
> +int dpif_bond_add(struct dpif *dpif, uint32_t bond_id, odp_port_t 
> *slave_map);
> +int dpif_bond_del(struct dpif *dpif, uint32_t bond_id);
> +int dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
> +                        uint64_t *n_bytes);
> +bool dpif_supports_lb_output_action(const struct dpif *);

EC> Add the variable name in the definition, as you do it for all other 
definitions, i.e.
     dpif_supports_lb_output_action(const struct dpif *dpif);
> +
>  

>  /* Miscellaneous. */
>
> diff --git a/lib/odp-execute.c b/lib/odp-execute.c
> index 42d3335..6018e37 100644
> --- a/lib/odp-execute.c
> +++ b/lib/odp-execute.c
> @@ -793,6 +793,7 @@ requires_datapath_assistance(const struct nlattr 
> *a)
>      switch (type) {
>          /* These only make sense in the context of a datapath. */
>      case OVS_ACTION_ATTR_OUTPUT:
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
>      case OVS_ACTION_ATTR_TUNNEL_PUSH:
>      case OVS_ACTION_ATTR_TUNNEL_POP:
>      case OVS_ACTION_ATTR_USERSPACE:
> @@ -1068,6 +1069,7 @@ odp_execute_actions(void *dp, struct 
> dp_packet_batch *batch, bool steal,
>              return;
>          }
>          case OVS_ACTION_ATTR_OUTPUT:
> +        case OVS_ACTION_ATTR_LB_OUTPUT:
>          case OVS_ACTION_ATTR_TUNNEL_PUSH:
>          case OVS_ACTION_ATTR_TUNNEL_POP:
>          case OVS_ACTION_ATTR_USERSPACE:
> diff --git a/lib/odp-util.c b/lib/odp-util.c
> index 746d1e9..4ef4cdc 100644
> --- a/lib/odp-util.c
> +++ b/lib/odp-util.c
> @@ -119,6 +119,7 @@ odp_action_len(uint16_t type)
>
>      switch ((enum ovs_action_attr) type) {
>      case OVS_ACTION_ATTR_OUTPUT: return sizeof(uint32_t);
> +    case OVS_ACTION_ATTR_LB_OUTPUT: return sizeof(uint32_t);
>      case OVS_ACTION_ATTR_TRUNC: return sizeof(struct 
> ovs_action_trunc);
>      case OVS_ACTION_ATTR_TUNNEL_PUSH: return ATTR_LEN_VARIABLE;
>      case OVS_ACTION_ATTR_TUNNEL_POP: return sizeof(uint32_t);
> @@ -1122,6 +1123,9 @@ format_odp_action(struct ds *ds, const struct 
> nlattr *a,
>      case OVS_ACTION_ATTR_OUTPUT:
>          odp_portno_name_format(portno_names, nl_attr_get_odp_port(a), 
> ds);
>          break;
> +    case OVS_ACTION_ATTR_LB_OUTPUT:
> +        ds_put_format(ds, "lb_output(bond,%"PRIu32")", 
> nl_attr_get_u32(a));
> +        break;
>      case OVS_ACTION_ATTR_TRUNC: {
>          const struct ovs_action_trunc *trunc =
>                         nl_attr_get_unspec(a, sizeof *trunc);
> diff --git a/ofproto/bond.c b/ofproto/bond.c
> index 405202f..3503a78 100644
> --- a/ofproto/bond.c
> +++ b/ofproto/bond.c
> @@ -54,10 +54,6 @@ static struct ovs_rwlock rwlock = 
> OVS_RWLOCK_INITIALIZER;
>  static struct hmap all_bonds__ = HMAP_INITIALIZER(&all_bonds__);
>  static struct hmap *const all_bonds OVS_GUARDED_BY(rwlock) = 
> &all_bonds__;
>
> -/* Bit-mask for hashing a flow down to a bucket. */
> -#define BOND_MASK 0xff
> -#define BOND_BUCKETS (BOND_MASK + 1)
> -
>  /* Priority for internal rules created to handle recirculation */
>  #define RECIRC_RULE_PRIORITY 20
>
> @@ -126,6 +122,8 @@ struct bond {
>      enum lacp_status lacp_status; /* Status of LACP negotiations. */
>      bool bond_revalidate;       /* True if flows need revalidation. 
> */
>      uint32_t basis;             /* Basis for flow hash function. */
> +    bool use_bond_buckets;      /* Use bond buckets to avoid 
> recirculation.
> +                                   Applicable only for Balance TCP 
> mode. */
>
>      /* SLB specific bonding info. */
>      struct bond_entry *hash;     /* An array of BOND_BUCKETS 
> elements. */
> @@ -188,6 +186,9 @@ static struct bond_slave 
> *choose_output_slave(const struct bond *,
>  static void update_recirc_rules__(struct bond *bond);
>  static bool bond_is_falling_back_to_ab(const struct bond *);
>

EC> Do we need the additional new line here?

> +static void bond_add_slave_buckets(const struct bond *bond);
> +static void bond_del_slave_buckets(const struct bond *bond);
> +
>  /* Attempts to parse 's' as the name of a bond balancing mode.  If 
> successful,
>   * stores the mode in '*balance' and returns true.  Otherwise returns 
> false
>   * without modifying '*balance'. */
> @@ -282,6 +283,8 @@ bond_unref(struct bond *bond)
>
>      /* Free bond resources. Remove existing post recirc rules. */
>      if (bond->recirc_id) {
> +        /* Delete bond buckets from datapath if installed. */
> +        bond_del_slave_buckets(bond);
>          recirc_free_id(bond->recirc_id);
>          bond->recirc_id = 0;
>      }
> @@ -355,6 +358,7 @@ update_recirc_rules__(struct bond *bond)
>                              &bond->hash[i].pr_rule);
>              }
>          }
> +        bond_add_slave_buckets(bond);
>      }
>
>      HMAP_FOR_EACH_SAFE(pr_op, next_op, hmap_node, &bond->pr_rule_ops) 
> {
> @@ -464,9 +468,14 @@ bond_reconfigure(struct bond *bond, const struct 
> bond_settings *s)
>              bond->recirc_id = recirc_alloc_id(bond->ofproto);
>          }
>      } else if (bond->recirc_id) {
> +        /* Delete bond buckets from datapath if installed. */
> +        bond_del_slave_buckets(bond);
>          recirc_free_id(bond->recirc_id);
>          bond->recirc_id = 0;
>      }
> +    if (bond->use_bond_buckets != s->use_bond_buckets) {
> +        bond->use_bond_buckets = s->use_bond_buckets;
> +    }
>
>      if (bond->balance == BM_AB || !bond->hash || revalidate) {
>          bond_entry_reset(bond);
> @@ -944,6 +953,14 @@ bond_recirculation_account(struct bond *bond)
>      OVS_REQ_WRLOCK(rwlock)
>  {
>      int i;
> +    uint64_t n_bytes[BOND_BUCKETS] = {0};
> +    bool use_bond_buckets = bond_is_cache_mode_enabled(bond);
> +
> +    if (use_bond_buckets) {
> +        /* Retrieve bond stats from datapath. */
> +        dpif_bond_stats_get(bond->ofproto->backer->dpif,
> +                            bond->recirc_id, n_bytes);
> +    }
>
>      for (i=0; i<=BOND_MASK; i++) {
>          struct bond_entry *entry = &bond->hash[i];
> @@ -953,8 +970,12 @@ bond_recirculation_account(struct bond *bond)
>              struct pkt_stats stats;
>              long long int used OVS_UNUSED;
>
> -            rule->ofproto->ofproto_class->rule_get_stats(
> -                rule, &stats, &used);
> +            if (!use_bond_buckets) {
> +                rule->ofproto->ofproto_class->rule_get_stats(
> +                    rule, &stats, &used);
> +            } else {
> +                stats.n_bytes = n_bytes[i];
> +            }
>              bond_entry_account(entry, stats.n_bytes);
>          }
>      }
> @@ -1365,6 +1386,8 @@ bond_print_details(struct ds *ds, const struct 
> bond *bond)
>                    may_recirc ? "yes" : "no", may_recirc ? recirc_id: 
> -1);
>
>      ds_put_format(ds, "bond-hash-basis: %"PRIu32"\n", bond->basis);
> +    ds_put_format(ds, "lb-output-action: %s, bond-id: %u\n",
> +                  bond->use_bond_buckets ? "enabled" : "disabled", 
> recirc_id);
>
>      ds_put_format(ds, "updelay: %d ms\n", bond->updelay);
>      ds_put_format(ds, "downdelay: %d ms\n", bond->downdelay);
> @@ -1942,3 +1965,36 @@ bond_get_changed_active_slave(const char *name, 
> struct eth_addr *mac,
>
>      return false;
>  }
> +
> +bool
> +bond_is_cache_mode_enabled(const struct bond *bond)
> +{
> +    return (bond->balance == BM_TCP) && bond->use_bond_buckets;
> +}
> +
> +static void
> +bond_add_slave_buckets(const struct bond *bond)
> +{
> +    ofp_port_t slave_map[BOND_MASK];
> +
> +    if (bond_is_cache_mode_enabled(bond)) {
> +        for (int i = 0; i < BOND_BUCKETS; i++) {
> +            struct bond_slave *slave = bond->hash[i].slave;
> +
> +            if (slave) {
> +                slave_map[i] = slave->ofp_port;
> +            } else {
> +                slave_map[i] = OFPP_NONE;
> +            }
> +        }
> +        ofproto_dpif_bundle_add(bond->ofproto, bond->recirc_id, 
> slave_map);
> +    }
> +}
> +
> +static void
> +bond_del_slave_buckets(const struct bond *bond)
> +{
> +    if (bond_is_cache_mode_enabled(bond)) {
> +        ofproto_dpif_bundle_del(bond->ofproto, bond->recirc_id);
> +    }
> +}
> diff --git a/ofproto/bond.h b/ofproto/bond.h
> index e7c3d9b..376a280 100644
> --- a/ofproto/bond.h
> +++ b/ofproto/bond.h
> @@ -58,6 +58,8 @@ struct bond_settings {
>                                  /* The MAC address of the interface
>                                     that was active during the last
>                                     ovs run. */
> +    bool use_bond_buckets;      /* Use bond buckets. Only applicable 
> for
> +                                   bond mode BALANCE TCP. */
>  };
>
>  /* Program startup. */
> @@ -122,4 +124,7 @@ void bond_rebalance(struct bond *);
>  */
>  void bond_update_post_recirc_rules(struct bond *, uint32_t 
> *recirc_id,
>                                     uint32_t *hash_basis);
> +
> +bool bond_is_cache_mode_enabled(const struct bond *);

EC> Other functions have variable name, so please add here too, i.e. 
(const struct bond *bond)

> +
>  #endif /* bond.h */
> diff --git a/ofproto/ofproto-dpif-ipfix.c 
> b/ofproto/ofproto-dpif-ipfix.c
> index b413768..796eb6f 100644
> --- a/ofproto/ofproto-dpif-ipfix.c
> +++ b/ofproto/ofproto-dpif-ipfix.c
> @@ -2979,6 +2979,7 @@ dpif_ipfix_read_actions(const struct flow *flow,
>          enum ovs_action_attr type = nl_attr_type(a);
>          switch (type) {
>          case OVS_ACTION_ATTR_OUTPUT:
> +        case OVS_ACTION_ATTR_LB_OUTPUT:
>              ipfix_actions->output_action = true;
>              break;
>          case OVS_ACTION_ATTR_SAMPLE:
> diff --git a/ofproto/ofproto-dpif-sflow.c 
> b/ofproto/ofproto-dpif-sflow.c
> index f9ea47a..f616fb2 100644
> --- a/ofproto/ofproto-dpif-sflow.c
> +++ b/ofproto/ofproto-dpif-sflow.c
> @@ -1175,8 +1175,9 @@ dpif_sflow_read_actions(const struct flow *flow,
>          case OVS_ACTION_ATTR_RECIRC:
>          case OVS_ACTION_ATTR_HASH:
>          case OVS_ACTION_ATTR_CT:
> -    case OVS_ACTION_ATTR_CT_CLEAR:
> +        case OVS_ACTION_ATTR_CT_CLEAR:
>          case OVS_ACTION_ATTR_METER:
> +        case OVS_ACTION_ATTR_LB_OUTPUT:
>              break;
>
>          case OVS_ACTION_ATTR_SET_MASKED:
> diff --git a/ofproto/ofproto-dpif-xlate.c 
> b/ofproto/ofproto-dpif-xlate.c
> index adf57a5..970bdbc 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -4190,7 +4190,6 @@ compose_output_action__(struct xlate_ctx *ctx, 
> ofp_port_t ofp_port,
>          xlate_commit_actions(ctx);
>
>          if (xr) {
> -            /* Recirculate the packet. */
>              struct ovs_action_hash *act_hash;
>
>              /* Hash action. */
> @@ -4199,15 +4198,27 @@ compose_output_action__(struct xlate_ctx *ctx, 
> ofp_port_t ofp_port,
>                  /* Algorithm supported by all datapaths. */
>                  hash_alg = OVS_HASH_ALG_L4;
>              }
> -            act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
> -                                                OVS_ACTION_ATTR_HASH,
> -                                                sizeof *act_hash);
> -            act_hash->hash_alg = hash_alg;
> -            act_hash->hash_basis = xr->hash_basis;
>
> -            /* Recirc action. */
> -            nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
> -                           xr->recirc_id);
> +            if (bond_is_cache_mode_enabled(xport->xbundle->bond)) {
> +                /*
> +                 * If bond mode is balance-tcp and optimize balance 
> tcp
> +                 * is enabled then use the hash directly for slave 
> selection
> +                 * and avoid recirculation.
> +                 *
> +                 * Currently support for netdev datapath only.
> +                 */
> +                nl_msg_put_u32(ctx->odp_actions, 
> OVS_ACTION_ATTR_LB_OUTPUT,
> +                               xr->recirc_id);
> +            } else {
> +                act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
> +                                                    
> OVS_ACTION_ATTR_HASH,
> +                                                    sizeof 
> *act_hash);
> +                act_hash->hash_alg = hash_alg;
> +                act_hash->hash_basis = xr->hash_basis;
> +                /* Recirculate the packet. */
> +                nl_msg_put_u32(ctx->odp_actions, 
> OVS_ACTION_ATTR_RECIRC,
> +                               xr->recirc_id);
> +            }
>          } else if (is_native_tunnel) {
>              /* Output to native tunnel port. */
>              native_tunnel_output(ctx, xport, flow, odp_port, 
> truncate);
> @@ -7268,7 +7279,8 @@ count_output_actions(const struct ofpbuf 
> *odp_actions)
>      int n = 0;
>
>      NL_ATTR_FOR_EACH_UNSAFE (a, left, odp_actions->data, 
> odp_actions->size) {
> -        if (a->nla_type == OVS_ACTION_ATTR_OUTPUT) {
> +        if ((a->nla_type == OVS_ACTION_ATTR_OUTPUT) ||
> +            (a->nla_type == OVS_ACTION_ATTR_LB_OUTPUT)) {
>              n++;
>          }
>      }
> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index d21874b..c4d219b 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -1582,6 +1582,8 @@ check_support(struct dpif_backer *backer)
>      backer->rt_support.ct_timeout = check_ct_timeout_policy(backer);
>      backer->rt_support.explicit_drop_action =
>          dpif_supports_explicit_drop_action(backer->dpif);
> +    backer->rt_support.lb_output_action=
> +        dpif_supports_lb_output_action(backer->dpif);
>
>      /* Flow fields. */
>      backer->rt_support.odp.ct_state = check_ct_state(backer);
> @@ -3441,6 +3443,25 @@ bundle_remove(struct ofport *port_)
>      }
>  }
>
> +int
> +ofproto_dpif_bundle_add(struct ofproto_dpif *ofproto,
> +                        uint32_t bond_id, const ofp_port_t 
> *slave_map)
> +{
> +    odp_port_t odp_map[BOND_BUCKETS];
> +
> +    for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
> +        /* Convert ofp_port to odp_port. */
> +        odp_map[bucket] = ofp_port_to_odp_port(ofproto, 
> slave_map[bucket]);
> +    }
> +    return dpif_bond_add(ofproto->backer->dpif, bond_id, odp_map);
> +}
> +
> +int
> +ofproto_dpif_bundle_del(struct ofproto_dpif *ofproto, uint32_t 
> bond_id)
> +{
> +    return dpif_bond_del(ofproto->backer->dpif, bond_id);
> +}
> +
>  static void
>  send_pdu_cb(void *port_, const void *pdu, size_t pdu_size)
>  {
> @@ -5571,6 +5592,7 @@ get_datapath_cap(const char *datapath_type, 
> struct smap *cap)
>      smap_add(cap, "ct_timeout", s.ct_timeout ? "true" : "false");
>      smap_add(cap, "explicit_drop_action",
>               s.explicit_drop_action ? "true" :"false");
> +    smap_add(cap, "lb_output_action", s.lb_output_action ? "true" : 
> "false");
>  }
>
>  /* Gets timeout policy name in 'backer' based on 'zone', 'dl_type' 
> and
> @@ -6609,6 +6631,13 @@ meter_del(struct ofproto *ofproto_, 
> ofproto_meter_id meter_id)
>      ovsrcu_postpone(free_meter_id, arg);
>  }
>
> +static bool
> +is_lb_output_action_supported(const struct ofproto *ofproto_)
> +{
> +    struct ofproto_dpif *ofproto = ofproto_dpif_cast(ofproto_);
> +    return ofproto->backer->rt_support.lb_output_action;
> +}
> +
>  const struct ofproto_class ofproto_dpif_class = {
>      init,
>      enumerate_types,
> @@ -6716,4 +6745,5 @@ const struct ofproto_class ofproto_dpif_class = 
> {
>      ct_flush,                   /* ct_flush */
>      ct_set_zone_timeout_policy,
>      ct_del_zone_timeout_policy,
> +    is_lb_output_action_supported,
>  };
> diff --git a/ofproto/ofproto-dpif.h b/ofproto/ofproto-dpif.h
> index aee61d6..9e2293e 100644
> --- a/ofproto/ofproto-dpif.h
> +++ b/ofproto/ofproto-dpif.h
> @@ -202,7 +202,10 @@ struct group_dpif *group_dpif_lookup(struct 
> ofproto_dpif *,
>      DPIF_SUPPORT_FIELD(bool, ct_timeout, "Conntrack timeout policy")  
>       \
>                                                                              \
>      /* True if the datapath supports explicit drop action. */         
>       \
> -    DPIF_SUPPORT_FIELD(bool, explicit_drop_action, "Explicit Drop 
> action")
> +    DPIF_SUPPORT_FIELD(bool, explicit_drop_action, "Explicit Drop 
> action")  \
> +                                                                      
>       \
> +    /* True if the datapath supports balance_tcp optimization */      
>       \
> +    DPIF_SUPPORT_FIELD(bool, lb_output_action, "Optimized Balance TCP 
> mode")
>
>
>  /* Stores the various features which the corresponding backer 
> supports. */
> @@ -382,6 +385,9 @@ int ofproto_dpif_add_internal_flow(struct 
> ofproto_dpif *,
>                                     struct rule **rulep);
>  int ofproto_dpif_delete_internal_flow(struct ofproto_dpif *, struct 
> match *,
>                                        int priority);
> +int ofproto_dpif_bundle_add(struct ofproto_dpif *, uint32_t bond_id,
> +                            const ofp_port_t *slave_map);
> +int ofproto_dpif_bundle_del(struct ofproto_dpif *, uint32_t bond_id);
>
>  bool ovs_native_tunneling_is_on(struct ofproto_dpif *);
>
> diff --git a/ofproto/ofproto-provider.h b/ofproto/ofproto-provider.h
> index afecb24..571da24 100644
> --- a/ofproto/ofproto-provider.h
> +++ b/ofproto/ofproto-provider.h
> @@ -1905,6 +1905,9 @@ struct ofproto_class {
>      /* Deletes the timeout policy associated with 'zone' in datapath 
> type
>       * 'dp_type'. */
>      void (*ct_del_zone_timeout_policy)(const char *dp_type, uint16_t 
> zone);
> +
> +    /* Return 'true' if datapath supports 'lb-output-action'. */
> +    bool (*is_lb_output_action_supported)(const struct ofproto *);
>  };
>
>  extern const struct ofproto_class ofproto_dpif_class;
> diff --git a/ofproto/ofproto.c b/ofproto/ofproto.c
> index 0fbd6c3..0d1cb49 100644
> --- a/ofproto/ofproto.c
> +++ b/ofproto/ofproto.c
> @@ -1497,6 +1497,16 @@ ofproto_is_mirror_output_bundle(const struct 
> ofproto *ofproto, void *aux)
>              ? 
> ofproto->ofproto_class->is_mirror_output_bundle(ofproto, aux)
>              : false);
>  }
> +
> +/* Returns true if datapath supports lb-output-action. */
> +bool
> +ofproto_is_lb_output_action_supported(const struct ofproto *ofproto)
> +{
> +    return (ofproto->ofproto_class->is_lb_output_action_supported
> +            ? 
> ofproto->ofproto_class->is_lb_output_action_supported(ofproto)
> +            : false);
> +}
> +
>  

>  /* Configuration of OpenFlow tables. */
>
> diff --git a/ofproto/ofproto.h b/ofproto/ofproto.h
> index 2dd2531..dda2733 100644
> --- a/ofproto/ofproto.h
> +++ b/ofproto/ofproto.h
> @@ -505,6 +505,7 @@ unsigned int ofproto_aa_vlan_get_queue_size(struct 
> ofproto *ofproto);
>
>  int ofproto_set_flood_vlans(struct ofproto *, unsigned long 
> *flood_vlans);
>  bool ofproto_is_mirror_output_bundle(const struct ofproto *, void 
> *aux);
> +bool ofproto_is_lb_output_action_supported(const struct ofproto 
> *ofproto);
>
>  /* Configuration of OpenFlow tables. */
>  struct ofproto_table_settings {
> diff --git a/tests/lacp.at b/tests/lacp.at
> index 7b460d7..d7d98f7 100644
> --- a/tests/lacp.at
> +++ b/tests/lacp.at
> @@ -121,6 +121,7 @@ AT_CHECK([ovs-appctl bond/show], [0], [dnl
>  bond_mode: active-backup
>  bond may use recirculation: no, Recirc-ID : -1
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 0
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -286,6 +287,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 1
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -301,6 +303,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 2
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -423,6 +426,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 1
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -440,6 +444,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 2
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -555,6 +560,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 1
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -572,6 +578,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 2
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -692,6 +699,7 @@ slave: p3: current attached
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 1
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> @@ -709,6 +717,7 @@ slave p1: enabled
>  bond_mode: balance-tcp
>  bond may use recirculation: yes, <del>
>  bond-hash-basis: 0
> +lb-output-action: disabled, bond-id: 2
>  updelay: 0 ms
>  downdelay: 0 ms
>  lacp_status: negotiated
> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
> index fe73c38..ffc3921 100644
> --- a/vswitchd/bridge.c
> +++ b/vswitchd/bridge.c
> @@ -4600,6 +4600,19 @@ port_configure_bond(struct port *port, struct 
> bond_settings *s)
>          /* OVSDB did not store the last active interface */
>          s->active_slave_mac = eth_addr_zero;
>      }
> +
> +    /* use_bond_buckets is disabled by default. */
> +    s->use_bond_buckets = (s->balance == BM_TCP)
> +                          && smap_get_bool(&port->cfg->other_config,
> +                                         "lb-output-action", false);
> +

EC> Nit, The use_bond_buckets is still confusing specially the relation 
with "lb-output-action"

> +    /* Verify if datapath supports lb-output-action. */
> +    if (s->use_bond_buckets
> +        && 
> !ofproto_is_lb_output_action_supported(port->bridge->ofproto)) {
> +        VLOG_WARN("port %s: Unsupported lb-output-action in datapath, 
> "
> +                  "defaulting to false.", port->name);
> +        s->use_bond_buckets = false;
> +    }
>  }
>
>  /* Returns true if 'port' is synthetic, that is, if we constructed it 
> locally
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 4a74ed3..3da8d0c 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -1994,6 +1994,16 @@
>          <code>active-backup</code>.
>        </column>
>
> +      <column name="other_config" key="lb-output-action"
> +              type='{"type": "boolean"}'>
> +        Enable/disable usage of optimized lb-output-action for
> +        balancing flows among output slaves in load balanced bonds in
> +        <code>balance-tcp</code>. When enabled, it uses optimized 
> path for
> +        balance-tcp mode by using rss hash and avoids recirculation.
> +        It affects only new flows, i.e, existing flows remain 
> unchanged.
> +        This knob does not affect other balancing modes.
> +      </column>
> +
>        <group title="Link Failure Detection">
>          <p>
>            An important part of link bonding is detecting that links 
> are down so
> @@ -5788,6 +5798,19 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 
> type=patch options:peer=p1 \
>          higher performance for MPLS and active-active load balancing
>          bonding modes.
>        </column>
> +      <column name="capabilities" key="lb_output_action"
> +              type='{"type": "boolean"}'>
> +        If this is true, then the datapath supports optimized 
> balance-tcp
> +        bond mode. This capability replaces existing 
> <code>hash</code> and
> +        <code>recirc</code> actions with new action 
> <code>lb_output</code>
> +        and avoids recirculation of packet in datapath. It is 
> supported
> +        only for balance-tcp bond mode in netdev datapath. The new 
> action
> +        gives higer performance by using bond buckets instead of post
> +        recirculation flows for selection of slave port from bond. By 
> default
> +        this new action is disabled, however it can be enabled by 
> setting
> +        <ref column="other-config" key="lb-output-action"/> in
> +        <ref table="Port"/> table.
> +      </column>
>        <group title="Connection-Tracking Capabilities">
>          <p>
>            These capabilities are granular because Open vSwitch and 
> its
> -- 
> 2.7.4
>



More information about the dev mailing list