[ovs-dev] [PATCH v13 1/1] Avoid dp_hash recirculation for balance-tcp bond selection mode

Vishal Deep Ajmera vishal.deep.ajmera at ericsson.com
Fri May 22 08:50:05 UTC 2020


Problem:
--------
In OVS-DPDK, flows with output over a bond interface of type “balance-tcp”
(using a hash on TCP/UDP 5-tuple) get translated by the ofproto layer into
"HASH" and "RECIRC" datapath actions. After recirculation, the packet is
forwarded to the bond member port based on 8-bits of the datapath hash
value computed through dp_hash. This causes performance degradation in the
following ways:

1. The recirculation of the packet implies another lookup of the packet’s
flow key in the exact match cache (EMC) and potentially Megaflow classifier
(DPCLS). This is the biggest cost factor.

2. The recirculated packets have a new “RSS” hash and compete with the
original packets for the scarce number of EMC slots. This implies more
EMC misses and potentially EMC thrashing causing costly DPCLS lookups.

3. The 256 extra megaflow entries per bond for dp_hash bond selection put
additional load on the revalidation threads.

Owing to this performance degradation, deployments stick to “balance-slb”
bond mode even though it does not do active-active load balancing for
VXLAN- and GRE-tunnelled traffic because all tunnel packet have the same
source MAC address.

Proposed optimization:
----------------------
This proposal introduces a new load-balancing output action instead of
recirculation.

Maintain one table per-bond (could just be an array of uint16's) and
program it the same way internal flows are created today for each possible
hash value(256 entries) from ofproto layer. Use this table to load-balance
flows as part of output action processing.

Currently xlate_normal() -> output_normal() -> bond_update_post_recirc_rules()
-> bond_may_recirc() and compose_output_action__() generate
“dp_hash(hash_l4(0))” and “recirc(<RecircID>)” actions. In this case the
RecircID identifies the bond. For the recirculated packets the ofproto layer
installs megaflow entries that match on RecircID and masked dp_hash and send
them to the corresponding output port.

Instead, we will now generate action as
    "lb_output(<bond id>)"

This combines hash computation (only if needed, else re-use RSS hash) and
inline load-balancing over the bond. This action is used *only* for balance-tcp
bonds in OVS-DPDK datapath (the OVS kernel datapath remains unchanged).

Example:
--------
Current scheme:
---------------
With 1 IP-UDP flow:

flow-dump from pmd on cpu core: 2
recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2828969, bytes:181054016, used:0.000s, actions:hash(hash_l4(0)),recirc(0x1)

recirc_id(0x1),dp_hash(0x113683bd/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:2828937, bytes:181051968, used:0.000s, actions:2

With 8 IP-UDP flows (with random UDP src port): (all hitting same DPCL):

flow-dump from pmd on cpu core: 2
recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2674009, bytes:171136576, used:0.000s, actions:hash(hash_l4(0)),recirc(0x1)

recirc_id(0x1),dp_hash(0xf8e02b7e/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:377395, bytes:24153280, used:0.000s, actions:2
recirc_id(0x1),dp_hash(0xb236c260/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:333486, bytes:21343104, used:0.000s, actions:1
recirc_id(0x1),dp_hash(0x7d89eb18/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:348461, bytes:22301504, used:0.000s, actions:1
recirc_id(0x1),dp_hash(0xa78d75df/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:633353, bytes:40534592, used:0.000s, actions:2
recirc_id(0x1),dp_hash(0xb58d846f/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:319901, bytes:20473664, used:0.001s, actions:2
recirc_id(0x1),dp_hash(0x24534406/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:334985, bytes:21439040, used:0.001s, actions:1
recirc_id(0x1),dp_hash(0x3cf32550/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:326404, bytes:20889856, used:0.001s, actions:1

New scheme:
-----------
We can do with a single flow entry (for any number of new flows):

in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2674009, bytes:171136576, used:0.000s, actions:lb_output(1)

A new CLI has been added to dump datapath bond cache as given below.

“sudo ovs-appctl dpif-netdev/bond-show [dp]”

root at ubuntu-190:performance_scripts # sudo ovs-appctl dpif-netdev/bond-show
Bond cache:
        bond-id 1 :
                bucket 0 - slave 2
                bucket 1 - slave 1
                bucket 2 - slave 2
                bucket 3 - slave 1

Performance improvement:
------------------------
With a prototype of the proposed idea, the following perf improvement is seen
with Phy-VM-Phy UDP traffic, single flow. With multiple flows, the improvement
is even more enhanced (due to reduced number of flows).

1 VM:
*****
+--------------------------------------+
|                 mpps                 |
+--------------------------------------+
| Flows  master  with-opt.   %delta    |
+--------------------------------------+
| 1      4.34    5.59        28.86
| 10     4.09    5.43        32.61
| 400    3.44    5.35        55.25
| 1k     3.31    5.25        58.41
| 10k    2.46    4.43        79.78
| 100k   2.32    4.27        83.59
| 500k   2.29    4.23        84.57
+--------------------------------------+
mpps: million packets per second.
packet size 64 bytes.

Signed-off-by: Manohar Krishnappa Chidambaraswamy <manukc at gmail.com>
Co-authored-by: Manohar Krishnappa Chidambaraswamy <manukc at gmail.com>
Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera at ericsson.com>

CC: Matteo Croce <mcroce at redhat.com>
CC: Eelco Chaudron <echaudro at redhat.com>
CC: Ilya Maximets <i.maximets at redhat.com>
CC: Jan Scheurich <jan.scheurich at ericsson.com>
CC: Venkatesan Pradeep <venkatesan.pradeep at ericsson.com>
CC: Ben Pfaff <blp at ovn.org>
CC: Ian Stokes <ian.stokes at intel.com>
CC: David Marchand <david.marchand at redhat.com>
Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera at ericsson.com>
---
 datapath/linux/compat/include/linux/openvswitch.h |   1 +
 lib/dpif-netdev.c                                 | 413 +++++++++++++++++++---
 lib/dpif-netlink.c                                |   3 +
 lib/dpif-provider.h                               |  11 +
 lib/dpif.c                                        |  49 +++
 lib/dpif.h                                        |  13 +
 lib/odp-execute.c                                 |   2 +
 lib/odp-util.c                                    |  14 +
 ofproto/bond.c                                    |  97 ++++-
 ofproto/bond.h                                    |   5 +
 ofproto/ofproto-dpif-ipfix.c                      |   1 +
 ofproto/ofproto-dpif-sflow.c                      |   3 +-
 ofproto/ofproto-dpif-xlate.c                      |  32 +-
 ofproto/ofproto-dpif.c                            |  32 ++
 ofproto/ofproto-dpif.h                            |  10 +-
 ofproto/ofproto-provider.h                        |   3 +
 ofproto/ofproto.c                                 |  10 +
 ofproto/ofproto.h                                 |   1 +
 tests/lacp.at                                     |   9 +
 tests/odp.at                                      |   1 +
 vswitchd/bridge.c                                 |  13 +
 vswitchd/vswitch.xml                              |  23 ++
 22 files changed, 675 insertions(+), 71 deletions(-)

diff --git a/datapath/linux/compat/include/linux/openvswitch.h b/datapath/linux/compat/include/linux/openvswitch.h
index f7c3b2e..cc41bbe 100644
--- a/datapath/linux/compat/include/linux/openvswitch.h
+++ b/datapath/linux/compat/include/linux/openvswitch.h
@@ -1023,6 +1023,7 @@ enum ovs_action_attr {
 	OVS_ACTION_ATTR_TUNNEL_PUSH,   /* struct ovs_action_push_tnl*/
 	OVS_ACTION_ATTR_TUNNEL_POP,    /* u32 port number. */
 	OVS_ACTION_ATTR_DROP,          /* u32 xlate_error. */
+	OVS_ACTION_ATTR_LB_OUTPUT,     /* u32 bond-id. */
 #endif
 	__OVS_ACTION_ATTR_MAX,	      /* Nothing past this will be accepted
 				       * from userspace. */
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 51c8885..c55d452 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -111,6 +111,7 @@ COVERAGE_DEFINE(datapath_drop_tunnel_push_error);
 COVERAGE_DEFINE(datapath_drop_tunnel_pop_error);
 COVERAGE_DEFINE(datapath_drop_recirc_error);
 COVERAGE_DEFINE(datapath_drop_invalid_port);
+COVERAGE_DEFINE(datapath_drop_invalid_bond);
 COVERAGE_DEFINE(datapath_drop_invalid_tnl_port);
 COVERAGE_DEFINE(datapath_drop_rx_invalid_packet);
 
@@ -310,6 +311,7 @@ struct pmd_auto_lb {
  *
  *    dp_netdev_mutex (global)
  *    port_mutex
+ *    bond_mutex
  *    non_pmd_mutex
  */
 struct dp_netdev {
@@ -377,6 +379,10 @@ struct dp_netdev {
 
     struct conntrack *conntrack;
     struct pmd_auto_lb pmd_alb;
+
+    /* Bonds. */
+    struct ovs_mutex bond_mutex; /* Protects updates of 'tx_bonds'. */
+    struct cmap tx_bonds; /* Contains 'struct tx_bond'. */
 };
 
 static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
@@ -608,6 +614,20 @@ struct tx_port {
     struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
 };
 
+/* Contained by struct tx_bond 'slave_buckets'. */
+struct slave_entry {
+    odp_port_t slave_id;
+    atomic_ullong n_packets;
+    atomic_ullong n_bytes;
+};
+
+/* Contained by struct dp_netdev_pmd_thread's 'tx_bonds'. */
+struct tx_bond {
+    struct cmap_node node;
+    uint32_t bond_id;
+    struct slave_entry slave_buckets[BOND_BUCKETS];
+};
+
 /* A set of properties for the current processing loop that is not directly
  * associated with the pmd thread itself, but with the packets being
  * processed or the short-term system configuration (for example, time).
@@ -740,6 +760,11 @@ struct dp_netdev_pmd_thread {
      * read by the pmd thread. */
     struct hmap tx_ports OVS_GUARDED;
 
+    struct ovs_mutex bond_mutex;    /* Protects updates of 'tx_bonds'. */
+    /* Map of 'tx_bond's used for transmission.  Written by the main thread
+     * and read by the pmd thread. */
+    struct cmap tx_bonds;
+
     /* These are thread-local copies of 'tx_ports'.  One contains only tunnel
      * ports (that support push_tunnel/pop_tunnel), the other contains ports
      * with at least one txq (that support send).  A port can be in both.
@@ -831,6 +856,10 @@ static void dp_netdev_del_rxq_from_pmd(struct dp_netdev_pmd_thread *pmd,
 static int
 dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
                                    bool force);
+static void dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
+                                         struct tx_bond *bond);
+static void dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
+                                           uint32_t bond_id);
 
 static void reconfigure_datapath(struct dp_netdev *dp)
     OVS_REQUIRES(dp->port_mutex);
@@ -1397,6 +1426,49 @@ pmd_perf_show_cmd(struct unixctl_conn *conn, int argc,
     par.command_type = PMD_INFO_PERF_SHOW;
     dpif_netdev_pmd_info(conn, argc, argv, &par);
 }
+
+static void
+dpif_netdev_bond_show(struct unixctl_conn *conn, int argc,
+                      const char *argv[], void *aux OVS_UNUSED)
+{
+    struct ds reply = DS_EMPTY_INITIALIZER;
+    struct dp_netdev *dp = NULL;
+
+    ovs_mutex_lock(&dp_netdev_mutex);
+    if (argc == 2) {
+        dp = shash_find_data(&dp_netdevs, argv[1]);
+    } else if (shash_count(&dp_netdevs) == 1) {
+        /* There's only one datapath. */
+        dp = shash_first(&dp_netdevs)->data;
+    }
+    if (!dp) {
+        ovs_mutex_unlock(&dp_netdev_mutex);
+        unixctl_command_reply_error(conn,
+                                    "please specify an existing datapath");
+        return;
+    }
+
+    if (cmap_count(&dp->tx_bonds) > 0) {
+        struct tx_bond *dp_bond_entry;
+        uint32_t slave_id;
+
+        ds_put_cstr(&reply, "\nBonds:\n");
+        CMAP_FOR_EACH (dp_bond_entry, node, &dp->tx_bonds) {
+            ds_put_format(&reply, "\tbond-id %"PRIu32":\n",
+                          dp_bond_entry->bond_id);
+            for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
+                slave_id =
+                    odp_to_u32(dp_bond_entry->slave_buckets[bucket].slave_id);
+                ds_put_format(&reply, "\t\tbucket %d - slave %"PRIu32"\n",
+                             bucket, slave_id);
+            }
+        }
+    }
+    ovs_mutex_unlock(&dp_netdev_mutex);
+    unixctl_command_reply(conn, ds_cstr(&reply));
+    ds_destroy(&reply);
+}
+
 

 static int
 dpif_netdev_init(void)
@@ -1428,6 +1500,9 @@ dpif_netdev_init(void)
                              "[-us usec] [-q qlen]",
                              0, 10, pmd_perf_log_set_cmd,
                              NULL);
+    unixctl_command_register("dpif-netdev/bond-show", "[dp]",
+                             0, 1, dpif_netdev_bond_show,
+                             NULL);
     return 0;
 }
 
@@ -1552,6 +1627,9 @@ create_dp_netdev(const char *name, const struct dpif_class *class,
     ovs_mutex_init_recursive(&dp->port_mutex);
     hmap_init(&dp->ports);
     dp->port_seq = seq_create();
+    ovs_mutex_init(&dp->bond_mutex);
+    cmap_init(&dp->tx_bonds);
+
     fat_rwlock_init(&dp->upcall_rwlock);
 
     dp->reconfigure_seq = seq_create();
@@ -1658,6 +1736,12 @@ dp_delete_meter(struct dp_netdev *dp, uint32_t meter_id)
     }
 }
 
+static uint32_t
+hash_bond_id(uint32_t bond_id)
+{
+    return hash_int(bond_id, 0);
+}
+
 /* Requires dp_netdev_mutex so that we can't get a new reference to 'dp'
  * through the 'dp_netdevs' shash while freeing 'dp'. */
 static void
@@ -1665,6 +1749,7 @@ dp_netdev_free(struct dp_netdev *dp)
     OVS_REQUIRES(dp_netdev_mutex)
 {
     struct dp_netdev_port *port, *next;
+    struct tx_bond *bond;
 
     shash_find_and_delete(&dp_netdevs, dp->name);
 
@@ -1674,6 +1759,13 @@ dp_netdev_free(struct dp_netdev *dp)
     }
     ovs_mutex_unlock(&dp->port_mutex);
 
+    ovs_mutex_lock(&dp->bond_mutex);
+    CMAP_FOR_EACH (bond, node, &dp->tx_bonds) {
+        cmap_remove(&dp->tx_bonds, &bond->node, hash_bond_id(bond->bond_id));
+        ovsrcu_postpone(free, bond);
+    }
+    ovs_mutex_unlock(&dp->bond_mutex);
+
     dp_netdev_destroy_all_pmds(dp, true);
     cmap_destroy(&dp->poll_threads);
 
@@ -1692,6 +1784,9 @@ dp_netdev_free(struct dp_netdev *dp)
     hmap_destroy(&dp->ports);
     ovs_mutex_destroy(&dp->port_mutex);
 
+    cmap_destroy(&dp->tx_bonds);
+    ovs_mutex_destroy(&dp->bond_mutex);
+
     /* Upcalls must be disabled at this point */
     dp_netdev_destroy_upcall_lock(dp);
 
@@ -4423,6 +4518,20 @@ tx_port_lookup(const struct hmap *hmap, odp_port_t port_no)
     return NULL;
 }
 
+static struct tx_bond *
+tx_bond_lookup(const struct cmap *tx_bonds, uint32_t bond_id)
+{
+    struct tx_bond *tx;
+    uint32_t hash = hash_bond_id(bond_id);
+
+    CMAP_FOR_EACH_WITH_HASH (tx, node, hash, tx_bonds) {
+        if (tx->bond_id == bond_id) {
+            return tx;
+        }
+    }
+    return NULL;
+}
+
 static int
 port_reconfigure(struct dp_netdev_port *port)
 {
@@ -4959,9 +5068,11 @@ reconfigure_datapath(struct dp_netdev *dp)
 
     /* Remove from the pmd threads all the ports that have been deleted or
      * need reconfiguration. */
+    ovs_mutex_lock(&dp->bond_mutex);
     CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
         pmd_remove_stale_ports(dp, pmd);
     }
+    ovs_mutex_unlock(&dp->bond_mutex);
 
     /* Reload affected pmd threads.  We must wait for the pmd threads before
      * reconfiguring the ports, because a port cannot be reconfigured while
@@ -5070,14 +5181,22 @@ reconfigure_datapath(struct dp_netdev *dp)
         }
     }
 
-    /* Add every port to the tx cache of every pmd thread, if it's not
-     * there already and if this pmd has at least one rxq to poll. */
+    /* Add every port and bond to the tx port and bond caches of
+     * every pmd thread, if it's not there already and if this pmd
+     * has at least one rxq to poll.
+     */
     CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
         ovs_mutex_lock(&pmd->port_mutex);
         if (hmap_count(&pmd->poll_list) || pmd->core_id == NON_PMD_CORE_ID) {
+            struct tx_bond *bond;
+
             HMAP_FOR_EACH (port, node, &dp->ports) {
                 dp_netdev_add_port_tx_to_pmd(pmd, port);
             }
+
+            CMAP_FOR_EACH (bond, node, &dp->tx_bonds) {
+                dp_netdev_add_bond_tx_to_pmd(pmd, bond);
+            }
         }
         ovs_mutex_unlock(&pmd->port_mutex);
     }
@@ -6125,6 +6244,7 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
     atomic_init(&pmd->reload, false);
     ovs_mutex_init(&pmd->flow_mutex);
     ovs_mutex_init(&pmd->port_mutex);
+    ovs_mutex_init(&pmd->bond_mutex);
     cmap_init(&pmd->flow_table);
     cmap_init(&pmd->classifiers);
     pmd->ctx.last_rxq = NULL;
@@ -6135,6 +6255,7 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
     hmap_init(&pmd->tx_ports);
     hmap_init(&pmd->tnl_port_cache);
     hmap_init(&pmd->send_port_cache);
+    cmap_init(&pmd->tx_bonds);
     /* init the 'flow_cache' since there is no
      * actual thread created for NON_PMD_CORE_ID. */
     if (core_id == NON_PMD_CORE_ID) {
@@ -6155,6 +6276,7 @@ dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
     hmap_destroy(&pmd->send_port_cache);
     hmap_destroy(&pmd->tnl_port_cache);
     hmap_destroy(&pmd->tx_ports);
+    cmap_destroy(&pmd->tx_bonds);
     hmap_destroy(&pmd->poll_list);
     /* All flows (including their dpcls_rules) have been deleted already */
     CMAP_FOR_EACH (cls, node, &pmd->classifiers) {
@@ -6166,6 +6288,7 @@ dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
     ovs_mutex_destroy(&pmd->flow_mutex);
     seq_destroy(pmd->reload_seq);
     ovs_mutex_destroy(&pmd->port_mutex);
+    ovs_mutex_destroy(&pmd->bond_mutex);
     free(pmd);
 }
 
@@ -6235,6 +6358,7 @@ dp_netdev_pmd_clear_ports(struct dp_netdev_pmd_thread *pmd)
 {
     struct rxq_poll *poll;
     struct tx_port *port;
+    struct tx_bond *tx;
 
     ovs_mutex_lock(&pmd->port_mutex);
     HMAP_FOR_EACH_POP (poll, node, &pmd->poll_list) {
@@ -6244,6 +6368,13 @@ dp_netdev_pmd_clear_ports(struct dp_netdev_pmd_thread *pmd)
         free(port);
     }
     ovs_mutex_unlock(&pmd->port_mutex);
+
+    ovs_mutex_lock(&pmd->bond_mutex);
+    CMAP_FOR_EACH (tx, node, &pmd->tx_bonds) {
+        cmap_remove(&pmd->tx_bonds, &tx->node, hash_bond_id(tx->bond_id));
+        ovsrcu_postpone(free, tx);
+    }
+    ovs_mutex_unlock(&pmd->bond_mutex);
 }
 
 /* Adds rx queue to poll_list of PMD thread, if it's not there already. */
@@ -6319,6 +6450,53 @@ dp_netdev_del_port_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
     free(tx);
     pmd->need_reload = true;
 }
+
+/* Add bond to the tx bond cmap of 'pmd'. */
+static void
+dp_netdev_add_bond_tx_to_pmd(struct dp_netdev_pmd_thread *pmd,
+                             struct tx_bond *bond)
+{
+    struct tx_bond *tx;
+
+    ovs_mutex_lock(&pmd->bond_mutex);
+    tx = tx_bond_lookup(&pmd->tx_bonds, bond->bond_id);
+    if (tx) {
+        struct tx_bond *new_tx = xmemdup(bond, sizeof *bond);
+
+        /* Copy the stats for each bucket. */
+        for (int i = 0; i < BOND_BUCKETS; i++) {
+            uint64_t n_packets, n_bytes;
+
+            atomic_read_relaxed(&tx->slave_buckets[i].n_packets, &n_packets);
+            atomic_read_relaxed(&tx->slave_buckets[i].n_bytes, &n_bytes);
+            atomic_init(&new_tx->slave_buckets[i].n_packets, n_packets);
+            atomic_init(&new_tx->slave_buckets[i].n_bytes, n_bytes);
+        }
+        cmap_replace(&pmd->tx_bonds, &tx->node, &new_tx->node,
+                     hash_bond_id(bond->bond_id));
+        ovsrcu_postpone(free, tx);
+    } else {
+        tx = xmemdup(bond, sizeof *bond);
+        cmap_insert(&pmd->tx_bonds, &tx->node, hash_bond_id(bond->bond_id));
+    }
+    ovs_mutex_unlock(&pmd->bond_mutex);
+}
+
+/* Delete bond from the tx bond cmap of 'pmd'. */
+static void
+dp_netdev_del_bond_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
+                               uint32_t bond_id)
+{
+    struct tx_bond *tx;
+
+    ovs_mutex_lock(&pmd->bond_mutex);
+    tx = tx_bond_lookup(&pmd->tx_bonds, bond_id);
+    if (tx) {
+        cmap_remove(&pmd->tx_bonds, &tx->node, hash_bond_id(tx->bond_id));
+        ovsrcu_postpone(free, tx);
+    }
+    ovs_mutex_unlock(&pmd->bond_mutex);
+}
 

 static char *
 dpif_netdev_get_datapath_version(void)
@@ -7144,6 +7322,96 @@ dp_execute_userspace_action(struct dp_netdev_pmd_thread *pmd,
     }
 }
 
+static bool
+dp_execute_output_action(struct dp_netdev_pmd_thread *pmd,
+                         struct dp_packet_batch *packets_,
+                         bool should_steal, odp_port_t port_no)
+{
+    struct tx_port *p = pmd_send_port_cache_lookup(pmd, port_no);
+    struct dp_packet_batch out;
+
+    if (!OVS_LIKELY(p)) {
+        COVERAGE_ADD(datapath_drop_invalid_port,
+                     dp_packet_batch_size(packets_));
+        dp_packet_delete_batch(packets_, should_steal);
+        return false;
+    }
+    if (!should_steal) {
+        dp_packet_batch_clone(&out, packets_);
+        dp_packet_batch_reset_cutlen(packets_);
+        packets_ = &out;
+    }
+    dp_packet_batch_apply_cutlen(packets_);
+#ifdef DPDK_NETDEV
+    if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
+             && packets_->packets[0]->source
+             != p->output_pkts.packets[0]->source)) {
+        /* XXX: netdev-dpdk assumes that all packets in a single
+         *      output batch has the same source. Flush here to
+         *      avoid memory access issues. */
+        dp_netdev_pmd_flush_output_on_port(pmd, p);
+    }
+#endif
+    if (dp_packet_batch_size(&p->output_pkts)
+        + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) {
+        /* Flush here to avoid overflow. */
+        dp_netdev_pmd_flush_output_on_port(pmd, p);
+    }
+    if (dp_packet_batch_is_empty(&p->output_pkts)) {
+        pmd->n_output_batches++;
+    }
+
+    struct dp_packet *packet;
+    DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
+        p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
+            pmd->ctx.last_rxq;
+        dp_packet_batch_add(&p->output_pkts, packet);
+    }
+    return true;
+}
+
+static void
+dp_execute_lb_output_action(struct dp_netdev_pmd_thread *pmd,
+                            struct dp_packet_batch *packets_,
+                            bool should_steal, uint32_t bond)
+{
+    struct dp_packet *packet;
+    struct tx_bond *p_bond = tx_bond_lookup(&pmd->tx_bonds, bond);
+    struct dp_packet_batch out;
+
+    if (!p_bond) {
+        COVERAGE_ADD(datapath_drop_invalid_bond,
+                     dp_packet_batch_size(packets_));
+        dp_packet_delete_batch(packets_, should_steal);
+        return;
+    }
+    if (!should_steal) {
+        dp_packet_batch_clone(&out, packets_);
+        dp_packet_batch_reset_cutlen(packets_);
+        packets_ = &out;
+    }
+    dp_packet_batch_apply_cutlen(packets_);
+
+    DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
+        /*
+         * Lookup the bond-hash table using hash to get the slave.
+         */
+        uint32_t hash = dp_packet_get_rss_hash(packet);
+        struct slave_entry *s_entry = &p_bond->slave_buckets[hash & BOND_MASK];
+        odp_port_t bond_member = s_entry->slave_id;
+        uint32_t size = dp_packet_size(packet);
+        struct dp_packet_batch output_pkt;
+
+        dp_packet_batch_init_packet(&output_pkt, packet);
+        if (OVS_LIKELY(dp_execute_output_action(pmd, &output_pkt, true,
+                                                bond_member))) {
+            /* Update slave stats. */
+            non_atomic_ullong_add(&s_entry->n_packets, 1);
+            non_atomic_ullong_add(&s_entry->n_bytes, size);
+        }
+    }
+}
+
 static void
 dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
               const struct nlattr *a, bool should_steal)
@@ -7159,49 +7427,14 @@ dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
 
     switch ((enum ovs_action_attr)type) {
     case OVS_ACTION_ATTR_OUTPUT:
-        p = pmd_send_port_cache_lookup(pmd, nl_attr_get_odp_port(a));
-        if (OVS_LIKELY(p)) {
-            struct dp_packet *packet;
-            struct dp_packet_batch out;
-
-            if (!should_steal) {
-                dp_packet_batch_clone(&out, packets_);
-                dp_packet_batch_reset_cutlen(packets_);
-                packets_ = &out;
-            }
-            dp_packet_batch_apply_cutlen(packets_);
-
-#ifdef DPDK_NETDEV
-            if (OVS_UNLIKELY(!dp_packet_batch_is_empty(&p->output_pkts)
-                             && packets_->packets[0]->source
-                                != p->output_pkts.packets[0]->source)) {
-                /* XXX: netdev-dpdk assumes that all packets in a single
-                 *      output batch has the same source. Flush here to
-                 *      avoid memory access issues. */
-                dp_netdev_pmd_flush_output_on_port(pmd, p);
-            }
-#endif
-            if (dp_packet_batch_size(&p->output_pkts)
-                + dp_packet_batch_size(packets_) > NETDEV_MAX_BURST) {
-                /* Flush here to avoid overflow. */
-                dp_netdev_pmd_flush_output_on_port(pmd, p);
-            }
-
-            if (dp_packet_batch_is_empty(&p->output_pkts)) {
-                pmd->n_output_batches++;
-            }
+        dp_execute_output_action(pmd, packets_, should_steal,
+                                     nl_attr_get_odp_port(a));
+        return;
 
-            DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
-                p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
-                                                             pmd->ctx.last_rxq;
-                dp_packet_batch_add(&p->output_pkts, packet);
-            }
-            return;
-        } else {
-            COVERAGE_ADD(datapath_drop_invalid_port,
-                         dp_packet_batch_size(packets_));
-        }
-        break;
+    case OVS_ACTION_ATTR_LB_OUTPUT:
+        dp_execute_lb_output_action(pmd, packets_, should_steal,
+                                        nl_attr_get_u32(a));
+        return;
 
     case OVS_ACTION_ATTR_TUNNEL_PUSH:
         if (should_steal) {
@@ -7813,6 +8046,95 @@ dpif_netdev_ipf_dump_done(struct dpif *dpif OVS_UNUSED, void *ipf_dump_ctx)
 
 }
 
+static int
+dpif_netdev_bond_add(struct dpif *dpif, uint32_t bond_id,
+                     odp_port_t *slave_map)
+{
+    struct dp_netdev *dp = get_dp_netdev(dpif);
+    struct dp_netdev_pmd_thread *pmd;
+    struct tx_bond *new_tx = xzalloc(sizeof *new_tx);
+
+    /* Prepare new bond mapping. */
+    new_tx->bond_id = bond_id;
+    for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
+        new_tx->slave_buckets[bucket].slave_id = slave_map[bucket];
+    }
+
+    ovs_mutex_lock(&dp->bond_mutex);
+    /* Check if bond already existed. */
+    struct tx_bond *old_tx = tx_bond_lookup(&dp->tx_bonds, bond_id);
+    if (old_tx) {
+        cmap_replace(&dp->tx_bonds, &old_tx->node, &new_tx->node,
+                     hash_bond_id(bond_id));
+        ovsrcu_postpone(free, old_tx);
+    } else {
+        cmap_insert(&dp->tx_bonds, &new_tx->node,
+                    hash_bond_id(bond_id));
+    }
+    ovs_mutex_unlock(&dp->bond_mutex);
+
+    /* Update all PMDs with new bond mapping. */
+    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
+        dp_netdev_add_bond_tx_to_pmd(pmd, new_tx);
+    }
+    return 0;
+}
+
+static int
+dpif_netdev_bond_del(struct dpif *dpif, uint32_t bond_id)
+{
+    struct dp_netdev *dp = get_dp_netdev(dpif);
+    struct tx_bond *tx;
+    struct dp_netdev_pmd_thread *pmd;
+
+    ovs_mutex_lock(&dp->bond_mutex);
+    /* Check if bond existed. */
+    tx = tx_bond_lookup(&dp->tx_bonds, bond_id);
+    if (tx) {
+        cmap_remove(&dp->tx_bonds, &tx->node, hash_bond_id(bond_id));
+        ovsrcu_postpone(free, tx);
+    } else {
+        /* Bond is not present. */
+        ovs_mutex_unlock(&dp->bond_mutex);
+        return 0;
+    }
+    ovs_mutex_unlock(&dp->bond_mutex);
+
+    /* Remove the bond map in all pmds. */
+    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
+        dp_netdev_del_bond_tx_from_pmd(pmd, bond_id);
+    }
+    return 0;
+}
+
+static int
+dpif_netdev_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
+                           uint64_t *n_bytes)
+{
+    struct dp_netdev *dp = get_dp_netdev(dpif);
+    struct dp_netdev_pmd_thread *pmd;
+
+    /* Search the bond in all PMDs. */
+    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
+        struct tx_bond *pmd_bond_entry
+            = tx_bond_lookup(&pmd->tx_bonds, bond_id);
+
+        if (!pmd_bond_entry) {
+            continue;
+        }
+
+        /* Read bond stats. */
+        for (int i = 0; i < BOND_BUCKETS; i++) {
+            uint64_t pmd_n_bytes;
+
+            atomic_read_relaxed(&pmd_bond_entry->slave_buckets[i].n_bytes,
+                                &pmd_n_bytes);
+            n_bytes[i] += pmd_n_bytes;
+        }
+    }
+    return 0;
+}
+
 const struct dpif_class dpif_netdev_class = {
     "netdev",
     true,                       /* cleanup_required */
@@ -7886,6 +8208,9 @@ const struct dpif_class dpif_netdev_class = {
     dpif_netdev_meter_set,
     dpif_netdev_meter_get,
     dpif_netdev_meter_del,
+    dpif_netdev_bond_add,
+    dpif_netdev_bond_del,
+    dpif_netdev_bond_stats_get,
 };
 
 static void
diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index dc64210..7818f91 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -4003,6 +4003,9 @@ const struct dpif_class dpif_netlink_class = {
     dpif_netlink_meter_set,
     dpif_netlink_meter_get,
     dpif_netlink_meter_del,
+    NULL,                       /* bond_add */
+    NULL,                       /* bond_del */
+    NULL,                       /* bond_stats_get */
 };
 
 static int
diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
index b77317b..781c7ca 100644
--- a/lib/dpif-provider.h
+++ b/lib/dpif-provider.h
@@ -616,6 +616,17 @@ struct dpif_class {
      * zero. */
     int (*meter_del)(struct dpif *, ofproto_meter_id meter_id,
                      struct ofputil_meter_stats *, uint16_t n_bands);
+
+    /* Adds a bond with 'bond_id' and the slave-map to 'dpif'. */
+    int (*bond_add)(struct dpif *dpif, uint32_t bond_id,
+                    odp_port_t *slave_map);
+
+    /* Removes bond identified by 'bond_id' from 'dpif'. */
+    int (*bond_del)(struct dpif *dpif, uint32_t bond_id);
+
+    /* Reads bond stats from 'dpif'. */
+    int (*bond_stats_get)(struct dpif *dpif, uint32_t bond_id,
+                          uint64_t *n_bytes);
 };
 
 extern const struct dpif_class dpif_netlink_class;
diff --git a/lib/dpif.c b/lib/dpif.c
index 9d9c716..c8dc8ed 100644
--- a/lib/dpif.c
+++ b/lib/dpif.c
@@ -1170,6 +1170,7 @@ dpif_execute_helper_cb(void *aux_, struct dp_packet_batch *packets_,
 
     case OVS_ACTION_ATTR_CT:
     case OVS_ACTION_ATTR_OUTPUT:
+    case OVS_ACTION_ATTR_LB_OUTPUT:
     case OVS_ACTION_ATTR_TUNNEL_PUSH:
     case OVS_ACTION_ATTR_TUNNEL_POP:
     case OVS_ACTION_ATTR_USERSPACE:
@@ -1220,6 +1221,7 @@ dpif_execute_helper_cb(void *aux_, struct dp_packet_batch *packets_,
         struct dp_packet *clone = NULL;
         uint32_t cutlen = dp_packet_get_cutlen(packet);
         if (cutlen && (type == OVS_ACTION_ATTR_OUTPUT
+                        || type == OVS_ACTION_ATTR_LB_OUTPUT
                         || type == OVS_ACTION_ATTR_TUNNEL_PUSH
                         || type == OVS_ACTION_ATTR_TUNNEL_POP
                         || type == OVS_ACTION_ATTR_USERSPACE)) {
@@ -1879,6 +1881,16 @@ dpif_supports_explicit_drop_action(const struct dpif *dpif)
     return dpif_is_netdev(dpif);
 }
 
+bool
+dpif_supports_lb_output_action(const struct dpif *dpif)
+{
+    /*
+     * Balance-tcp optimization is currently supported in netdev
+     * datapath only.
+     */
+    return dpif_is_netdev(dpif);
+}
+
 /* Meters */
 void
 dpif_meter_get_features(const struct dpif *dpif,
@@ -1976,3 +1988,40 @@ dpif_meter_del(struct dpif *dpif, ofproto_meter_id meter_id,
     }
     return error;
 }
+
+int
+dpif_bond_add(struct dpif *dpif, uint32_t bond_id, odp_port_t *slave_map)
+{
+    int error = -ENOTSUP;
+
+    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) {
+        error = dpif->dpif_class->bond_add(dpif, bond_id, slave_map);
+    }
+
+    return error;
+}
+
+int
+dpif_bond_del(struct dpif *dpif, uint32_t bond_id)
+{
+    int error = -ENOTSUP;
+
+    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_del) {
+        error = dpif->dpif_class->bond_del(dpif, bond_id);
+    }
+
+    return error;
+}
+
+int
+dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
+                    uint64_t *n_bytes)
+{
+    int error = -ENOTSUP;
+
+    if (dpif && dpif->dpif_class && dpif->dpif_class->bond_stats_get) {
+        error = dpif->dpif_class->bond_stats_get(dpif, bond_id, n_bytes);
+    }
+
+    return error;
+}
diff --git a/lib/dpif.h b/lib/dpif.h
index 4df8f7c..1cb5a71 100644
--- a/lib/dpif.h
+++ b/lib/dpif.h
@@ -891,6 +891,19 @@ int dpif_meter_get(const struct dpif *, ofproto_meter_id meter_id,
                    struct ofputil_meter_stats *, uint16_t n_bands);
 int dpif_meter_del(struct dpif *, ofproto_meter_id meter_id,
                    struct ofputil_meter_stats *, uint16_t n_bands);
+
+/* Bonding. */
+
+/* Bit-mask for hashing a flow down to a bucket. */
+#define BOND_MASK 0xff
+#define BOND_BUCKETS (BOND_MASK + 1)
+
+int dpif_bond_add(struct dpif *dpif, uint32_t bond_id, odp_port_t *slave_map);
+int dpif_bond_del(struct dpif *dpif, uint32_t bond_id);
+int dpif_bond_stats_get(struct dpif *dpif, uint32_t bond_id,
+                        uint64_t *n_bytes);
+bool dpif_supports_lb_output_action(const struct dpif *dpif);
+
 

 /* Miscellaneous. */
 
diff --git a/lib/odp-execute.c b/lib/odp-execute.c
index 42d3335..6018e37 100644
--- a/lib/odp-execute.c
+++ b/lib/odp-execute.c
@@ -793,6 +793,7 @@ requires_datapath_assistance(const struct nlattr *a)
     switch (type) {
         /* These only make sense in the context of a datapath. */
     case OVS_ACTION_ATTR_OUTPUT:
+    case OVS_ACTION_ATTR_LB_OUTPUT:
     case OVS_ACTION_ATTR_TUNNEL_PUSH:
     case OVS_ACTION_ATTR_TUNNEL_POP:
     case OVS_ACTION_ATTR_USERSPACE:
@@ -1068,6 +1069,7 @@ odp_execute_actions(void *dp, struct dp_packet_batch *batch, bool steal,
             return;
         }
         case OVS_ACTION_ATTR_OUTPUT:
+        case OVS_ACTION_ATTR_LB_OUTPUT:
         case OVS_ACTION_ATTR_TUNNEL_PUSH:
         case OVS_ACTION_ATTR_TUNNEL_POP:
         case OVS_ACTION_ATTR_USERSPACE:
diff --git a/lib/odp-util.c b/lib/odp-util.c
index b66d266..92e9253 100644
--- a/lib/odp-util.c
+++ b/lib/odp-util.c
@@ -119,6 +119,7 @@ odp_action_len(uint16_t type)
 
     switch ((enum ovs_action_attr) type) {
     case OVS_ACTION_ATTR_OUTPUT: return sizeof(uint32_t);
+    case OVS_ACTION_ATTR_LB_OUTPUT: return sizeof(uint32_t);
     case OVS_ACTION_ATTR_TRUNC: return sizeof(struct ovs_action_trunc);
     case OVS_ACTION_ATTR_TUNNEL_PUSH: return ATTR_LEN_VARIABLE;
     case OVS_ACTION_ATTR_TUNNEL_POP: return sizeof(uint32_t);
@@ -1132,6 +1133,9 @@ format_odp_action(struct ds *ds, const struct nlattr *a,
     case OVS_ACTION_ATTR_OUTPUT:
         odp_portno_name_format(portno_names, nl_attr_get_odp_port(a), ds);
         break;
+    case OVS_ACTION_ATTR_LB_OUTPUT:
+        ds_put_format(ds, "lb_output(%"PRIu32")", nl_attr_get_u32(a));
+        break;
     case OVS_ACTION_ATTR_TRUNC: {
         const struct ovs_action_trunc *trunc =
                        nl_attr_get_unspec(a, sizeof *trunc);
@@ -2306,6 +2310,16 @@ parse_odp_action__(struct parse_odp_context *context, const char *s,
     }
 
     {
+        uint32_t bond_id;
+        int n;
+
+        if (ovs_scan(s, "lb_output(%"PRIu32")%n", &bond_id, &n)) {
+            nl_msg_put_u32(actions, OVS_ACTION_ATTR_LB_OUTPUT, bond_id);
+            return n;
+        }
+    }
+
+    {
         uint32_t max_len;
         int n;
 
diff --git a/ofproto/bond.c b/ofproto/bond.c
index 405202f..74e8d2e 100644
--- a/ofproto/bond.c
+++ b/ofproto/bond.c
@@ -54,10 +54,6 @@ static struct ovs_rwlock rwlock = OVS_RWLOCK_INITIALIZER;
 static struct hmap all_bonds__ = HMAP_INITIALIZER(&all_bonds__);
 static struct hmap *const all_bonds OVS_GUARDED_BY(rwlock) = &all_bonds__;
 
-/* Bit-mask for hashing a flow down to a bucket. */
-#define BOND_MASK 0xff
-#define BOND_BUCKETS (BOND_MASK + 1)
-
 /* Priority for internal rules created to handle recirculation */
 #define RECIRC_RULE_PRIORITY 20
 
@@ -126,6 +122,8 @@ struct bond {
     enum lacp_status lacp_status; /* Status of LACP negotiations. */
     bool bond_revalidate;       /* True if flows need revalidation. */
     uint32_t basis;             /* Basis for flow hash function. */
+    bool use_lb_output_action;  /* Use lb-output-action to avoid recirculation.
+                                   Applicable only for Balance TCP mode. */
 
     /* SLB specific bonding info. */
     struct bond_entry *hash;     /* An array of BOND_BUCKETS elements. */
@@ -187,6 +185,8 @@ static struct bond_slave *choose_output_slave(const struct bond *,
     OVS_REQ_RDLOCK(rwlock);
 static void update_recirc_rules__(struct bond *bond);
 static bool bond_is_falling_back_to_ab(const struct bond *);
+static void bond_add_lb_output_buckets(const struct bond *bond);
+static void bond_del_lb_output_buckets(const struct bond *bond);
 
 /* Attempts to parse 's' as the name of a bond balancing mode.  If successful,
  * stores the mode in '*balance' and returns true.  Otherwise returns false
@@ -282,6 +282,10 @@ bond_unref(struct bond *bond)
 
     /* Free bond resources. Remove existing post recirc rules. */
     if (bond->recirc_id) {
+        if (bond_use_lb_output_action(bond)) {
+            /* Delete bond buckets from datapath if installed. */
+            bond_del_lb_output_buckets(bond);
+        }
         recirc_free_id(bond->recirc_id);
         bond->recirc_id = 0;
     }
@@ -343,16 +347,25 @@ update_recirc_rules__(struct bond *bond)
     }
 
     if (bond->hash && bond->recirc_id) {
-        for (i = 0; i < BOND_BUCKETS; i++) {
-            struct bond_slave *slave = bond->hash[i].slave;
+        if (bond_use_lb_output_action(bond)) {
+            bond_add_lb_output_buckets(bond);
+            /* No need to install post recirculation rules as we are using
+             * lb-output-action with bond buckets.
+             */
+            ofpbuf_uninit(&ofpacts);
+            return;
+        } else {
+            for (i = 0; i < BOND_BUCKETS; i++) {
+                struct bond_slave *slave = bond->hash[i].slave;
 
-            if (slave) {
-                match_init_catchall(&match);
-                match_set_recirc_id(&match, bond->recirc_id);
-                match_set_dp_hash_masked(&match, i, BOND_MASK);
+                if (slave) {
+                    match_init_catchall(&match);
+                    match_set_recirc_id(&match, bond->recirc_id);
+                    match_set_dp_hash_masked(&match, i, BOND_MASK);
 
-                add_pr_rule(bond, &match, slave->ofp_port,
-                            &bond->hash[i].pr_rule);
+                    add_pr_rule(bond, &match, slave->ofp_port,
+                                &bond->hash[i].pr_rule);
+                }
             }
         }
     }
@@ -464,9 +477,17 @@ bond_reconfigure(struct bond *bond, const struct bond_settings *s)
             bond->recirc_id = recirc_alloc_id(bond->ofproto);
         }
     } else if (bond->recirc_id) {
+        if (bond_use_lb_output_action(bond)) {
+            /* Delete bond buckets from datapath if installed. */
+            bond_del_lb_output_buckets(bond);
+        }
         recirc_free_id(bond->recirc_id);
         bond->recirc_id = 0;
     }
+    if (bond->use_lb_output_action != s->use_lb_output_action) {
+        bond->use_lb_output_action = s->use_lb_output_action;
+        revalidate = true;
+    }
 
     if (bond->balance == BM_AB || !bond->hash || revalidate) {
         bond_entry_reset(bond);
@@ -944,19 +965,31 @@ bond_recirculation_account(struct bond *bond)
     OVS_REQ_WRLOCK(rwlock)
 {
     int i;
+    uint64_t n_bytes[BOND_BUCKETS] = {0};
+    bool use_lb_output_action = bond_use_lb_output_action(bond);
+
+    if (use_lb_output_action) {
+        /* Retrieve bond stats from datapath. */
+        dpif_bond_stats_get(bond->ofproto->backer->dpif,
+                            bond->recirc_id, n_bytes);
+    }
 
     for (i=0; i<=BOND_MASK; i++) {
         struct bond_entry *entry = &bond->hash[i];
         struct rule *rule = entry->pr_rule;
+        struct pkt_stats stats;
 
-        if (rule) {
-            struct pkt_stats stats;
+        if (use_lb_output_action) {
+            stats.n_bytes = n_bytes[i];
+        } else if (rule) {
             long long int used OVS_UNUSED;
 
             rule->ofproto->ofproto_class->rule_get_stats(
                 rule, &stats, &used);
-            bond_entry_account(entry, stats.n_bytes);
+        } else {
+            continue;
         }
+        bond_entry_account(entry, stats.n_bytes);
     }
 }
 
@@ -1365,6 +1398,9 @@ bond_print_details(struct ds *ds, const struct bond *bond)
                   may_recirc ? "yes" : "no", may_recirc ? recirc_id: -1);
 
     ds_put_format(ds, "bond-hash-basis: %"PRIu32"\n", bond->basis);
+    ds_put_format(ds, "lb-output-action: %s, bond-id: %u\n",
+                  bond->use_lb_output_action ? "enabled" : "disabled",
+                  recirc_id);
 
     ds_put_format(ds, "updelay: %d ms\n", bond->updelay);
     ds_put_format(ds, "downdelay: %d ms\n", bond->downdelay);
@@ -1942,3 +1978,34 @@ bond_get_changed_active_slave(const char *name, struct eth_addr *mac,
 
     return false;
 }
+
+bool
+bond_use_lb_output_action(const struct bond *bond)
+{
+    return (bond_may_recirc(bond)) && bond->use_lb_output_action;
+}
+
+static void
+bond_add_lb_output_buckets(const struct bond *bond)
+{
+    ofp_port_t slave_map[BOND_MASK];
+
+    for (int i = 0; i < BOND_BUCKETS; i++) {
+        struct bond_slave *slave = bond->hash[i].slave;
+
+        if (slave) {
+            slave_map[i] = slave->ofp_port;
+        } else {
+            slave_map[i] = OFPP_NONE;
+        }
+    }
+    ofproto_dpif_add_lb_output_buckets(bond->ofproto, bond->recirc_id,
+                                           slave_map);
+}
+
+static void
+bond_del_lb_output_buckets(const struct bond *bond)
+{
+    ofproto_dpif_delete_lb_output_buckets(bond->ofproto,
+                                          bond->recirc_id);
+}
diff --git a/ofproto/bond.h b/ofproto/bond.h
index e7c3d9b..5176b08 100644
--- a/ofproto/bond.h
+++ b/ofproto/bond.h
@@ -58,6 +58,8 @@ struct bond_settings {
                                 /* The MAC address of the interface
                                    that was active during the last
                                    ovs run. */
+    bool use_lb_output_action;  /* Use lb-output-action. Only applicable for
+                                   bond mode BALANCE TCP. */
 };
 
 /* Program startup. */
@@ -122,4 +124,7 @@ void bond_rebalance(struct bond *);
 */
 void bond_update_post_recirc_rules(struct bond *, uint32_t *recirc_id,
                                    uint32_t *hash_basis);
+
+bool bond_use_lb_output_action(const struct bond *bond);
+
 #endif /* bond.h */
diff --git a/ofproto/ofproto-dpif-ipfix.c b/ofproto/ofproto-dpif-ipfix.c
index b413768..796eb6f 100644
--- a/ofproto/ofproto-dpif-ipfix.c
+++ b/ofproto/ofproto-dpif-ipfix.c
@@ -2979,6 +2979,7 @@ dpif_ipfix_read_actions(const struct flow *flow,
         enum ovs_action_attr type = nl_attr_type(a);
         switch (type) {
         case OVS_ACTION_ATTR_OUTPUT:
+        case OVS_ACTION_ATTR_LB_OUTPUT:
             ipfix_actions->output_action = true;
             break;
         case OVS_ACTION_ATTR_SAMPLE:
diff --git a/ofproto/ofproto-dpif-sflow.c b/ofproto/ofproto-dpif-sflow.c
index f9ea47a..f616fb2 100644
--- a/ofproto/ofproto-dpif-sflow.c
+++ b/ofproto/ofproto-dpif-sflow.c
@@ -1175,8 +1175,9 @@ dpif_sflow_read_actions(const struct flow *flow,
         case OVS_ACTION_ATTR_RECIRC:
         case OVS_ACTION_ATTR_HASH:
         case OVS_ACTION_ATTR_CT:
-    case OVS_ACTION_ATTR_CT_CLEAR:
+        case OVS_ACTION_ATTR_CT_CLEAR:
         case OVS_ACTION_ATTR_METER:
+        case OVS_ACTION_ATTR_LB_OUTPUT:
             break;
 
         case OVS_ACTION_ATTR_SET_MASKED:
diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
index 80fba84..57ac574 100644
--- a/ofproto/ofproto-dpif-xlate.c
+++ b/ofproto/ofproto-dpif-xlate.c
@@ -4208,7 +4208,6 @@ compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
         xlate_commit_actions(ctx);
 
         if (xr) {
-            /* Recirculate the packet. */
             struct ovs_action_hash *act_hash;
 
             /* Hash action. */
@@ -4217,15 +4216,27 @@ compose_output_action__(struct xlate_ctx *ctx, ofp_port_t ofp_port,
                 /* Algorithm supported by all datapaths. */
                 hash_alg = OVS_HASH_ALG_L4;
             }
-            act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
-                                                OVS_ACTION_ATTR_HASH,
-                                                sizeof *act_hash);
-            act_hash->hash_alg = hash_alg;
-            act_hash->hash_basis = xr->hash_basis;
 
-            /* Recirc action. */
-            nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
-                           xr->recirc_id);
+            if (bond_use_lb_output_action(xport->xbundle->bond)) {
+                /*
+                 * If bond mode is balance-tcp and optimize balance tcp
+                 * is enabled then use the hash directly for slave selection
+                 * and avoid recirculation.
+                 *
+                 * Currently support for netdev datapath only.
+                 */
+                nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_LB_OUTPUT,
+                               xr->recirc_id);
+            } else {
+                act_hash = nl_msg_put_unspec_uninit(ctx->odp_actions,
+                                                    OVS_ACTION_ATTR_HASH,
+                                                    sizeof *act_hash);
+                act_hash->hash_alg = hash_alg;
+                act_hash->hash_basis = xr->hash_basis;
+                /* Recirculate the packet. */
+                nl_msg_put_u32(ctx->odp_actions, OVS_ACTION_ATTR_RECIRC,
+                               xr->recirc_id);
+            }
         } else if (is_native_tunnel) {
             /* Output to native tunnel port. */
             native_tunnel_output(ctx, xport, flow, odp_port, truncate);
@@ -7308,7 +7319,8 @@ count_output_actions(const struct ofpbuf *odp_actions)
     int n = 0;
 
     NL_ATTR_FOR_EACH_UNSAFE (a, left, odp_actions->data, odp_actions->size) {
-        if (a->nla_type == OVS_ACTION_ATTR_OUTPUT) {
+        if ((a->nla_type == OVS_ACTION_ATTR_OUTPUT) ||
+            (a->nla_type == OVS_ACTION_ATTR_LB_OUTPUT)) {
             n++;
         }
     }
diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
index 7e10375..5380c9c 100644
--- a/ofproto/ofproto-dpif.c
+++ b/ofproto/ofproto-dpif.c
@@ -1582,6 +1582,8 @@ check_support(struct dpif_backer *backer)
     backer->rt_support.ct_timeout = check_ct_timeout_policy(backer);
     backer->rt_support.explicit_drop_action =
         dpif_supports_explicit_drop_action(backer->dpif);
+    backer->rt_support.lb_output_action=
+        dpif_supports_lb_output_action(backer->dpif);
 
     /* Flow fields. */
     backer->rt_support.odp.ct_state = check_ct_state(backer);
@@ -3441,6 +3443,27 @@ bundle_remove(struct ofport *port_)
     }
 }
 
+int
+ofproto_dpif_add_lb_output_buckets(struct ofproto_dpif *ofproto,
+                                   uint32_t bond_id,
+                                   const ofp_port_t *slave_map)
+{
+    odp_port_t odp_map[BOND_BUCKETS];
+
+    for (int bucket = 0; bucket < BOND_BUCKETS; bucket++) {
+        /* Convert ofp_port to odp_port. */
+        odp_map[bucket] = ofp_port_to_odp_port(ofproto, slave_map[bucket]);
+    }
+    return dpif_bond_add(ofproto->backer->dpif, bond_id, odp_map);
+}
+
+int
+ofproto_dpif_delete_lb_output_buckets(struct ofproto_dpif *ofproto,
+                                      uint32_t bond_id)
+{
+    return dpif_bond_del(ofproto->backer->dpif, bond_id);
+}
+
 static void
 send_pdu_cb(void *port_, const void *pdu, size_t pdu_size)
 {
@@ -5572,6 +5595,7 @@ get_datapath_cap(const char *datapath_type, struct smap *cap)
     smap_add(cap, "ct_timeout", s.ct_timeout ? "true" : "false");
     smap_add(cap, "explicit_drop_action",
              s.explicit_drop_action ? "true" :"false");
+    smap_add(cap, "lb_output_action", s.lb_output_action ? "true" : "false");
 }
 
 /* Gets timeout policy name in 'backer' based on 'zone', 'dl_type' and
@@ -6610,6 +6634,13 @@ meter_del(struct ofproto *ofproto_, ofproto_meter_id meter_id)
     ovsrcu_postpone(free_meter_id, arg);
 }
 
+static bool
+is_lb_output_action_supported(const struct ofproto *ofproto_)
+{
+    struct ofproto_dpif *ofproto = ofproto_dpif_cast(ofproto_);
+    return ofproto->backer->rt_support.lb_output_action;
+}
+
 const struct ofproto_class ofproto_dpif_class = {
     init,
     enumerate_types,
@@ -6717,4 +6748,5 @@ const struct ofproto_class ofproto_dpif_class = {
     ct_flush,                   /* ct_flush */
     ct_set_zone_timeout_policy,
     ct_del_zone_timeout_policy,
+    is_lb_output_action_supported,
 };
diff --git a/ofproto/ofproto-dpif.h b/ofproto/ofproto-dpif.h
index aee61d6..9fc6f88 100644
--- a/ofproto/ofproto-dpif.h
+++ b/ofproto/ofproto-dpif.h
@@ -202,7 +202,10 @@ struct group_dpif *group_dpif_lookup(struct ofproto_dpif *,
     DPIF_SUPPORT_FIELD(bool, ct_timeout, "Conntrack timeout policy")        \
                                                                             \
     /* True if the datapath supports explicit drop action. */               \
-    DPIF_SUPPORT_FIELD(bool, explicit_drop_action, "Explicit Drop action")
+    DPIF_SUPPORT_FIELD(bool, explicit_drop_action, "Explicit Drop action")  \
+                                                                            \
+    /* True if the datapath supports balance_tcp optimization */            \
+    DPIF_SUPPORT_FIELD(bool, lb_output_action, "Optimized Balance TCP mode")
 
 
 /* Stores the various features which the corresponding backer supports. */
@@ -382,6 +385,11 @@ int ofproto_dpif_add_internal_flow(struct ofproto_dpif *,
                                    struct rule **rulep);
 int ofproto_dpif_delete_internal_flow(struct ofproto_dpif *, struct match *,
                                       int priority);
+int ofproto_dpif_add_lb_output_buckets(struct ofproto_dpif *dpif,
+                                       uint32_t bond_id,
+                                       const ofp_port_t *slave_map);
+int ofproto_dpif_delete_lb_output_buckets(struct ofproto_dpif *dpif,
+                                          uint32_t bond_id);
 
 bool ovs_native_tunneling_is_on(struct ofproto_dpif *);
 
diff --git a/ofproto/ofproto-provider.h b/ofproto/ofproto-provider.h
index afecb24..571da24 100644
--- a/ofproto/ofproto-provider.h
+++ b/ofproto/ofproto-provider.h
@@ -1905,6 +1905,9 @@ struct ofproto_class {
     /* Deletes the timeout policy associated with 'zone' in datapath type
      * 'dp_type'. */
     void (*ct_del_zone_timeout_policy)(const char *dp_type, uint16_t zone);
+
+    /* Return 'true' if datapath supports 'lb-output-action'. */
+    bool (*is_lb_output_action_supported)(const struct ofproto *);
 };
 
 extern const struct ofproto_class ofproto_dpif_class;
diff --git a/ofproto/ofproto.c b/ofproto/ofproto.c
index 59f06aa..b34522a 100644
--- a/ofproto/ofproto.c
+++ b/ofproto/ofproto.c
@@ -1497,6 +1497,16 @@ ofproto_is_mirror_output_bundle(const struct ofproto *ofproto, void *aux)
             ? ofproto->ofproto_class->is_mirror_output_bundle(ofproto, aux)
             : false);
 }
+
+/* Returns true if datapath supports lb-output-action. */
+bool
+ofproto_is_lb_output_action_supported(const struct ofproto *ofproto)
+{
+    return (ofproto->ofproto_class->is_lb_output_action_supported
+            ? ofproto->ofproto_class->is_lb_output_action_supported(ofproto)
+            : false);
+}
+
 

 /* Configuration of OpenFlow tables. */
 
diff --git a/ofproto/ofproto.h b/ofproto/ofproto.h
index 2dd2531..dda2733 100644
--- a/ofproto/ofproto.h
+++ b/ofproto/ofproto.h
@@ -505,6 +505,7 @@ unsigned int ofproto_aa_vlan_get_queue_size(struct ofproto *ofproto);
 
 int ofproto_set_flood_vlans(struct ofproto *, unsigned long *flood_vlans);
 bool ofproto_is_mirror_output_bundle(const struct ofproto *, void *aux);
+bool ofproto_is_lb_output_action_supported(const struct ofproto *ofproto);
 
 /* Configuration of OpenFlow tables. */
 struct ofproto_table_settings {
diff --git a/tests/lacp.at b/tests/lacp.at
index 7b460d7..d7d98f7 100644
--- a/tests/lacp.at
+++ b/tests/lacp.at
@@ -121,6 +121,7 @@ AT_CHECK([ovs-appctl bond/show], [0], [dnl
 bond_mode: active-backup
 bond may use recirculation: no, Recirc-ID : -1
 bond-hash-basis: 0
+lb-output-action: disabled, bond-id: 0
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -286,6 +287,7 @@ slave: p3: current attached
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+lb-output-action: disabled, bond-id: 1
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -301,6 +303,7 @@ slave p1: enabled
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+lb-output-action: disabled, bond-id: 2
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -423,6 +426,7 @@ slave: p3: current attached
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+lb-output-action: disabled, bond-id: 1
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -440,6 +444,7 @@ slave p1: enabled
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+lb-output-action: disabled, bond-id: 2
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -555,6 +560,7 @@ slave: p3: current attached
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+lb-output-action: disabled, bond-id: 1
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -572,6 +578,7 @@ slave p1: enabled
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+lb-output-action: disabled, bond-id: 2
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -692,6 +699,7 @@ slave: p3: current attached
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+lb-output-action: disabled, bond-id: 1
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
@@ -709,6 +717,7 @@ slave p1: enabled
 bond_mode: balance-tcp
 bond may use recirculation: yes, <del>
 bond-hash-basis: 0
+lb-output-action: disabled, bond-id: 2
 updelay: 0 ms
 downdelay: 0 ms
 lacp_status: negotiated
diff --git a/tests/odp.at b/tests/odp.at
index 3ab9ad6..1ebdf05 100644
--- a/tests/odp.at
+++ b/tests/odp.at
@@ -383,6 +383,7 @@ check_pkt_len(size=200,gt(4),le(5))
 check_pkt_len(size=200,gt(drop),le(5))
 check_pkt_len(size=200,gt(ct(nat)),le(drop))
 check_pkt_len(size=200,gt(set(eth(src=00:01:02:03:04:05,dst=10:11:12:13:14:15))),le(set(eth(src=00:01:02:03:04:06,dst=10:11:12:13:14:16))))
+lb_output(1)
 ])
 AT_CHECK_UNQUOTED([ovstest test-odp parse-actions < actions.txt], [0],
   [`cat actions.txt`
diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
index fe73c38..de14fd4 100644
--- a/vswitchd/bridge.c
+++ b/vswitchd/bridge.c
@@ -4600,6 +4600,19 @@ port_configure_bond(struct port *port, struct bond_settings *s)
         /* OVSDB did not store the last active interface */
         s->active_slave_mac = eth_addr_zero;
     }
+
+    /* lb-output-action is disabled by default. */
+    s->use_lb_output_action = (s->balance == BM_TCP)
+                          && smap_get_bool(&port->cfg->other_config,
+                                         "lb-output-action", false);
+
+    /* Verify if datapath supports lb-output-action. */
+    if (s->use_lb_output_action
+        && !ofproto_is_lb_output_action_supported(port->bridge->ofproto)) {
+        VLOG_WARN("port %s: Unsupported lb-output-action in datapath, "
+                  "defaulting to false.", port->name);
+        s->use_lb_output_action = false;
+    }
 }
 
 /* Returns true if 'port' is synthetic, that is, if we constructed it locally
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 6d33437..5a5a474 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -1994,6 +1994,16 @@
         <code>active-backup</code>.
       </column>
 
+      <column name="other_config" key="lb-output-action"
+              type='{"type": "boolean"}'>
+        Enable/disable usage of optimized lb-output-action for
+        balancing flows among output slaves in load balanced bonds in
+        <code>balance-tcp</code>. When enabled, it uses optimized path for
+        balance-tcp mode by using rss hash and avoids recirculation.
+        It affects only new flows, i.e, existing flows remain unchanged.
+        This knob does not affect other balancing modes.
+      </column>
+
       <group title="Link Failure Detection">
         <p>
           An important part of link bonding is detecting that links are down so
@@ -5812,6 +5822,19 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
         higher performance for MPLS and active-active load balancing
         bonding modes.
       </column>
+      <column name="capabilities" key="lb_output_action"
+              type='{"type": "boolean"}'>
+        If this is true, then the datapath supports optimized balance-tcp
+        bond mode. This capability replaces existing <code>hash</code> and
+        <code>recirc</code> actions with new action <code>lb_output</code>
+        and avoids recirculation of packet in datapath. It is supported
+        only for balance-tcp bond mode in netdev datapath. The new action
+        gives higer performance by using bond buckets instead of post
+        recirculation flows for selection of slave port from bond. By default
+        this new action is disabled, however it can be enabled by setting
+        <ref column="other-config" key="lb-output-action"/> in
+        <ref table="Port"/> table.
+      </column>
       <group title="Connection-Tracking Capabilities">
         <p>
           These capabilities are granular because Open vSwitch and its
-- 
2.7.4



More information about the dev mailing list