[ovs-dev] [PATCH v7 1/1] Avoid dp_hash recirculation for balance-tcp bond selection mode

Ilya Maximets ilya.maximets at gmail.com
Tue Sep 24 13:54:24 UTC 2019


On 17.09.2019 11:50, Vishal Deep Ajmera wrote:> Problem:
> --------
> In OVS-DPDK, flows with output over a bond interface of type “balance-tcp”
> (using a hash on TCP/UDP 5-tuple) get translated by the ofproto layer into
> "HASH" and "RECIRC" datapath actions. After recirculation, the packet is
> forwarded to the bond member port based on 8-bits of the datapath hash
> value computed through dp_hash. This causes performance degradation in the
> following ways:
> 
> 1. The recirculation of the packet implies another lookup of the packet’s
> flow key in the exact match cache (EMC) and potentially Megaflow classifier
> (DPCLS). This is the biggest cost factor.
> 
> 2. The recirculated packets have a new “RSS” hash and compete with the
> original packets for the scarce number of EMC slots. This implies more
> EMC misses and potentially EMC thrashing causing costly DPCLS lookups.
> 
> 3. The 256 extra megaflow entries per bond for dp_hash bond selection put
> additional load on the revalidation threads.
> 
> Owing to this performance degradation, deployments stick to “balance-slb”
> bond mode even though it does not do active-active load balancing for
> VXLAN- and GRE-tunnelled traffic because all tunnel packet have the same
> source MAC address.
> 
> Proposed optimization:
> ----------------------
> This proposal introduces a new load-balancing output action instead of
> recirculation.
> 
> Maintain one table per-bond (could just be an array of uint16's) and
> program it the same way internal flows are created today for each possible
> hash value(256 entries) from ofproto layer. Use this table to load-balance
> flows as part of output action processing.
> 
> Currently xlate_normal() -> output_normal() -> bond_update_post_recirc_rules()
> -> bond_may_recirc() and compose_output_action__() generate
> “dp_hash(hash_l4(0))” and “recirc(<RecircID>)” actions. In this case the
> RecircID identifies the bond. For the recirculated packets the ofproto layer
> installs megaflow entries that match on RecircID and masked dp_hash and send
> them to the corresponding output port.
> 
> Instead, we will now generate actions as
>      "hash(l4(0)),lb_output(bond,<bond id>)"
> 
> This combines hash computation (only if needed, else re-use RSS hash) and
> inline load-balancing over the bond. This action is used *only* for balance-tcp
> bonds in OVS-DPDK datapath (the OVS kernel datapath remains unchanged).
> 
> Example:
> --------
> Current scheme:
> ---------------
> With 1 IP-UDP flow:
> 
> flow-dump from pmd on cpu core: 2
> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2828969, bytes:181054016, used:0.000s, actions:hash(hash_l4(0)),recirc(0x1)
> 
> recirc_id(0x1),dp_hash(0x113683bd/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:2828937, bytes:181051968, used:0.000s, actions:2
> 
> With 8 IP-UDP flows (with random UDP src port): (all hitting same DPCL):
> 
> flow-dump from pmd on cpu core: 2
> recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2674009, bytes:171136576, used:0.000s, actions:hash(hash_l4(0)),recirc(0x1)
> 
> recirc_id(0x1),dp_hash(0xf8e02b7e/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:377395, bytes:24153280, used:0.000s, actions:2
> recirc_id(0x1),dp_hash(0xb236c260/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:333486, bytes:21343104, used:0.000s, actions:1
> recirc_id(0x1),dp_hash(0x7d89eb18/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:348461, bytes:22301504, used:0.000s, actions:1
> recirc_id(0x1),dp_hash(0xa78d75df/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:633353, bytes:40534592, used:0.000s, actions:2
> recirc_id(0x1),dp_hash(0xb58d846f/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:319901, bytes:20473664, used:0.001s, actions:2
> recirc_id(0x1),dp_hash(0x24534406/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:334985, bytes:21439040, used:0.001s, actions:1
> recirc_id(0x1),dp_hash(0x3cf32550/0xff),in_port(7),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:326404, bytes:20889856, used:0.001s, actions:1
> 
> New scheme:
> -----------
> We can do with a single flow entry (for any number of new flows):
> 
> in_port(7),packet_type(ns=0,id=0),eth(src=02:00:00:02:14:01,dst=0c:c4:7a:58:f0:2b),eth_type(0x0800),ipv4(frag=no), packets:2674009, bytes:171136576, used:0.000s, actions:hash(l4(0)),lb_output(bond,1)
> 
> A new CLI has been added to dump datapath bond cache as given below.
> 
> “sudo ovs-appctl dpif-netdev/dp-bond-show [dp]”
> 
> root at ubuntu-190:performance_scripts # sudo ovs-appctl dpif-netdev/dp-bond-show
> Bond cache:
>          bond-id 1 :
>                  bucket 0 - slave 2
>                  bucket 1 - slave 1
>                  bucket 2 - slave 2
>                  bucket 3 - slave 1
> 
> Performance improvement:
> ------------------------
> With a prototype of the proposed idea, the following perf improvement is seen
> with Phy-VM-Phy UDP traffic, single flow. With multiple flows, the improvement
> is even more enhanced (due to reduced number of flows).
> 
> 1 VM:
> *****
> +--------------------------------------+
> |                 mpps                 |
> +--------------------------------------+
> | Flows  master  with-opt.   %delta    |
> +--------------------------------------+
> | 1      4.53    5.89        29.96
> | 10     4.16    5.89        41.51
> | 400    3.55    5.55        56.22
> | 1k     3.44    5.45        58.30
> | 10k    2.50    4.63        85.34
> | 100k   2.29    4.27        86.15
> | 500k   2.25    4.27        89.23
> +--------------------------------------+
> mpps: million packets per second.
> 
> Signed-off-by: Manohar Krishnappa Chidambaraswamy <manukc at gmail.com>
> Co-authored-by: Manohar Krishnappa Chidambaraswamy <manukc at gmail.com>
> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera at ericsson.com>
> 
> CC: Jan Scheurich <jan.scheurich at ericsson.com>
> CC: Venkatesan Pradeep <venkatesan.pradeep at ericsson.com>
> CC: Nitin Katiyar <nitin.katiyar at ericsson.com>
> ---
>   datapath/linux/compat/include/linux/openvswitch.h |   2 +
>   lib/dpif-netdev.c                                 | 515 ++++++++++++++++++++--
>   lib/dpif-netlink.c                                |   3 +
>   lib/dpif-provider.h                               |   8 +
>   lib/dpif.c                                        |  48 ++
>   lib/dpif.h                                        |   7 +
>   lib/odp-execute.c                                 |   2 +
>   lib/odp-util.c                                    |   4 +
>   ofproto/bond.c                                    |  52 ++-
>   ofproto/bond.h                                    |   9 +
>   ofproto/ofproto-dpif-ipfix.c                      |   1 +
>   ofproto/ofproto-dpif-sflow.c                      |   1 +
>   ofproto/ofproto-dpif-xlate.c                      |  39 +-
>   ofproto/ofproto-dpif.c                            |  32 ++
>   ofproto/ofproto-dpif.h                            |  12 +-
>   tests/lacp.at                                     |   9 +
>   vswitchd/bridge.c                                 |   6 +
>   vswitchd/vswitch.xml                              |  10 +
>   18 files changed, 700 insertions(+), 60 deletions(-)
> 
> diff --git a/datapath/linux/compat/include/linux/openvswitch.h b/datapath/linux/compat/include/linux/openvswitch.h
> index 65a003a..20467f9 100644
> --- a/datapath/linux/compat/include/linux/openvswitch.h
> +++ b/datapath/linux/compat/include/linux/openvswitch.h
> @@ -734,6 +734,7 @@ enum ovs_hash_alg {
>   	OVS_HASH_ALG_L4,
>   #ifndef __KERNEL__
>   	OVS_HASH_ALG_SYM_L4,
> +	OVS_HASH_ALG_L4_RSS,
>   #endif
>   	__OVS_HASH_MAX
>   };
> @@ -989,6 +990,7 @@ enum ovs_action_attr {
>   #ifndef __KERNEL__
>   	OVS_ACTION_ATTR_TUNNEL_PUSH,   /* struct ovs_action_push_tnl*/
>   	OVS_ACTION_ATTR_TUNNEL_POP,    /* u32 port number. */
> +	OVS_ACTION_ATTR_LB_OUTPUT,     /* u32 bond-id */
>   #endif
>   	__OVS_ACTION_ATTR_MAX,	      /* Nothing past this will be accepted
>   				       * from userspace. */
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index a88a78f..05a0ca3 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -79,6 +79,7 @@
>   #include "unixctl.h"
>   #include "util.h"
>   #include "uuid.h"
> +#include "ofproto/bond.h"
>   
>   VLOG_DEFINE_THIS_MODULE(dpif_netdev);
>   
> @@ -366,6 +367,11 @@ struct dp_netdev {
>   
>       struct conntrack *conntrack;
>       struct pmd_auto_lb pmd_alb;
> +    /* Bonds.
> +     *
> +     * Any lookup into 'bonds' requires taking 'bond_mutex'. */
> +    struct ovs_mutex bond_mutex;
> +    struct hmap bonds;
>   };
>   
>   static void meter_lock(const struct dp_netdev *dp, uint32_t meter_id)
> @@ -596,6 +602,20 @@ struct tx_port {
>       struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
>   };
>   
> +/* Contained by struct tx_bond 'slave_buckets' */
> +struct slave_entry {
> +    uint32_t slave_id;
> +    atomic_ullong n_packets;
> +    atomic_ullong n_bytes;
> +};
> +
> +/* Contained by struct dp_netdev_pmd_thread's 'bond_cache' or 'tx_bonds'. */
> +struct tx_bond {
> +    struct hmap_node node;
> +    uint32_t bond_id;
> +    struct slave_entry slave_buckets[BOND_BUCKETS];
> +};
> +
>   /* A set of properties for the current processing loop that is not directly
>    * associated with the pmd thread itself, but with the packets being
>    * processed or the short-term system configuration (for example, time).
> @@ -708,6 +728,11 @@ struct dp_netdev_pmd_thread {
>       atomic_bool reload_tx_qid;      /* Do we need to reload static_tx_qid? */
>       atomic_bool exit;               /* For terminating the pmd thread. */
>   
> +    atomic_bool reload_bond_cache;  /* Do we need to load tx bond cache?
> +                                     * Note: This flag is decoupled from 'reload'
> +                                     * flag otherwise full pmd reload will become
> +                                     * frequent and costly everytime bond
> +                                     * rebalancing is done. */
Hi,
Thanks for a new revision.

Back in my review [1] for v5 I asked about the performance difference
between the full reloading and reloading the bonding only.
So, did you have a chance to measure it?

I'm concerned about this because having more than one reloading point
makes code more complex and less readable.

One more thought is that we could introduce some version sequence number
for a rx/tx config for PMD to not reload if it's not changed.

[1] https://mail.openvswitch.org/pipermail/ovs-dev/2019-August/362001.html

Best regards, Ilya Maximets.


More information about the dev mailing list