[ovs-dev] group dp_hash method works incorrectly when using snat

Jan Scheurich jan.scheurich at ericsson.com
Mon Sep 30 09:49:37 UTC 2019


Hi,

You have pointed out an interesting issue in the netdev datapath implementation (not sure in how far the same applies also to the kernel datapath).

Conceptually, the dp_hash of a packet should be based on the current packet's flow. It should not change if the headers remain unchanged.

For performance reasons, the actual implementation of the dp_hash action in odp_execute.c bases the dp_hash value on the current RSS hash of the packet, if it has one, and only computes it from the actual packet content if not.

However, the RSS hash of the packet is updated with every recirculation in order to improve the EMC lookup success rate. So even if initially the RSS hash was a suitable base for dp_hash (that itself is uncertain, as the implemantation of the RSS hash is dependent on the NIC HW and might not satisfy the algorithm specified as part of the dp_hash action), its volatility at recirculation destroys the required property of the dp_hash.

What we could do is something like the following (not even compiler-tested):

diff --git a/lib/odp-execute.c b/lib/odp-execute.c
index 563ad1da8..1937bb1e6 100644
--- a/lib/odp-execute.c
+++ b/lib/odp-execute.c
@@ -820,16 +820,22 @@ odp_execute_actions(void *dp, struct dp_packet_batch *batch, bool steal,
                 uint32_t hash;

                 DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
-                    /* RSS hash can be used here instead of 5tuple for
-                     * performance reasons. */
-                    if (dp_packet_rss_valid(packet)) {
-                        hash = dp_packet_get_rss_hash(packet);
-                        hash = hash_int(hash, hash_act->hash_basis);
-                    } else {
-                        flow_extract(packet, &flow);
-                        hash = flow_hash_5tuple(&flow, hash_act->hash_basis);
+                    if (packet->md.dp_hash == 0) {
+                        if (packet->md.recirc_id == 0 &&
+                            dp_packet_rss_valid(packet)) {
+                            /* RSS hash is used here instead of 5tuple for
+                             * performance reasons. */
+                            hash = dp_packet_get_rss_hash(packet);
+                            hash = hash_int(hash, hash_act->hash_basis);
+                        } else {
+                            flow_extract(packet, &flow);
+                            hash = flow_hash_5tuple(&flow, hash_act->hash_basis);
+                        }
+                        if (unlikely(hash == 0)) {
+                            hash = 1;
+                        }
+                        packet->md.dp_hash = hash;
                     }
-                    packet->md.dp_hash = hash;
                 }
                 break;
             }
@@ -842,6 +848,9 @@ odp_execute_actions(void *dp, struct dp_packet_batch *batch, bool steal,
                     hash = flow_hash_symmetric_l3l4(&flow,
                                                     hash_act->hash_basis,
                                                     false);
+                    if (unlikely(hash == 0)) {
+                        hash = 1;
+                    }
                     packet->md.dp_hash = hash;
                 }
                 break;
diff --git a/lib/packets.c b/lib/packets.c
index ab0b1a36d..a03a3ab61 100644
--- a/lib/packets.c
+++ b/lib/packets.c
@@ -391,6 +391,8 @@ push_mpls(struct dp_packet *packet, ovs_be16 ethtype, ovs_be32 lse)
     header = dp_packet_resize_l2_5(packet, MPLS_HLEN);
     memmove(header, header + MPLS_HLEN, len);
     memcpy(header + len, &lse, sizeof lse);
+    /* Invalidate dp_hash */
+    packet->md.dp_hash = 0;
 }

 /* If 'packet' is an MPLS packet, removes its outermost MPLS label stack entry.
@@ -411,6 +413,8 @@ pop_mpls(struct dp_packet *packet, ovs_be16 ethtype)
         /* Shift the l2 header forward. */
         memmove((char*)dp_packet_data(packet) + MPLS_HLEN, dp_packet_data(packet), len);
         dp_packet_resize_l2_5(packet, -MPLS_HLEN);
+        /* Invalidate dp_hash */
+        packet->md.dp_hash = 0;
     }
 }

@@ -444,6 +448,8 @@ push_nsh(struct dp_packet *packet, const struct nsh_hdr *nsh_hdr_src)
     packet->packet_type = htonl(PT_NSH);
     dp_packet_reset_offsets(packet);
     packet->l3_ofs = 0;
+    /* Invalidate dp_hash */
+    packet->md.dp_hash = 0;
 }

 bool
@@ -474,6 +480,8 @@ pop_nsh(struct dp_packet *packet)

         length = nsh_hdr_len(nsh);
         dp_packet_reset_packet(packet, length);
+        /* Invalidate dp_hash */
+        packet->md.dp_hash = 0;
         packet->packet_type = htonl(next_pt);
         /* Packet must be recirculated for further processing. */
     }
diff --git a/lib/packets.h b/lib/packets.h
index a4bee3819..8691fa0c2 100644
--- a/lib/packets.h
+++ b/lib/packets.h
@@ -98,8 +98,7 @@ PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline0,
     uint32_t recirc_id;         /* Recirculation id carried with the
                                    recirculating packets. 0 for packets
                                    received from the wire. */
-    uint32_t dp_hash;           /* hash value computed by the recirculation
-                                   action. */
+    uint32_t dp_hash;           /* hash value computed by dp_hash action. */
     uint32_t skb_priority;      /* Packet priority for QoS. */
     uint32_t pkt_mark;          /* Packet mark. */
     uint8_t  ct_state;          /* Connection state. */

The list of places where we need to invalidate the dp_hash is probably not exhaustive.

BR, Jan



> -----Original Message-----
> From: ovs-dev-bounces at openvswitch.org <ovs-dev-
> bounces at openvswitch.org> On Behalf Of ychen
> Sent: Monday, 30 September, 2019 06:09
> To: dev at openvswitch.org
> Subject: [ovs-dev] group dp_hash method works incorrectly when using snat
> 
> Hi,
>    We found that when the same TCP session  using snat with dp_hash group
> as output actionj,
>    SYN packet and the other packets behaves different, SYN packet outputs
> to one group bucket, and the other packets outputs to another group bucket.
> 
> 
>    Here is the ovs flows:
> 
> table=0,in_port=DOWN_PORT,tun_id=vni,ip,actions=ct(nat,zone=ZID,table=1
> )
> 
> table=1,ip,ct_state=+new,ct(commit,nat,src=SNAT_PUB_IP,zone=ZID,table=2
> )
>    table=1,ip,ct_state=-new,actions=goto_table(table=2)
>    table=2,ip,actions=group:1
> 
> group=1,type=select,selection_method=dp_hash,bucket=actions=output:UP_
> PORT1,bucket=actions=output:UP_PORT2
> 
> 
>   Here is the datapath flow:
>   tunnel(tun_id=0x1435,src=10.185.2.87,dst=10.185.2.93,flags(-
> df+csum+key)),recirc_id(0),in_port(7),packet_type(ns=0,id=0),eth_type(0x080
> 0),ipv4(src=192.168.100.16/255.255.255.240,frag=no), packets:5, bytes:455,
> used:2.978s, flags:FP.,
> actions:meter(248),meter(249),ct(zone=1298,nat),recirc(0x176)
> flow-dump from pmd on cpu core: 6
> tunnel(tun_id=0x1435,src=10.185.2.87,dst=10.185.2.93,flags(-
> df+csum+key)),ct_state(+new-
> inv),ct_zone(0x512),recirc_id(0x176),in_port(7),packet_type(ns=0,id=0),eth_t
> ype(0x0800),ipv4(frag=no), packets:0, bytes:0, used:never,
> actions:meter(250),ct(commit,zone=1298,nat(src=172.16.1.152:1024-
> 65535)),recirc(0x177)
> tunnel(tun_id=0x1435,src=10.185.2.87,dst=10.185.2.93,flags(-
> df+csum+key)),ct_state(-new-
> inv),ct_zone(0x512),recirc_id(0x176),in_port(7),packet_type(ns=0,id=0),eth(sr
> c=02:00:00:00:00:00,dst=00:00:00:00:00:00),eth_type(0x0800),ipv4(ttl=64,fra
> g=no), packets:4, bytes:389, used:3.002s, flags:FP.,
> actions:set(eth(src=fa:25:fa:c2:52:71,dst=xx:xx:xx:xx:xx:xx)),set(ipv4(ttl=63)),
> hash(hash_l4(0)),recirc(0x178)
> flow-dump from pmd on cpu core: 6
> tunnel(tun_id=0x1435,src=10.185.2.87,dst=10.185.2.93,flags(-
> df+csum+key)),recirc_id(0x178),dp_hash(0x8a6c9809/0xf),in_port(7),packet_t
> ype(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:4, bytes:389,
> used:3.025s, flags:FP., actions:2
> tunnel(tun_id=0x1435,src=10.185.2.87,dst=10.185.2.93,flags(-
> df+csum+key)),recirc_id(0x178),dp_hash(0xbab97b2e/0xf),in_port(7),packet_
> type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no), packets:0, bytes:0,
> used:never, actions:3 flow-dump from pmd on cpu core: 6
> tunnel(tun_id=0x1435,src=10.185.2.87,dst=10.185.2.93,flags(-
> df+csum+key)),recirc_id(0x177),in_port(7),packet_type(ns=0,id=0),eth(src=02:
> 00:00:00:00:00,dst=00:00:00:00:00:00),eth_type(0x0800),ipv4(ttl=64,frag=no)
> , packets:0, bytes:0, used:never,
> actions:set(eth(src=fa:25:fa:c2:52:71,dst=xx:xx:xx:xx:xx:xx)),set(ipv4(ttl=63)),
> hash(hash_l4(0)),recirc(0x178)
> 
> 
> from the above datapath flow, we can get the conclusion:
>  1. the first SYN packet match ct_state=+new, and recirculates 3 times  2.
> other packets match ct_state=-new, and recirculates only 2 times  3. packet's
> match +new and packets match -new have different dp_hash value, hence
> may output to different port
>    (same session TCP packets output to different port may increase the
> disorder risk)
> 
> 
> we researched ovs code, and found the following:
>  dpif_netdev_packet_get_rss_hash(struct dp_packet *packet,
>                                 const struct miniflow *mf) {
>     uint32_t hash, recirc_depth;
> 
> 
>     if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
>         hash = dp_packet_get_rss_hash(packet);
>     } else {
>         hash = miniflow_hash_5tuple(mf, 0);
>         dp_packet_set_rss_hash(packet, hash);
>     }
> 
> 
>     /* The RSS hash must account for the recirculation depth to avoid
>      * collisions in the exact match cache */
>     recirc_depth = *recirc_depth_get_unsafe();
>     if (OVS_UNLIKELY(recirc_depth)) {
>         hash = hash_finish(hash, recirc_depth);=====> this code changes the RSS
> hash, and this function is called before EMC lookup
>         dp_packet_set_rss_hash(packet, hash);
>     }
>     return hash;
> }
> 
> 
> so is there any method to fix this problem?
> we tried change the ovs flow with :
>  table=1,ip,ct_state=-new,actions=ct(commit, table=2) and problem dispeer,
> but in this time ,packets match ct_state=-new also need recirc 3 times which
> may decrease performance.
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://protect2.fireeye.com/url?k=64df9133-3855bdab-64dfd1a8-
> 0cc47ad93eae-
> 2000efaabd2673f9&q=1&u=https%3A%2F%2Fmail.openvswitch.org%2Fmail
> man%2Flistinfo%2Fovs-dev


More information about the dev mailing list