[ovs-dev] Bug: select group with dp_hash causing recursive recirculation

Zang MingJie zealot0630 at gmail.com
Tue Sep 25 08:45:08 UTC 2018


Hi, we found a serious problem where one pmd is stop working, I want to
share the problem and find solution here.

vswitchd log:

  2018-09-13T23:36:44.377Z|40269235|dpif_netdev(pmd45)|WARN|Packet dropped.
Max recirculation depth exceeded.
  2018-09-13T23:36:44.387Z|40269236|dpif_netdev(pmd45)|WARN|Packet dropped.
Max recirculation depth exceeded.
  2018-09-13T23:36:44.391Z|40269237|dpif_netdev(pmd45)|WARN|Packet dropped.
Max recirculation depth exceeded.

problematic datapath flows:


ct_state(+new-est),recirc_id(0x143c893),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no),tcp(dst=443),
packets:84573093, bytes:6308807903, used:0.009s,
flags:SFPRU.ECN[200][400][800],
actions:meter(306),hash(hash_l4(0)),recirc(0x237b09d)


recirc_id(0x237b09d),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(frag=no),
packets:279713339, bytes:20890205186, used:0.007s,
flags:SFPRU.ECN[200][400][800], actions:hash(hash_l4(0)),recirc(0x237b09d)

corresponding openflow:

  cookie=0x5b5ab65e000f0101, duration=4848269.642s, table=40,
n_packets=974343805, n_bytes=72484367083,
priority=10,tcp,metadata=0xf010000000000/0xffffff0000000000,tp_dst=443
actions=group:983297


group_id=983297,type=select,selection_method=dp_hash,bucket=bucket_id:3057033848,weight:100,actions=ct(commit,table=70,zone=15,exec(nat(dst=10.177.251.203:443))),...``lots
of buckets``...



Following explains how select group with dp_hash works.

To implement select group with dp_hash, two datapath flows are needed:

 1. calculate dp_hash, recirculate to second one
 2. select group bucket by dp_hash

When encounter a datapath miss, openflow doesn't know which one is missing,
so it depends on dp_hash value of the packet:

 if dp_hash == 0 generate first dp flow.
if dp_hash != 0 generate second dp flow.


Back to the problem.

Notice that second datapath flow is a dead loop, it recirculate to itself.
The cause of the problem is here ofproto/ofproto-dpif-xlate.c#L4429[1]:

    /* dp_hash value 0 is special since it means that the dp_hash has not
been
     * computed, as all computed dp_hash values are non-zero.  Therefore
     * compare to zero can be used to decide if the dp_hash value is valid
     * without masking the dp_hash field. */
    if (!dp_hash) {

The comment saying that `dp_hash` shouldn't be zero, but under DPDK, it can
be zero, at lib/odp-execute.c#L747[2]

   /* RSS hash can be used here instead of 5tuple for
    * performance reasons. */
   if (dp_packet_rss_valid(packet)) {
       hash = dp_packet_get_rss_hash(packet);
       hash = hash_int(hash, hash_act->hash_basis);
   } else {
       flow_extract(packet, &flow);
       hash = flow_hash_5tuple(&flow, hash_act->hash_basis);
   }
   packet->md.dp_hash = hash;

I don't know how small chance that `hash_int` returns 0, we have tested
that if the final hash is 0, will definitely trigger the same bug. And due
to the chance is extremely low, I'm also investigation that if there are
other situation that will pass 0 hash to ofp.



IMO, it is silly to depends on dp_hash value, maybe we need a new mechanism
which can pass data between ofp and odp freely. And a quick solution could
be just change the 0 hash to 1.


[1]
https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-xlate.c#L4429
[2] https://github.com/openvswitch/ovs/blob/master/lib/odp-execute.c#L747


More information about the dev mailing list