[ovs-dev] [PATCH] dpif-netdev: Enter quiescent state after each offloading operation.

Ilya Maximets i.maximets at ovn.org
Mon Mar 16 11:41:01 UTC 2020

On 3/4/20 4:51 PM, Eli Britstein wrote:
> On 3/3/2020 4:44 PM, Eli Britstein wrote:
>> On 2/28/2020 11:03 AM, Ilya Maximets wrote:
>>> On 2/27/20 10:00 AM, Eli Britstein wrote:
>>>> On 2/26/2020 2:31 PM, Ilya Maximets wrote:
>>>>> On 2/26/20 1:05 PM, Eli Britstein wrote:
>>>>>> On 2/24/2020 1:06 PM, Ilya Maximets wrote:
>>>>>>> On 2/23/20 3:32 PM, Eli Britstein wrote:
>>>>>>>> On 2/21/2020 4:54 PM, Ilya Maximets wrote:
>>>>>>>>> If the offloading queue is big and filled continuously, offloading
>>>>>>>>> thread may have no chance to quiesce blocking rcu callbacks and
>>>>>>>>> other threads waiting for synchronization.
>>>>>>>>> Fix that by entering momentary quiescent state after each operation
>>>>>>>>> since we're not holding any rcu-protected memory here.
>>>>>>>>> Fixes: 02bb2824e51d ("dpif-netdev: do hw flow offload in a thread")
>>>>>>>>> Reported-by: Eli Britstein <elibr at mellanox.com>
>>>>>>>>> Reported-at: https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-discuss%2F2020-February%2F049768.html&data=02%7C01%7Celibr%40mellanox.com%7Ceb2c70875cb14a4d4b4108d7bc2d2676%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637184774411852418&sdata=8YgXyAJ248Mqhiq4tbUcAb357Dqdu1ue%2BBXp%2F6pX6tY%3D&reserved=0
>>>>>>>>> Signed-off-by: Ilya Maximets <i.maximets at ovn.org>
>>>>>>>>> ---
>>>>>>>>>      lib/dpif-netdev.c | 1 +
>>>>>>>>>      1 file changed, 1 insertion(+)
>>>>>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>>>>>>>>> index d393aab5e..a798db45d 100644
>>>>>>>>> --- a/lib/dpif-netdev.c
>>>>>>>>> +++ b/lib/dpif-netdev.c
>>>>>>>>> @@ -2512,6 +2512,7 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
>>>>>>>>>              VLOG_DBG("%s to %s netdev flow\n",
>>>>>>>>>                       ret == 0 ? "succeed" : "failed", op);
>>>>>>>>>              dp_netdev_free_flow_offload(offload);
>>>>>>>>> +        ovsrcu_quiesce();
> Acked-by: Eli Britstein <elibr at mellanox.com>

Applied to master and backported down to branch-2.10.

>>>>>>>> This seems to solve the issue of responsiveness, but I have encountered a crash using it, while there is a lot of flow deletions.
>>>>>>>> #0  0x0000000001e462dc in get_unaligned_u32 (p=0x7f57f7813500) at lib/unaligned.h:86
>>>>>>>> #1  0x0000000001e46388 in hash_bytes (p_=0x7f57f7813500, n=16, basis=0) at lib/hash.c:38
>>>>>>>> #2  0x0000000001f7f836 in ufid_to_rte_flow_data_find (ufid=0x7f57f7813500) at lib/netdev-offload-dpdk.c:134
>>>>>>>> #3  0x0000000001f88693 in netdev_offload_dpdk_flow_del (netdev=0x42260da0, ufid=0x7f57f7813500, stats=0x0) at lib/netdev-offload-dpdk.c:3361
>>>>>>>> #4  0x0000000001e6a00d in netdev_flow_del (netdev=0x42260da0, ufid=0x7f57f7813500, stats=0x0) at lib/netdev-offload.c:296
>>>>>>> This might mean 2 things:
>>>>>>> 1. Incorrect usage of RCU protected data inside netdev-offload-dpdk.
>>>>>>> 2. We're not allowed to quiesce if we have any item in offloading queue,
>>>>>>>       i.e. bad RCU usage scheme in dpif-netdev.
>>>>>>> Both cases are bad and I'm not sure if we can easily fix case #2.
>>>>>>> I'll take a look and try to find a root cause.
>>>>>> It was my bad. Sorry. This is another issue, that exists in my branch only (CT offloads), so it's not related. I was able to fix it.
>>>>> OK.  Good to know.
>>>>>> However, I wanted to reproduce with master branch, but then I cannot create a lot of flows as I get: "upcall: datapath flow limit reached".
>>>>>> I tried setting large number using "ovs-appctl upcall/set-flow-limit 400000", but it didn't help.
>>>>> Number of flows in dpif-netdev is artificially limited by MAX_FLOWS (65536)
>>>>> value.  There was a recent discussion about this limit and I think we could
>>>>> just remove it.  I have a small patch for this, will send soon.
>>>>> For now, you can just raise the hardcoded value of MAX_FLOWS in dpif-netdev.c.
>>>> The issue is not in dpif-netdev.c. It's in ofproto/ofproto-dpif-upcall.c. If I'm not wrong, when it happens, there is not attempt to install the flow, so anyway MAX_FLOWS in dpif-netdev.c is not the blocker.
>>> There are couple of things that controls the maximum number of datapath flows.
>>> First is the other_config:flow-limit that defaults to 200000. Second is the
>>> datapath internal MAX_FLOWS define.  And also, "ovs-appctl upcall/set-flow-limit"
>>> should be able to modify this value.  Sounds strange.
>>>> BTW, my setup is the PF and VF rep in a bridge. I run testpmd with macswap on the VF, and send 10k different src MAC packets (OpenFlow configuration is NORMAL). So, I expect total 20k flows, also less than 64k of MAX_FLOWS.
>>> Are you sure that packets are not corrupted?  Could you dump datapath flows
>>> and verify that these flows are what you've expected?
>> This is for example for 4 different MACs (e4:10:00:00:00:01-e4:10:00:00:00:04). As expected:
>> $ ovs-appctl dpctl/dump-flows -m
>> flow-dump from pmd on cpu core: 8
>> ufid:65aae19f-06a9-4887-8d2d-60b04187f7cd, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(vm1),packet_type(ns=0,id=0),eth(src=e4:11:00:00:00:01,dst=e4:10:00:00:00:01),eth_type(0x0800),ipv4(src=,dst=,proto=17/0,tos=0/0,ttl=64/0,frag=no),udp(src=1025/0,dst=1025/0), packets:11316766, bytes:1244844260, used:0.013s, offloaded:yes, dp:dpdk, actions:pf, dp-extra-info:miniflow_bits(5,1)
>> ufid:a0e4b9bc-8973-41d1-b2cf-29f2a14935d8, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(pf),packet_type(ns=0,id=0),eth(src=e4:10:00:00:00:03,dst=e4:11:00:00:00:01),eth_type(0x0800),ipv4(src=,dst=,proto=17/0,tos=0/0,ttl=64/0,frag=no),udp(src=1025/0,dst=1025/0), packets:109476803, bytes:12042448330, used:0.013s, offloaded:yes, dp:dpdk, actions:vm1, dp-extra-info:miniflow_bits(5,1)
>> ufid:bbb7fe49-422b-434c-bd70-32d6a90c44c4, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(vm1),packet_type(ns=0,id=0),eth(src=e4:11:00:00:00:01,dst=e4:10:00:00:00:03),eth_type(0x0800),ipv4(src=,dst=,proto=17/0,tos=0/0,ttl=64/0,frag=no),udp(src=1025/0,dst=1025/0), packets:11147160, bytes:1226187600, used:0.013s, offloaded:yes, dp:dpdk, actions:pf, dp-extra-info:miniflow_bits(5,1)
>> ufid:bebe7de1-b2d6-48d6-87aa-3fab025a793d, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(pf),packet_type(ns=0,id=0),eth(src=e4:10:00:00:00:02,dst=e4:11:00:00:00:01),eth_type(0x0800),ipv4(src=,dst=,proto=17/0,tos=0/0,ttl=64/0,frag=no),udp(src=1025/0,dst=1025/0), packets:110643486, bytes:12170783460, used:0.013s, offloaded:yes, dp:dpdk, actions:vm1, dp-extra-info:miniflow_bits(5,1)
>> ufid:24e09efe-f02c-41d5-a7c1-3d0aea28cbe0, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(vm1),packet_type(ns=0,id=0),eth(src=e4:11:00:00:00:01,dst=e4:10:00:00:00:02),eth_type(0x0800),ipv4(src=,dst=,proto=17/0,tos=0/0,ttl=64/0,frag=no),udp(src=1025/0,dst=1025/0), packets:11304574, bytes:1243503140, used:0.013s, offloaded:yes, dp:dpdk, actions:pf, dp-extra-info:miniflow_bits(5,1)
>> ufid:4fba2205-37d8-42a1-8899-57bdbbc1c539, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(pf),packet_type(ns=0,id=0),eth(src=e4:10:00:00:00:01,dst=e4:11:00:00:00:01),eth_type(0x0800),ipv4(src=,dst=,proto=17/0,tos=0/0,ttl=64/0,frag=no),udp(src=1025/0,dst=1025/0), packets:110644871, bytes:12170935810, used:0.013s, offloaded:yes, dp:dpdk, actions:vm1, dp-extra-info:miniflow_bits(5,1)
>> ufid:99848c40-a8cd-4732-b520-30a27010964b, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(pf),packet_type(ns=0,id=0),eth(src=e4:10:00:00:00:04,dst=e4:11:00:00:00:01),eth_type(0x0800),ipv4(src=,dst=,proto=17/0,tos=0/0,ttl=64/0,frag=no),udp(src=1025/0,dst=1025/0), packets:110644198, bytes:12170861780, used:0.013s, offloaded:yes, dp:dpdk, actions:vm1, dp-extra-info:miniflow_bits(5,1)
>> ufid:293e5fa1-cce6-41d2-9b3c-db01105f3cc2, skb_priority(0/0),skb_mark(0/0),ct_state(0/0),ct_zone(0/0),ct_mark(0/0),ct_label(0/0),recirc_id(0),dp_hash(0/0),in_port(vm1),packet_type(ns=0,id=0),eth(src=e4:11:00:00:00:01,dst=e4:10:00:00:00:04),eth_type(0x0800),ipv4(src=,dst=,proto=17/0,tos=0/0,ttl=64/0,frag=no),udp(src=1025/0,dst=1025/0), packets:11327727, bytes:1246049970, used:0.013s, offloaded:yes, dp:dpdk, actions:pf, dp-extra-info:miniflow_bits(5,1)
> This issue is resolved by enlarging the MAC learning number:
> ovs-vsctl set bridge br-int other_config:mac-table-size=200000

Hmm. OK.

>>>>>>>> In other trials, I see the ovs-appctl stuck again, but then I cannot attach gdb (or use pstack). It hangs. In this scenario, I see this in dmesg:
>>>>>>>> [  404.450694] ovs-vswitchd[8574]: segfault at 7f8bd4e87280 ip 0000000001e463c3 sp 00007f8cfaf33028 error 4 in ovs-vswitchd[400000+200c000]
>>>>>>>> Yanqin Wei <Yanqin.Wei at arm.com> suggested to add ovsrcu_try_quiesce call (in the same place). It seems more stable (haven't see such crash as above), but it has the same stuck symptom as above.
>>>>>>> ovsrcu_try_quiesce() is the same as ovsrcu_quiesce() except that is fails
>>>>>>> if mutex can not be taken right now.  There is no real difference.  Your
>>>>>>> crashes are less frequent just because thread enters quiescent state less
>>>>>>> frequently.
>>>>>> If so, I think "ovsrcu_try_quiesce" is more suitable here. Don't you think?
>>>>> I don't think that taking of one more mutex could make harm in
>>>>> offloading thread.  If we'll want to care about minor sleeping
>>>>> spikes, we should start from replacing the dp_flow_offload.mutex
>>>>> and sleeping on the conditional variable.
>>>> Agree it's not critical. However, if it's not mandatory, I think it's better to avoid it.
>>> Since it's not critical, I think it's better to not block RCU callbacks
>>> and other threads that waits for RCU synchronization.
>> I am not clear. I thought ovsrcu_try_quiesce is more lightweight than ovsrcu_quiesce. Now I see in the code there is indeed a lock there (seq_try_lock()).
>> So, what's the logic behind your previous comment:
>> Your
>> crashes are less frequent just because thread enters quiescent state less
>> frequently.
>> Is it because of that lock in ovsrcu_try_quiesce the thread enters quiescent less frequently? So actually ovsrcu_try_quiesce is heavier than ovsrcu_quiesce.

They're very similar except that 'try' fails if lock already held,
but regular function goes to sleep waiting for kernel to wake it
up when mutex is free to be locked.  There should be no much difference
beside that.  So, ovsrcu_try_quiesce doesn't enter quiescent state
is seq_lock is busy, i.e. thread enters quiescent state less frequently.

>>>> In addition I think of making it multi-threaded (thread per port) to boost insertion/deletion rates.
>>> This will not give any performance unless you've found a good way how to
>>> avoid taking of global dp->port_mutex.  Also, "per port" seems too much.
>> Of course. Need to find a way not to have a global lock. Regarding "per-port", maybe not a thread per-port, but per-port request queue. The number of threads can maybe configurable, each handling several queues.

More information about the dev mailing list