[ovs-dev] [PATCH] dpif-netdev: Enter quiescent state after each offloading operation.

Mon Feb 24 11:06:30 UTC 2020

On 2/23/20 3:32 PM, Eli Britstein wrote:
> 
> On 2/21/2020 4:54 PM, Ilya Maximets wrote:
>> If the offloading queue is big and filled continuously, offloading
>> thread may have no chance to quiesce blocking rcu callbacks and
>> other threads waiting for synchronization.
>>
>> Fix that by entering momentary quiescent state after each operation
>> since we're not holding any rcu-protected memory here.
>>
>> Fixes: 02bb2824e51d ("dpif-netdev: do hw flow offload in a thread")
>> Reported-by: Eli Britstein <elibr at mellanox.com>
>> Reported-at: https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fpipermail%2Fovs-discuss%2F2020-February%2F049768.html&data=02%7C01%7Celibr%40mellanox.com%7Cd9f2868bb21d43721b0408d7b6ddedd5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637178936610286432&sdata=I1qteh%2FJ6QHMn593P1uJOvWqbrumsfthYdgZ1Me96Fo%3D&reserved=0
>> Signed-off-by: Ilya Maximets <i.maximets at ovn.org>
>> ---
>>   lib/dpif-netdev.c | 1 +
>>   1 file changed, 1 insertion(+)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index d393aab5e..a798db45d 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -2512,6 +2512,7 @@ dp_netdev_flow_offload_main(void *data OVS_UNUSED)
>>           VLOG_DBG("%s to %s netdev flow\n",
>>                    ret == 0 ? "succeed" : "failed", op);
>>           dp_netdev_free_flow_offload(offload);
>> +        ovsrcu_quiesce();
> 
> This seems to solve the issue of responsiveness, but I have encountered a crash using it, while there is a lot of flow deletions.
> 
> #0  0x0000000001e462dc in get_unaligned_u32 (p=0x7f57f7813500) at lib/unaligned.h:86
> #1  0x0000000001e46388 in hash_bytes (p_=0x7f57f7813500, n=16, basis=0) at lib/hash.c:38
> #2  0x0000000001f7f836 in ufid_to_rte_flow_data_find (ufid=0x7f57f7813500) at lib/netdev-offload-dpdk.c:134
> #3  0x0000000001f88693 in netdev_offload_dpdk_flow_del (netdev=0x42260da0, ufid=0x7f57f7813500, stats=0x0) at lib/netdev-offload-dpdk.c:3361
> #4  0x0000000001e6a00d in netdev_flow_del (netdev=0x42260da0, ufid=0x7f57f7813500, stats=0x0) at lib/netdev-offload.c:296

This might mean 2 things:
1. Incorrect usage of RCU protected data inside netdev-offload-dpdk.
2. We're not allowed to quiesce if we have any item in offloading queue,
   i.e. bad RCU usage scheme in dpif-netdev.

Both cases are bad and I'm not sure if we can easily fix case #2.
I'll take a look and try to find a root cause.

> 
> In other trials, I see the ovs-appctl stuck again, but then I cannot attach gdb (or use pstack). It hangs. In this scenario, I see this in dmesg:
> 
> [  404.450694] ovs-vswitchd[8574]: segfault at 7f8bd4e87280 ip 0000000001e463c3 sp 00007f8cfaf33028 error 4 in ovs-vswitchd[400000+200c000]
> 
> Yanqin Wei <Yanqin.Wei at arm.com> suggested to add ovsrcu_try_quiesce call (in the same place). It seems more stable (haven't see such crash as above), but it has the same stuck symptom as above.

ovsrcu_try_quiesce() is the same as ovsrcu_quiesce() except that is fails
if mutex can not be taken right now.  There is no real difference.  Your
crashes are less frequent just because thread enters quiescent state less
frequently.

Best regards, Ilya Maximets.