[ovs-discuss] dpdk watchdog stuck?
Patrik Andersson R
patrik.r.andersson at ericsson.com
Fri Feb 5 12:27:23 UTC 2016
Hi,
I applied the patch suggested:
http://openvswitch.org/pipermail/dev/2016-January/065073.html
While I can appreciate the issue it addresses, it did not help with our
problem.
What happens in our case is the dpdk_watchdog thread is blocked
from going into quiescent state by the dpdk_mutex. This mutex is held
in the destroy_device call from a vhost_thread, effectively giving us a
kind of deadlock.
The ovsrcu_synchronize() will then wait indefinitely for the blocked
thread to quiesce.
One solution that we came up with that resolved the
dead-lock is as shown below.
I'm wondering if we missed any important aspect of the rcu solution,
since we identified that the dpdk_mutex should not be locked over the
rcu_synchronize() call.
@@ -1759,21 +1759,23 @@ destroy_device(volatile struct virtio_net *dev)
dev->flags &= ~VIRTIO_DEV_RUNNING;
ovsrcu_set(&vhost_dev->virtio_dev, NULL);
+ }
+ }
+
+ ovs_mutex_unlock(&dpdk_mutex);
/*
* Wait for other threads to quiesce before
* setting the virtio_dev to NULL.
*/
ovsrcu_synchronize();
/*
* As call to ovsrcu_synchronize() will end the
quiescent state,
* put thread back into quiescent state before returning.
*/
ovsrcu_quiesce_start();
- }
- }
-
- ovs_mutex_unlock(&dpdk_mutex);
VLOG_INFO("vHost Device '%s' (%ld) has been removed",
dev->ifname, dev->device_fh);
}
Any thoughts on this will be appreciated.
Regards,
Patrik
On 01/27/2016 08:26 AM, Patrik Andersson R wrote:
> Hi,
>
> thank you for the link to the patch. Will try that out when I get a
> chance.
>
> I don't yet have a back-trace for the instance when the tracing
> indicates the
> "main" thread, it does not happen that often.
>
> For the "watchdog3" issue though, we seem to be waiting to acquire a
> mutex:
>
> #0 (LWP 8377) "dpdk_watchdog3" in __lll_lock_wait ()
> #1 in _L_lock_909
> #2 in
> __GI___pthread_mutex_lock (mutex=0x955680)
> ...
>
> The thread that currently holds the mutex is
>
> Thread (LWP 8378) "vhost_thread2" in poll ()
>
>
> Mutex data: p *(pthread_mutex_t*)0x955680
>
> __lock = 2,
> __count = 0,
> __owner = 8378,
> __nusers = 1,
> __kind = 2,
> __spins = 0,
> __elision = 0,
> ...
>
>
> Any ideas on this will be appreciated.
>
> Regards,
>
> Patrik
>
> On 01/27/2016 06:18 AM, Daniele Di Proietto wrote:
>> Hi,
>>
>> Ben, turns out I was wrong, this appears to be a genuine bug in
>> dpif-netdev.
>>
>> I sent a fix that I believe might be related to the bug observed
>> here:
>>
>> http://openvswitch.org/pipermail/dev/2016-January/065073.html
>>
>> Otherwise it would be interesting to get a backtrace of the main thread
>> from gdb to investigate further.
>>
>> Thanks,
>>
>> Daniele
>>
>> On 26/01/2016 00:23, "Iezzi, Federico" <federico.iezzi at hpe.com> wrote:
>>
>>> Hi there,
>>>
>>> I have the same issue with OVS 2.4 (latest commit in the branch 2.4)
>>> and
>>> DPDK 2.0.0 in Debian 8 environment.
>>> After a while it just stuck.
>>>
>>> Regards,
>>> Federico
>>>
>>> -----Original Message-----
>>> From: discuss [mailto:discuss-bounces at openvswitch.org] On Behalf Of Ben
>>> Pfaff
>>> Sent: Tuesday, January 26, 2016 7:13 AM
>>> To: Daniele di Proietto <diproiettod at vmware.com>
>>> Cc: discuss at openvswitch.org
>>> Subject: Re: [ovs-discuss] dpdk watchdog stuck?
>>>
>>> Daniele, I think that you said in our meeting today that there was some
>>> sort of bug that falsely blames a thread. Can you explain further?
>>>
>>> On Mon, Jan 25, 2016 at 09:29:52PM +0100, Patrik Andersson R wrote:
>>>> Right, that is likely for sure. Will look there first.
>>>>
>>>> What do you think of the case where the thread is "main". I've got
>>>> examples of this one as well. Have not been able to figure out so far
>>>> what would cause this.
>>>>
>>>> ...
>>>> ovs-vswitchd.log.1.1.1.1:2016-01-23T01:47:19.026Z|00016|ovs_rcu(urcu2)
>>>> |WARN|blocked
>>>> 32768000 ms waiting for main to quiesce
>>>> ovs-vswitchd.log.1.1.1.1:2016-01-23T10:53:27.026Z|00017|ovs_rcu(urcu2)
>>>> |WARN|blocked
>>>> 65536000 ms waiting for main to quiesce
>>>> ovs-vswitchd.log.1.1.1.1:2016-01-24T05:05:43.026Z|00018|ovs_rcu(urcu2)
>>>> |WARN|blocked
>>>> 131072000 ms waiting for main to quiesce
>>>> ovs-vswitchd.log.1.1.1.1:2016-01-24T18:24:40.826Z|00001|ovs_rcu(urcu1)
>>>> |WARN|blocked
>>>> 1092 ms waiting for main to quiesce
>>>> ovs-vswitchd.log.1.1.1.1:2016-01-24T18:24:41.805Z|00002|ovs_rcu(urcu1)
>>>> |WARN|blocked
>>>> 2072 ms waiting for main to quiesce
>>>> ...
>>>>
>>>> Could it be in connection with a deletion of a netdev port?
>>>>
>>>> Regards,
>>>>
>>>> Patrik
>>>>
>>>>
>>>> On 01/25/2016 07:50 PM, Ben Pfaff wrote:
>>>>> On Mon, Jan 25, 2016 at 03:09:09PM +0100, Patrik Andersson R wrote:
>>>>>> during robustness testing, where VM:s are booted and deleted using
>>>>>> nova boot/delete in rather rapid succession, VMs get stuck in
>>>>>> spawning state after a few test cycles. Presumably this is due to
>>>>>> the OVS not responding to port additions and deletions anymore, or
>>>>>> rather that responses to these requests become painfully slow. Other
>>>>>> requests towards the vswitchd fail to complete in any reasonable
>>>>>> time frame as well, ovs-appctl vlog/set is one example.
>>>>>>
>>>>>> The only conclusion I can draw at the moment is that some thread
>>>>>> (I've observed main and dpdk_watchdog3) is blocking the
>>>>>> ovsrcu_synchronize() operation for "infinite" time and there is no
>>>> fall-back to get out of this.
>>>>>> To
>>>>>> recover, the minimum operation seems to be a service restart of the
>>>>>> openvswitch-switch service but that seems to cause other issues
>>>> longer term.
>>>>>> In the vswitch log when this happens the following can be observed:
>>>>>>
>>>>>> 2016-01-24T20:36:14.601Z|02742|ovs_rcu(vhost_thread2)|WARN|blocked
>>>>>> 1000 ms waiting for dpdk_watchdog3 to quiesce
>>>>> This looks like a bug somewhere in the DPDK code. The watchdog code
>>>>> is really simple:
>>>>>
>>>>> static void *
>>>>> dpdk_watchdog(void *dummy OVS_UNUSED)
>>>>> {
>>>>> struct netdev_dpdk *dev;
>>>>>
>>>>> pthread_detach(pthread_self());
>>>>>
>>>>> for (;;) {
>>>>> ovs_mutex_lock(&dpdk_mutex);
>>>>> LIST_FOR_EACH (dev, list_node, &dpdk_list) {
>>>>> ovs_mutex_lock(&dev->mutex);
>>>>> check_link_status(dev);
>>>>> ovs_mutex_unlock(&dev->mutex);
>>>>> }
>>>>> ovs_mutex_unlock(&dpdk_mutex);
>>>>> xsleep(DPDK_PORT_WATCHDOG_INTERVAL);
>>>>> }
>>>>>
>>>>> return NULL;
>>>>> }
>>>>>
>>>>> Although it looks at first glance like it doesn't quiesce, xsleep()
>>>>> does that internally, so I guess check_link_status() must be hanging.
>>> _______________________________________________
>>> discuss mailing list
>>> discuss at openvswitch.org
>>> http://openvswitch.org/mailman/listinfo/discuss
>
> _______________________________________________
> discuss mailing list
> discuss at openvswitch.org
> http://openvswitch.org/mailman/listinfo/discuss
More information about the discuss
mailing list