[ovs-discuss] dpdk watchdog stuck?

Daniele Di Proietto diproiettod at vmware.com
Wed Jan 27 05:18:34 UTC 2016


Hi,

Ben, turns out I was wrong, this appears to be a genuine bug in
dpif-netdev.

I sent a fix that I believe might be related to the bug observed
here:

http://openvswitch.org/pipermail/dev/2016-January/065073.html

Otherwise it would be interesting to get a backtrace of the main thread
from gdb to investigate further.

Thanks,

Daniele

On 26/01/2016 00:23, "Iezzi, Federico" <federico.iezzi at hpe.com> wrote:

>Hi there,
>
>I have the same issue with OVS 2.4 (latest commit in the branch 2.4) and
>DPDK 2.0.0 in Debian 8 environment.
>After a while it just stuck.
>
>Regards,
>Federico
>
>-----Original Message-----
>From: discuss [mailto:discuss-bounces at openvswitch.org] On Behalf Of Ben
>Pfaff
>Sent: Tuesday, January 26, 2016 7:13 AM
>To: Daniele di Proietto <diproiettod at vmware.com>
>Cc: discuss at openvswitch.org
>Subject: Re: [ovs-discuss] dpdk watchdog stuck?
>
>Daniele, I think that you said in our meeting today that there was some
>sort of bug that falsely blames a thread.  Can you explain further?
>
>On Mon, Jan 25, 2016 at 09:29:52PM +0100, Patrik Andersson R wrote:
>> Right, that is likely for sure. Will look there first.
>> 
>> What do you think of the case where the thread is "main". I've got
>> examples of this one as well. Have not been able to figure out so far
>> what would cause this.
>> 
>> ...
>> ovs-vswitchd.log.1.1.1.1:2016-01-23T01:47:19.026Z|00016|ovs_rcu(urcu2)
>> |WARN|blocked
>> 32768000 ms waiting for main to quiesce
>> ovs-vswitchd.log.1.1.1.1:2016-01-23T10:53:27.026Z|00017|ovs_rcu(urcu2)
>> |WARN|blocked
>> 65536000 ms waiting for main to quiesce
>> ovs-vswitchd.log.1.1.1.1:2016-01-24T05:05:43.026Z|00018|ovs_rcu(urcu2)
>> |WARN|blocked
>> 131072000 ms waiting for main to quiesce
>> ovs-vswitchd.log.1.1.1.1:2016-01-24T18:24:40.826Z|00001|ovs_rcu(urcu1)
>> |WARN|blocked
>> 1092 ms waiting for main to quiesce
>> ovs-vswitchd.log.1.1.1.1:2016-01-24T18:24:41.805Z|00002|ovs_rcu(urcu1)
>> |WARN|blocked
>> 2072 ms waiting for main to quiesce
>> ...
>> 
>> Could it be in connection with a deletion of a netdev port?
>> 
>> Regards,
>> 
>> Patrik
>> 
>> 
>> On 01/25/2016 07:50 PM, Ben Pfaff wrote:
>> >On Mon, Jan 25, 2016 at 03:09:09PM +0100, Patrik Andersson R wrote:
>> >>during robustness testing, where VM:s are booted and deleted using
>> >>nova boot/delete in rather rapid succession, VMs get stuck in
>> >>spawning state after a few test cycles. Presumably this is due to
>> >>the OVS not responding to port additions and deletions anymore, or
>> >>rather that responses to these requests become painfully slow. Other
>> >>requests towards the vswitchd fail to complete in any reasonable
>> >>time frame as well, ovs-appctl vlog/set is one example.
>> >>
>> >>The only conclusion I can draw at the moment is that some thread
>> >>(I've observed main and dpdk_watchdog3) is blocking the
>> >>ovsrcu_synchronize() operation for "infinite" time and there is no
>>fall-back to get out of this.
>> >>To
>> >>recover, the minimum operation seems to be a service restart of the
>> >>openvswitch-switch service but that seems to cause other issues
>>longer term.
>> >>
>> >>In the vswitch log when this happens the following can be observed:
>> >>
>> >>2016-01-24T20:36:14.601Z|02742|ovs_rcu(vhost_thread2)|WARN|blocked
>> >>1000 ms waiting for dpdk_watchdog3 to quiesce
>> >This looks like a bug somewhere in the DPDK code.  The watchdog code
>> >is really simple:
>> >
>> >     static void *
>> >     dpdk_watchdog(void *dummy OVS_UNUSED)
>> >     {
>> >         struct netdev_dpdk *dev;
>> >
>> >         pthread_detach(pthread_self());
>> >
>> >         for (;;) {
>> >             ovs_mutex_lock(&dpdk_mutex);
>> >             LIST_FOR_EACH (dev, list_node, &dpdk_list) {
>> >                 ovs_mutex_lock(&dev->mutex);
>> >                 check_link_status(dev);
>> >                 ovs_mutex_unlock(&dev->mutex);
>> >             }
>> >             ovs_mutex_unlock(&dpdk_mutex);
>> >             xsleep(DPDK_PORT_WATCHDOG_INTERVAL);
>> >         }
>> >
>> >         return NULL;
>> >     }
>> >
>> >Although it looks at first glance like it doesn't quiesce, xsleep()
>> >does that internally, so I guess check_link_status() must be hanging.
>> 
>_______________________________________________
>discuss mailing list
>discuss at openvswitch.org
>http://openvswitch.org/mailman/listinfo/discuss




More information about the discuss mailing list