[ovs-discuss] dpdk watchdog stuck?

Traynor, Kevin kevin.traynor at intel.com
Fri Feb 5 17:09:26 UTC 2016


> -----Original Message-----
> From: discuss [mailto:discuss-bounces at openvswitch.org] On Behalf Of Patrik
> Andersson R
> Sent: Friday, February 5, 2016 12:27 PM
> To: Daniele Di Proietto; Iezzi, Federico; Ben Pfaff
> Cc: discuss at openvswitch.org
> Subject: Re: [ovs-discuss] dpdk watchdog stuck?
> 
> Hi,
> 
> I applied the patch suggested:
> 
> http://openvswitch.org/pipermail/dev/2016-January/065073.html
> 
> While I can appreciate the issue it addresses, it did not help with our
> problem.
> 
> What happens in our case is the dpdk_watchdog thread is blocked
> from going into quiescent state by the dpdk_mutex. This mutex is held
> in the destroy_device call from a vhost_thread, effectively giving us a
> kind of deadlock.
> 
> The ovsrcu_synchronize() will then wait indefinitely for the blocked
> thread to quiesce.
> 
> One solution that we came up with that resolved the
> dead-lock is as shown below.
> 
> I'm wondering if we missed any important aspect of the rcu solution,
> since we identified that the dpdk_mutex should not be locked over the
> rcu_synchronize() call.
> 
> 
>   @@ -1759,21 +1759,23 @@ destroy_device(volatile struct virtio_net *dev)
>                           dev->flags &= ~VIRTIO_DEV_RUNNING;
>                           ovsrcu_set(&vhost_dev->virtio_dev, NULL);
> +                  }
> +             }
> +
> +             ovs_mutex_unlock(&dpdk_mutex);
> 
>                  /*
>                   * Wait for other threads to quiesce before
>                   * setting the virtio_dev to NULL.
>                   */
>                  ovsrcu_synchronize();
>                  /*
>                   * As call to ovsrcu_synchronize() will end the
> quiescent state,
>                   * put thread back into quiescent state before returning.
>                   */
>                  ovsrcu_quiesce_start();
> -           }
> -      }
> -
> -      ovs_mutex_unlock(&dpdk_mutex);
> 
>         VLOG_INFO("vHost Device '%s' (%ld) has been removed",
>                           dev->ifname, dev->device_fh);
>    }
> 
> 
> Any thoughts on this will be appreciated.

Hi Patrik, that seems reasonable to me. synchronize()/quiesce_start() would not
need to be called if we don't set a virtio_dev to NULL, and as an aside we don't
need to keep looking through the list once we've found the device. 

I've posted a modified version of your code snippet, let me know what you think?
http://openvswitch.org/pipermail/dev/2016-February/065740.html

Kevin.

> 
> Regards,
> 
> Patrik
> 
> 
> On 01/27/2016 08:26 AM, Patrik Andersson R wrote:
> > Hi,
> >
> > thank you for the link to the patch. Will try that out when I get a
> > chance.
> >
> > I don't yet have a back-trace for the instance when the tracing
> > indicates the
> > "main" thread, it does not happen that often.
> >
> > For the "watchdog3" issue though, we seem to be waiting to acquire a
> > mutex:
> >
> > #0  (LWP 8377) "dpdk_watchdog3" in __lll_lock_wait ()
> > #1                                                   in _L_lock_909
> > #2                                                   in
> > __GI___pthread_mutex_lock (mutex=0x955680)
> > ...
> >
> > The thread that currently holds the mutex is
> >
> >  Thread  (LWP 8378) "vhost_thread2"  in poll ()
> >
> >
> > Mutex data:  p *(pthread_mutex_t*)0x955680
> >
> >               __lock = 2,
> >               __count = 0,
> >               __owner = 8378,
> >               __nusers = 1,
> >               __kind = 2,
> >               __spins = 0,
> >               __elision = 0,
> >               ...
> >
> >
> > Any ideas on this will be appreciated.
> >
> > Regards,
> >
> > Patrik
> >
> > On 01/27/2016 06:18 AM, Daniele Di Proietto wrote:
> >> Hi,
> >>
> >> Ben, turns out I was wrong, this appears to be a genuine bug in
> >> dpif-netdev.
> >>
> >> I sent a fix that I believe might be related to the bug observed
> >> here:
> >>
> >> http://openvswitch.org/pipermail/dev/2016-January/065073.html
> >>
> >> Otherwise it would be interesting to get a backtrace of the main thread
> >> from gdb to investigate further.
> >>
> >> Thanks,
> >>
> >> Daniele
> >>
> >> On 26/01/2016 00:23, "Iezzi, Federico" <federico.iezzi at hpe.com> wrote:
> >>
> >>> Hi there,
> >>>
> >>> I have the same issue with OVS 2.4 (latest commit in the branch 2.4)
> >>> and
> >>> DPDK 2.0.0 in Debian 8 environment.
> >>> After a while it just stuck.
> >>>
> >>> Regards,
> >>> Federico
> >>>
> >>> -----Original Message-----
> >>> From: discuss [mailto:discuss-bounces at openvswitch.org] On Behalf Of Ben
> >>> Pfaff
> >>> Sent: Tuesday, January 26, 2016 7:13 AM
> >>> To: Daniele di Proietto <diproiettod at vmware.com>
> >>> Cc: discuss at openvswitch.org
> >>> Subject: Re: [ovs-discuss] dpdk watchdog stuck?
> >>>
> >>> Daniele, I think that you said in our meeting today that there was some
> >>> sort of bug that falsely blames a thread.  Can you explain further?
> >>>
> >>> On Mon, Jan 25, 2016 at 09:29:52PM +0100, Patrik Andersson R wrote:
> >>>> Right, that is likely for sure. Will look there first.
> >>>>
> >>>> What do you think of the case where the thread is "main". I've got
> >>>> examples of this one as well. Have not been able to figure out so far
> >>>> what would cause this.
> >>>>
> >>>> ...
> >>>> ovs-vswitchd.log.1.1.1.1:2016-01-23T01:47:19.026Z|00016|ovs_rcu(urcu2)
> >>>> |WARN|blocked
> >>>> 32768000 ms waiting for main to quiesce
> >>>> ovs-vswitchd.log.1.1.1.1:2016-01-23T10:53:27.026Z|00017|ovs_rcu(urcu2)
> >>>> |WARN|blocked
> >>>> 65536000 ms waiting for main to quiesce
> >>>> ovs-vswitchd.log.1.1.1.1:2016-01-24T05:05:43.026Z|00018|ovs_rcu(urcu2)
> >>>> |WARN|blocked
> >>>> 131072000 ms waiting for main to quiesce
> >>>> ovs-vswitchd.log.1.1.1.1:2016-01-24T18:24:40.826Z|00001|ovs_rcu(urcu1)
> >>>> |WARN|blocked
> >>>> 1092 ms waiting for main to quiesce
> >>>> ovs-vswitchd.log.1.1.1.1:2016-01-24T18:24:41.805Z|00002|ovs_rcu(urcu1)
> >>>> |WARN|blocked
> >>>> 2072 ms waiting for main to quiesce
> >>>> ...
> >>>>
> >>>> Could it be in connection with a deletion of a netdev port?
> >>>>
> >>>> Regards,
> >>>>
> >>>> Patrik
> >>>>
> >>>>
> >>>> On 01/25/2016 07:50 PM, Ben Pfaff wrote:
> >>>>> On Mon, Jan 25, 2016 at 03:09:09PM +0100, Patrik Andersson R wrote:
> >>>>>> during robustness testing, where VM:s are booted and deleted using
> >>>>>> nova boot/delete in rather rapid succession, VMs get stuck in
> >>>>>> spawning state after a few test cycles. Presumably this is due to
> >>>>>> the OVS not responding to port additions and deletions anymore, or
> >>>>>> rather that responses to these requests become painfully slow. Other
> >>>>>> requests towards the vswitchd fail to complete in any reasonable
> >>>>>> time frame as well, ovs-appctl vlog/set is one example.
> >>>>>>
> >>>>>> The only conclusion I can draw at the moment is that some thread
> >>>>>> (I've observed main and dpdk_watchdog3) is blocking the
> >>>>>> ovsrcu_synchronize() operation for "infinite" time and there is no
> >>>> fall-back to get out of this.
> >>>>>> To
> >>>>>> recover, the minimum operation seems to be a service restart of the
> >>>>>> openvswitch-switch service but that seems to cause other issues
> >>>> longer term.
> >>>>>> In the vswitch log when this happens the following can be observed:
> >>>>>>
> >>>>>> 2016-01-24T20:36:14.601Z|02742|ovs_rcu(vhost_thread2)|WARN|blocked
> >>>>>> 1000 ms waiting for dpdk_watchdog3 to quiesce
> >>>>> This looks like a bug somewhere in the DPDK code.  The watchdog code
> >>>>> is really simple:
> >>>>>
> >>>>>      static void *
> >>>>>      dpdk_watchdog(void *dummy OVS_UNUSED)
> >>>>>      {
> >>>>>          struct netdev_dpdk *dev;
> >>>>>
> >>>>>          pthread_detach(pthread_self());
> >>>>>
> >>>>>          for (;;) {
> >>>>>              ovs_mutex_lock(&dpdk_mutex);
> >>>>>              LIST_FOR_EACH (dev, list_node, &dpdk_list) {
> >>>>>                  ovs_mutex_lock(&dev->mutex);
> >>>>>                  check_link_status(dev);
> >>>>>                  ovs_mutex_unlock(&dev->mutex);
> >>>>>              }
> >>>>>              ovs_mutex_unlock(&dpdk_mutex);
> >>>>>              xsleep(DPDK_PORT_WATCHDOG_INTERVAL);
> >>>>>          }
> >>>>>
> >>>>>          return NULL;
> >>>>>      }
> >>>>>
> >>>>> Although it looks at first glance like it doesn't quiesce, xsleep()
> >>>>> does that internally, so I guess check_link_status() must be hanging.
> >>> _______________________________________________
> >>> discuss mailing list
> >>> discuss at openvswitch.org
> >>> http://openvswitch.org/mailman/listinfo/discuss
> >
> > _______________________________________________
> > discuss mailing list
> > discuss at openvswitch.org
> > http://openvswitch.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list
> discuss at openvswitch.org
> http://openvswitch.org/mailman/listinfo/discuss


More information about the discuss mailing list