[ovs-discuss] dpdk watchdog stuck? (was: ovsrcu_synchronize() blocking while indefinitely waiting for thread to quiesce)
Ben Pfaff
blp at ovn.org
Mon Jan 25 18:50:38 UTC 2016
On Mon, Jan 25, 2016 at 03:09:09PM +0100, Patrik Andersson R wrote:
> during robustness testing, where VM:s are booted and deleted using nova
> boot/delete in rather rapid succession, VMs get stuck in spawning state
> after
> a few test cycles. Presumably this is due to the OVS not responding to port
> additions and deletions anymore, or rather that responses to these requests
> become painfully slow. Other requests towards the vswitchd fail to complete
> in any reasonable time frame as well, ovs-appctl vlog/set is one example.
>
> The only conclusion I can draw at the moment is that some thread (I've
> observed main and dpdk_watchdog3) is blocking the ovsrcu_synchronize()
> operation for "infinite" time and there is no fall-back to get out of this.
> To
> recover, the minimum operation seems to be a service restart of the
> openvswitch-switch service but that seems to cause other issues longer term.
>
> In the vswitch log when this happens the following can be observed:
>
> 2016-01-24T20:36:14.601Z|02742|ovs_rcu(vhost_thread2)|WARN|blocked 1000 ms
> waiting for dpdk_watchdog3 to quiesce
This looks like a bug somewhere in the DPDK code. The watchdog code is
really simple:
static void *
dpdk_watchdog(void *dummy OVS_UNUSED)
{
struct netdev_dpdk *dev;
pthread_detach(pthread_self());
for (;;) {
ovs_mutex_lock(&dpdk_mutex);
LIST_FOR_EACH (dev, list_node, &dpdk_list) {
ovs_mutex_lock(&dev->mutex);
check_link_status(dev);
ovs_mutex_unlock(&dev->mutex);
}
ovs_mutex_unlock(&dpdk_mutex);
xsleep(DPDK_PORT_WATCHDOG_INTERVAL);
}
return NULL;
}
Although it looks at first glance like it doesn't quiesce, xsleep() does
that internally, so I guess check_link_status() must be hanging.
More information about the discuss
mailing list