[ovs-discuss] ovs-vswitchd get stucked on ovs_rcu revalidator quiesce

Zdenek Janda zdenek.janda at cloudevelops.com
Wed Nov 27 17:02:34 UTC 2019


Hi,
we have new lead at
https://github.com/openvswitch/ovs-issues/issues/175, which is pointing
to bug in glibc: https://sourceware.org/bugzilla/show_bug.cgi?id=23844
I will see on testing patch in our environment.
Best regards
Zdenek

On 27. 11. 19 17:32, Zdenek Janda wrote:
> Hi,
> issue is also on Github at
> https://github.com/openvswitch/ovs-issues/issues/175, added my comments.
> Now will need help from somebody familiar with code - Openvswitch is
> definitively not stable on Ubuntu Bionic and we can reproduce bug
> easily, so any patching can be tested but need to figure out where to
> begin - so far no clue why the thread is getting stucked and why it
> unstucks after gcore dump.
> Best regards
> Zdenek Janda
> 
> On 27. 11. 19 12:07, Zdenek Janda wrote:
>> Hi,
>> it took some time to catch this, now we have detection check which
>> recognizes stucked ovs and automaticaly executes gdb debug. For now, i
>> have included "info threads" and "thread apply all bt full", attached is
>> two files with data i got from two occurrences stucked instances of
>> ovs-vswitchd.
>> Let me know, how to proceed next - the issue manifestates itself several
>> times a day across large number of openstack compute nodes we have, so
>> we have good chance to catch and test fix.
>> Best regards
>> Zdenek Janda
>>
>>
>> On 23. 10. 19 18:49, Ben Pfaff wrote:
>>> It looks like every thread is blocking in poll() except for
>>> revalidator525.  Can you give me a full backtrace for that thread?  It's
>>> unusual to see a futex in an OVS backtrace and it's not obvious to me
>>> where it's coming from.
>>>
>>> On Wed, Oct 23, 2019 at 09:56:54AM +0200, Zdenek Janda wrote:
>>>> Hello,
>>>> debugged on next occurence. Here is info on stucked revalidator525:
>>>>
>>>> Id   Target Id         Frame
>>>> * 1    Thread 0x7fde892b4e00 (LWP 1499) "ovs-vswitchd"
>>>> 0x00007fde879f9bf9 in __GI___poll (fds=0x55a40e6d75e0, nfds=2,
>>>> timeout=76638) at ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   2    Thread 0x7fde855a1700 (LWP 2131) "urcu4" 0x00007fde879f9bf9 in
>>>> __GI___poll (fds=0x7fde40004d10, nfds=2, timeout=75817) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   3    Thread 0x7fddd1ffb700 (LWP 4596) "handler482" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fddb86440e0, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   4    Thread 0x7fddd27fc700 (LWP 4597) "handler483" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fddd84e11e0, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   5    Thread 0x7fddd2ffd700 (LWP 4598) "handler484" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fddec3062d0, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   6    Thread 0x7fddd37fe700 (LWP 4599) "handler485" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde341c3f80, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   7    Thread 0x7fde86da4700 (LWP 4600) "handler486" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde500e8530, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   8    Thread 0x7fde865a3700 (LWP 4601) "handler487" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fddf40c00c0, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   9    Thread 0x7fde85da2700 (LWP 4602) "handler488" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fddfc06e480, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   10   Thread 0x7fde84da0700 (LWP 4603) "handler489" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde78049250, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   11   Thread 0x7fde73fff700 (LWP 4604) "handler490" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fddc8016700, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   12   Thread 0x7fde737fe700 (LWP 4605) "handler491" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fdde800cbd0, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   13   Thread 0x7fde72ffd700 (LWP 4606) "handler492" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde60012fb0, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   14   Thread 0x7fde727fc700 (LWP 4607) "handler493" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde24017bb0, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   15   Thread 0x7fde71ffb700 (LWP 4608) "handler494" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fddd40060b0, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   16   Thread 0x7fde717fa700 (LWP 4609) "handler495" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde54004ef0, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   17   Thread 0x7fde70ff9700 (LWP 4610) "handler496" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde38008630, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   18   Thread 0x7fde5bfff700 (LWP 4611) "handler497" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fdda400c650, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   19   Thread 0x7fde5b7fe700 (LWP 4612) "handler498" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde14015730, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   20   Thread 0x7fde5affd700 (LWP 4614) "handler499" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde28016fc0, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   21   Thread 0x7fde5a7fc700 (LWP 4615) "handler500" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fddf0007570, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   22   Thread 0x7fde59ffb700 (LWP 4616) "handler501" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde04012990, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   23   Thread 0x7fde597fa700 (LWP 4617) "handler502" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fddc0019910, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   24   Thread 0x7fde58ff9700 (LWP 4618) "handler503" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde3c005810, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   25   Thread 0x7fde4bfff700 (LWP 4619) "handler504" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde08005770, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   26   Thread 0x7fde4b7fe700 (LWP 4620) "handler505" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde2c0043e0, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   27   Thread 0x7fde4affd700 (LWP 4621) "handler506" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde7c01c4f0, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   28   Thread 0x7fde4a7fc700 (LWP 4622) "handler507" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fddcc014890, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   29   Thread 0x7fde49ffb700 (LWP 4623) "handler508" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fdda8004c00, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   30   Thread 0x7fde497fa700 (LWP 4624) "handler509" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde0c006450, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   31   Thread 0x7fde48ff9700 (LWP 4625) "handler510" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde64004c40, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   32   Thread 0x7fde13fff700 (LWP 4626) "handler511" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde6c000b20, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   33   Thread 0x7fde137fe700 (LWP 4627) "handler512" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fddc400b240, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   34   Thread 0x7fde12ffd700 (LWP 4628) "handler513" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde180056c0, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   35   Thread 0x7fde127fc700 (LWP 4629) "handler514" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fdde0005ea0, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   36   Thread 0x7fde11ffb700 (LWP 4630) "handler515" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde00007af0, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   37   Thread 0x7fde117fa700 (LWP 4631) "handler516" 0x00007fde879f9bf9
>>>> in __GI___poll (fds=0x7fde6800b2d0, nfds=3, timeout=2147483647) at
>>>> ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   38   Thread 0x7fde10ff9700 (LWP 4632) "revalidator517"
>>>> 0x00007fde879f9bf9 in __GI___poll (fds=0x7fde5c0a0960, nfds=2,
>>>> timeout=2147483647) at ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   39   Thread 0x7fddfbfff700 (LWP 4633) "revalidator518"
>>>> 0x00007fde879f9bf9 in __GI___poll (fds=0x7fde4c00bbc0, nfds=2,
>>>> timeout=2147483647) at ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   40   Thread 0x7fddfb7fe700 (LWP 4634) "revalidator519"
>>>> 0x00007fde879f9bf9 in __GI___poll (fds=0x7fddb0036600, nfds=2,
>>>> timeout=2147483647) at ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   41   Thread 0x7fddfaffd700 (LWP 4635) "revalidator520"
>>>> 0x00007fde879f9bf9 in __GI___poll (fds=0x7fdddc06fef0, nfds=2,
>>>> timeout=2147483647) at ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   42   Thread 0x7fddfa7fc700 (LWP 4636) "revalidator521"
>>>> 0x00007fde879f9bf9 in __GI___poll (fds=0x7fde800016a0, nfds=2,
>>>> timeout=2147483647) at ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   43   Thread 0x7fddf9ffb700 (LWP 4637) "revalidator522"
>>>> 0x00007fde879f9bf9 in __GI___poll (fds=0x7fddac07a390, nfds=2,
>>>> timeout=2147483647) at ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   44   Thread 0x7fddf97fa700 (LWP 4638) "revalidator523"
>>>> 0x00007fde879f9bf9 in __GI___poll (fds=0x7fde40043230, nfds=2,
>>>> timeout=2147483647) at ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   45   Thread 0x7fddf8ff9700 (LWP 4639) "revalidator524"
>>>> 0x00007fde879f9bf9 in __GI___poll (fds=0x7fdde4060d20, nfds=2,
>>>> timeout=2147483647) at ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   46   Thread 0x7fddd3fff700 (LWP 4640) "revalidator525"
>>>> 0x00007fde8853237b in futex_abstimed_wait (private=<optimized out>,
>>>> abstime=0x0, expected=10, futex_word=0x55a40e4afa78)
>>>>     at ../sysdeps/unix/sysv/linux/futex-internal.h:172
>>>>   47   Thread 0x7fddd17fa700 (LWP 4641) "revalidator526"
>>>> 0x00007fde879f9bf9 in __GI___poll (fds=0x7fde2004bd10, nfds=2,
>>>> timeout=2147483647) at ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   48   Thread 0x7fddd0ff9700 (LWP 4642) "revalidator527"
>>>> 0x00007fde879f9bf9 in __GI___poll (fds=0x7fddbc07be70, nfds=2,
>>>> timeout=2147483647) at ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   49   Thread 0x7fddb7fff700 (LWP 4643) "revalidator528"
>>>> 0x00007fde879f9bf9 in __GI___poll (fds=0x7fde1c04bd20, nfds=2,
>>>> timeout=2147483647) at ../sysdeps/unix/sysv/linux/poll.c:29
>>>>   50   Thread 0x7fddb77fe700 (LWP 4644) "revalidator529"
>>>> 0x00007fde879f9bf9 in __GI___poll (fds=0x7fde30027a50, nfds=2,
>>>> timeout=2147483647) at ../sysdeps/unix/sysv/linux/poll.c:29
>>>>
>>>>
>>>>
>>>> So here is the thread:
>>>>  46   Thread 0x7fddd3fff700 (LWP 4640) "revalidator525"
>>>> 0x00007fde8853237b in futex_abstimed_wait (private=<optimized out>,
>>>> abstime=0x0, expected=10, futex_word=0x55a40e4afa78)
>>>>     at ../sysdeps/unix/sysv/linux/futex-internal.h:172
>>>>
>>>> Any ideas what to do next ?
>>>>
>>>> Thanks
>>>> Zdenek
>>>>
>>>> On 14. 10. 19 20:15, Ben Pfaff wrote:
>>>>> On Mon, Oct 14, 2019 at 10:11:35AM +0200, Zdenek Janda wrote:
>>>>>> Hello,
>>>>>> we have really strange issue. Ovs-vswitchd  get stucked on ovs_rcu
>>>>>> revalidator quiesce, reporting following to the log:
>>>>>>
>>>>>> 2019-10-14T05:11:18.049Z|00416|ovs_rcu|WARN|blocked 1000 ms waiting for
>>>>>> revalidator518 to quiesce
>>>>>> 2019-10-14T05:11:19.049Z|00417|ovs_rcu|WARN|blocked 2000 ms waiting for
>>>>>> revalidator518 to quiesce
>>>>>> 2019-10-14T05:11:21.049Z|00418|ovs_rcu|WARN|blocked 4000 ms waiting for
>>>>> ...
>>>>>> We have tried to debug this issue, and noticed while running strace or
>>>>>> producing core dump will magically unstuck ovs-vswitchd and it starts
>>>>>> working immediatelly:
>>>>>>
>>>>>> gcore `pidof ovs-vswitchd`
>>>>>
>>>>> The key question is, what is revalidator518 doing when it is failed to
>>>>> quiesce?  If you attach with GDB and obtain a backtrace from the stuck
>>>>> thread (and possibly from the other threads), it would probably provide
>>>>> a clue.
>>>>>
>>>>> Do you have any idea about how busy OVS is when this happens?  The
>>>>> revalidator threads should not take multiple seconds to do their work,
>>>>> but if it happens anyway, it would probably happen when OVS was very
>>>>> busy.
>>>>>
>>>>
>>
> 



More information about the discuss mailing list