[ovs-discuss] an inexplicable deadlock on ovs mac-learning-table

Huanle Han hanxueluo at gmail.com
Thu Oct 19 18:40:32 UTC 2017


Hi, all

I met a deadlock on ovs mac-learning-table several days ago. I can't figure
out why, and almost run out of guess .

Some background:
1. My ovs code is ovs-2.5.0 based with samll additional change. The
modification is not about thread or mac-learning.
2. the same code and same binary have run on hundreds of ubuntu12.04 for
more than one years. This deadlock is first time.
3. no openflow connection and no ovsdb change happens at the moment that
ovs-vswitchd deadlock.
4. code is built with gcc argument "-g -O0" (no optimization) and libc 2.15.

My investigation:
1. according to result of gdb "thread apply all bt" (see attachment) ,
almost all threads block on "ml->rwlock" waiting for write lock.
   example:
    Thread 14
    #0  0x00007f55c188953d in pthread_rwlock_wrlock
    #1  0x0000000000513966 in ovs_rwlock_wrlock_at
    #2  0x0000000000451e68 in update_learning_table
    #3  0x0000000000452f11 in xlate_normal
    #4  0x00000000004578cd in xlate_output_action
    #5  0x0000000000458aae in do_xlate_actions
    #6  0x000000000045b32d in xlate_actions
    #7  0x0000000000447f02 in upcall_xlate
    #8  0x0000000000448611 in process_upcall
    #9  0x0000000000447439 in recv_upcalls
    #10 0x000000000044704b in udpif_upcall_handler
    #11 0x00000000005146b7 in ovsthread_wrapper
    #12 0x00007f55c1885e9a in start_thread
    #13 0x00007f55c10af36d in clone
    #14 0x0000000000000000 in ??
    Thread 13
    #0  0x00007f55c188953d in pthread_rwlock_wrlock
    #1  0x0000000000513966 in ovs_rwlock_wrlock_at
    #2  0x0000000000451e68 in update_learning_table
    #3  0x000000000045ba1c in xlate_cache_normal
    #4  0x000000000045bc1f in xlate_push_stats
    #5  0x0000000000449dff in revalidate_ukey
    #6  0x000000000044ad27 in revalidate
    #7  0x00000000004476d0 in udpif_revalidator
    #8  0x00000000005146b7 in ovsthread_wrapper
    #9  0x00007f55c1885e9a in start_thread
    #10 0x00007f55c10af36d in clone
    #11 0x0000000000000000 in ??

2. according to rwlock content, we can see lock is acquired by a reader,
while 38 writers is waiting.
   2.1 However, no reader thread is found in all threads.
   2.2 Does some reader thread forget to unlock? I go through all code
about ml->rwlock, find all "lock" and "unlock" in pairs.
    (gdb) p *l_
    $1 = {lock = {__data = {__lock = 0
    __nr_readers = 1            // 1 reader is reading?
    __readers_wakeup = 32609
    __writer_wakeup = 36688
    __nr_readers_queued = 0
    __nr_writers_queued = 38    //38 writers are waiting.
    __writer = 0
    __shared = 0
    __pad1 = 0
    __pad2 = 0
    __flags = 0}
    __size =
"\000\000\000\000\001\000\000\000a\177\000\000P\217\000\000\000\000\000\000&"
    '\000' <repeats 34 times>
    __align = 4294967296}
    where = 0x5d5f00 "<unlocked>"}
3. the memory of rwlock is invalidly wrote by others? I check  around
memory, they all make sense. So invalid write is almost impossible.
4.  no thread_exit is called (we all known ovs doesn't do this ).
revalidator and upcall thread number match the global variable
n_revalidators and n_handlers.


Does anyone also met this situation? or have some ideas?

Best regards,
Huanle Han
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20171020/a688f04c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: mac-learning-rwlock-deadlock.log
Type: application/octet-stream
Size: 74350 bytes
Desc: not available
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20171020/a688f04c/attachment-0001.obj>


More information about the discuss mailing list