[ovs-dev] thundering herd wakeup of handler threads

Tue Dec 10 22:20:26 UTC 2019

On 12/10/19 3:09 PM, Jason Baron wrote:
> Hi David,
> 
> The idea is that we try and queue new work to 'idle' threads in an
> attempt to distribute a workload. Thus, once we find an 'idle' thread we
> stop waking up other threads. While we are searching the wakeup list for
> idle threads, we do queue an epoll event to the non-idle threads, this
> doesn't mean they are woken up. It just means that when they go to
> epoll_wait() to harvest events from the kernel, if the event is still
> available it will be reported. If the condition for the event is no
> longer true (because another thread consumed it), they the event
> wouldn't be visible. So its a way of load balancing a workload while
> also reducing the number of wakeups. Its 'exclusive' in the sense that
> it will stop after it finds the first idle thread.
> 
> We certainly can employ other wakeup strategies - there was interest
> (and patches) for a strict 'round robin' but that has not been merged
> upstream.
> 
> I would like to better understand the current usecase. It sounds like
> each thread as an epoll file descriptor. And each epoll file descriptor
> is attached the name netlink socket. But when that netlink socket gets a
> packet it causes all the threads to wakeup? Are you sure there is just 1
> netlink socket that all epoll file desciptors are are attached to?
> 

Thanks for the response.

This is the code in question:

https://github.com/openvswitch/ovs/blob/branch-2.11/lib/dpif-netlink.c#L492

Yes, prior to finding the above code reference I had traced it to a
single socket with all handler threads (71 threads on this 96 cpu box)
on the wait queue.

The ovs kernel module is punting a packet to userspace. It generates a
netlink message and invokes netlink_unicast. This the stack trace:

        ffffffffad09cc02 ttwu_do_wakeup+0x92 ([kernel.kallsyms])
        ffffffffad09d945 try_to_wake_up+0x1d5 ([kernel.kallsyms])
        ffffffffad257275 pollwake+0x75 ([kernel.kallsyms])
        ffffffffad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
        ffffffffad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
        ffffffffad289ecc ep_poll_wakeup_proc+0x1c ([kernel.kallsyms])
        ffffffffad28a4bc ep_call_nested.constprop.18+0xbc
([kernel.kallsyms])
        ffffffffad28b0f2 ep_poll_callback+0x172 ([kernel.kallsyms])
        ffffffffad0b58a4 __wake_up_common+0x74 ([kernel.kallsyms])
        ffffffffad0b59cc __wake_up_common_lock+0x7c ([kernel.kallsyms])
        ffffffffad794af9 sock_def_readable+0x39 ([kernel.kallsyms])
        ffffffffad7e846e __netlink_sendskb+0x3e ([kernel.kallsyms])
        ffffffffad7eb11a netlink_unicast+0x20a ([kernel.kallsyms])
        ffffffffc07abd44 queue_userspace_packet+0x2d4 ([kernel.kallsyms])
        ffffffffc07ac330 ovs_dp_upcall+0x50 ([kernel.kallsyms])

A probe on sock_def_readable shows it is a single socket that the wait
queue is processed. Eventually ttwu_do_wakeup is invoked 71 times (once
for each thread). In some cases I see the awakened threads immediately
running on the target_cpu, while the queue walk continues to wake up all
threads. Only the first one is going to handle the packet so the rest of
the wakeups are just noise.

On this system in just a 1-second interval I see this sequence play out
400+ times.