[ovs-dev] thundering herd wakeup of handler threads

Matteo Croce mcroce at redhat.com
Tue Dec 10 21:20:11 UTC 2019


On Tue, Dec 10, 2019 at 10:00 PM David Ahern <dsahern at gmail.com> wrote:
>
> [ adding Jason as author of the patch that added the epoll exclusive flag ]
>
> On 12/10/19 12:37 PM, Matteo Croce wrote:
> > On Tue, Dec 10, 2019 at 8:13 PM David Ahern <dsahern at gmail.com> wrote:
> >>
> >> Hi Matteo:
> >>
> >> On a hypervisor running a 4.14.91 kernel and OVS 2.11 I am seeing a
> >> thundering herd wake up problem. Every packet punted to userspace wakes
> >> up every one of the handler threads. On a box with 96 cpus, there are 71
> >> handler threads which means 71 process wakeups for every packet punted.
> >>
> >> This is really easy to see, just watch sched:sched_wakeup tracepoints.
> >> With a few extra probes:
> >>
> >> perf probe sock_def_readable sk=%di
> >> perf probe ep_poll_callback wait=%di mode=%si sync=%dx key=%cx
> >> perf probe __wake_up_common wq_head=%di mode=%si nr_exclusive=%dx
> >> wake_flags=%cx key=%8
> >>
> >> you can see there is a single netlink socket and its wait queue contains
> >> an entry for every handler thread.
> >>
> >> This does not happen with the 2.7.3 version. Roaming commits it appears
> >> that the change in behavior comes from this commit:
> >>
> >> commit 69c51582ff786a68fc325c1c50624715482bc460
> >> Author: Matteo Croce <mcroce at redhat.com>
> >> Date:   Tue Sep 25 10:51:05 2018 +0200
> >>
> >>     dpif-netlink: don't allocate per thread netlink sockets
> >>
> >>
> >> Is this a known problem?
> >>
> >> David
> >>
> >
> > Hi David,
> >
> > before my patch, vswitchd created NxM sockets, being N the ports and M
> > the active cores,
> > because every thread opens a netlink socket per port.
> >
> > With my patch, a pool is created with N socket, one per port, and all
> > the threads polls the same list
> > with the EPOLLEXCLUSIVE flag.
> > As the name suggests, EPOLLEXCLUSIVE lets the kernel wakeup only one
> > of the waiting threads.
> >
> > I'm not aware of this problem, but it goes against the intended
> > behaviour of EPOLLEXCLUSIVE.
> > Such flag exists since Linux 4.5, can you check that it's passed
> > correctly to epoll()?
> >
>
> This the commit that added the EXCLUSIVE flag:
>
> commit df0108c5da561c66c333bb46bfe3c1fc65905898
> Author: Jason Baron <jbaron at akamai.com>
> Date:   Wed Jan 20 14:59:24 2016 -0800
>
>     epoll: add EPOLLEXCLUSIVE flag
>
>
> The commit message acknowledges that multiple threads can still be awakened:
>
> "The implementation walks the list of exclusive waiters, and queues an
> event to each epfd, until it finds the first waiter that has threads
> blocked on it via epoll_wait().  The idea is to search for threads which
> are idle and ready to process the wakeup events.  Thus, we queue an
> event to at least 1 epfd, but may still potentially queue an event to
> all epfds that are attached to the shared fd source."
>
> To me that means all idle handler threads are going to be awakened on
> each upcall message even though only 1 is needed to handle the message.
>
> Jason: What was the rationale behind the exclusive flag that still wakes
> up more than 1 waiter? In the case of OVS and vswitchd I am seeing all N
> handler threads awakened on every single event which is a horrible
> scaling property.
>

Actually, I didn't look at that commit message, but I read the
epoll_ctl manpage which says:

"When a wakeup event occurs and multiple epoll file descriptors
are attached to the same target file using EPOLLEXCLUSIVE, one
or more of the epoll file descriptors will receive an event
with epoll_wait(2).  The default in this scenario (when
EPOLLEXCLUSIVE is not set) is for all epoll file descriptors
to receive an event.  EPOLLEXCLUSIVE is thus useful for avoid‐
ing thundering herd problems in certain scenarios."

I'd expect "one or more" to be probably greater than 1, but still much
lower than all.

Before this patch (which unfortunately is needed to avoid -EMFILE
errors with many ports), how many sockets are awakened when an ARP is
received?

Regards,

-- 
Matteo Croce
per aspera ad upstream



More information about the dev mailing list