[ovs-dev] [PATCH] dpif-netlink: don't allocate per thread netlink sockets

Ben Pfaff blp at ovn.org
Sun Sep 23 23:15:49 UTC 2018


On Sun, Sep 23, 2018 at 10:18:50PM +0000, Matteo Croce wrote:
> On Fri, Sep 21, 2018 at 8:41 PM Ben Pfaff <blp at ovn.org> wrote:
> >
> > On Wed, Sep 19, 2018 at 02:47:03PM +0200, Matteo Croce wrote:
> > > When using the kernel datapath, OVS allocates a pool of sockets to handle
> > > netlink events. The number of sockets is: ports * n-handler-threads, where
> > > n-handler-threads is user configurable and defaults to 3/4*number of cores.
> > >
> > > This because vswitchd starts n-handler-threads threads, each one with a
> > > netlink socket for every port of the switch. Every thread then, starts
> > > listening on events on its set of sockets with epoll().
> > >
> > > On setup with lot of CPUs and ports, the number of sockets easily hits
> > > the process file descriptor limit, and ovs-vswitchd will exit with -EMFILE.
> > >
> > > Change the number of allocated sockets to just one per port by moving
> > > the socket array from a per handler structure to a per datapath one,
> > > and let all the handlers share the same sockets by using EPOLLEXCLUSIVE
> > > epoll flag which avoids duplicate events, on systems that support it.
> >
> > Thanks a lot for working on this.  OVS clearly uses too many fds in some
> > situations and we need to fix it.
> >
> > I noticed that the check for EPOLLEXCLUSIVE is only a build-time check,
> > using #ifdef.  This will only be effective if the build system is where
> > OVS runs and only if the build system uses its own kernel headers for
> > the build.  Usually, in OVS, we instead try to always enable a feature
> > at build time, e.g. through something like:
> >
> >         #ifndef EPOLLEXCLUSIVE
> >         #define EPOLLEXCLUSIVE (whatever)
> >         #endif
> >
> > and then at runtime test whether the feature is actually available and
> > try to use it.
> >
> 
> Hi Ben,
> 
> Thanks for the review.
> 
> Got it, will do it in a v2
> 
> BTW, one of my tests was to run the binary compiled against the latest
> kernel header on:
> - RHEL 7.2, which has a 3.10 kernel without EPOLLEXCLUSIVE backported
> (it was backported in 7.3)
> - Devuan Jessie, which has a 3.16 kernel.
> I ran strace on vswitchd and I clearly saw this call to epoll_ctl()
> with an unknown flag:
> 
> epoll_ctl(22, EPOLL_CTL_ADD, 23, {EPOLLIN|0x10000000, {u32=0, u64=0}}) = 0
> 
> which means that the call doesn't fail and the extra flag is ignored.

OK, in that case then we don't even need to test for support at runtime,
we can just make sure it's always defined properly.

> > Is EPOLLEXCLUSIVE safe, that is, can we miss events or end up sleeping
> > before processing them?  If it is safe, then what invariants does OVS
> > need to maintain to assure them?
> >
> 
> What do you mean by safe? I've skimmed the kernel code, the flag will
> just let the kernel use add_wait_queue_exclusive() instead of
> add_wait_queue(), so the raised or eventually missed events should be
> the same, just notified to only one handler.

OK, sounds good.

> > With this commit, the "hash" argument to dpif_port_get_pid() is never
> > used.  The overall design of the function can be simplified.  That could
> > be folded into this patch or go in a later patch.
> >
> 
> Yes I noticed it, but didn't want to mess with generic code, so I
> simply used OVS_UNUSED to suppress compiler warnings.
> Such refactor can be done as a second step.

OK.

> > Does this patch maintain the same level of thread safety as before?
> >
> 
> I thinked a lot about it. Reading from a socket is thread safe, I
> highly doubt that a netlink datagram can be split and received half
> and half by two handlers.
> Also, the code which adds or removes sockets is protected with
> OVS_REQ_WRLOCK(dpif->upcall_lock) so everything should be safe as
> before.
> 
> I ran this test: on a 56 core machine, I created a bridge with 2200
> dummy interfaces and 22 veth, then started 22 mausezahn on every veth
> peer outside the bridge.
> All packets were forced to the slowpath with action=controller, and
> the whole test lasted 48 hours, no errors found so far.
> 
> As always, four eyes are better than two, if you find something
> suspicious, I'll try to sort it out.

That sounds pretty good.  Thank you.  I'll look forward to v2.

Would you mind describing the testing in the commit message?

Thanks,

Ben.


More information about the dev mailing list