[ovs-dev] [PATCH v8] dpif-netlink: distribute polling to discreet handlers

Flavio Leitner fbl at sysclose.org
Fri Sep 18 17:53:54 UTC 2020


On Fri, Sep 18, 2020 at 05:31:30PM +0100, Mark Gray wrote:
> On 17/09/2020 20:49, Aaron Conole wrote:
> > Ilya Maximets <i.maximets at ovn.org> writes:
> > 
> >> On 9/9/20 7:08 PM, Mark Gray wrote:
> >>> From: Aaron Conole <aconole at redhat.com>
> >>>
> >>> Currently, the channel handlers are polled globally.  On some
> >>> systems, this causes a thundering herd issue where multiple
> >>> handler threads become active, only to do no work and immediately
> >>> sleep.
> >>>
> >>> The approach here is to push the netlink socket channels to discreet
> >>> handler threads to process, rather than polling on every thread.
> >>> This will eliminate the need to wake multiple threads.
> >>>
> >>> To check:
> >>>
> >>>   ip netns add left
> >>>   ip netns add right
> >>>   ip link add center-left type veth peer name left0 netns left
> >>>   ip link add center-right type veth peer name right0 netns right
> >>>   ip link set center-left up
> >>>   ip link set center-right up
> >>>   ip -n left ip link set left0 up
> >>>   ip -n left ip addr add 172.31.110.10/24 dev left0
> >>>   ip -n right ip link set right0 up
> >>>   ip -n right ip addr add 172.31.110.11/24 dev right0
> >>>
> >>>   ovs-vsctl add-br br0
> >>>   ovs-vsctl add-port br0 center-right
> >>>   ovs-vsctl add-port br0 center-left
> >>>
> >>>   # in one terminal
> >>>   perf record -e sched:sched_wakeup,irq:softirq_entry -ag
> >>>
> >>>   # in a separate terminal
> >>>   ip netns exec left arping -I left0 -c 1 172.31.110.11
> >>>
> >>>   # in the perf terminal after exiting
> >>>   perf script
> >>>
> >>> Look for the number of 'handler' threads which were made active.
> >>>
> >>> Suggested-by: Ben Pfaff <blp at ovn.org>
> >>> Co-authored-by: Mark Gray <mark.d.gray at redhat.com>
> >>> Reported-by: David Ahern <dsahern at gmail.com>
> >>> Reported-at: https://mail.openvswitch.org/pipermail/ovs-dev/2019-December/365857.html
> >>> Cc: Matteo Croce <technoboy85 at gmail.com>
> >>> Cc: Flavio Leitner <fbl at sysclose.org>
> >>> Fixes: 69c51582f ("dpif-netlink: don't allocate per thread netlink sockets")
> >>> Signed-off-by: Aaron Conole <aconole at redhat.com>
> >>> Signed-off-by: Mark Gray <mark.d.gray at redhat.com>
> >>> ---
> >>>
> >>>
> >>> v2: Oops - forgot to commit my whitespace cleanups.
> >>> v3: one port latency results as per Matteo's comments
> >>>
> >>>     Stock:
> >>>       min/avg/max/mdev = 21.5/36.5/96.5/1.0 us
> >>>     With Patch:
> >>>       min/avg/max/mdev = 5.3/9.7/98.4/0.5 us 
> >>> v4: Oops - first email did not send
> >>> v5:   min/avg/max/mdev = 9.3/10.4/33.6/2.2 us
> >>> v6: Split out the AUTHORS patch and added a cover letter as
> >>>     per Ilya's suggestion.
> >>>     Fixed 0-day issues.
> >>> v7: Merged patches as per Flavio's suggestion. This is
> >>>     no longer a series. Fixed some other small issues.
> >>>
> >>>  lib/dpif-netlink.c | 330 +++++++++++++++++++++++++--------------------
> >>>  lib/id-pool.c      |  10 ++
> >>>  lib/id-pool.h      |   1 +
> >>>  3 files changed, 197 insertions(+), 144 deletions(-)
> >>>
> >>> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> >>> index 7da4fb54d..a74838506 100644
> >>> --- a/lib/dpif-netlink.c
> >>> +++ b/lib/dpif-netlink.c
> >>> @@ -37,6 +37,7 @@
> >>>  #include "dpif-provider.h"
> >>>  #include "fat-rwlock.h"
> >>>  #include "flow.h"
> >>> +#include "id-pool.h"
> >>>  #include "netdev-linux.h"
> >>>  #include "netdev-offload.h"
> >>>  #include "netdev-provider.h"
> >>> @@ -160,6 +161,9 @@ static void dpif_netlink_flow_to_dpif_flow(struct dpif_flow *,
> >>>  
> >>>  /* One of the dpif channels between the kernel and userspace. */
> >>>  struct dpif_channel {
> >>> +    struct hmap_node hmap_node;
> >>
> >> This needs a comment and all other new structure fileds too.
> >> Ex.:
> >>     struct hmap_node hmap_node; /* In struct dpif_handler's channels hmap. */
> >>
> >>> +    struct dpif_handler *handler;
> 
> Ok, updated
> 
> >>
> >> Why do we need this back-pointer?  Maybe it's better to just add 'handler'
> >> argument to dpif_channel_del() ?
> 
> I have kept this as it is required due to the rework that Flavio has
> suggested in another thread.
> 
> >>
> >>> +    uint32_t port_idx;
> >>
> >> I see a lot of 'port_idx' here and there, but it's not really an index anymore.
> >> You removed all the arrays that used these variables as index.
> >> Can we just use 'port_no' directly here?  I know that there will be conversions
> >> from odp to u32, but this should be more clear.  If conversions are the issue,
> >> we could store u32 version of 'port_no' here.
> 
> I think this is a good idea. 'port_idx' is littered around this file in
> various places and it caused me an issue it a previous revision because
> the naming convention was not consistent. I will just remove all
> references to it.
> 
> >>
> >> And, please, add a comment.
> >>
> >>>      struct nl_sock *sock;       /* Netlink socket. */
> >>>      long long int last_poll;    /* Last time this channel was polled. */
> >>>  };
> >>> @@ -183,6 +187,8 @@ struct dpif_handler {
> >>>      int n_events;                 /* Num events returned by epoll_wait(). */
> >>>      int event_offset;             /* Offset into 'epoll_events'. */
> >>>  
> >>> +    struct hmap channels;         /* map of channels for each port. */
> >>
> >> Comment style: Capital letter.
> 
> Ok
> 
> >>
> >>> +
> >>>  #ifdef _WIN32
> >>>      /* Pool of sockets. */
> >>>      struct dpif_windows_vport_sock *vport_sock_pool;
> >>> @@ -201,9 +207,8 @@ struct dpif_netlink {
> >>>      struct fat_rwlock upcall_lock;
> >>>      struct dpif_handler *handlers;
> >>>      uint32_t n_handlers;           /* Num of upcall handlers. */
> >>> -    struct dpif_channel *channels; /* Array of channels for each port. */
> >>> -    int uc_array_size;             /* Size of 'handler->channels' and */
> >>> -                                   /* 'handler->epoll_events'. */
> >>> +
> >>> +    struct id_pool *port_ids;      /* Set of added port ids */
> >>
> >> What is the reason of having this pool?  Datapath port numbers are unique
> >> except for ODPP_NONE, and all the functions you implemented actually
> >> iterates over all channels anyway to find one with requested port_idx.
> >> So, the only purpose of it is to avoid searching on incorect port number,
> >> but this should be a rare case.  Can we just remove it?
> >>
> >> OTOH, if it's necessary to have this id pool, I'd suggest to use hmapx
> >> instead as it already has all required infrastructure and doesn't actually
> >> care about range of values.
> 
> I have replaced this with a hmap of all the channels as suggested in
> another thread by Flavio.

I think Ilya is referring to id_pool, not the hmap used for the
channels.


> >>>      /* Change notification. */
> >>>      struct nl_sock *port_notifier; /* vport multicast group subscriber. */
> >>> @@ -287,6 +292,44 @@ close_nl_sock(struct nl_sock *sock)
> >>>  #endif
> >>>  }
> >>>  
> >>> +static size_t
> >>> +dpif_handler_channels_count(const struct dpif_handler *handler)
> >>> +{
> >>> +    return hmap_count(&handler->channels);
> >>> +}
> >>> +
> >>> +static struct dpif_channel *
> >>> +dpif_handler_channel_find(struct dpif_handler *handler, uint32_t idx)
> >>> +{
> >>> +    struct dpif_channel *channel;
> >>> +
> >>> +    HMAP_FOR_EACH_WITH_HASH (channel, hmap_node, hash_int(idx, 0),
> >>> +                             &handler->channels) {
> >>> +        if (channel->port_idx == idx) {
> >>> +            return channel;
> >>> +        }
> >>> +    }
> >>> +
> >>> +    return NULL;
> >>> +}
> >>> +
> >>> +static void
> >>> +dpif_channel_del(struct dpif_netlink *dpif,
> >>> +                 struct dpif_channel *channel)
> >>
> >> As suggested above, it might make sense to just pass 'handler' here and
> >> avoid having back-pointer inside the channel structure.
> 
> I have left this like this as I have reworked based on Flavio's suggestion.

Before the back-pointer we had multiple iterations over the hmap.
The back-pointer helped to avoid that, but now it seems the code
was simplified enough that the back-pointer might not be needed
at all.

fbl


More information about the dev mailing list