[ovs-dev] [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

Tue Jul 31 23:12:03 UTC 2018

On Tue, Jul 31, 2018 at 12:43 PM, Matteo Croce <mcroce at redhat.com> wrote:
> On Mon, Jul 16, 2018 at 4:54 PM Matteo Croce <mcroce at redhat.com> wrote:
>>
>> On Tue, Jul 10, 2018 at 6:31 PM Pravin Shelar <pshelar at ovn.org> wrote:
>> >
>> > On Wed, Jul 4, 2018 at 7:23 AM, Matteo Croce <mcroce at redhat.com> wrote:
>> > > From: Stefano Brivio <sbrivio at redhat.com>
>> > >
>> > > Open vSwitch sends to userspace all received packets that have
>> > > no associated flow (thus doing an "upcall"). Then the userspace
>> > > program creates a new flow and determines the actions to apply
>> > > based on its configuration.
>> > >
>> > > When a single port generates a high rate of upcalls, it can
>> > > prevent other ports from dispatching their own upcalls. vswitchd
>> > > overcomes this problem by creating many netlink sockets for each
>> > > port, but it quickly exceeds any reasonable maximum number of
>> > > open files when dealing with huge amounts of ports.
>> > >
>> > > This patch queues all the upcalls into a list, ordering them in
>> > > a per-port round-robin fashion, and schedules a deferred work to
>> > > queue them to userspace.
>> > >
>> > > The algorithm to queue upcalls in a round-robin fashion,
>> > > provided by Stefano, is based on these two rules:
>> > >  - upcalls for a given port must be inserted after all the other
>> > >    occurrences of upcalls for the same port already in the queue,
>> > >    in order to avoid out-of-order upcalls for a given port
>> > >  - insertion happens once the highest upcall count for any given
>> > >    port (excluding the one currently at hand) is greater than the
>> > >    count for the port we're queuing to -- if this condition is
>> > >    never true, upcall is queued at the tail. This results in a
>> > >    per-port round-robin order.
>> > >
>> > > In order to implement a fair round-robin behaviour, a variable
>> > > queueing delay is introduced. This will be zero if the upcalls
>> > > rate is below a given threshold, and grows linearly with the
>> > > queue utilisation (i.e. upcalls rate) otherwise.
>> > >
>> > > This ensures fairness among ports under load and with few
>> > > netlink sockets.
>> > >
>> > Thanks for the patch.
>> > This patch is adding following overhead for upcall handling:
>> > 1. kmalloc.
>> > 2. global spin-lock.
>> > 3. context switch to single worker thread.
>> > I think this could become bottle neck on most of multi core systems.
>> > You have mentioned issue with existing fairness mechanism, Can you
>> > elaborate on those, I think we could improve that before implementing
>> > heavy weight fairness in upcall handling.
>>
>> Hi Pravin,
>>
>> vswitchd allocates N * P netlink sockets, where N is the number of
>> online CPU cores, and P the number of ports.
>> With some setups, this number can grow quite fast, also exceeding the
>> system maximum file descriptor limit.
>> I've seen a 48 core server failing with -EMFILE when trying to create
>> more than 65535 netlink sockets needed for handling 1800+ ports.
>>
>> I made a previous attempt to reduce the sockets to one per CPU, but
>> this was discussed and rejected on ovs-dev because it would remove
>> fairness among ports[1].

Rather than reducing number of thread down to 1, we could find better
number of FDs per port.
How about this simple solution:
1. Allocate (N * P) FDs as long as it is under FD limit.
2. If FD limit (-EMFILE) is hit reduce N value by half and repeat step 1.

Thanks,
Pravin.

>> I think that the current approach of opening a huge number of sockets
>> doesn't really work, (it doesn't scale for sure), it still needs some
>> queueing logic (either in kernel or user space) if we really want to
>> be sure that low traffic ports gets their upcalls quota when other
>> ports are doing way more traffic.
>>
>> If you are concerned about the kmalloc or spinlock, we can solve them
>> with kmem_cache or two copies of the list and rcu, I'll happy to
>> discuss the implementation details, as long as we all agree that the
>> current implementation doesn't scale well and has an issue.
>>

>> [1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-February/344279.html
>>
>> --
>> Matteo Croce
>> per aspera ad upstream
>
> Hi all,
>
> any idea on how to solve the file descriptor limit hit by the netlink sockets?
> I see this issue happen very often, and raising the FD limit to 400k
> seems not the right way to solve it.
> Any other suggestion on how to improve the patch, or solve the problem
> in a different way?
>
> Regards,
>
>
>
> --
> Matteo Croce
> per aspera ad upstream