[ovs-dev] [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

Wed Aug 15 07:19:39 UTC 2018

Hi Stefano

On Tue, Aug 7, 2018 at 6:31 AM, Stefano Brivio <sbrivio at redhat.com> wrote:
> Hi Pravin,
>
> On Tue, 31 Jul 2018 16:12:03 -0700
> Pravin Shelar <pshelar at ovn.org> wrote:
>
>> Rather than reducing number of thread down to 1, we could find better
>> number of FDs per port.
>> How about this simple solution:
>> 1. Allocate (N * P) FDs as long as it is under FD limit.
>> 2. If FD limit (-EMFILE) is hit reduce N value by half and repeat step 1.
>
> I still see a few quantitative issues with this approach, other than
> Ben's observation about design (which, by the way, looks entirely
> reasonable to me).
>
> We're talking about a disproportionate amount of sockets in any case.
> We can have up to 2^16 vports here, with 5k vports being rather common,
> and for any reasonable value of N that manages somehow to perturbate the
> distribution of upcalls per thread, we are talking about something well
> in excess of 100k sockets. I think this doesn't really scale.
>
My argument is not about proposed fairness algorithm. It is about cost
of the fairness and I do not see it is addressed in any of the follow
ups. You seems to be worried about memory cost and fairness aspects, I
am worried about CPU cost of the solution.
I think proposed solution is solving the fairness issue but it is also
creating bottleneck in upcall processing. OVS is known to have slower
upcall processing. This patch is adding even more cost to the upcall
handling. The latency of first packet handling is also going up with
this approach.

I revisited the original patch, here is what I see in term of added
cost to existing upcall processing:
1. one "kzalloc(sizeof(*upcall), GFP_ATOMIC);" This involve allocate
and initialize memory
2. copy flow key which is more than 1 KB (upcall->key = *key)
3. Acquire spin_lock_bh dp->upcalls.lock, which would disable bottom
half processing on CPU while waiting for the global lock.
4. Iterate list of queued upcalls, one of objective it is to avoid out
of order packet. But I do not see point of ordering packets from
different streams.
5. signal upcall thread after delay ovs_dp_upcall_delay(). This adds
further to the latency.
6. upcall is then handed over to different thread (context switch),
likely on different CPU.
8. the upcall object is freed on remote CPU.
9. single lock essentially means OVS kernel datapath upcall processing
is single threaded no matter number of cores in system.

I would be interested in how are we going to address these issues.

In example you were talking about netlink fd issue on server with 48
core, how does this solution works when there are 5K ports each
triggering upcall ? Can you benchmark your patch? Do you have
performance numbers for TCP_CRR with and without this patch ? Also
publish latency numbers for this patch. Please turn off megaflow to
exercise upcall handling.

I understand fairness has cost, but we need to find right balance
between performance and fairness. Current fairness scheme is a
lockless algorithm without much computational overhead, did you try to
improve current algorithm so that it uses less number of ports.

> With the current value for N (3/4 * number of threads) this can even get
> close to /proc/sys/fs/file-max on some systems, and there raising the
> number of allowed file descriptors for ovs-vswitchd isn't a solution
> anymore.
>
> I would instead try to address the concerns that you had about the
> original patch adding fairness in the kernel, rather than trying to
> make the issue appear less severe in ovs-vswitchd.
>
> --
> Stefano