[ovs-dev] [PATCH v2] dpif-netlink: don't allocate per thread netlink sockets

Tue Sep 25 18:24:39 UTC 2018

On Tue, Sep 25, 2018 at 10:51:05AM +0200, Matteo Croce wrote:
> When using the kernel datapath, OVS allocates a pool of sockets to handle
> netlink events. The number of sockets is: ports * n-handler-threads, where
> n-handler-threads is user configurable and defaults to 3/4*number of cores.
> 
> This because vswitchd starts n-handler-threads threads, each one with a
> netlink socket for every port of the switch. Every thread then, starts
> listening on events on its set of sockets with epoll().
> 
> On setup with lot of CPUs and ports, the number of sockets easily hits
> the process file descriptor limit, and ovs-vswitchd will exit with -EMFILE.
> 
> Change the number of allocated sockets to just one per port by moving
> the socket array from a per handler structure to a per datapath one,
> and let all the handlers share the same sockets by using EPOLLEXCLUSIVE
> epoll flag which avoids duplicate events, on systems that support it.
> 
> The patch was tested on a 56 core machine running Linux 4.18 and latest
> Open vSwitch. A bridge was created with 2000+ ports, some of them being
> veth interfaces with the peer outside the bridge. The latency of the upcall
> is measured by setting a single 'action=controller,local' OpenFlow rule to
> force all the packets going to the slow path and then to the local port.
> A tool[1] injects some packets to the veth outside the bridge, and measures
> the delay until the packet is captured on the local port. The rx timestamp
> is get from the socket ancillary data in the attribute SO_TIMESTAMPNS, to
> avoid having the scheduler delay in the measured time.
> 
> The first test measures the average latency for an upcall generated from
> a single port. To measure it 100k packets, one every msec, are sent to a
> single port and the latencies are measured.
> 
> The second test is meant to check latency fairness among ports, namely if
> latency is equal between ports or if some ports have lower priority.
> The previous test is repeated for every port, the average of the average
> latencies and the standard deviation between averages is measured.
> 
> The third test serves to measure responsiveness under load. Heavy traffic
> is sent through all ports, latency and packet loss is measured
> on a single idle port.
> 
> The fourth test is all about fairness. Heavy traffic is injected in all
> ports but one, latency and packet loss is measured on the single idle port.
> 
> This is the test setup:
> 
>   # nproc
>   56
>   # ovs-vsctl show |grep -c Port
>   2223
>   # ovs-ofctl dump-flows ovs_upc_br
>    cookie=0x0, duration=4.827s, table=0, n_packets=0, n_bytes=0, actions=CONTROLLER:65535,LOCAL
>   # uname -a
>   Linux fc28 4.18.7-200.fc28.x86_64 #1 SMP Mon Sep 10 15:44:45 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> 
> And these are the results of the tests:
> 
>                                           Stock OVS                 Patched
>   netlink sockets
>   in use by vswitchd
>   lsof -p $(pidof ovs-vswitchd) \
>       |grep -c GENERIC                        91187                    2227
> 
>   Test 1
>   one port latency
>   min/avg/max/mdev (us)           2.7/6.6/238.7/1.8       1.6/6.8/160.6/1.7
> 
>   Test 2
>   all port
>   avg latency/mdev (us)                   6.51/0.97               6.86/0.17
> 
>   Test 3
>   single port latency
>   under load
>   avg/mdev (us)                             7.5/5.9                 3.8/4.8
>   packet loss                                  95 %                    62 %
> 
>   Test 4
>   idle port latency
>   under load
>   min/avg/max/mdev (us)           0.8/1.5/210.5/0.9       1.0/2.1/344.5/1.2
>   packet loss                                  94 %                     4 %
> 
> CPU and RAM usage seems not to be affected, the resource usage of vswitchd
> idle with 2000+ ports is unchanged:
> 
>   # ps u $(pidof ovs-vswitchd)
>   USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
>   openvsw+  5430 54.3  0.3 4263964 510968 pts/1  RLl+ 16:20   0:50 ovs-vswitchd
> 
> Additionally, to check if vswitchd is thread safe with this patch, the
> following test was run for circa 48 hours: on a 56 core machine, a
> bridge with kernel datapath is filled with 2200 dummy interfaces and 22
> veth, then 22 traffic generators are run in parallel piping traffic into
> the veths peers outside the bridge.
> To generate as many upcalls as possible, all packets were forced to the
> slowpath with an openflow rule like 'action=controller,local' and packet
> size was set to 64 byte. Also, to avoid overflowing the FDB early and
> slowing down the upcall processing, generated mac addresses were restricted
> to a small interval. vswitchd ran without problems for 48+ hours,
> obviously with all the handler threads with almost 99% CPU usage.
> 
> [1] https://github.com/teknoraver/network-tools/blob/master/weed.c
> 
> Signed-off-by: Matteo Croce <mcroce at redhat.com>
> ---

Acked-by: Flavio Leitner <fbl at sysclose.org>