[ovs-dev] [PATCH v2] dpif-netlink: don't allocate per thread netlink sockets
Flavio Leitner
fbl at sysclose.org
Tue Sep 25 18:24:39 UTC 2018
On Tue, Sep 25, 2018 at 10:51:05AM +0200, Matteo Croce wrote:
> When using the kernel datapath, OVS allocates a pool of sockets to handle
> netlink events. The number of sockets is: ports * n-handler-threads, where
> n-handler-threads is user configurable and defaults to 3/4*number of cores.
>
> This because vswitchd starts n-handler-threads threads, each one with a
> netlink socket for every port of the switch. Every thread then, starts
> listening on events on its set of sockets with epoll().
>
> On setup with lot of CPUs and ports, the number of sockets easily hits
> the process file descriptor limit, and ovs-vswitchd will exit with -EMFILE.
>
> Change the number of allocated sockets to just one per port by moving
> the socket array from a per handler structure to a per datapath one,
> and let all the handlers share the same sockets by using EPOLLEXCLUSIVE
> epoll flag which avoids duplicate events, on systems that support it.
>
> The patch was tested on a 56 core machine running Linux 4.18 and latest
> Open vSwitch. A bridge was created with 2000+ ports, some of them being
> veth interfaces with the peer outside the bridge. The latency of the upcall
> is measured by setting a single 'action=controller,local' OpenFlow rule to
> force all the packets going to the slow path and then to the local port.
> A tool[1] injects some packets to the veth outside the bridge, and measures
> the delay until the packet is captured on the local port. The rx timestamp
> is get from the socket ancillary data in the attribute SO_TIMESTAMPNS, to
> avoid having the scheduler delay in the measured time.
>
> The first test measures the average latency for an upcall generated from
> a single port. To measure it 100k packets, one every msec, are sent to a
> single port and the latencies are measured.
>
> The second test is meant to check latency fairness among ports, namely if
> latency is equal between ports or if some ports have lower priority.
> The previous test is repeated for every port, the average of the average
> latencies and the standard deviation between averages is measured.
>
> The third test serves to measure responsiveness under load. Heavy traffic
> is sent through all ports, latency and packet loss is measured
> on a single idle port.
>
> The fourth test is all about fairness. Heavy traffic is injected in all
> ports but one, latency and packet loss is measured on the single idle port.
>
> This is the test setup:
>
> # nproc
> 56
> # ovs-vsctl show |grep -c Port
> 2223
> # ovs-ofctl dump-flows ovs_upc_br
> cookie=0x0, duration=4.827s, table=0, n_packets=0, n_bytes=0, actions=CONTROLLER:65535,LOCAL
> # uname -a
> Linux fc28 4.18.7-200.fc28.x86_64 #1 SMP Mon Sep 10 15:44:45 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> And these are the results of the tests:
>
> Stock OVS Patched
> netlink sockets
> in use by vswitchd
> lsof -p $(pidof ovs-vswitchd) \
> |grep -c GENERIC 91187 2227
>
> Test 1
> one port latency
> min/avg/max/mdev (us) 2.7/6.6/238.7/1.8 1.6/6.8/160.6/1.7
>
> Test 2
> all port
> avg latency/mdev (us) 6.51/0.97 6.86/0.17
>
> Test 3
> single port latency
> under load
> avg/mdev (us) 7.5/5.9 3.8/4.8
> packet loss 95 % 62 %
>
> Test 4
> idle port latency
> under load
> min/avg/max/mdev (us) 0.8/1.5/210.5/0.9 1.0/2.1/344.5/1.2
> packet loss 94 % 4 %
>
> CPU and RAM usage seems not to be affected, the resource usage of vswitchd
> idle with 2000+ ports is unchanged:
>
> # ps u $(pidof ovs-vswitchd)
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> openvsw+ 5430 54.3 0.3 4263964 510968 pts/1 RLl+ 16:20 0:50 ovs-vswitchd
>
> Additionally, to check if vswitchd is thread safe with this patch, the
> following test was run for circa 48 hours: on a 56 core machine, a
> bridge with kernel datapath is filled with 2200 dummy interfaces and 22
> veth, then 22 traffic generators are run in parallel piping traffic into
> the veths peers outside the bridge.
> To generate as many upcalls as possible, all packets were forced to the
> slowpath with an openflow rule like 'action=controller,local' and packet
> size was set to 64 byte. Also, to avoid overflowing the FDB early and
> slowing down the upcall processing, generated mac addresses were restricted
> to a small interval. vswitchd ran without problems for 48+ hours,
> obviously with all the handler threads with almost 99% CPU usage.
>
> [1] https://github.com/teknoraver/network-tools/blob/master/weed.c
>
> Signed-off-by: Matteo Croce <mcroce at redhat.com>
> ---
Acked-by: Flavio Leitner <fbl at sysclose.org>
More information about the dev
mailing list