[ovs-dev] BUG: Application write syscall throttled over Open vSwitch + VxLAN

Jean Tourrilhes jean.tourrilhes at hpe.com
Thu Mar 5 01:13:54 UTC 2020

	Hi all,

	I'm facing a puzzling bug that I've been trying to narrow down
with modest success, and I would like some help debugging it because
I'm stuck...

	I've set up two hosts with Open vSwitch 2.12 and a VxLAN
tunnel over a 1Gb/s line between two OWS and trivial static forwarding
rules local<->vx1. OS is Debian 10 with kernel 4.19.67.
	I have an application one each host sending and receiving UDP
packets using regular socket programming (MTU 1422 or lower). The
application can send directly on the Ethernet link, or in the VxLAN
tunnel based on which IP address is used.
	When using OVS+VxLAN, sometimes the write() syscall to send
packets on the socket is throttled to ~300Mb/s. Even if the socket is
non-blocking, I can see throttling of the write syscall : it takes
around 35us instead of the usual 8us.
	When it's not throttled, I can easily send over 1Gb/s, and
receive around 930Mb/s using OVS+VxLAN. I have never seen such
throttling when using the direct Ethernet links.
	I've been trying to figure what impacts this bug, but it's
been frustrating as it tend to come and go randomly. Reproducibility
is not great, unfortunately. What I found :
	o Most frequent when using tg3 driver, a bit less frequent
using igb driver, rarely seen using i40 driver (10Gb/s).
	o BFD seems to be a factor. I have never seen throttling so
far without BFD enabled on the VxLAN. Probability of throttling seems
to depend on BFD frequency and number of VxLAN tunnels.

	I did some quick kernel monitoring using perf-tools. The main
difference I can see between the normal perf log and the throttled
perf log is the addition of "ovs_dp_upcall" called from
"ovs_dp_process_packet". It looks like all my packets are missing in
the kernel flow caches continuously and are all processed by
userspace. This could explain pretty well the throttling...
	I don't see anything special about my UDP traffic and the flow
rules I'm using that would cause such cache misses. And I don't know
how to debug such an issue further.

	This is it. Any help would be appreciated.
	Thanks in advance !


More information about the dev mailing list