[ovs-discuss] OVS under load causes TCP connection stalls on the lo interface
aschultz at tpip.net
Fri Sep 19 11:40:44 UTC 2014
I have been observing strange TCP connection aborts on an OVS lately. The TCP
connection are all localhost only. So no external components can be blamed.
tcpdump shows the TCP ACKs where missing and there might have been some data
corruption as well (hard to tell without a proper decoder).
The ovs instance is not configured to touch lo.
After lots of debugging I have been able to find a correlation with an OVS
instance on that host. To reproduce the issue I run netperf on lo like this:
# netperf -l 600 -D 1,second -H localhost
This reports a steady 48527.77 10^6bits/s through on lo. Then I push load
through OVS. My OF controller creates on flow rule per TCP connection going
through the switch. With about 100 new connections per second this loads
the 8 cores to about 50% each. At some random point (mostly within the first
10 seconds of the test), CPU load drops to zero and netperf stalls.
The kernel begins to spill out messages like this:
grep : 1433 callbacks suppressed
With systemtap, I have traced this message to ip_finish_output2 in
net/ipv4/ip_output.c. The skb's at that point have a destination IP
openvswitch-1.11 on Linux 3.8.13
openvswitch-2.3.0 on Linux 3.14.19
openvswitch-git (2654cc338bfb413a6295078e3a7a8e1d4f67cbcc) on Linux 3.14.19
I seems that under this type of load openvswitch kills traffic through lo.
Any ideas on what to try next?
More information about the discuss