[ovs-dev] [BUG] SLB bonding & bond-rebalance-interval not working as expected

Markus Schuster ml at markus.schuster.name
Wed Jan 30 14:44:21 UTC 2013


Hi,

looks like I'm currently in SLB bonding bug hunting mood :) - I think I found 
an additional bug / strange behaviour with balance-slb bonding: Two VMs 
suffered from short term (a  few minutes) connectivity issues every now and 
then, so I started digging further into the issue. First thing I noticed is 
the MAC of the VM jumping between ports on the uplink switches like mad - I 
had to think of my other bug report [1] - but this time I saw no broad- or 
multicast frames but normal unicast frames sent out on both member ports of 
the bond for a few minutes every now and then. 

Long story short: XCP 1.6 configures SLB bonds to rebalance their traffic 
every 30 minutes. And it looks like Open vSwitch sometimes fails in migrating 
certain flows from one interface to the other. That causes some traffic to be 
sent via the "old" interface and some via the "new" interface. 

Do debug the issue further I set up tcpdump on both bond member interfaces, 
monitored the log file for notifications of traffic shifting and did a ovs-
dpctl dump-flows xapi1 as I noticed the problem. 

First, the log showed the following:
--- cut ---
Jan 29 18:17:36 hostname ovs-vswitchd: 30635|bond|INFO|bond bond0: shift 
13622kB of load (with hash 101) from eth1 to eth0 (now carrying 25209kB and 
45293kB load, respectively)
Jan 29 18:17:36 hostname ovs-vswitchd: 30636|bond|INFO|bond bond0: shift 
2036kB of load (with hash 112) from eth0 to eth1 (now carrying 43257kB and 
27245kB load, respectively)
Jan 29 18:17:36 hostname ovs-vswitchd: 30637|bond|INFO|bond bond0: shift 
29434kB of load (with hash 189) from eth0 to eth1 (now carrying 13823kB and 
56680kB load, respectively)
Jan 29 18:17:36 hostname ovs-vswitchd: 30638|bond|INFO|bond bond0: shift 
1413kB of load (with hash 3) from eth1 to eth0 (now carrying 55267kB and 
15236kB load, respectively)
Jan 29 18:17:36 hostname ovs-vswitchd: 30639|bond|INFO|bond bond0: shift 
1030kB of load (with hash 62) from eth1 to eth0 (now carrying 54236kB and 
16266kB load, respectively)
Jan 29 18:17:36 hostname ovs-vswitchd: 30640|bond|INFO|bond bond0: shift 
1831kB of load (with hash 111) from eth1 to eth0 (now carrying 52405kB and 
18098kB load, respectively)
Jan 29 18:17:36 hostname ovs-vswitchd: 30641|bond|INFO|bond bond0: shift 
20447kB of load (with hash 203) from eth1 to eth0 (now carrying 31957kB and 
38546kB load, respectively)
--- cut ---

The traffic of the VM I was monitoring in this case should fall into hash 189 
(calculated with ovs-appctl bond/hash). 

tcpdump showed the following:
- First packet on eth1 at 18:17:37
- Still traffic on eth0
- Last packet sent via eth0 at 18:19:16

So in this case there's a timeframe of roughly 90 seconds in which both bonded 
interfaces are used to send out traffic, which is causing great trouble as we 
all know. 

Now to ovs-dpctl dump-flows xapi1: I used grep to filter for the MAC of the VM 
in question - please find my results attached to this e-mail. As that's taken 
from production servers I'm forced to replace IP addresses but I left the MAC 
and VLAN information as is. 

Background information: Two VMs form a HA cluster: 192.168.0.21 and 
192.168.0.22 (they use UDP/691 for cluster communication and TCP/7781, 
TCP/7782 and TCP/7783 for DRBD communication). Cluster communication is 
happening a few times every second. 

actions:push_vlan(vid=200,pcp=0),1 and actions:push_vlan(vid=200,pcp=0),2 at 
the same time should show the issue. Looks like Open vSwitch was unable to 
move the cluster communication on UDP/691 over to the other interface. 

I hope that's enough information for you. 

Regards,
Markus


[1] Message-ID: <kdmmcb$nhc$1 at ger.gmane.org>; Subject: [BUG] broad-/multicast 
& SLB bonding -> FAIL
-------------- next part --------------
in_port(157),eth(src=56:cf:67:4f:46:89,dst=00:00:00:ff:00:02),eth_type(0x0806),arp(sip=192.168.0.22,tip=192.168.0.242,op=2,sha=56:cf:67:4f:46:89,tha=00:00:00:ff:00:02), packets:0, bytes:0, used:never, actions:push_vlan(vid=200,pcp=0),1
in_port(157),eth(src=56:cf:67:4f:46:89,dst=1a:16:08:6c:f7:6c),eth_type(0x0800),ipv4(src=192.168.0.22,dst=192.168.0.21,proto=17,tos=0x10,ttl=64,frag=no),udp(src=42347,dst=691), packets:2259, bytes:518945, used:0.040s, actions:push_vlan(vid=200,pcp=0),2
in_port(157),eth(src=56:cf:67:4f:46:89,dst=1a:16:08:6c:f7:6c),eth_type(0x0800),ipv4(src=192.168.0.22,dst=192.168.0.21,proto=6,tos=0,ttl=64,frag=no),tcp(src=56923,dst=7781), packets:0, bytes:0, used:never, actions:push_vlan(vid=200,pcp=0),1
in_port(157),eth(src=56:cf:67:4f:46:89,dst=1a:16:08:6c:f7:6c),eth_type(0x0800),ipv4(src=192.168.0.22,dst=192.168.0.21,proto=6,tos=0,ttl=64,frag=no),tcp(src=57549,dst=7782), packets:2, bytes:152, used:2.540s, actions:push_vlan(vid=200,pcp=0),1
in_port(157),eth(src=56:cf:67:4f:46:89,dst=1a:16:08:6c:f7:6c),eth_type(0x0800),ipv4(src=192.168.0.22,dst=192.168.0.21,proto=6,tos=0,ttl=64,frag=no),tcp(src=7783,dst=34851), packets:4, bytes:280, used:1.471s, actions:push_vlan(vid=200,pcp=0),1

in_port(1),eth(src=1a:16:08:6c:f7:6c,dst=56:cf:67:4f:46:89),eth_type(0x8100),vlan(vid=200,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.0.21,dst=192.168.0.22,proto=17,tos=0x10,ttl=64,frag=no),udp(src=34746,dst=691)), packets:1, bytes:226, used:1.620s, actions:pop_vlan,157
in_port(1),eth(src=1a:16:08:6c:f7:6c,dst=56:cf:67:4f:46:89),eth_type(0x8100),vlan(vid=200,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.0.21,dst=192.168.0.22,proto=6,tos=0,ttl=64,frag=no),tcp(src=34851,dst=7783)), packets:3, bytes:218, used:1.501s, actions:pop_vlan,157
in_port(1),eth(src=1a:16:08:6c:f7:6c,dst=56:cf:67:4f:46:89),eth_type(0x8100),vlan(vid=200,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.0.21,dst=192.168.0.22,proto=6,tos=0,ttl=64,frag=no),tcp(src=7781,dst=56923)), packets:1, bytes:66, used:1.240s, actions:pop_vlan,157
in_port(1),eth(src=1a:16:08:6c:f7:6c,dst=56:cf:67:4f:46:89),eth_type(0x8100),vlan(vid=200,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.0.21,dst=192.168.0.22,proto=6,tos=0,ttl=64,frag=no),tcp(src=7782,dst=57549)), packets:0, bytes:0, used:never, actions:pop_vlan,157

in_port(2),eth(src=1a:16:08:6c:f7:6c,dst=56:cf:67:4f:46:89),eth_type(0x8100),vlan(vid=200,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.0.21,dst=192.168.0.22,proto=17,tos=0x10,ttl=64,frag=no),udp(src=34746,dst=691)), packets:479, bytes:109082, used:0.620s, actions:pop_vlan,157
in_port(2),eth(src=1a:16:08:6c:f7:6c,dst=56:cf:67:4f:46:89),eth_type(0x8100),vlan(vid=200,pcp=0),encap(eth_type(0x0800),ipv4(src=192.168.0.21,dst=192.168.0.22,proto=6,tos=0,ttl=64,frag=no),tcp(src=7782,dst=57549)), packets:1, bytes:66, used:2.503s, actions:pop_vlan,157


More information about the dev mailing list