[ovs-discuss] bond_updelay being ignored?

Ben Pfaff blp at ovn.org
Wed Oct 10 20:35:45 UTC 2018


On Tue, Oct 09, 2018 at 08:33:32AM -0600, Daniel Leaberry wrote:
> 
> > On Oct 8, 2018, at 5:36 PM, Ethan J. Jackson <ejj at eecs.berkeley.edu> wrote:
> > 
> > No memory unfortunately.
> > 
> > Ethan
> > 
> > Ethan J. Jackson
> > ejj.sh
> > 
> > 
> > On Mon, Oct 08, 2018 at 1:45 PM, Ben Pfaff <blp at ovn.org> wrote:
> > On Tue, Oct 02, 2018 at 10:28:52AM -0600, Daniel Leaberry via discuss wrote:
> > 
> > I have Centos 7 with openvswitch 2.9.0. The server has 4 ports in an lacp bond (called allbond) connected to a set of mlagged arista switches. Here's the config
> > 
> > ovs-vsctl list port allbond 
> > _uuid : 9f224f2d-8bb1-4cfd-84e2-d60c6d973a7a bond_active_slave : "90:e2:ba:d6:1c:44" bond_downdelay : 0 
> > bond_fake_iface : false 
> > bond_mode : balance-tcp 
> > bond_updelay : 40000 
> > cvlans : [] 
> > external_ids : {} 
> > fake_bridge : false 
> > interfaces : [61b9a345-2f3d-4127-b9cd-eaca8a749574, 89ce3480-d62d-4291-9a84-bdf711016793, 941c9393-1021-490c-84ac-311250ba0343, dc49ffd3-c259-43b6-8072-2ce12c52d1b1] lacp : active 
> > mac : [] 
> > name : allbond 
> > other_config : {} 
> > protected : false 
> > qos : [] 
> > rstp_statistics : {} 
> > rstp_status : {} 
> > statistics : {} 
> > status : {} 
> > tag : [] 
> > trunks : [] 
> > vlan_mode : []
> > 
> > ---- allbond ---- 
> > bond_mode: balance-tcp 
> > bond may use recirculation: yes, Recirc-ID : 3 
> > bond-hash-basis: 0 
> > updelay: 40000 ms 
> > downdelay: 0 ms 
> > next rebalance: 3229 ms 
> > lacp_status: negotiated 
> > lacp_fallback_ab: false 
> > active slave mac: 90:e2:ba:d6:1c:44(eth5)
> > 
> > slave eth3: enabled 
> > may_enable: true 
> > hash 50: 1 kB load 
> > hash 162: 1 kB load 
> > hash 170: 1 kB load
> > 
> > slave eth4: enabled 
> > may_enable: true 
> > hash 123: 4 kB load 
> > hash 221: 12 kB load
> > 
> > slave eth5: enabled 
> > active slave 
> > may_enable: true 
> > hash 94: 1 kB load 
> > hash 177: 1 kB load 
> > hash 245: 1 kB load
> > 
> > slave eth6: enabled 
> > may_enable: true 
> > hash 97: 46 kB load
> > 
> > As you can see updelay is set to 40 seconds. I go to the switch and shutdown the port for eth6. It's immediately pulled from the bond. I then clear the switch counters and wait a few minutes. I would expect when the port is "no shutdown" that 40 seconds will go by before openvswitch brings it back into the bond. But that doesn't happen.
> > 
> > 2018-10-02T15:31:32.885Z|00349|bond|INFO|interface eth6: link state down 2018-10-02T15:31:32.885Z|00350|bond|INFO|interface eth6: disabled 2018-10-02T15:35:45.861Z|00352|bond|INFO|interface eth6: link state up 2018-10-02T15:35:45.861Z|00353|bond|INFO|interface eth6: enabled 2018-10-02T15:35:51.286Z|00354|bond|INFO|bond allbond: shift 93kB of load (with hash 97) from eth3 to eth6 (now carrying 6kB and 93kB load, respectively)
> > 
> > Immediately after link is re-established the port (eth6) is enabled again and traffic as shown in the switch counters begins to flow again. It feels like I'm doing something wrong but I've googled for hours and can't find anything that explains why the bond_updelay is being ignored.
> > 
> > I spent some time looking through the history here. Ethan (CCed) added LACP support to OVS in January 2011. From that point forward, OVS has always ignored updelay and downdelay for a bond when LACP is enabled. I don't know why, exactly. Maybe Ethan remembers.
> > 
> > It would be easy to enable updelay and downdelay for LACP bonds:
> > 
> > diff --git a/ofproto/bond.c b/ofproto/bond.c 
> > index f87cdba7908f..8a90ba2686af 100644 
> > --- a/ofproto/bond.c 
> > +++ b/ofproto/bond.c 
> > @@ -1717,8 +1717,7 @@ bond_link_status_update(struct bond_slave *slave) VLOG_INFO_RL(&rl, "interface %s: will not be %s", slave->name, up ? "disabled" : "enabled"); 
> > } else { 
> > - int delay = (bond->lacp_status != LACP_DISABLED ? 0 
> > - : up ? bond->updelay : bond->downdelay); 
> > + int delay = up ? bond->updelay : bond->downdelay; slave->delay_expires = time_msec() + delay; 
> > if (delay) { 
> > VLOG_INFO_RL(&rl, "interface %s: will be %s if it stays %s "
> > 
> > 
> 
> I *greatly* appreciate you looking into this Ben, it's rare in opensource that I find an actual bug so generally I just figure I'm doing something wrong. The documentation is pretty clear about calling out the bond_updelay and downdelay parameters so at the very least those should be clarified/removed. 
> 
> What next steps should I take? Is there a bug report I should file? This is fairly critical to me because we run a ton of these 4 port bonds to 2 Arista switches (they're redundant). When we upgrade the switch firmware the switch comes back online, the ports all light up at the same time but it takes a few seconds for spanning tree to sort everything out. During those seconds we have packet loss because ovs thinks the ports are totally back in action when they aren't.

Since we don't have a known reason not to honor these settings for LACP
bonds, I propose that we just change OVS behavior.

I sent a formal patch:
        https://patchwork.ozlabs.org/patch/982091/


More information about the discuss mailing list