[ovs-discuss] Possible bug with OVS LACP + VPC

Ben Pfaff blp at ovn.org
Wed Jan 18 18:49:08 UTC 2017


Shu, thanks for all the debugging!

If this is a correct interpretation of the standard, and the peer switch
is misbehaving, and it's common behavior across a product line, then
possibly we'll need to make OVS cope with it.

Chad, do you have thoughts on Shu's discoveries?

Thanks,

Ben.

On Wed, Jan 18, 2017 at 07:05:10AM -0800, Shu Shen wrote:
> Hi Chad,
> 
> I now have a theory on what's happening in your case.
> 
> I realized that the first LACPDU packet the peer switch sent for
> re-negotiation contains all zeros in Actor Informaiton TLV, line
> 81-84 from your gist:
> 
>   20:38:17.109650 00:00:00:00:00:00 > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 110
>       Actor Information TLV (0x01), length 20
>             System 00:00:00:00:00:00, System Priority 0, Key 0, Port 0,Port Priority 0
>             State Flags [Timeout]
> 
> I'd assume the switch is doing the same on eth1 when the link is back
> up.
> 
> When OVS receives the packet with all-zero Actor Information TLV on
> eth1, it compares the TLV with the partener information it has on record
> for eth1, and find they are different. Therefore, it considers the peer
> has changed and will trigger lacp_update_attach() by setting
> lacp->update to true. See
>     https://github.com/openvswitch/ovs/blob/master/lib/lacp.c#L355
> 
> In lacp_update_attach(), the function tries to determine a "lead" slave,
> which would be the slave with the highest priority. Because the eth1
> slave has all-zero partner actor TLV, it will always be selected as the
> "lead" slave. See line 627-637 in
>     https://github.com/openvswitch/ovs/blob/master/lib/lacp.c#L627
> 
> Next in lacp_update_attach(), all other slaves that have different
> sys_id and key from the lead slave will be deattached. See line 642-651:
>     https://github.com/openvswitch/ovs/blob/master/lib/lacp.c#L642
> This is because all links attached to the same LACP aggregator would
> need be talking to the same aggregator on peer system as identified by
> sys_id and key.
> 
> Once a slave is deattached, in your case the eth0 slave, it will no
> longer have (Synchronization, Collecting, Distributing) flags set and
> thus the "rogue" packet you are seeing on eth0, which is indeed to
> indicate that the slave is out-of-sync and a re-negotiation is required.
> 
> My opinion is that OVS appears to be behaving correctly to restart LACP
> negotiation on both links upon receiving the all-zero actor TLV from
> peer switch. The issue is likely on the peer switch, it probably should
> have sent a proper actor TLV about itself with at least unchanged system
> id, key, after the link comes back on eth1.
> 
> /Shu
> 
> 
> On Tue, Jan 17, 2017 at 09:30:39PM -0800, Shu Shen wrote:
> > On Tue, Jan 17, 2017 at 04:54:51PM -0600, Chad Norgan wrote:
> > > Given that the partner port_id on the rogue packet matches the slave
> > > it's sent out. I lean towards #1, that the LACP implementation is
> > > somehow mixing up the status for the slave's pdu, rather than leaking
> > > eth1's pdu out the eth0 interface.
> > > 
> > > -Chad
> > 
> > Hi Chad,
> > 
> > A few observations and questions as below:
> > 
> > 1) I wrote an additional testcase for the slave down and back up case,
> > which appears to be working fine. I put additional debug messages (not
> > in the commit referred below thought) to trace the lacpdu being sent by
> > all slaves and did not see any rogue package. Of course, the testcase
> > uses two ovs switches and patch ports, so it may well be far away from
> > reproducing the problem you are having.  You may find the test case
> > here:
> > 
> >     https://github.com/shushen/ovs/commit/72aa0afc6b61d5135ea9253b8aaf31a57c7c4734
> > 
> > And travis-ci builds with the above test case included are passing:
> >     https://travis-ci.org/shushen/ovs/builds/192922935
> > 
> > 2) Could you please elaborate a bit more about how you "manually down
> > the eth1 interface" and "bring eth1 back up"? Did you unplug a physical
> > link or did you use any ovs/Linux CLI to do so? This may help me refine
> > the test case to reproduce what you are doing.
> > 
> > 3) I find it interesting in the packet trace from the gist you posted,
> > where the source mac address from the peer switch is all zeros, see
> > 
> >     https://gist.github.com/beardymcbeards/7bd9feca87c0574e996a397d90d5ff98#file-2_tcpdump-L81
> > 
> > If I read correctly, in Section 6.2.11.1 of 802.1AX-2014, it says:
> > 
> >     Protocol entities sourcing frames from within the Link Aggregation
> >     sublayer (e.g., LACP and the Marker protocol) use the MAC address of
> >     the MAC within an underlying Aggregation Port as the SA in frames
> >     transmitted through that Aggregation Port.
> > 
> > I'm not sure why the peer switch is using the all-zero MAC address but
> > it probably shouldn't. I don't know how ovs datapath handles such
> > packets. If when eth1 is coming back up and the source MAC address is
> > also all zeros, could this affect how the LACPDU from eth1 being
> > handled? I welcome comments from you and the list.
> > 
> > I'd appreciate if you could provide a bit more information on 2) or any
> > other thoughts. My intention is to investigate a bit more on this
> > problem.
> > 
> > /Shu
> > 
> > > _______________________________________________
> > > discuss mailing list
> > > discuss at openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> _______________________________________________
> discuss mailing list
> discuss at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


More information about the discuss mailing list