[ovs-discuss] Possible bug with OVS LACP + VPC

Shu Shen shu.shen at gmail.com
Thu Jan 19 02:24:07 UTC 2017


I will look at what could be done without breaking the protocol and will
report back in a couple of days.

On Jan 18, 2017 2:56 PM, "Chad Norgan" <chad.norgan at gmail.com> wrote:

> Shu,
>
> Thanks so much for the deeper understanding. I think from here I'm
> going to get a mirror port setup on the switch so I can confirm that
> the switch is indeed sending that pdu on the restored link. I can then
> take that to the switch vendor to address.
>
> I love the idea of patching OVS to handle this case, but I will also
> work the switch vendor side. I would be willing to test a build if you
> guys have an idea on how you could make OVS cope with it. I'm curious
> if it couldn't just be discarded.
>
> Thanks,
> Chad
>
> On Wed, Jan 18, 2017 at 12:49 PM, Ben Pfaff <blp at ovn.org> wrote:
> > Shu, thanks for all the debugging!
> >
> > If this is a correct interpretation of the standard, and the peer switch
> > is misbehaving, and it's common behavior across a product line, then
> > possibly we'll need to make OVS cope with it.
> >
> > Chad, do you have thoughts on Shu's discoveries?
> >
> > Thanks,
> >
> > Ben.
> >
> > On Wed, Jan 18, 2017 at 07:05:10AM -0800, Shu Shen wrote:
> >> Hi Chad,
> >>
> >> I now have a theory on what's happening in your case.
> >>
> >> I realized that the first LACPDU packet the peer switch sent for
> >> re-negotiation contains all zeros in Actor Informaiton TLV, line
> >> 81-84 from your gist:
> >>
> >>   20:38:17.109650 00:00:00:00:00:00 > 01:80:c2:00:00:02, ethertype Slow
> Protocols (0x8809), length 124: LACPv1, length 110
> >>       Actor Information TLV (0x01), length 20
> >>             System 00:00:00:00:00:00, System Priority 0, Key 0, Port
> 0,Port Priority 0
> >>             State Flags [Timeout]
> >>
> >> I'd assume the switch is doing the same on eth1 when the link is back
> >> up.
> >>
> >> When OVS receives the packet with all-zero Actor Information TLV on
> >> eth1, it compares the TLV with the partener information it has on record
> >> for eth1, and find they are different. Therefore, it considers the peer
> >> has changed and will trigger lacp_update_attach() by setting
> >> lacp->update to true. See
> >>     https://github.com/openvswitch/ovs/blob/master/lib/lacp.c#L355
> >>
> >> In lacp_update_attach(), the function tries to determine a "lead" slave,
> >> which would be the slave with the highest priority. Because the eth1
> >> slave has all-zero partner actor TLV, it will always be selected as the
> >> "lead" slave. See line 627-637 in
> >>     https://github.com/openvswitch/ovs/blob/master/lib/lacp.c#L627
> >>
> >> Next in lacp_update_attach(), all other slaves that have different
> >> sys_id and key from the lead slave will be deattached. See line 642-651:
> >>     https://github.com/openvswitch/ovs/blob/master/lib/lacp.c#L642
> >> This is because all links attached to the same LACP aggregator would
> >> need be talking to the same aggregator on peer system as identified by
> >> sys_id and key.
> >>
> >> Once a slave is deattached, in your case the eth0 slave, it will no
> >> longer have (Synchronization, Collecting, Distributing) flags set and
> >> thus the "rogue" packet you are seeing on eth0, which is indeed to
> >> indicate that the slave is out-of-sync and a re-negotiation is required.
> >>
> >> My opinion is that OVS appears to be behaving correctly to restart LACP
> >> negotiation on both links upon receiving the all-zero actor TLV from
> >> peer switch. The issue is likely on the peer switch, it probably should
> >> have sent a proper actor TLV about itself with at least unchanged system
> >> id, key, after the link comes back on eth1.
> >>
> >> /Shu
> >>
> >>
> >> On Tue, Jan 17, 2017 at 09:30:39PM -0800, Shu Shen wrote:
> >> > On Tue, Jan 17, 2017 at 04:54:51PM -0600, Chad Norgan wrote:
> >> > > Given that the partner port_id on the rogue packet matches the slave
> >> > > it's sent out. I lean towards #1, that the LACP implementation is
> >> > > somehow mixing up the status for the slave's pdu, rather than
> leaking
> >> > > eth1's pdu out the eth0 interface.
> >> > >
> >> > > -Chad
> >> >
> >> > Hi Chad,
> >> >
> >> > A few observations and questions as below:
> >> >
> >> > 1) I wrote an additional testcase for the slave down and back up case,
> >> > which appears to be working fine. I put additional debug messages (not
> >> > in the commit referred below thought) to trace the lacpdu being sent
> by
> >> > all slaves and did not see any rogue package. Of course, the testcase
> >> > uses two ovs switches and patch ports, so it may well be far away from
> >> > reproducing the problem you are having.  You may find the test case
> >> > here:
> >> >
> >> >     https://github.com/shushen/ovs/commit/
> 72aa0afc6b61d5135ea9253b8aaf31a57c7c4734
> >> >
> >> > And travis-ci builds with the above test case included are passing:
> >> >     https://travis-ci.org/shushen/ovs/builds/192922935
> >> >
> >> > 2) Could you please elaborate a bit more about how you "manually down
> >> > the eth1 interface" and "bring eth1 back up"? Did you unplug a
> physical
> >> > link or did you use any ovs/Linux CLI to do so? This may help me
> refine
> >> > the test case to reproduce what you are doing.
> >> >
> >> > 3) I find it interesting in the packet trace from the gist you posted,
> >> > where the source mac address from the peer switch is all zeros, see
> >> >
> >> >     https://gist.github.com/beardymcbeards/
> 7bd9feca87c0574e996a397d90d5ff98#file-2_tcpdump-L81
> >> >
> >> > If I read correctly, in Section 6.2.11.1 of 802.1AX-2014, it says:
> >> >
> >> >     Protocol entities sourcing frames from within the Link Aggregation
> >> >     sublayer (e.g., LACP and the Marker protocol) use the MAC address
> of
> >> >     the MAC within an underlying Aggregation Port as the SA in frames
> >> >     transmitted through that Aggregation Port.
> >> >
> >> > I'm not sure why the peer switch is using the all-zero MAC address but
> >> > it probably shouldn't. I don't know how ovs datapath handles such
> >> > packets. If when eth1 is coming back up and the source MAC address is
> >> > also all zeros, could this affect how the LACPDU from eth1 being
> >> > handled? I welcome comments from you and the list.
> >> >
> >> > I'd appreciate if you could provide a bit more information on 2) or
> any
> >> > other thoughts. My intention is to investigate a bit more on this
> >> > problem.
> >> >
> >> > /Shu
> >> >
> >> > > _______________________________________________
> >> > > discuss mailing list
> >> > > discuss at openvswitch.org
> >> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >> _______________________________________________
> >> discuss mailing list
> >> discuss at openvswitch.org
> >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20170118/11834f62/attachment-0001.html>


More information about the discuss mailing list