[ovs-discuss] Possible bug with OVS LACP + VPC

Wed Jan 18 15:05:10 UTC 2017

Hi Chad,

I now have a theory on what's happening in your case.

I realized that the first LACPDU packet the peer switch sent for
re-negotiation contains all zeros in Actor Informaiton TLV, line
81-84 from your gist:

  20:38:17.109650 00:00:00:00:00:00 > 01:80:c2:00:00:02, ethertype Slow Protocols (0x8809), length 124: LACPv1, length 110
      Actor Information TLV (0x01), length 20
            System 00:00:00:00:00:00, System Priority 0, Key 0, Port 0,Port Priority 0
            State Flags [Timeout]

I'd assume the switch is doing the same on eth1 when the link is back
up.

When OVS receives the packet with all-zero Actor Information TLV on
eth1, it compares the TLV with the partener information it has on record
for eth1, and find they are different. Therefore, it considers the peer
has changed and will trigger lacp_update_attach() by setting
lacp->update to true. See
    https://github.com/openvswitch/ovs/blob/master/lib/lacp.c#L355

In lacp_update_attach(), the function tries to determine a "lead" slave,
which would be the slave with the highest priority. Because the eth1
slave has all-zero partner actor TLV, it will always be selected as the
"lead" slave. See line 627-637 in
    https://github.com/openvswitch/ovs/blob/master/lib/lacp.c#L627

Next in lacp_update_attach(), all other slaves that have different
sys_id and key from the lead slave will be deattached. See line 642-651:
    https://github.com/openvswitch/ovs/blob/master/lib/lacp.c#L642
This is because all links attached to the same LACP aggregator would
need be talking to the same aggregator on peer system as identified by
sys_id and key.

Once a slave is deattached, in your case the eth0 slave, it will no
longer have (Synchronization, Collecting, Distributing) flags set and
thus the "rogue" packet you are seeing on eth0, which is indeed to
indicate that the slave is out-of-sync and a re-negotiation is required.

My opinion is that OVS appears to be behaving correctly to restart LACP
negotiation on both links upon receiving the all-zero actor TLV from
peer switch. The issue is likely on the peer switch, it probably should
have sent a proper actor TLV about itself with at least unchanged system
id, key, after the link comes back on eth1.

/Shu

On Tue, Jan 17, 2017 at 09:30:39PM -0800, Shu Shen wrote:
> On Tue, Jan 17, 2017 at 04:54:51PM -0600, Chad Norgan wrote:
> > Given that the partner port_id on the rogue packet matches the slave
> > it's sent out. I lean towards #1, that the LACP implementation is
> > somehow mixing up the status for the slave's pdu, rather than leaking
> > eth1's pdu out the eth0 interface.
> > 
> > -Chad
> 
> Hi Chad,
> 
> A few observations and questions as below:
> 
> 1) I wrote an additional testcase for the slave down and back up case,
> which appears to be working fine. I put additional debug messages (not
> in the commit referred below thought) to trace the lacpdu being sent by
> all slaves and did not see any rogue package. Of course, the testcase
> uses two ovs switches and patch ports, so it may well be far away from
> reproducing the problem you are having.  You may find the test case
> here:
> 
>     https://github.com/shushen/ovs/commit/72aa0afc6b61d5135ea9253b8aaf31a57c7c4734
> 
> And travis-ci builds with the above test case included are passing:
>     https://travis-ci.org/shushen/ovs/builds/192922935
> 
> 2) Could you please elaborate a bit more about how you "manually down
> the eth1 interface" and "bring eth1 back up"? Did you unplug a physical
> link or did you use any ovs/Linux CLI to do so? This may help me refine
> the test case to reproduce what you are doing.
> 
> 3) I find it interesting in the packet trace from the gist you posted,
> where the source mac address from the peer switch is all zeros, see
> 
>     https://gist.github.com/beardymcbeards/7bd9feca87c0574e996a397d90d5ff98#file-2_tcpdump-L81
> 
> If I read correctly, in Section 6.2.11.1 of 802.1AX-2014, it says:
> 
>     Protocol entities sourcing frames from within the Link Aggregation
>     sublayer (e.g., LACP and the Marker protocol) use the MAC address of
>     the MAC within an underlying Aggregation Port as the SA in frames
>     transmitted through that Aggregation Port.
> 
> I'm not sure why the peer switch is using the all-zero MAC address but
> it probably shouldn't. I don't know how ovs datapath handles such
> packets. If when eth1 is coming back up and the source MAC address is
> also all zeros, could this affect how the LACPDU from eth1 being
> handled? I welcome comments from you and the list.
> 
> I'd appreciate if you could provide a bit more information on 2) or any
> other thoughts. My intention is to investigate a bit more on this
> problem.
> 
> /Shu
> 
> > _______________________________________________
> > discuss mailing list
> > discuss at openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss