[ovs-discuss] Slowness on neighbour learning

Joe Stringer joe at ovn.org
Thu Apr 6 19:15:27 UTC 2017


On 6 April 2017 at 11:57, Mika Väisänen <mika.vaisanen at gmail.com> wrote:
> Hello,
>
> Is it normal OVS behaviour that neighbour update (MAC moving from OVS switch port to another) can cause 100-500 ms break to traffic? Is there any way to configure it to be faster?
>
> In my case a Linux host is connected with two bonded Ethernets to server running OVS 2.5.2 switch. When I change the active bonding slave from the Linux host, it causes 100-500 ms break to traffic between the Linux host and other hosts on the network. In case I run the same test with HW switch, there is no noticeable traffic break at all.
>
> While investigating this, I found some strangeness in the way how MAC is learned by OVS. In the following example I have moved the active slave from interface eno3 to eno4.  It seems correct (hander51), but then revalidator56 updates the MAC to be found from the old port again:
>
> 2017-04-06T09:51:25.179Z|00339|ofproto_dpif_xlate(handler51)|DBG|bridge swu0: learned that 02:11:61:61:70:25 is on port eno4 in VLAN 4
> 2017-04-06T09:51:25.179Z|00344|ofproto_dpif_xlate(handler51)|DBG|bridge swu0: learned that 02:11:61:61:70:25 is on port eno4 in VLAN 5
> 2017-04-06T09:51:25.179Z|00349|ofproto_dpif_xlate(handler51)|DBG|bridge swu0: learned that 02:11:61:61:70:25 is on port eno4 in VLAN 64
> 2017-04-06T09:51:25.247Z|00065|ofproto_dpif_xlate(revalidator56)|DBG|bridge swu0: learned that 02:11:61:61:70:25 is on port eno3 in VLAN 4
> 2017-04-06T09:51:25.247Z|00066|ofproto_dpif_xlate(revalidator56)|DBG|bridge swu0: learned that 02:11:61:61:70:25 is on port eno3 in VLAN 5
> 2017-04-06T09:51:25.249Z|00067|ofproto_dpif_xlate(revalidator56)| DBG |bridge swu0: learned that 02:11:61:61:70:25 is on port eno4 in VLAN 4
>
> Why is revalidator refreshing old neighbour information? Could it be causing the slowness or is it totally irrelevant?

I wonder if there's a race that happens here.

Let's say that at T0, revalidator runs, forwarding is all correct, and
the datapath flows are all fine.
At T1, a packet arrives for eno3 VLAN 4, and is forwarded correctly.
There's now one packet attributed to the datapath flow which
revalidator will need to translate and attribute stats for.
At T2, the active slave is shifted from eno3 to eno4. Traffic starts
to flow over eno4, so you see the handler threads setting up new flows
to handle this traffic (correctly).
At T3, the revalidator thread wakes up, and starts dumping all of the
datapath flows. When it finds the flow that was hit in T1, it will
translate this flow, attribute stats, and execute side effects such as
learning the MAC. If it learnt the MAC at the exact moment that the
packet arrived, then it would have correctly learned that the mac
existed on eno3. However, it's not aware that the traffic has since
shifted to eno4, so it attributes and learns on eno3. The revalidator
thread continues to dump the datapath flows and finds the one that
handles the traffic now on eno4, and translates that one which also
has traffic. This makes the learning happen again on eno4.

Thereafter, I'm guessing that you don't send the traffic on eno3 so
there will be no packets to attribute, no MAC should be learnt, and
eventually the revalidator will time out the flow.

It may be possible to mitigate this is if there were to be some sort
of 'learning ratelimit' where a MAC that shifts to a new interface
cannot be relearnt for X seconds. It could try to be smart and track
the previous interface, then if the MAC shifts to a new interface we
don't perform learning for the previous interface, or it could be
something a bit more general as just 'don't learn a particular MAC
more than once a second'.


More information about the discuss mailing list