[ovs-discuss] Duplicate ARP Problem

Wed Aug 14 06:44:35 UTC 2019

Hi all,

We are currently running Proxmox, backed by OpenVSwitch. Up until recently we have not noticed any issues in this setup. We have upgraded our data centre switching (Juniper QFX), which by default enables an ARP Suppression feature. This now appears to be suppressing some ARP traffic, and we are now intermittently losing access to our Proxmox hosts.

The setup on one of them is as follows (all three of our hosts are experiencing the same issue though): -

# Main interface
allow-vmbr0 bond0
iface bond0 inet manual
        ovs_bonds eno3 eno4
        ovs_type OVSBond
        ovs_bridge vmbr0
        ovs_options lacp=active bond_mode=balance-tcp

auto lo
iface lo inet loopback

# Interface to secondary network
allow-vmbr1 eno1
iface eno1 inet manual
        ovs_type OVSPort
        ovs_bridge vmbr1

# Mirror to port capture server
allow-vmbr0 eno2
iface eno2 inet manual
        ovs_type OVSPort
        ovs_bridge vmbr0

iface eno3 inet manual

iface eno4 inet manual

# Management interface
allow-vmbr0 vport0
iface vport0 inet static
        address  10.21.0.15
        netmask  255.255.255.0
        gateway  10.21.0.210
        ovs_type OVSIntPort
        ovs_bridge vmbr0

# Secondary network
allow-vmbr1 vport1
iface vport1 inet static
        address  172.22.1.15
        netmask  255.255.255.0
        ovs_type OVSIntPort
        ovs_bridge vmbr1
        ovs_options tag=100

auto vmbr0
iface vmbr0 inet manual
        ovs_type OVSBridge
        ovs_ports bond0 vport0 eno2

auto vmbr1
iface vmbr1 inet manual
        ovs_type OVSBridge
        ovs_ports eno1 vport1

We appear to be hitting some strange behaviour where two interfaces on 
the hosts respond to ARP, with different MACs, and interestingly only if
 the source address of the ARP packet is 0.0.0.0.

 ip a  | grep -EiA2 "vmbr0|vport0"
vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 18:66:da:51:b3:eb brd ff:ff:ff:ff:ff:ff

vport0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether aa:83:29:09:fa:bc brd ff:ff:ff:ff:ff:ff
    inet 10.21.0.15/24 brd 10.21.0.255 scope global vport0

In the packet captures, we see ARP replies with a source MAC address of 18:66:da:51:b3:eb and aa:83:29:09:fa:bc. As noted, before this ARP suppression feature was enabled, both ARP replies would be seen by anything requesting it, and therefore never caused an issue. Now ARP tables in our network are getting updated with just the 18:66:da:51:b3:eb address, which blackholes traffic. We will then later see the ARP entries in our network updated again to the aa:83:29:09:fa:bc MAC address (probably due to genuine ARP requests), at which point the hosts are reachable again.

We have disabled ARP Suppression for now, but the option to disable this feature will be removed in the next JunOS major version, so we need to work out what is causing both interfaces to generate the replies.

We can recreate the issue using arping, by turning ARP suppression back on, and sending ARP packets to the IP with a source IP of 0.0.0.0. Using 0.0.0.0 as a source IP is appears to be valid usage of ARP, and is used for duplicate ARP detection. Unfortunately this very detection is 
causing duplicate ARP responses, usefully enough!

$ sudo ovs-vsctl --version
ovs-vsctl (Open vSwitch) 2.7.0
DB Schema 7.14.0

I'm more than happy to provide more diagnostics and more information. I have tried some of the Protocol Tracing, and it doesn't appear to give much insight as to why its happening, or that it even believes it is happening?

ovs-appctl ofproto/trace vmbr0 in_port=1,arp,dl_src=88:a2:5e:e6:47:a0,dl_dst=ff:ff:ff:ff:ff:ff,arp_tpa=10.21.0.15,arp_spa=0.0.0.0,arp_op=1,arp_sha=88:a2:5e:e6:47:a0
Flow: arp,in_port=1,vlan_tci=0x0000,dl_src=88:a2:5e:e6:47:a0,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=0.0.0.0,arp_tpa=10.21.0.15,arp_op=1,arp_sha=88:a2:5e:e6:47:a0,arp_tha=00:00:00:00:00:00

bridge("vmbr0")
---------------
 0. priority 0
    NORMAL
     -> no learned MAC for destination, flooding

Final flow: unchanged
Megaflow: recirc_id=0,arp,in_port=1,vlan_tci=0x0000/0x1fff,dl_src=88:a2:5e:e6:47:a0,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=0.0.0.0,arp_tpa=10.21.0.15,arp_op=1
Datapath actions: 1,5,30,34,40,56,65,74,79

Thanks in advance

Stuart Howlette