[ovs-discuss] Duplicate ARP Problem
stuart.howlette at cloudcall.com
Wed Aug 14 06:44:35 UTC 2019
We are currently running Proxmox, backed by OpenVSwitch. Up until recently we have not noticed any issues in this setup. We have upgraded our data centre switching (Juniper QFX), which by default enables an ARP Suppression feature. This now appears to be suppressing some ARP traffic, and we are now intermittently losing access to our Proxmox hosts.
The setup on one of them is as follows (all three of our hosts are experiencing the same issue though): -
# Main interface
iface bond0 inet manual
ovs_bonds eno3 eno4
ovs_options lacp=active bond_mode=balance-tcp
iface lo inet loopback
# Interface to secondary network
iface eno1 inet manual
# Mirror to port capture server
iface eno2 inet manual
iface eno3 inet manual
iface eno4 inet manual
# Management interface
iface vport0 inet static
# Secondary network
iface vport1 inet static
iface vmbr0 inet manual
ovs_ports bond0 vport0 eno2
iface vmbr1 inet manual
ovs_ports eno1 vport1
We appear to be hitting some strange behaviour where two interfaces on
the hosts respond to ARP, with different MACs, and interestingly only if
the source address of the ARP packet is 0.0.0.0.
ip a | grep -EiA2 "vmbr0|vport0"
vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 18:66:da:51:b3:eb brd ff:ff:ff:ff:ff:ff
vport0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether aa:83:29:09:fa:bc brd ff:ff:ff:ff:ff:ff
inet 10.21.0.15/24 brd 10.21.0.255 scope global vport0
In the packet captures, we see ARP replies with a source MAC address of 18:66:da:51:b3:eb and aa:83:29:09:fa:bc. As noted, before this ARP suppression feature was enabled, both ARP replies would be seen by anything requesting it, and therefore never caused an issue. Now ARP tables in our network are getting updated with just the 18:66:da:51:b3:eb address, which blackholes traffic. We will then later see the ARP entries in our network updated again to the aa:83:29:09:fa:bc MAC address (probably due to genuine ARP requests), at which point the hosts are reachable again.
We have disabled ARP Suppression for now, but the option to disable this feature will be removed in the next JunOS major version, so we need to work out what is causing both interfaces to generate the replies.
We can recreate the issue using arping, by turning ARP suppression back on, and sending ARP packets to the IP with a source IP of 0.0.0.0. Using 0.0.0.0 as a source IP is appears to be valid usage of ARP, and is used for duplicate ARP detection. Unfortunately this very detection is
causing duplicate ARP responses, usefully enough!
$ sudo ovs-vsctl --version
ovs-vsctl (Open vSwitch) 2.7.0
DB Schema 7.14.0
I'm more than happy to provide more diagnostics and more information. I have tried some of the Protocol Tracing, and it doesn't appear to give much insight as to why its happening, or that it even believes it is happening?
ovs-appctl ofproto/trace vmbr0 in_port=1,arp,dl_src=88:a2:5e:e6:47:a0,dl_dst=ff:ff:ff:ff:ff:ff,arp_tpa=10.21.0.15,arp_spa=0.0.0.0,arp_op=1,arp_sha=88:a2:5e:e6:47:a0
0. priority 0
-> no learned MAC for destination, flooding
Final flow: unchanged
Datapath actions: 1,5,30,34,40,56,65,74,79
Thanks in advance
More information about the discuss