[ovs-discuss] [OVN] ovn chassis fails after long period of time

Daniel Alvarez dalvarez at redhat.com
Thu Jun 11 06:45:23 UTC 2020


Hi John,

> On 10 Jun 2020, at 23:22, John Bartelme <bartelme at gmail.com> wrote:
> 
> 
> Hello,
> 
>                 I’m trying to run down an issue with a couple of my servers but I’m having a really hard time pinpointing the root cause.  I have around 250 servers up an running and after about a year one of the servers is no longer able to communicate over OVN.  About two months later another server fell into this same state.  For a given ovn switch any two VMs connected to that switch can talk to each other unless one of the endpoints resides on one of these failed servers.  If both VMs are on the same server they have no problem communicating through the ovs bridge.  Turning up various different debug I can’t determine why these servers are having issues.  Ovn-trace shows that it should work.  I see their chassis in the southbound database.  Doing tcpdump on the different servers I can see a geneve encapsulated arp going out of the server and coming back in.  It never seems to get the vm interface though. Tcpdump on the vm interface only shows the arp going out and never coming back.   Turning up openvswitch debug I see debug statements saying the flow is sent but I never see flow received like I do on working boxes.  What other tools/debug can I bring to bear to try and figure out what is wrong?  It feels like perhaps something isn’t getting cleaned up somewhere.  Again I have many servers working with the same configuration as these two servers and these two servers used to work without issue.  I’ve tried completely re-installing the OS and reconfiguring the bad servers and the problem still persists.   I have a lot of users using this setup but I may try and upgrade to a newer version of ovs(2.12) vs. 2.7-2 that I’m on now if I can get some system downtime.  I’m also currently using RHEL 7.8 as the OS.  
> 
What version of OVN are you using? The one shipped with RHEL? Can you share the exact version of it? If it is ovn2.11 I remember some issues with conjunctive flows but I don’t think this could be the case as you say that VMs within that one server can talk to each other.

Also you mention that comm between VMs on different servers doesn’t work if one of them lives on that server but yet you see ARP traffic going out the tunnel. This is not expected If the two VMs belong to OVN as ovn-controller will reply to the ARP request. Did I understand the scenario right?

This is a total blind guess from my end but if you reinstalled everything and it still doesn’t work, could it be some wrong MAC_Binding entry in the SB database? I don’t know your topology so I’m totally guessing here.
You could delete all MAC binding entries for that particular logical switch and see if it makes a change.

Also it looks like you have inspected local OVS logs but what about local ovn-controller logs in the faulty hypervisor?

Daniel 

> Thanks, john
> 
> 
> _______________________________________________
> discuss mailing list
> discuss at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20200611/74b79b7c/attachment-0001.html>


More information about the discuss mailing list