[ovs-discuss] [OVN] ovn chassis fails after long period of time

John Bartelme bartelme at gmail.com
Thu Jun 11 13:13:05 UTC 2020


Hi Daniel,
     Thanks so much for the reply.  I'm using version 2.7-2 but if I can't
figure this out I'm going to try and upgrade but that will be a lengthy and
challenging process so trying to avoid if possible.
     For the VM connection I was doing a ping test between the two VMs
connected to the same OVN switch but residing on different hypervisors.
I'm not sure about the arp, all I can say is that the VMs arp tables aren't
populated and when I tcpdump port 6081 on the hypervisors I see the request
geneve encapsulated arp go out the failed host and come back in, it just
isn't making it to the VM's interface.  As best I can decode the metadata
on the geneve packet coming back in it looks like it is going to the right
tunnel id.
     I have looked at the Mac bindings in the past and it isn't that, this
also doesn't work on any and every new switch created.  I went a step
further and dumped the ovs, sb, nb databases and searched for known
ips/Macs being used and nothing stood out.
     I have the OVN controller logs at dbg and while trying to communicate
there are no entries printed related to the ping on the working or non
working hypervisor.

Thanks, john

On Thu, Jun 11, 2020, 2:45 AM Daniel Alvarez <dalvarez at redhat.com> wrote:

> Hi John,
>
> On 10 Jun 2020, at 23:22, John Bartelme <bartelme at gmail.com> wrote:
>
> 
>
> Hello,
>
>                 I’m trying to run down an issue with a couple of my
> servers but I’m having a really hard time pinpointing the root cause.  I
> have around 250 servers up an running and after about a year one of the
> servers is no longer able to communicate over OVN.  About two months later
> another server fell into this same state.  For a given ovn switch any two
> VMs connected to that switch can talk to each other unless one of the
> endpoints resides on one of these failed servers.  If both VMs are on the
> same server they have no problem communicating through the ovs bridge.
> Turning up various different debug I can’t determine why these servers are
> having issues.  Ovn-trace shows that it should work.  I see their chassis
> in the southbound database.  Doing tcpdump on the different servers I can
> see a geneve encapsulated arp going out of the server and coming back in.
> It never seems to get the vm interface though. Tcpdump on the vm interface
> only shows the arp going out and never coming back.   Turning up
> openvswitch debug I see debug statements saying the flow is sent but I
> never see flow received like I do on working boxes.  What other tools/debug
> can I bring to bear to try and figure out what is wrong?  It feels like
> perhaps something isn’t getting cleaned up somewhere.  Again I have many
> servers working with the same configuration as these two servers and these
> two servers used to work without issue.  I’ve tried completely
> re-installing the OS and reconfiguring the bad servers and the problem
> still persists.   I have a lot of users using this setup but I may try and
> upgrade to a newer version of ovs(2.12) vs. 2.7-2 that I’m on now if I can
> get some system downtime.  I’m also currently using RHEL 7.8 as the OS.
>
> What version of OVN are you using? The one shipped with RHEL? Can you
> share the exact version of it? If it is ovn2.11 I remember some issues with
> conjunctive flows but I don’t think this could be the case as you say that
> VMs within that one server can talk to each other.
>
> Also you mention that comm between VMs on different servers doesn’t work
> if one of them lives on that server but yet you see ARP traffic going out
> the tunnel. This is not expected If the two VMs belong to OVN as
> ovn-controller will reply to the ARP request. Did I understand the scenario
> right?
>
> This is a total blind guess from my end but if you reinstalled everything
> and it still doesn’t work, could it be some wrong MAC_Binding entry in the
> SB database? I don’t know your topology so I’m totally guessing here.
> You could delete all MAC binding entries for that particular logical
> switch and see if it makes a change.
>
> Also it looks like you have inspected local OVS logs but what about local
> ovn-controller logs in the faulty hypervisor?
>
> Daniel
>
> Thanks, john
>
> _______________________________________________
> discuss mailing list
> discuss at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20200611/c311e2bf/attachment.html>


More information about the discuss mailing list