[ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

Venugopal Iyer venugopali at nvidia.com
Fri May 22 15:39:39 UTC 2020


A couple of comments below:

________________________________________
From: ovn-kubernetes at googlegroups.com <ovn-kubernetes at googlegroups.com> on behalf of Han Zhou <zhouhan at gmail.com>
Sent: Thursday, May 21, 2020 7:43 PM
To: Girish Moodalbail
Cc: Tim Rozet; Venugopal Iyer; Dumitru Ceara; Han Zhou; Dan Winship; ovs-discuss; ovn-kubernetes at googlegroups.com; Michael Cambria
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments



On Thu, May 21, 2020 at 7:12 PM Girish Moodalbail <gmoodalbail at gmail.com<mailto:gmoodalbail at gmail.com>> wrote:


On Thu, May 21, 2020 at 6:58 PM Tim Rozet <trozet at redhat.com<mailto:trozet at redhat.com>> wrote:
On Thu, May 21, 2020 at 8:45 PM Venugopal Iyer <venugopali at nvidia.com<mailto:venugopali at nvidia.com>> wrote:
Hi, Han:

________________________________________
From: ovn-kubernetes at googlegroups.com<mailto:ovn-kubernetes at googlegroups.com> <ovn-kubernetes at googlegroups.com<mailto:ovn-kubernetes at googlegroups.com>> on behalf of Han Zhou <zhouhan at gmail.com<mailto:zhouhan at gmail.com>>
Sent: Thursday, May 21, 2020 4:42 PM
To: Tim Rozet
Cc: Venugopal Iyer; Dumitru Ceara; Girish Moodalbail; Han Zhou; Dan Winship; ovs-discuss; ovn-kubernetes at googlegroups.com<mailto:ovn-kubernetes at googlegroups.com>; Michael Cambria
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments



On Thu, May 21, 2020 at 2:35 PM Tim Rozet <trozet at redhat.com<mailto:trozet at redhat.com><mailto:trozet at redhat.com<mailto:trozet at redhat.com>>> wrote:
I think that if you directly connect GR to DR you don't need to learn any ARP with packet_in and you can preprogram the static entries. Each GR will have 1 enty for the DR, while the DR will have N number of entries for N nodes.

Hi Tim, as mentioned by Girish, directly connecting GRs to DR requires N ports on the DR and also requires a lot of small subnets, which is not desirable. And since changes are needed anyway in OVN to support that, we moved forward with the current approach of avoiding the static ARP flows to solve the problem instead of directly connecting GRs to DR.

Why is that not desirable? They are all private subnets with /30 (if using ipv4). If IPv6, it's even less of a concern from an addressing perspective.

It is not just about the subnet management but also the additional logical flows that created between two ways of connecting DR and GR.

Say, we have a fix that efficiently allows one to connect 1000s of GR using a single logical switch, then would you rather use that instead of 1000 patch cables connecting a GR to DR? It is not only the issue of Subnet Management for those 1000 point-to-point connections but also those 1000 patch ports are local to each of the chassis, so we need to understand in such a topology how many addition logical flows gets created in the SB and how many OpenFlow flows gets created on each of the 1000 chassis for those 1000 patch cables.


The real issue with ARP learning comes from the GR-----External. You have to learn these, and from my conversation with Girish it seems like every GR is adding an entry on every ARP request it sees. This means 1 GR sends ARP request to external L2 network and every GR sees the ARP request and adds an entry. I think the behavior should be:

GRs only add ARP entries when:

  1.  An ARP Response is sent to it
  2.  The GR receives a GARP broadcast, and already has an entry in his cache for that IP (Girish mentioned this is similar to linux arp_accept behavior)

For 2), it is expensive to do in OVN because OpenFlow doesn't support a match condition of "field1 == field2", which is required to check if the incoming ARP request is a GARP, i.e. SPA == TPA. However, it is ok to support something similar like linux arp_accept configuration but slightly different. In OVN we can configure it to alllow/disable learning from all ARP requests to IPs not belonging to the router, including GARPs. Would that solve the problem here? (@Venugopal Iyer<mailto:venugopali at nvidia.com<mailto:venugopali at nvidia.com>>  brought up the same thing about "arp_accept". I hope this reply addresses that as well)

I think the issue there is if you have an external device, which is using a VIP and it fails over, it will usually send GARP to inform of the mac change. In this case if you ignore GARP, what happens? You wont send another ARP because OVN programs the arp entry forever and doesn't expire it right? So you won't learn the new mac and keep sending packets to a dead mac?

I think we will have to support GARP otherwise VIPs will not work like Tim mentions. If we do learn from GARP and as long as the GARP itself is not originated by any of the 1000s GRs, then we should be fine.

Right, I didn't thought this through. I thought it is just a configurable option, but it seems we will always need to support GARP, so the option becomes useless.
However, there is no easy way to achieve: "do learn from GARP and as long as the GARP itself is not originated by any of the 1000s GRs", because OVN doesn't have the knowledge of the use case. The requirement is like: don't learn neighbours from ARP requests if the ARP's src belongs to OVN routers. Firstly this requirement is hard to understand by users not from the particular ovn-k8s setup. Secondly to implement this, it requires O(n^2) flows already, just to bypass the OVN owned router IPs, which is useless to the original problem. We will have to figure out a clean way.


<vi> I suppose the use of GARP as a reply v/s response is not very clear; [1], Section 3 seems to offer a concise summary of this. If the application sends GARP as
<vi> a reply we are covered, but the question is if the GARP is a request (which is allowed) then what our response should be. Tim is right, we can't ignore
<vi> the request (more so, since aging is not supported currently), however "arp_accept" ignores the request for creating a new cache entry, not updating
<vi> an existing one (see last para below)

[2]
arp_accept - BOOLEAN
	Define behavior for gratuitous ARP frames who's IP is not
	already present in the ARP table:
	0 - don't create new entries in the ARP table
	1 - create new entries in the ARP table

	Both replies and requests type gratuitous arp will trigger the
	ARP table to be updated, if this setting is on.

	If the ARP table already contains the IP address of the
	gratuitous arp frame, the arp table will be updated regardless
	if this setting is on or off.

<vi> if we lookup and get a hit, we should still process the GARP; only if we don't  have a hit, we should ignore (instead of
<vi> creating an entry). BTW, do we update today? if I understand the use of reg9[2] / REGBIT_LOOKUP_NEIGHBOR_RESULT (assuming lookup_arp
<vi> returns 1 if entry exists), I am not sure it does? maybe I missed it ..

thanks,

-venu

[1]https://www.ietf.org/rfc/rfc5227.txt


For the internal join-switch this is easier. I think allowing broadcasting from LRs only the GARP request and ARP request to unknown IPs (all others will be unicasted) will solve the problem. But for the external logical switch, I have no idea. Can it be handled from the operator perspective, by initiating a ping from external to the GR, so that GR learns the external GW IP-MAC binding, before sending broadcast to all neighbours?

Regards,
~Girish


--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernetes+unsubscribe at googlegroups.com<mailto:ovn-kubernetes+unsubscribe at googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCmKJ4JpZ-HfKhmb18LU3HmqAiAvUmFGnRrPcDF5M7u0yw%40mail.gmail.com<https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCmKJ4JpZ-HfKhmb18LU3HmqAiAvUmFGnRrPcDF5M7u0yw%40mail.gmail.com?utm_medium=email&utm_source=footer>.


More information about the discuss mailing list