[ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

Fri May 22 00:45:05 UTC 2020

Hi, Han:

________________________________________
From: ovn-kubernetes at googlegroups.com <ovn-kubernetes at googlegroups.com> on behalf of Han Zhou <zhouhan at gmail.com>
Sent: Thursday, May 21, 2020 4:42 PM
To: Tim Rozet
Cc: Venugopal Iyer; Dumitru Ceara; Girish Moodalbail; Han Zhou; Dan Winship; ovs-discuss; ovn-kubernetes at googlegroups.com; Michael Cambria
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments

On Thu, May 21, 2020 at 2:35 PM Tim Rozet <trozet at redhat.com<mailto:trozet at redhat.com>> wrote:
I think that if you directly connect GR to DR you don't need to learn any ARP with packet_in and you can preprogram the static entries. Each GR will have 1 enty for the DR, while the DR will have N number of entries for N nodes.

Hi Tim, as mentioned by Girish, directly connecting GRs to DR requires N ports on the DR and also requires a lot of small subnets, which is not desirable. And since changes are needed anyway in OVN to support that, we moved forward with the current approach of avoiding the static ARP flows to solve the problem instead of directly connecting GRs to DR.

The real issue with ARP learning comes from the GR-----External. You have to learn these, and from my conversation with Girish it seems like every GR is adding an entry on every ARP request it sees. This means 1 GR sends ARP request to external L2 network and every GR sees the ARP request and adds an entry. I think the behavior should be:

GRs only add ARP entries when:

  1.  An ARP Response is sent to it
  2.  The GR receives a GARP broadcast, and already has an entry in his cache for that IP (Girish mentioned this is similar to linux arp_accept behavior)

For 2), it is expensive to do in OVN because OpenFlow doesn't support a match condition of "field1 == field2", which is required to check if the incoming ARP request is a GARP, i.e. SPA == TPA. However, it is ok to support something similar like linux arp_accept configuration but slightly different. In OVN we can configure it to alllow/disable learning from all ARP requests to IPs not belonging to the router, including GARPs. Would that solve the problem here? (@Venugopal Iyer<mailto:venugopali at nvidia.com>  brought up the same thing about "arp_accept". I hope this reply addresses that as well)

<vi> I can't think of any side effects to this, so seems fine to me to do so. Believe linux behaves that way w.r.t. ARP request
<vi> anyway (assuming I am reading it right).

https://elixir.bootlin.com/linux/v5.7-rc6/source/net/ipv4/arp.c (L874)

thanks,

-venu

In addition, as Michael Cambria pointed out in our weekly meeting, these ARP cache entries should have expiry timers on them. If they are permanently learned, you will end up with a growing ARP table over time, and end up in the same place. We can probably just program the GR ARP flows with an idle_timeout and have the flow removed. What do you think?

This has been discussed before. It is also mentioned in the TODO.rst. However, it is not taken care because there is no good solution found yet. It can be done but will be expensive and the gains do not worth the costs. Accepting ARP requests partially reduces the needs of ARP expiration. It is true that it could still be a problem in some scenarios but so far we didn't heard any use case that has hard dependency on this.

Should I file a bugzilla outlining the above so we can have proper tracking?

I think bugzilla is out of the control of OVN community, so please feel free to file or not file ;)

Thanks,
Han

Thanks,

Tim Rozet
Red Hat CTO Networking Team

On Thu, May 21, 2020 at 5:01 PM Han Zhou <zhouhan at gmail.com<mailto:zhouhan at gmail.com>> wrote:

On Thu, May 21, 2020 at 10:33 AM Venugopal Iyer <venugopali at nvidia.com<mailto:venugopali at nvidia.com>> wrote:
Han,

just a quick question below..

________________________________________
From: ovn-kubernetes at googlegroups.com<mailto:ovn-kubernetes at googlegroups.com> <ovn-kubernetes at googlegroups.com<mailto:ovn-kubernetes at googlegroups.com>> on behalf of Girish Moodalbail <gmoodalbail at gmail.com<mailto:gmoodalbail at gmail.com>>
Sent: Tuesday, May 19, 2020 11:09 PM
To: Han Zhou
Cc: Han Zhou; Dan Winship; ovs-discuss; ovn-kubernetes at googlegroups.com<mailto:ovn-kubernetes at googlegroups.com>
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments

Hello Han,

Please see in-line:

On Sat, May 16, 2020 at 11:17 PM Han Zhou <zhouhan at gmail.com<mailto:zhouhan at gmail.com><mailto:zhouhan at gmail.com<mailto:zhouhan at gmail.com>>> wrote:

On Sat, May 16, 2020 at 12:13 PM Girish Moodalbail <gmoodalbail at gmail.com<mailto:gmoodalbail at gmail.com><mailto:gmoodalbail at gmail.com<mailto:gmoodalbail at gmail.com>>> wrote:
Hello Han,

Can you please explain how the dynamic resolution of the IP-to-MAC will work with this new option set?

Say the packet is being forwarded from router2 towards the distributed router? So, nexthop (reg0) is set to IP1 and we need to find the MAC address M1 to set eth.dst to.

+----------------+        +----------------+
|   l3gateway    |        |   l3gateway    |
|    router2     |        |    router3     |
+-------------+--+        +-+--------------+
            IP2,M2         IP3,M3
              |             |
           +--+-------------+---+
           |    join switch     |
           +---------+----------+
                     |
                  IP1,M1
             +-------+--------+
             |  distributed   |
             |     router     |
             +----------------+

The MAC M1 will not obviously in the MAC_binding table. On the hypervisor where the packet originated, the router2's port and the distributed router's port are locally present. So, does this result in a PACKET_IN to the ovn-controller and the resolution happens there?

Yes there will be a PACKET_IN, and then:
1. ovn-controller will generate the ARP request for IP1, and send PACKET_OUT to OVS.
2. The ARP request will be delivered to the distributed router pipeline only, because of a special handling of ARP in OVN for IPs of router ports, although it is a broadcast. (It would have been broadcasted to all GRs without that special handling)
3. The distributed router pipeline should learn the IP-MAC binding of IP2-M2 (through a PACKET_IN to ovn-controller), and at the same time send ARP reply to the router2 in the distributed router pipeline.
4. Router2 pipeline will handle the ARP response and learn the IP-MAC binding of IP1-M1 (through a PACKET_IN to ovn-controller).

Unfortunately, the ARP request (who as IP1) from router2 is broadcasted out to all of the chassis through Geneve Tunnel. The other gateway routers learn the Source mac of 'M2'. Now, each of the gateway router has an entry for (IP2, M2) in the MAC binding table on their respective rtoj-<blah> router port. So, the MAC_Binding table will now have N X N entries, where N is the number of gateway routers.

Per your explanation above, the ARP request should not have broadcasted right?

<vi> probably obvious and I am missing it, but..
<vi> I see the lflow to direct ARP request to the router port, instead of bcast. However,
<vi> we also add flows to bcast self-originated (unsolicitated ?) arp requests (we should
<vi> not see this  for router IPs, I suppose). But, given we just match on the source
<vi> MAC address  of the packet for such packets, does it differ from the ARP
<vi> request generated for Router IP?

Good catch! That seems to be the reason why it is broadcasted. I thought the feature was only allowing GARP to be broadcasted, but it is actually allowing (G)ARP including regular ARP generated by the LRs. It can be an easy fix to: commit 32f5ebb062 ("ovn-northd: Limit ARP/ND broadcast domain whenever possible."), but I am not sure if there are other concerns of doing that. @Dumitru Ceara<mailto:dceara at redhat.com> to comment if we can restrict it to be GARP only.

On the other hand, in this use case, if there are any ARP from the distributed router to any of the GRs, then all the GRs should have learned the MAC-bindings of the IP1-M1, and they won't send ARP for IP1 any more, thus would not result in N x N MAC-bindings, right? In the real use case, it may depend on which direction of traffic comes first. If it is always from external to k8s workloads first, then yes it will end up with N x N mac-bindings finally.

thanks,

-venu

Note that the direction of  ARP request is from Gateway Router to Distributed Router.

Regards,
~Girish

How about the resolution of IP3-to-M3 happen on gateway router2? Will there be an ARP request packet that will be broadcasted on the join switch for this case?

I think in the use case of ovn-k8s, as you described before, this should not happen. However, if this does happen, it is similar to above steps, except that in step 2) and 3) the ARP request and response will be sent between the chassises through tunnel. If this happens between all pairs of GRs, then there will be again O(n^2) MAC_Binding entries.

I haven't tested the GR scenario yet, so I can't guarantee it works as expected. Please let me know if you see any problems. I will submit formal patch with more test cases if it is confirmed in your environment.

Thanks,
Han

Regards,
~Girish

On Sat, May 16, 2020 at 10:25 AM Girish Moodalbail <gmoodalbail at gmail.com<mailto:gmoodalbail at gmail.com><mailto:gmoodalbail at gmail.com<mailto:gmoodalbail at gmail.com>>> wrote:

On Sat, May 16, 2020 at 12:36 AM Han Zhou <zhouhan at gmail.com<mailto:zhouhan at gmail.com><mailto:zhouhan at gmail.com<mailto:zhouhan at gmail.com>>> wrote:

On Tue, May 5, 2020 at 11:57 AM Han Zhou <hzhou at ovn.org<mailto:hzhou at ovn.org><mailto:hzhou at ovn.org<mailto:hzhou at ovn.org>>> wrote:
>
>
>
> On Fri, May 1, 2020 at 2:14 PM Dan Winship <danwinship at redhat.com<mailto:danwinship at redhat.com><mailto:danwinship at redhat.com<mailto:danwinship at redhat.com>>> wrote:
> >
> > On 5/1/20 12:37 PM, Girish Moodalbail wrote:
> > > If we now look at table=12 (lr_in_arp_resolve) in the ingress pipeline
> > > of Gateway Router-1, then you will see that there will be 2000 logical
> > > flow entries...
> >
> > > In the topology above, the only intended path is North-South between
> > > each gateway router and the logical router. There is no east-west
> > > traffic between the gateway routers
> >
> > > Is there an another way to solve the above problem with just keeping the
> > > single join logical switch?
> >
> > Two thoughts:
> >
> > 1. In openshift-sdn, the bridge doesn't try to handle ARP itself. It
> > just lets ARP requests pass through normally, and lets ARP replies pass
> > through normally as long as they are correct (ie, it doesn't let
> > spoofing through). This means fewer flows but more traffic. Maybe that's
> > the right tradeoff?
> >
> The 2M entries here is not for ARP responder, but more equivalent to the neighbour table (or ARP cache), on each LR. The ARP responder resides in the LS (join logical switch), which is O(n) instead of O(n^2), so it is not a problem here.
>
> However, a similar idea may works here to avoid the O(n^2) scale issue. For the neighbour table, actually OVN has two parts, one is statically build, which is the 2M entires mentioned in this case, and the other is the dynamic ARP resolve - the mac_binding table, which is dynamically populated by handling ARP messages. To solve the problem here, it is possible to change OVN to support configuring a LR to avoid static neighbour table, and relies only on dynamic ARP resolving. In this case, all the gateway routers can be configured as not using static ARP resolving, and eventually there will be only 2 entries (one for IPv4 and one for IPv6) for each gateway router in mac_binding table for the north-south traffic to the join router. (of source there will be still same amount of mac_bindings in each router for the external traffic on the other side of the gateway routers).
>
> This change seems straightforward, but I am not sure if there is any corner cases.

Hi Girish,

I've sent a RFC patch here for the above proposal: https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hzhou@ovn.org/
For this use case, just set options:dynamic_neigh_routes=true for all the Gateway Routers. Could you try it in your scale environment and see if it solves the problem?

Thanks,
Han

>
> > 2. In most places in ovn-kubernetes, our MAC addresses are
> > programmatically related to the corresponding IP addresses, and in
> > places where that's not currently true, we could try to make it true,
> > and then perhaps the thousands of rules could just be replaced by a
> > single rule?
> >
> This may be a good idea, but I am not sure how to implement in OVN to make it generic, since most OVN users can't make such assumption.
>
> On the other hand, why wouldn't splitting the join logical switch to 1000 LSes solve the problem? I understand that there will be 1000 more datapaths, and 1000 more LRPs, but these are all O(n), which is much more efficient than the O(n^2) exploding. What's the other scale issues created by this?
>
> In addition, Girish, for the external LS, I am not sure why can't it be shared, if all the nodes are connected to a single L2 network. (If they are connected to separate L2 networks, different external LSes should be created, at least according to current OVN model).

Thanks Han for the patch. Will give it a try and let you know.

Regards,
~Girish

>
> Thanks,
> Han

--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernetes+unsubscribe at googlegroups.com<mailto:ovn-kubernetes%2Bunsubscribe at googlegroups.com><mailto:ovn-kubernetes+unsubscribe at googlegroups.com<mailto:ovn-kubernetes%2Bunsubscribe at googlegroups.com>>.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STTq4WSwvwHbws5e0yozT7OM9RYcpWwaA2v49k83JDmEqA%40mail.gmail.com<https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STTq4WSwvwHbws5e0yozT7OM9RYcpWwaA2v49k83JDmEqA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernetes+unsubscribe at googlegroups.com<mailto:ovn-kubernetes+unsubscribe at googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCnZ0ZJeC0L%3DXXf8JQ0k1TqJoo0MkHzj6%3DkmEv1qHPxaZA%40mail.gmail.com<https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCnZ0ZJeC0L%3DXXf8JQ0k1TqJoo0MkHzj6%3DkmEv1qHPxaZA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernetes+unsubscribe at googlegroups.com<mailto:ovn-kubernetes+unsubscribe at googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCmDL84qU_aciBz_OgNwj8RQhiz%3DyCwzrnc6ZVqb80QyPQ%40mail.gmail.com<https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCmDL84qU_aciBz_OgNwj8RQhiz%3DyCwzrnc6ZVqb80QyPQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.