[ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

Venugopal Iyer venugopali at nvidia.com
Tue Jun 9 16:06:20 UTC 2020


Sorry for the delay, Han, a quick question below:

From: ovn-kubernetes at googlegroups.com <ovn-kubernetes at googlegroups.com> On Behalf Of Han Zhou
Sent: Wednesday, June 3, 2020 4:27 PM
To: Girish Moodalbail <gmoodalbail at gmail.com>
Cc: Tim Rozet <trozet at redhat.com>; Dumitru Ceara <dceara at redhat.com>; Daniel Alvarez Sanchez <dalvarez at redhat.com>; Dan Winship <danwinship at redhat.com>; ovn-kubernetes at googlegroups.com; ovs-discuss <ovs-discuss at openvswitch.org>; Michael Cambria <mcambria at redhat.com>; Venugopal Iyer <venugopali at nvidia.com>
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments

Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry that I forgot to update here.

On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail <gmoodalbail at gmail.com<mailto:gmoodalbail at gmail.com>> wrote:
>
> Hello all,
>
> To kind of proceed with the proposed fixes, with minimal impact, is the following a reasonable approach?
>
> Add an option, namely dynamic_neigh_routes={true|false}, for a gateway router. With this option enabled, the nextHop IP's MAC will be learned through a ARP request on the physical network. The ARP request will be flooded on the L2 broadcast domain (for both join switch and external switch).
>

The RFC patch fulfils this purpose: https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hzhou@ovn.org/
I am working on the formal patch.

> Add an option, namely learn_from_arp_request={true|false}, for a gateway router. The option is interpreted as below:\
> "true" - learn the MAC/IP binding and add a new MAC_Binding entry (default behavior)
> "false" - if there is a MAC_binding for that IP and the MAC is different, then update that MAC/IP binding. The external entity might be trying to advertise the new MAC for that IP. (If we don't do this, then we will never learn External VIP to MAC changes)
>
> (Irrespective of, learn_from_arp_request is true or false, always do this -- if the TPA is on the router, add a new entry (it means the remote wants to communicate with this node, so it makes sense to learn the remote as well))
>

I am working on this as well, but delayed a little. I hope to have something this week.
[vi> ] Just wanted to check if this should be learn_From_unsolicit_arp (unsolicited ARP request or reply) instead of learn_from_arp_request? This is just to protect from potential rogue usage of  GARP reply flooding the MAC bindings.?

Thanks,

-venu

>
> For now, I think it is fine for ARP packets to be broadcasted on the tunnel for the `join` switch case. If it becomes a problem, then we can start looking around changing the logical flows.
>
> Thanks everyone for the lively discussion.
>
> Regards,
> ~Girish
>
> On Thu, May 28, 2020 at 7:33 AM Tim Rozet <trozet at redhat.com<mailto:trozet at redhat.com>> wrote:
>>
>>
>>
>> On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara <dceara at redhat.com<mailto:dceara at redhat.com>> wrote:
>>>
>>> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
>>> > Hi all
>>> >
>>> > Sorry for top posting. I want to thank you all for the discussion and
>>> > give also some feedback from OpenStack perspective which is affected
>>> > by the problem described here.
>>> >
>>> > In OpenStack, it's kind of common to have a shared external network
>>> > (logical switch with a localnet port) across many tenants. Each tenant
>>> > user may create their own router where their instances will be
>>> > connected to access the external network.
>>> >
>>> > In such scenario, we are hitting the issue described here. In
>>> > particular in our tests we exercise 3K VIFs (with 1 FIP) each spanning
>>> > 300 LS; each LS connected to a LR (ie. 300 LRs) and that router
>>> > connected to the public LS. This is creating a huge problem in terms
>>> > of performance and tons of events due to the MAC_Binding entries
>>> > generated as a consequence of the GARPs sent for the floating IPs.
>>> >
>>>
>>> Just as an addition to this, GARPs wouldn't be the only reason why all
>>> routers would learn the MAC_Binding. Even if we wouldn't be sending
>>> GARPs for the FIPs, when a VM that's behind a FIP would send traffic to
>>> the outside, the router will generate an ARP request for the next hop
>>> using the FIP-IP and FIP-MAC. This will be broadcasted to all routers
>>> connected to the public LS and will trigger them to learn the
>>> FIP-IP:FIP-MAC binding.
>>
>>
>> Yeah we shouldn't be learning on regular ARP requests.
>>
>>>
>>>
>>> > Thanks,
>>> > Daniel
>>> >
>>> >
>>> > On Thu, May 28, 2020 at 10:51 AM Dumitru Ceara <dceara at redhat.com<mailto:dceara at redhat.com>> wrote:
>>> >>
>>> >> On 5/28/20 8:34 AM, Han Zhou wrote:
>>> >>>
>>> >>>
>>> >>> On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara <dceara at redhat.com<mailto:dceara at redhat.com>
>>> >>> <mailto:dceara at redhat.com<mailto:dceara at redhat.com>>> wrote:
>>> >>>>
>>> >>>> Hi Girish, Han,
>>> >>>>
>>> >>>> On 5/26/20 11:51 PM, Han Zhou wrote:
>>> >>>>>
>>> >>>>>
>>> >>>>> On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail
>>> >>> <gmoodalbail at gmail.com<mailto:gmoodalbail at gmail.com> <mailto:gmoodalbail at gmail.com<mailto:gmoodalbail at gmail.com>>
>>> >>>>> <mailto:gmoodalbail at gmail.com<mailto:gmoodalbail at gmail.com> <mailto:gmoodalbail at gmail.com<mailto:gmoodalbail at gmail.com>>>> wrote:
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Tue, May 26, 2020 at 12:42 PM Han Zhou <zhouhan at gmail.com<mailto:zhouhan at gmail.com>
>>> >>> <mailto:zhouhan at gmail.com<mailto:zhouhan at gmail.com>>
>>> >>>>> <mailto:zhouhan at gmail.com<mailto:zhouhan at gmail.com> <mailto:zhouhan at gmail.com<mailto:zhouhan at gmail.com>>>> wrote:
>>> >>>>>>>
>>> >>>>>>> Hi Girish,
>>> >>>>>>>
>>> >>>>>>> Thanks for the summary. I agree with you that GARP request v.s. reply
>>> >>>>> is irrelavent to the problem here.
>>> >>>>
>>> >>>> Well, actually I think GARP request vs reply is relevant (at least for
>>> >>>> case 1 below) because if OVN would be generating GARP replies we
>>> >>>> wouldn't need the priority 80 flow to determine if an ARP request packet
>>> >>>> is actually an OVN self originated GARP that needs to be flooded in the
>>> >>>> L2 broadcast domain.
>>> >>>>
>>> >>>> On the other hand, router3 would be learning mac_binding IP2,M2 from the
>>> >>>> GARP reply originated by router2 and vice versa so we'd have to restrict
>>> >>>> flooding of GARP replies to non-patch ports.
>>> >>>>
>>> >>>
>>> >>> Hi Dumitru, the point was that, on the external LS, the GRs will have to
>>> >>> send ARP requests to resolve unknown IPs (at least for the external GW),
>>> >>> and it has to be broadcasted, which will cause all the GRs learn all
>>> >>> MACs of other GRs. This is regardless of the GARP behavior. You are
>>> >>> right that if we only consider the Join switch then the GARP request
>>> >>> v.s. reply does make a difference. However, GARP request/reply may be
>>> >>> really needed only on the external LS.
>>> >>>
>>> >>
>>> >> Ok, but do you see an easy way to determine if we need to add the
>>> >> logical flows that flood self originated GARP packets on a given logical
>>> >> switch? Right now we add them on all switches.
>>> >>
>>> >>>>>>> Please see my comment inline below.
>>> >>>>>>>
>>> >>>>>>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail
>>> >>>>> <gmoodalbail at gmail.com<mailto:gmoodalbail at gmail.com> <mailto:gmoodalbail at gmail.com<mailto:gmoodalbail at gmail.com>>
>>> >>> <mailto:gmoodalbail at gmail.com<mailto:gmoodalbail at gmail.com> <mailto:gmoodalbail at gmail.com<mailto:gmoodalbail at gmail.com>>>> wrote:
>>> >>>>>>>>
>>> >>>>>>>> Hello Dumitru,
>>> >>>>>>>>
>>> >>>>>>>> There are several things that are being discussed on this thread.
>>> >>>>> Let me see if I can tease them out for clarity.
>>> >>>>>>>>
>>> >>>>>>>> 1. All the router IPs are known to OVN (the join switch case)
>>> >>>>>>>> 2. Some IPs are known and some are not known (the external logical
>>> >>>>> switch that connects to physical network case).
>>> >>>>>>>>
>>> >>>>>>>> Let us look at each of the case above:
>>> >>>>>>>>
>>> >>>>>>>> 1. Join Switch Case
>>> >>>>>>>>
>>> >>>>>>>> +----------------+        +----------------+
>>> >>>>>>>> |   l3gateway    |        |   l3gateway    |
>>> >>>>>>>> |    router2     |        |    router3     |
>>> >>>>>>>> +-------------+--+        +-+--------------+
>>> >>>>>>>>             IP2,M2         IP3,M3
>>> >>>>>>>>               |             |
>>> >>>>>>>>            +--+-------------+---+
>>> >>>>>>>>            |    join switch     |
>>> >>>>>>>>            +---------+----------+
>>> >>>>>>>>                      |
>>> >>>>>>>>                   IP1,M1
>>> >>>>>>>>              +-------+--------+
>>> >>>>>>>>              |  distributed   |
>>> >>>>>>>>              |     router     |
>>> >>>>>>>>              +----------------+
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> Say, GR router2 wants to send the packet out to DR and that we
>>> >>>>> don't have static mappings of MAC to IP in lr_in_arp_resolve table on GR
>>> >>>>> router2 (with Han's patch of dynamic_neigh_routes=true for all the
>>> >>>>> Gateway Routers). With this in mind, when an ARP request is sent out by
>>> >>>>> router2's hypervisor the packet should be directly sent to the
>>> >>>>> distributed router alone. Your commit 32f5ebb0622 (ovn-northd: Limit
>>> >>>>> ARP/ND broadcast domain whenever possible) should have allowed only
>>> >>>>> unicast. However, in ls_in_l2_lkup table we have
>>> >>>>>>>>
>>> >>>>>>>>   table=19(ls_in_l2_lkup      ), priority=80   , match=(eth.src ==
>>> >>>>> { M2 } && (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood";
>>> >>> output;)
>>> >>>>>>>>   table=19(ls_in_l2_lkup      ), priority=75   , match=(flags[1] ==
>>> >>>>> 0 && arp.op == 1 && arp.tpa == { IP1}), action=(outport =
>>> >>>>> "jtor-router2"; output;)
>>> >>>>>>>>
>>> >>>>>>>> As you can see, `priority=80` rule will always be hit and sent out
>>> >>>>> to all the GRs. The `priority=75` rule is never hit. So, we will see ARP
>>> >>>>> packets on the GENEVE tunnel. So, we need to change `priority=80` to
>>> >>>>> match GARP request packets. That way, for the known OVN IPs case we
>>> >>>>> don't do broadcast.
>>> >>>>>>>
>>> >>>>>>> Since the solution to case 2) below (i.e.
>>> >>>>> learn_from_arp_request=false) solves the problem of case 1), too, I
>>> >>>>> think we don't need this change just for case 1). As @Dumitru Ceara
>>> >>>>>  mentioned, there is some cost because it adds extra flows. It would be
>>> >>>>> significant amount of flows if there are a lot of snat_and_dnat IPs.
>>> >>>>> What do you think?
>>> >>>>
>>> >>>> I think the following might be a solution, although with the cost of
>>> >>>> adding as many flows as dnat_and_snat IPs are configured:
>>> >>>>
>>> >>>> - priority 80: explicitly determine if an ARP request is a self
>>> >>>> originated GARP for configured IP addresses and dnat_and_snat IPs (by
>>> >>>> matching on all eth.src and arp.tpa pairs) and if so flood on all
>>> >>>> non-patch ports.
>>> >>>> - priority 75: if arp.tpa is owned by an OVN logical router port,
>>> >>>> "unicast" it only on the patch port towards the router.
>>> >>>> - priority 1: flood any broadcast packet.
>>> >>>>
>>> >>>> Together with the learn_from_arp_request=false knob this would cover
>>> >>>> both case 1 (join switch) and case 2 (external switch).
>>> >>>>
>>> >>>> Wdyt?
>>> >>>>
>>> >>> Would the "learn_from_arp_request=false knob" cover both cases? If yes,
>>> >>> we don't need to add more flows of priority 80, or more accurately:
>>> >>> whether to update the priority-80 flows is not directly related to the
>>> >>> current problem.
>>> >>>
>>> >>
>>> >> Yes, it would, except for the fact that the ARP requests would still be
>>> >> flooded to all routers (and ignored at the destination). Which is afaiu
>>> >> what Girish was worried about. In order to address that part too I'm
>>> >> afraid we have to update the priority-80 flows.
>>> >>
>>> >> Regards,
>>> >> Dumitru
>>> >>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Han, yes it will work. However, my only concern is that we would send
>>> >>>>> all these ARP requests via tunnel to each of 1000 hypervisors and these
>>> >>>>> hypervisors will just drop them on the floor. when they see
>>> >>>>> learn_from_arp_request=false.
>>> >>>>>
>>> >>>>> I think maybe it is not a problem since it happens only once on the Join
>>> >>>>> switch. Once the MAC is learned, it won't broadcast again. It may be
>>> >>>>> more of a problem on the external LS if periodical GARP is required
>>> >>>>> there. However, I'd suggest to have some test and see if it is really a
>>> >>>>> problem, before trying to solve it.
>>> >>>>>
>>> >>>>>>
>>> >>>>>> Han, Dumitru,
>>> >>>>>>
>>> >>>>>> Why can't we swap the priorities of the above two flows so that the
>>> >>>>> ARP request for NexHop IP known to OVN will be always sent via
>>> >>> `unicast`?
>>> >>>>>
>>> >>>>> If swapped, even GARP won't get broadcasted. Maybe that's not the
>>> >>>>> desired behavior.
>>> >>>>>
>>> >>>>
>>> >>>> This is definitely not desired as we'd be hitting the prio 75 flow that
>>> >>>> would send the self originated GARP request (IPx) packet back towards
>>> >>>> the router port that owns IPx.
>>> >>>>
>>> >>>>>>
>>> >>>>>> Regards,
>>> >>>>>> ~Girish
>>> >>>>>>
>>> >>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> 2. External Logical Switch Case
>>> >>>>>>>>
>>> >>>>>>>>                        10.10.10.0/24<http://10.10.10.0/24> <http://10.10.10.0/24>
>>> >>> <http://10.10.10.0/24>
>>> >>>>>
>>> >>>>>>>>    -------------------------+--------------------------
>>> >>>>>>>>                             |
>>> >>>>>>>>                          localnet
>>> >>>>>>>>                       +-----+-----+
>>> >>>>>>>>                       | external  |
>>> >>>>>>>>          +------------+    LS1    +-------------+
>>> >>>>>>>>          |            +-----+-----+             |
>>> >>>>>>>>          |                  |                   |
>>> >>>>>>>>      10.10.10.2         10.10.10.3          10.10.10.4
>>> >>>>>>>>         SNAT               SNAT                SNAT
>>> >>>>>>>>    +-----+-----+      +-----+-----+       +-----------+
>>> >>>>>>>>    | l3gateway |      | l3gateway |       | l3gateway |
>>> >>>>>>>>    |   node1   |      |   node2   |       |   node3   |
>>> >>>>>>>>    +-----------+      +-----------+       +-----------+
>>> >>>>>>>>
>>> >>>>>>>> In this case, we have some of the IPs in OVN and some in the
>>> >>>>> physical network. If we fix (1) above, all the ARP requests for the
>>> >>>>> OVN's router IPs will be unicast. However, all the ARP requests to
>>> >>>>> external IPs, say 10.10.10.1 on the "physical router", will be
>>> >>>>> broadcast. Now, we will see these ARP broadcasts on all the L3 gateway
>>> >>>>> routers. With 'learn_from_arp_request=false' [a], then the MAC_Binding
>>> >>>>> table will not explode for both ARP and GARP requests.
>>> >>>>>>>>
>>> >>>>>>>> So, I don't think GARP requests and replies is the issue here?
>>> >>>>> Furthermore, learning from the GARP replies are blocked on certain
>>> >>>>> routers. For example:
>>> >>>>>
>>> >>>  https://www.juniper.net/documentation/en_US/junose15.1/topics/concept/ip-gratuitous-arps-transmission-overview.html
>>> >>>>>  says "By default, updating the ARP cache on GARP replies is disabled on
>>> >>>>> the router.". So, our NAT addresses mapping will not be learnt.
>>> >>>>
>>> >>>> Just as a side note, the above doesn't mean Juniper boxes don't support
>>> >>>> learning from GARP replies, just that they'd need extra configuration. I
>>> >>>> don't necessarily think that's a bad thing if properly documented in OVN
>>> >>>> that we would be generating GARP replies.
>>> >>>>
>>> >>>> Regards,
>>> >>>> Dumitru
>>> >>>>
>>> >>>>>>>>
>>> >>>>>>>> Regards,
>>> >>>>>>>> ~Girish
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> [a] - From Han's mail, the meaning of learn_from_arp_request=false
>>> >>>>> --> if the TPA is on the router, add a new entry (it means the
>>> >>>>>>>>>     remote wants to communicate with this node, so it makes
>>> >>> sense to
>>> >>>>>>>>>     learn the remote as well). Otherwise, ignore it and no new
>>> >>>>> entry added.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>
>>> >>>>>> --
>>> >>>>>> You received this message because you are subscribed to the Google
>>> >>>>> Groups "ovn-kubernetes" group.
>>> >>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>> >>>>> an email to ovn-kubernetes+unsubscribe at googlegroups.com<mailto:ovn-kubernetes%2Bunsubscribe at googlegroups.com>
>>> >>> <mailto:ovn-kubernetes%2Bunsubscribe at googlegroups.com<mailto:ovn-kubernetes%252Bunsubscribe at googlegroups.com>>
>>> >>>>> <mailto:ovn-kubernetes%2Bunsubscribe at googlegroups.com<mailto:ovn-kubernetes%252Bunsubscribe at googlegroups.com>
>>> >>> <mailto:ovn-kubernetes%252Bunsubscribe at googlegroups.com<mailto:ovn-kubernetes%25252Bunsubscribe at googlegroups.com>>>.
>>> >>>>>> To view this discussion on the web visit
>>> >>>>>
>>> >>> https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STRnem2PeSahuwhro1t%2BQJxchZNC7viq8n-ngM9KU%2B%2B-Xw%40mail.gmail.com.
>>> >>>>
>>> >>>
>>> >>> --
>>> >>> You received this message because you are subscribed to the Google
>>> >>> Groups "ovn-kubernetes" group.
>>> >>> To unsubscribe from this group and stop receiving emails from it, send
>>> >>> an email to ovn-kubernetes+unsubscribe at googlegroups.com<mailto:ovn-kubernetes%2Bunsubscribe at googlegroups.com>
>>> >>> <mailto:ovn-kubernetes+unsubscribe at googlegroups.com<mailto:ovn-kubernetes%2Bunsubscribe at googlegroups.com>>.
>>> >>> To view this discussion on the web visit
>>> >>> https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com
>>> >>> <https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>>> >>
>>> >> _______________________________________________
>>> >> discuss mailing list
>>> >> discuss at openvswitch.org<mailto:discuss at openvswitch.org>
>>> >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>> >
>>>
>> --
>> You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernetes+unsubscribe at googlegroups.com<mailto:ovn-kubernetes%2Bunsubscribe at googlegroups.com>.
>> To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CADO7ZnoBqbOvo-2jjTOKPA3otgA_4LYqiao2k718guFdW8kTAg%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernetes+unsubscribe at googlegroups.com<mailto:ovn-kubernetes+unsubscribe at googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCma-PU%3D3Gd%3DKLOkzuWKrKdBmqWVc-%3Dd-h6KAUqcvbzMgA%40mail.gmail.com<https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCma-PU%3D3Gd%3DKLOkzuWKrKdBmqWVc-%3Dd-h6KAUqcvbzMgA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20200609/60e1069e/attachment-0001.html>


More information about the discuss mailing list