[ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

Han Zhou zhouhan at gmail.com
Tue Jun 9 17:04:47 UTC 2020


On Tue, Jun 9, 2020 at 9:06 AM Venugopal Iyer <venugopali at nvidia.com> wrote:

> Sorry for the delay, Han, a quick question below:
>
>
>
> *From:* ovn-kubernetes at googlegroups.com <ovn-kubernetes at googlegroups.com> *On
> Behalf Of *Han Zhou
> *Sent:* Wednesday, June 3, 2020 4:27 PM
> *To:* Girish Moodalbail <gmoodalbail at gmail.com>
> *Cc:* Tim Rozet <trozet at redhat.com>; Dumitru Ceara <dceara at redhat.com>;
> Daniel Alvarez Sanchez <dalvarez at redhat.com>; Dan Winship <
> danwinship at redhat.com>; ovn-kubernetes at googlegroups.com; ovs-discuss <
> ovs-discuss at openvswitch.org>; Michael Cambria <mcambria at redhat.com>;
> Venugopal Iyer <venugopali at nvidia.com>
> *Subject:* Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve
> table
>
>
>
> *External email: Use caution opening links or attachments*
>
>
>
> Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry
> that I forgot to update here.
>
>
> On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail <gmoodalbail at gmail.com>
> wrote:
> >
> > Hello all,
> >
> > To kind of proceed with the proposed fixes, with minimal impact, is the
> following a reasonable approach?
> >
> > Add an option, namely dynamic_neigh_routes={true|false}, for a gateway
> router. With this option enabled, the nextHop IP's MAC will be learned
> through a ARP request on the physical network. The ARP request will be
> flooded on the L2 broadcast domain (for both join switch and external
> switch).
>
> >
>
>
>
> The RFC patch fulfils this purpose:
> https://patchwork.ozlabs.org/project/openvswitch/patch/1589614395-99499-1-git-send-email-hzhou@ovn.org/
>
> I am working on the formal patch.
>
>
>
> > Add an option, namely learn_from_arp_request={true|false}, for a gateway
> router. The option is interpreted as below:\
> > "true" - learn the MAC/IP binding and add a new MAC_Binding entry
> (default behavior)
> > "false" - if there is a MAC_binding for that IP and the MAC is
> different, then update that MAC/IP binding. The external entity might be
> trying to advertise the new MAC for that IP. (If we don't do this, then we
> will never learn External VIP to MAC changes)
> >
> > (Irrespective of, learn_from_arp_request is true or false, always do
> this -- if the TPA is on the router, add a new entry (it means the remote
> wants to communicate with this node, so it makes sense to learn the remote
> as well))
>
> >
>
>
>
> I am working on this as well, but delayed a little. I hope to have
> something this week.
>
> *[vi> ] Just wanted to check if this should be learn_From_unsolicit_arp
> (unsolicited ARP request or reply) instead of learn_from_arp_request? This
> is just to protect from potential rogue usage of  GARP reply flooding the
> MAC bindings.?*
>
>
>

Hi Venu, as discussed earlier in this thread it is hard to check if it is
GARP in OVN from the router ingress pipeline. The proposal here cares about
ARP request only. It seems the best option so far.


> *Thanks,*
>
>
>
> *-venu*
>
>
>
> >
> > For now, I think it is fine for ARP packets to be broadcasted on the
> tunnel for the `join` switch case. If it becomes a problem, then we can
> start looking around changing the logical flows.
> >
> > Thanks everyone for the lively discussion.
> >
> > Regards,
> > ~Girish
> >
> > On Thu, May 28, 2020 at 7:33 AM Tim Rozet <trozet at redhat.com> wrote:
> >>
> >>
> >>
> >> On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara <dceara at redhat.com>
> wrote:
> >>>
> >>> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
> >>> > Hi all
> >>> >
> >>> > Sorry for top posting. I want to thank you all for the discussion and
> >>> > give also some feedback from OpenStack perspective which is affected
> >>> > by the problem described here.
> >>> >
> >>> > In OpenStack, it's kind of common to have a shared external network
> >>> > (logical switch with a localnet port) across many tenants. Each
> tenant
> >>> > user may create their own router where their instances will be
> >>> > connected to access the external network.
> >>> >
> >>> > In such scenario, we are hitting the issue described here. In
> >>> > particular in our tests we exercise 3K VIFs (with 1 FIP) each
> spanning
> >>> > 300 LS; each LS connected to a LR (ie. 300 LRs) and that router
> >>> > connected to the public LS. This is creating a huge problem in terms
> >>> > of performance and tons of events due to the MAC_Binding entries
> >>> > generated as a consequence of the GARPs sent for the floating IPs.
> >>> >
> >>>
> >>> Just as an addition to this, GARPs wouldn't be the only reason why all
> >>> routers would learn the MAC_Binding. Even if we wouldn't be sending
> >>> GARPs for the FIPs, when a VM that's behind a FIP would send traffic to
> >>> the outside, the router will generate an ARP request for the next hop
> >>> using the FIP-IP and FIP-MAC. This will be broadcasted to all routers
> >>> connected to the public LS and will trigger them to learn the
> >>> FIP-IP:FIP-MAC binding.
> >>
> >>
> >> Yeah we shouldn't be learning on regular ARP requests.
> >>
> >>>
> >>>
> >>> > Thanks,
> >>> > Daniel
> >>> >
> >>> >
> >>> > On Thu, May 28, 2020 at 10:51 AM Dumitru Ceara <dceara at redhat.com>
> wrote:
> >>> >>
> >>> >> On 5/28/20 8:34 AM, Han Zhou wrote:
> >>> >>>
> >>> >>>
> >>> >>> On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara <dceara at redhat.com
> >>> >>> <mailto:dceara at redhat.com>> wrote:
> >>> >>>>
> >>> >>>> Hi Girish, Han,
> >>> >>>>
> >>> >>>> On 5/26/20 11:51 PM, Han Zhou wrote:
> >>> >>>>>
> >>> >>>>>
> >>> >>>>> On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail
> >>> >>> <gmoodalbail at gmail.com <mailto:gmoodalbail at gmail.com>
> >>> >>>>> <mailto:gmoodalbail at gmail.com <mailto:gmoodalbail at gmail.com>>>
> wrote:
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> On Tue, May 26, 2020 at 12:42 PM Han Zhou <zhouhan at gmail.com
> >>> >>> <mailto:zhouhan at gmail.com>
> >>> >>>>> <mailto:zhouhan at gmail.com <mailto:zhouhan at gmail.com>>> wrote:
> >>> >>>>>>>
> >>> >>>>>>> Hi Girish,
> >>> >>>>>>>
> >>> >>>>>>> Thanks for the summary. I agree with you that GARP request
> v.s. reply
> >>> >>>>> is irrelavent to the problem here.
> >>> >>>>
> >>> >>>> Well, actually I think GARP request vs reply is relevant (at
> least for
> >>> >>>> case 1 below) because if OVN would be generating GARP replies we
> >>> >>>> wouldn't need the priority 80 flow to determine if an ARP request
> packet
> >>> >>>> is actually an OVN self originated GARP that needs to be flooded
> in the
> >>> >>>> L2 broadcast domain.
> >>> >>>>
> >>> >>>> On the other hand, router3 would be learning mac_binding IP2,M2
> from the
> >>> >>>> GARP reply originated by router2 and vice versa so we'd have to
> restrict
> >>> >>>> flooding of GARP replies to non-patch ports.
> >>> >>>>
> >>> >>>
> >>> >>> Hi Dumitru, the point was that, on the external LS, the GRs will
> have to
> >>> >>> send ARP requests to resolve unknown IPs (at least for the
> external GW),
> >>> >>> and it has to be broadcasted, which will cause all the GRs learn
> all
> >>> >>> MACs of other GRs. This is regardless of the GARP behavior. You are
> >>> >>> right that if we only consider the Join switch then the GARP
> request
> >>> >>> v.s. reply does make a difference. However, GARP request/reply may
> be
> >>> >>> really needed only on the external LS.
> >>> >>>
> >>> >>
> >>> >> Ok, but do you see an easy way to determine if we need to add the
> >>> >> logical flows that flood self originated GARP packets on a given
> logical
> >>> >> switch? Right now we add them on all switches.
> >>> >>
> >>> >>>>>>> Please see my comment inline below.
> >>> >>>>>>>
> >>> >>>>>>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail
> >>> >>>>> <gmoodalbail at gmail.com <mailto:gmoodalbail at gmail.com>
> >>> >>> <mailto:gmoodalbail at gmail.com <mailto:gmoodalbail at gmail.com>>>
> wrote:
> >>> >>>>>>>>
> >>> >>>>>>>> Hello Dumitru,
> >>> >>>>>>>>
> >>> >>>>>>>> There are several things that are being discussed on this
> thread.
> >>> >>>>> Let me see if I can tease them out for clarity.
> >>> >>>>>>>>
> >>> >>>>>>>> 1. All the router IPs are known to OVN (the join switch case)
> >>> >>>>>>>> 2. Some IPs are known and some are not known (the external
> logical
> >>> >>>>> switch that connects to physical network case).
> >>> >>>>>>>>
> >>> >>>>>>>> Let us look at each of the case above:
> >>> >>>>>>>>
> >>> >>>>>>>> 1. Join Switch Case
> >>> >>>>>>>>
> >>> >>>>>>>> +----------------+        +----------------+
> >>> >>>>>>>> |   l3gateway    |        |   l3gateway    |
> >>> >>>>>>>> |    router2     |        |    router3     |
> >>> >>>>>>>> +-------------+--+        +-+--------------+
> >>> >>>>>>>>             IP2,M2         IP3,M3
> >>> >>>>>>>>               |             |
> >>> >>>>>>>>            +--+-------------+---+
> >>> >>>>>>>>            |    join switch     |
> >>> >>>>>>>>            +---------+----------+
> >>> >>>>>>>>                      |
> >>> >>>>>>>>                   IP1,M1
> >>> >>>>>>>>              +-------+--------+
> >>> >>>>>>>>              |  distributed   |
> >>> >>>>>>>>              |     router     |
> >>> >>>>>>>>              +----------------+
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> Say, GR router2 wants to send the packet out to DR and that we
> >>> >>>>> don't have static mappings of MAC to IP in lr_in_arp_resolve
> table on GR
> >>> >>>>> router2 (with Han's patch of dynamic_neigh_routes=true for all
> the
> >>> >>>>> Gateway Routers). With this in mind, when an ARP request is sent
> out by
> >>> >>>>> router2's hypervisor the packet should be directly sent to the
> >>> >>>>> distributed router alone. Your commit 32f5ebb0622 (ovn-northd:
> Limit
> >>> >>>>> ARP/ND broadcast domain whenever possible) should have allowed
> only
> >>> >>>>> unicast. However, in ls_in_l2_lkup table we have
> >>> >>>>>>>>
> >>> >>>>>>>>   table=19(ls_in_l2_lkup      ), priority=80   ,
> match=(eth.src ==
> >>> >>>>> { M2 } && (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood";
> >>> >>> output;)
> >>> >>>>>>>>   table=19(ls_in_l2_lkup      ), priority=75   ,
> match=(flags[1] ==
> >>> >>>>> 0 && arp.op == 1 && arp.tpa == { IP1}), action=(outport =
> >>> >>>>> "jtor-router2"; output;)
> >>> >>>>>>>>
> >>> >>>>>>>> As you can see, `priority=80` rule will always be hit and
> sent out
> >>> >>>>> to all the GRs. The `priority=75` rule is never hit. So, we will
> see ARP
> >>> >>>>> packets on the GENEVE tunnel. So, we need to change
> `priority=80` to
> >>> >>>>> match GARP request packets. That way, for the known OVN IPs case
> we
> >>> >>>>> don't do broadcast.
> >>> >>>>>>>
> >>> >>>>>>> Since the solution to case 2) below (i.e.
> >>> >>>>> learn_from_arp_request=false) solves the problem of case 1),
> too, I
> >>> >>>>> think we don't need this change just for case 1). As @Dumitru
> Ceara
> >>> >>>>>  mentioned, there is some cost because it adds extra flows. It
> would be
> >>> >>>>> significant amount of flows if there are a lot of snat_and_dnat
> IPs.
> >>> >>>>> What do you think?
> >>> >>>>
> >>> >>>> I think the following might be a solution, although with the cost
> of
> >>> >>>> adding as many flows as dnat_and_snat IPs are configured:
> >>> >>>>
> >>> >>>> - priority 80: explicitly determine if an ARP request is a self
> >>> >>>> originated GARP for configured IP addresses and dnat_and_snat IPs
> (by
> >>> >>>> matching on all eth.src and arp.tpa pairs) and if so flood on all
> >>> >>>> non-patch ports.
> >>> >>>> - priority 75: if arp.tpa is owned by an OVN logical router port,
> >>> >>>> "unicast" it only on the patch port towards the router.
> >>> >>>> - priority 1: flood any broadcast packet.
> >>> >>>>
> >>> >>>> Together with the learn_from_arp_request=false knob this would
> cover
> >>> >>>> both case 1 (join switch) and case 2 (external switch).
> >>> >>>>
> >>> >>>> Wdyt?
> >>> >>>>
> >>> >>> Would the "learn_from_arp_request=false knob" cover both cases? If
> yes,
> >>> >>> we don't need to add more flows of priority 80, or more accurately:
> >>> >>> whether to update the priority-80 flows is not directly related to
> the
> >>> >>> current problem.
> >>> >>>
> >>> >>
> >>> >> Yes, it would, except for the fact that the ARP requests would
> still be
> >>> >> flooded to all routers (and ignored at the destination). Which is
> afaiu
> >>> >> what Girish was worried about. In order to address that part too I'm
> >>> >> afraid we have to update the priority-80 flows.
> >>> >>
> >>> >> Regards,
> >>> >> Dumitru
> >>> >>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> Han, yes it will work. However, my only concern is that we
> would send
> >>> >>>>> all these ARP requests via tunnel to each of 1000 hypervisors
> and these
> >>> >>>>> hypervisors will just drop them on the floor. when they see
> >>> >>>>> learn_from_arp_request=false.
> >>> >>>>>
> >>> >>>>> I think maybe it is not a problem since it happens only once on
> the Join
> >>> >>>>> switch. Once the MAC is learned, it won't broadcast again. It
> may be
> >>> >>>>> more of a problem on the external LS if periodical GARP is
> required
> >>> >>>>> there. However, I'd suggest to have some test and see if it is
> really a
> >>> >>>>> problem, before trying to solve it.
> >>> >>>>>
> >>> >>>>>>
> >>> >>>>>> Han, Dumitru,
> >>> >>>>>>
> >>> >>>>>> Why can't we swap the priorities of the above two flows so that
> the
> >>> >>>>> ARP request for NexHop IP known to OVN will be always sent via
> >>> >>> `unicast`?
> >>> >>>>>
> >>> >>>>> If swapped, even GARP won't get broadcasted. Maybe that's not the
> >>> >>>>> desired behavior.
> >>> >>>>>
> >>> >>>>
> >>> >>>> This is definitely not desired as we'd be hitting the prio 75
> flow that
> >>> >>>> would send the self originated GARP request (IPx) packet back
> towards
> >>> >>>> the router port that owns IPx.
> >>> >>>>
> >>> >>>>>>
> >>> >>>>>> Regards,
> >>> >>>>>> ~Girish
> >>> >>>>>>
> >>> >>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> 2. External Logical Switch Case
> >>> >>>>>>>>
> >>> >>>>>>>>                        10.10.10.0/24 <http://10.10.10.0/24>
> >>> >>> <http://10.10.10.0/24>
> >>> >>>>>
> >>> >>>>>>>>    -------------------------+--------------------------
> >>> >>>>>>>>                             |
> >>> >>>>>>>>                          localnet
> >>> >>>>>>>>                       +-----+-----+
> >>> >>>>>>>>                       | external  |
> >>> >>>>>>>>          +------------+    LS1    +-------------+
> >>> >>>>>>>>          |            +-----+-----+             |
> >>> >>>>>>>>          |                  |                   |
> >>> >>>>>>>>      10.10.10.2         10.10.10.3          10.10.10.4
> >>> >>>>>>>>         SNAT               SNAT                SNAT
> >>> >>>>>>>>    +-----+-----+      +-----+-----+       +-----------+
> >>> >>>>>>>>    | l3gateway |      | l3gateway |       | l3gateway |
> >>> >>>>>>>>    |   node1   |      |   node2   |       |   node3   |
> >>> >>>>>>>>    +-----------+      +-----------+       +-----------+
> >>> >>>>>>>>
> >>> >>>>>>>> In this case, we have some of the IPs in OVN and some in the
> >>> >>>>> physical network. If we fix (1) above, all the ARP requests for
> the
> >>> >>>>> OVN's router IPs will be unicast. However, all the ARP requests
> to
> >>> >>>>> external IPs, say 10.10.10.1 on the "physical router", will be
> >>> >>>>> broadcast. Now, we will see these ARP broadcasts on all the L3
> gateway
> >>> >>>>> routers. With 'learn_from_arp_request=false' [a], then the
> MAC_Binding
> >>> >>>>> table will not explode for both ARP and GARP requests.
> >>> >>>>>>>>
> >>> >>>>>>>> So, I don't think GARP requests and replies is the issue here?
> >>> >>>>> Furthermore, learning from the GARP replies are blocked on
> certain
> >>> >>>>> routers. For example:
> >>> >>>>>
> >>> >>>
> https://www.juniper.net/documentation/en_US/junose15.1/topics/concept/ip-gratuitous-arps-transmission-overview.html
> >>> >>>>>  says "By default, updating the ARP cache on GARP replies is
> disabled on
> >>> >>>>> the router.". So, our NAT addresses mapping will not be learnt.
> >>> >>>>
> >>> >>>> Just as a side note, the above doesn't mean Juniper boxes don't
> support
> >>> >>>> learning from GARP replies, just that they'd need extra
> configuration. I
> >>> >>>> don't necessarily think that's a bad thing if properly documented
> in OVN
> >>> >>>> that we would be generating GARP replies.
> >>> >>>>
> >>> >>>> Regards,
> >>> >>>> Dumitru
> >>> >>>>
> >>> >>>>>>>>
> >>> >>>>>>>> Regards,
> >>> >>>>>>>> ~Girish
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> [a] - From Han's mail, the meaning of
> learn_from_arp_request=false
> >>> >>>>> --> if the TPA is on the router, add a new entry (it means the
> >>> >>>>>>>>>     remote wants to communicate with this node, so it makes
> >>> >>> sense to
> >>> >>>>>>>>>     learn the remote as well). Otherwise, ignore it and no
> new
> >>> >>>>> entry added.
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>
> >>> >>>>>> --
> >>> >>>>>> You received this message because you are subscribed to the
> Google
> >>> >>>>> Groups "ovn-kubernetes" group.
> >>> >>>>>> To unsubscribe from this group and stop receiving emails from
> it, send
> >>> >>>>> an email to ovn-kubernetes+unsubscribe at googlegroups.com
> >>> >>> <mailto:ovn-kubernetes%2Bunsubscribe at googlegroups.com>
> >>> >>>>> <mailto:ovn-kubernetes%2Bunsubscribe at googlegroups.com
> >>> >>> <mailto:ovn-kubernetes%252Bunsubscribe at googlegroups.com>>.
> >>> >>>>>> To view this discussion on the web visit
> >>> >>>>>
> >>> >>>
> https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STRnem2PeSahuwhro1t%2BQJxchZNC7viq8n-ngM9KU%2B%2B-Xw%40mail.gmail.com
> .
> >>> >>>>
> >>> >>>
> >>> >>> --
> >>> >>> You received this message because you are subscribed to the Google
> >>> >>> Groups "ovn-kubernetes" group.
> >>> >>> To unsubscribe from this group and stop receiving emails from it,
> send
> >>> >>> an email to ovn-kubernetes+unsubscribe at googlegroups.com
> >>> >>> <mailto:ovn-kubernetes+unsubscribe at googlegroups.com>.
> >>> >>> To view this discussion on the web visit
> >>> >>>
> https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com
> >>> >>> <
> https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com?utm_medium=email&utm_source=footer
> >.
> >>> >>
> >>> >> _______________________________________________
> >>> >> discuss mailing list
> >>> >> discuss at openvswitch.org
> >>> >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >>> >
> >>>
> >> --
> >> You received this message because you are subscribed to the Google
> Groups "ovn-kubernetes" group.
> >> To unsubscribe from this group and stop receiving emails from it, send
> an email to ovn-kubernetes+unsubscribe at googlegroups.com.
> >> To view this discussion on the web visit
> https://groups.google.com/d/msgid/ovn-kubernetes/CADO7ZnoBqbOvo-2jjTOKPA3otgA_4LYqiao2k718guFdW8kTAg%40mail.gmail.com
> .
>
> --
> You received this message because you are subscribed to the Google Groups
> "ovn-kubernetes" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ovn-kubernetes+unsubscribe at googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCma-PU%3D3Gd%3DKLOkzuWKrKdBmqWVc-%3Dd-h6KAUqcvbzMgA%40mail.gmail.com
> <https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCma-PU%3D3Gd%3DKLOkzuWKrKdBmqWVc-%3Dd-h6KAUqcvbzMgA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20200609/7a17d80a/attachment-0001.html>


More information about the discuss mailing list