[ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

Girish Moodalbail gmoodalbail at gmail.com
Wed Jun 3 22:32:07 UTC 2020


Hello all,

To kind of proceed with the proposed fixes, with minimal impact, is the
following a reasonable approach?

   1. Add an option, namely dynamic_neigh_routes={true|false}, for a
   gateway router. With this option enabled, the nextHop IP's MAC will be
   learned through a ARP request on the physical network. The ARP request will
   be flooded on the L2 broadcast domain (for both join switch and external
   switch).

   2. Add an option, namely learn_from_arp_request={true|false}, for a
   gateway router. The option is interpreted as below:\
   "true" - learn the MAC/IP binding and add a new MAC_Binding entry
   (default behavior)
   "false" - if there is a MAC_binding for that IP and the MAC is
   different, then update that MAC/IP binding. The external entity might be
   trying to advertise the new MAC for that IP. (If we don't do this, then we
   will never learn External VIP to MAC changes)

   (Irrespective of, learn_from_arp_request is true or false, always do
   this -- if the TPA is on the router, add a new entry (it means the remote
   wants to communicate with this node, so it makes sense to learn the remote
   as well))


For now, I think it is fine for ARP packets to be broadcasted on the tunnel
for the `join` switch case. If it becomes a problem, then we can start
looking around changing the logical flows.

Thanks everyone for the lively discussion.

Regards,
~Girish

On Thu, May 28, 2020 at 7:33 AM Tim Rozet <trozet at redhat.com> wrote:

>
>
> On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara <dceara at redhat.com> wrote:
>
>> On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
>> > Hi all
>> >
>> > Sorry for top posting. I want to thank you all for the discussion and
>> > give also some feedback from OpenStack perspective which is affected
>> > by the problem described here.
>> >
>> > In OpenStack, it's kind of common to have a shared external network
>> > (logical switch with a localnet port) across many tenants. Each tenant
>> > user may create their own router where their instances will be
>> > connected to access the external network.
>> >
>> > In such scenario, we are hitting the issue described here. In
>> > particular in our tests we exercise 3K VIFs (with 1 FIP) each spanning
>> > 300 LS; each LS connected to a LR (ie. 300 LRs) and that router
>> > connected to the public LS. This is creating a huge problem in terms
>> > of performance and tons of events due to the MAC_Binding entries
>> > generated as a consequence of the GARPs sent for the floating IPs.
>> >
>>
>> Just as an addition to this, GARPs wouldn't be the only reason why all
>> routers would learn the MAC_Binding. Even if we wouldn't be sending
>> GARPs for the FIPs, when a VM that's behind a FIP would send traffic to
>> the outside, the router will generate an ARP request for the next hop
>> using the FIP-IP and FIP-MAC. This will be broadcasted to all routers
>> connected to the public LS and will trigger them to learn the
>> FIP-IP:FIP-MAC binding.
>>
>
> Yeah we shouldn't be learning on regular ARP requests.
>
>
>>
>> > Thanks,
>> > Daniel
>> >
>> >
>> > On Thu, May 28, 2020 at 10:51 AM Dumitru Ceara <dceara at redhat.com>
>> wrote:
>> >>
>> >> On 5/28/20 8:34 AM, Han Zhou wrote:
>> >>>
>> >>>
>> >>> On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara <dceara at redhat.com
>> >>> <mailto:dceara at redhat.com>> wrote:
>> >>>>
>> >>>> Hi Girish, Han,
>> >>>>
>> >>>> On 5/26/20 11:51 PM, Han Zhou wrote:
>> >>>>>
>> >>>>>
>> >>>>> On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail
>> >>> <gmoodalbail at gmail.com <mailto:gmoodalbail at gmail.com>
>> >>>>> <mailto:gmoodalbail at gmail.com <mailto:gmoodalbail at gmail.com>>>
>> wrote:
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Tue, May 26, 2020 at 12:42 PM Han Zhou <zhouhan at gmail.com
>> >>> <mailto:zhouhan at gmail.com>
>> >>>>> <mailto:zhouhan at gmail.com <mailto:zhouhan at gmail.com>>> wrote:
>> >>>>>>>
>> >>>>>>> Hi Girish,
>> >>>>>>>
>> >>>>>>> Thanks for the summary. I agree with you that GARP request v.s.
>> reply
>> >>>>> is irrelavent to the problem here.
>> >>>>
>> >>>> Well, actually I think GARP request vs reply is relevant (at least
>> for
>> >>>> case 1 below) because if OVN would be generating GARP replies we
>> >>>> wouldn't need the priority 80 flow to determine if an ARP request
>> packet
>> >>>> is actually an OVN self originated GARP that needs to be flooded in
>> the
>> >>>> L2 broadcast domain.
>> >>>>
>> >>>> On the other hand, router3 would be learning mac_binding IP2,M2 from
>> the
>> >>>> GARP reply originated by router2 and vice versa so we'd have to
>> restrict
>> >>>> flooding of GARP replies to non-patch ports.
>> >>>>
>> >>>
>> >>> Hi Dumitru, the point was that, on the external LS, the GRs will have
>> to
>> >>> send ARP requests to resolve unknown IPs (at least for the external
>> GW),
>> >>> and it has to be broadcasted, which will cause all the GRs learn all
>> >>> MACs of other GRs. This is regardless of the GARP behavior. You are
>> >>> right that if we only consider the Join switch then the GARP request
>> >>> v.s. reply does make a difference. However, GARP request/reply may be
>> >>> really needed only on the external LS.
>> >>>
>> >>
>> >> Ok, but do you see an easy way to determine if we need to add the
>> >> logical flows that flood self originated GARP packets on a given
>> logical
>> >> switch? Right now we add them on all switches.
>> >>
>> >>>>>>> Please see my comment inline below.
>> >>>>>>>
>> >>>>>>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail
>> >>>>> <gmoodalbail at gmail.com <mailto:gmoodalbail at gmail.com>
>> >>> <mailto:gmoodalbail at gmail.com <mailto:gmoodalbail at gmail.com>>> wrote:
>> >>>>>>>>
>> >>>>>>>> Hello Dumitru,
>> >>>>>>>>
>> >>>>>>>> There are several things that are being discussed on this thread.
>> >>>>> Let me see if I can tease them out for clarity.
>> >>>>>>>>
>> >>>>>>>> 1. All the router IPs are known to OVN (the join switch case)
>> >>>>>>>> 2. Some IPs are known and some are not known (the external
>> logical
>> >>>>> switch that connects to physical network case).
>> >>>>>>>>
>> >>>>>>>> Let us look at each of the case above:
>> >>>>>>>>
>> >>>>>>>> 1. Join Switch Case
>> >>>>>>>>
>> >>>>>>>> +----------------+        +----------------+
>> >>>>>>>> |   l3gateway    |        |   l3gateway    |
>> >>>>>>>> |    router2     |        |    router3     |
>> >>>>>>>> +-------------+--+        +-+--------------+
>> >>>>>>>>             IP2,M2         IP3,M3
>> >>>>>>>>               |             |
>> >>>>>>>>            +--+-------------+---+
>> >>>>>>>>            |    join switch     |
>> >>>>>>>>            +---------+----------+
>> >>>>>>>>                      |
>> >>>>>>>>                   IP1,M1
>> >>>>>>>>              +-------+--------+
>> >>>>>>>>              |  distributed   |
>> >>>>>>>>              |     router     |
>> >>>>>>>>              +----------------+
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> Say, GR router2 wants to send the packet out to DR and that we
>> >>>>> don't have static mappings of MAC to IP in lr_in_arp_resolve table
>> on GR
>> >>>>> router2 (with Han's patch of dynamic_neigh_routes=true for all the
>> >>>>> Gateway Routers). With this in mind, when an ARP request is sent
>> out by
>> >>>>> router2's hypervisor the packet should be directly sent to the
>> >>>>> distributed router alone. Your commit 32f5ebb0622 (ovn-northd: Limit
>> >>>>> ARP/ND broadcast domain whenever possible) should have allowed only
>> >>>>> unicast. However, in ls_in_l2_lkup table we have
>> >>>>>>>>
>> >>>>>>>>   table=19(ls_in_l2_lkup      ), priority=80   , match=(eth.src
>> ==
>> >>>>> { M2 } && (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood";
>> >>> output;)
>> >>>>>>>>   table=19(ls_in_l2_lkup      ), priority=75   , match=(flags[1]
>> ==
>> >>>>> 0 && arp.op == 1 && arp.tpa == { IP1}), action=(outport =
>> >>>>> "jtor-router2"; output;)
>> >>>>>>>>
>> >>>>>>>> As you can see, `priority=80` rule will always be hit and sent
>> out
>> >>>>> to all the GRs. The `priority=75` rule is never hit. So, we will
>> see ARP
>> >>>>> packets on the GENEVE tunnel. So, we need to change `priority=80` to
>> >>>>> match GARP request packets. That way, for the known OVN IPs case we
>> >>>>> don't do broadcast.
>> >>>>>>>
>> >>>>>>> Since the solution to case 2) below (i.e.
>> >>>>> learn_from_arp_request=false) solves the problem of case 1), too, I
>> >>>>> think we don't need this change just for case 1). As @Dumitru Ceara
>> >>>>>  mentioned, there is some cost because it adds extra flows. It
>> would be
>> >>>>> significant amount of flows if there are a lot of snat_and_dnat IPs.
>> >>>>> What do you think?
>> >>>>
>> >>>> I think the following might be a solution, although with the cost of
>> >>>> adding as many flows as dnat_and_snat IPs are configured:
>> >>>>
>> >>>> - priority 80: explicitly determine if an ARP request is a self
>> >>>> originated GARP for configured IP addresses and dnat_and_snat IPs (by
>> >>>> matching on all eth.src and arp.tpa pairs) and if so flood on all
>> >>>> non-patch ports.
>> >>>> - priority 75: if arp.tpa is owned by an OVN logical router port,
>> >>>> "unicast" it only on the patch port towards the router.
>> >>>> - priority 1: flood any broadcast packet.
>> >>>>
>> >>>> Together with the learn_from_arp_request=false knob this would cover
>> >>>> both case 1 (join switch) and case 2 (external switch).
>> >>>>
>> >>>> Wdyt?
>> >>>>
>> >>> Would the "learn_from_arp_request=false knob" cover both cases? If
>> yes,
>> >>> we don't need to add more flows of priority 80, or more accurately:
>> >>> whether to update the priority-80 flows is not directly related to the
>> >>> current problem.
>> >>>
>> >>
>> >> Yes, it would, except for the fact that the ARP requests would still be
>> >> flooded to all routers (and ignored at the destination). Which is afaiu
>> >> what Girish was worried about. In order to address that part too I'm
>> >> afraid we have to update the priority-80 flows.
>> >>
>> >> Regards,
>> >> Dumitru
>> >>
>> >>>>>>
>> >>>>>>
>> >>>>>> Han, yes it will work. However, my only concern is that we would
>> send
>> >>>>> all these ARP requests via tunnel to each of 1000 hypervisors and
>> these
>> >>>>> hypervisors will just drop them on the floor. when they see
>> >>>>> learn_from_arp_request=false.
>> >>>>>
>> >>>>> I think maybe it is not a problem since it happens only once on the
>> Join
>> >>>>> switch. Once the MAC is learned, it won't broadcast again. It may be
>> >>>>> more of a problem on the external LS if periodical GARP is required
>> >>>>> there. However, I'd suggest to have some test and see if it is
>> really a
>> >>>>> problem, before trying to solve it.
>> >>>>>
>> >>>>>>
>> >>>>>> Han, Dumitru,
>> >>>>>>
>> >>>>>> Why can't we swap the priorities of the above two flows so that the
>> >>>>> ARP request for NexHop IP known to OVN will be always sent via
>> >>> `unicast`?
>> >>>>>
>> >>>>> If swapped, even GARP won't get broadcasted. Maybe that's not the
>> >>>>> desired behavior.
>> >>>>>
>> >>>>
>> >>>> This is definitely not desired as we'd be hitting the prio 75 flow
>> that
>> >>>> would send the self originated GARP request (IPx) packet back towards
>> >>>> the router port that owns IPx.
>> >>>>
>> >>>>>>
>> >>>>>> Regards,
>> >>>>>> ~Girish
>> >>>>>>
>> >>>>>>>
>> >>>>>>>>
>> >>>>>>>> 2. External Logical Switch Case
>> >>>>>>>>
>> >>>>>>>>                        10.10.10.0/24 <http://10.10.10.0/24>
>> >>> <http://10.10.10.0/24>
>> >>>>>
>> >>>>>>>>    -------------------------+--------------------------
>> >>>>>>>>                             |
>> >>>>>>>>                          localnet
>> >>>>>>>>                       +-----+-----+
>> >>>>>>>>                       | external  |
>> >>>>>>>>          +------------+    LS1    +-------------+
>> >>>>>>>>          |            +-----+-----+             |
>> >>>>>>>>          |                  |                   |
>> >>>>>>>>      10.10.10.2         10.10.10.3          10.10.10.4
>> >>>>>>>>         SNAT               SNAT                SNAT
>> >>>>>>>>    +-----+-----+      +-----+-----+       +-----------+
>> >>>>>>>>    | l3gateway |      | l3gateway |       | l3gateway |
>> >>>>>>>>    |   node1   |      |   node2   |       |   node3   |
>> >>>>>>>>    +-----------+      +-----------+       +-----------+
>> >>>>>>>>
>> >>>>>>>> In this case, we have some of the IPs in OVN and some in the
>> >>>>> physical network. If we fix (1) above, all the ARP requests for the
>> >>>>> OVN's router IPs will be unicast. However, all the ARP requests to
>> >>>>> external IPs, say 10.10.10.1 on the "physical router", will be
>> >>>>> broadcast. Now, we will see these ARP broadcasts on all the L3
>> gateway
>> >>>>> routers. With 'learn_from_arp_request=false' [a], then the
>> MAC_Binding
>> >>>>> table will not explode for both ARP and GARP requests.
>> >>>>>>>>
>> >>>>>>>> So, I don't think GARP requests and replies is the issue here?
>> >>>>> Furthermore, learning from the GARP replies are blocked on certain
>> >>>>> routers. For example:
>> >>>>>
>> >>>
>> https://www.juniper.net/documentation/en_US/junose15.1/topics/concept/ip-gratuitous-arps-transmission-overview.html
>> >>>>>  says "By default, updating the ARP cache on GARP replies is
>> disabled on
>> >>>>> the router.". So, our NAT addresses mapping will not be learnt.
>> >>>>
>> >>>> Just as a side note, the above doesn't mean Juniper boxes don't
>> support
>> >>>> learning from GARP replies, just that they'd need extra
>> configuration. I
>> >>>> don't necessarily think that's a bad thing if properly documented in
>> OVN
>> >>>> that we would be generating GARP replies.
>> >>>>
>> >>>> Regards,
>> >>>> Dumitru
>> >>>>
>> >>>>>>>>
>> >>>>>>>> Regards,
>> >>>>>>>> ~Girish
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> [a] - From Han's mail, the meaning of
>> learn_from_arp_request=false
>> >>>>> --> if the TPA is on the router, add a new entry (it means the
>> >>>>>>>>>     remote wants to communicate with this node, so it makes
>> >>> sense to
>> >>>>>>>>>     learn the remote as well). Otherwise, ignore it and no new
>> >>>>> entry added.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> You received this message because you are subscribed to the Google
>> >>>>> Groups "ovn-kubernetes" group.
>> >>>>>> To unsubscribe from this group and stop receiving emails from it,
>> send
>> >>>>> an email to ovn-kubernetes+unsubscribe at googlegroups.com
>> >>> <mailto:ovn-kubernetes%2Bunsubscribe at googlegroups.com>
>> >>>>> <mailto:ovn-kubernetes%2Bunsubscribe at googlegroups.com
>> >>> <mailto:ovn-kubernetes%252Bunsubscribe at googlegroups.com>>.
>> >>>>>> To view this discussion on the web visit
>> >>>>>
>> >>>
>> https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STRnem2PeSahuwhro1t%2BQJxchZNC7viq8n-ngM9KU%2B%2B-Xw%40mail.gmail.com
>> .
>> >>>>
>> >>>
>> >>> --
>> >>> You received this message because you are subscribed to the Google
>> >>> Groups "ovn-kubernetes" group.
>> >>> To unsubscribe from this group and stop receiving emails from it, send
>> >>> an email to ovn-kubernetes+unsubscribe at googlegroups.com
>> >>> <mailto:ovn-kubernetes+unsubscribe at googlegroups.com>.
>> >>> To view this discussion on the web visit
>> >>>
>> https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com
>> >>> <
>> https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com?utm_medium=email&utm_source=footer
>> >.
>> >>
>> >> _______________________________________________
>> >> discuss mailing list
>> >> discuss at openvswitch.org
>> >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>> >
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "ovn-kubernetes" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ovn-kubernetes+unsubscribe at googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/ovn-kubernetes/CADO7ZnoBqbOvo-2jjTOKPA3otgA_4LYqiao2k718guFdW8kTAg%40mail.gmail.com
> <https://groups.google.com/d/msgid/ovn-kubernetes/CADO7ZnoBqbOvo-2jjTOKPA3otgA_4LYqiao2k718guFdW8kTAg%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20200603/e0ef5e46/attachment-0001.html>


More information about the discuss mailing list