[ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

Dumitru Ceara dceara at redhat.com
Thu May 28 11:26:43 UTC 2020


On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
> Hi all
> 
> Sorry for top posting. I want to thank you all for the discussion and
> give also some feedback from OpenStack perspective which is affected
> by the problem described here.
> 
> In OpenStack, it's kind of common to have a shared external network
> (logical switch with a localnet port) across many tenants. Each tenant
> user may create their own router where their instances will be
> connected to access the external network.
> 
> In such scenario, we are hitting the issue described here. In
> particular in our tests we exercise 3K VIFs (with 1 FIP) each spanning
> 300 LS; each LS connected to a LR (ie. 300 LRs) and that router
> connected to the public LS. This is creating a huge problem in terms
> of performance and tons of events due to the MAC_Binding entries
> generated as a consequence of the GARPs sent for the floating IPs.
> 

Just as an addition to this, GARPs wouldn't be the only reason why all
routers would learn the MAC_Binding. Even if we wouldn't be sending
GARPs for the FIPs, when a VM that's behind a FIP would send traffic to
the outside, the router will generate an ARP request for the next hop
using the FIP-IP and FIP-MAC. This will be broadcasted to all routers
connected to the public LS and will trigger them to learn the
FIP-IP:FIP-MAC binding.

> Thanks,
> Daniel
> 
> 
> On Thu, May 28, 2020 at 10:51 AM Dumitru Ceara <dceara at redhat.com> wrote:
>>
>> On 5/28/20 8:34 AM, Han Zhou wrote:
>>>
>>>
>>> On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara <dceara at redhat.com
>>> <mailto:dceara at redhat.com>> wrote:
>>>>
>>>> Hi Girish, Han,
>>>>
>>>> On 5/26/20 11:51 PM, Han Zhou wrote:
>>>>>
>>>>>
>>>>> On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail
>>> <gmoodalbail at gmail.com <mailto:gmoodalbail at gmail.com>
>>>>> <mailto:gmoodalbail at gmail.com <mailto:gmoodalbail at gmail.com>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, May 26, 2020 at 12:42 PM Han Zhou <zhouhan at gmail.com
>>> <mailto:zhouhan at gmail.com>
>>>>> <mailto:zhouhan at gmail.com <mailto:zhouhan at gmail.com>>> wrote:
>>>>>>>
>>>>>>> Hi Girish,
>>>>>>>
>>>>>>> Thanks for the summary. I agree with you that GARP request v.s. reply
>>>>> is irrelavent to the problem here.
>>>>
>>>> Well, actually I think GARP request vs reply is relevant (at least for
>>>> case 1 below) because if OVN would be generating GARP replies we
>>>> wouldn't need the priority 80 flow to determine if an ARP request packet
>>>> is actually an OVN self originated GARP that needs to be flooded in the
>>>> L2 broadcast domain.
>>>>
>>>> On the other hand, router3 would be learning mac_binding IP2,M2 from the
>>>> GARP reply originated by router2 and vice versa so we'd have to restrict
>>>> flooding of GARP replies to non-patch ports.
>>>>
>>>
>>> Hi Dumitru, the point was that, on the external LS, the GRs will have to
>>> send ARP requests to resolve unknown IPs (at least for the external GW),
>>> and it has to be broadcasted, which will cause all the GRs learn all
>>> MACs of other GRs. This is regardless of the GARP behavior. You are
>>> right that if we only consider the Join switch then the GARP request
>>> v.s. reply does make a difference. However, GARP request/reply may be
>>> really needed only on the external LS.
>>>
>>
>> Ok, but do you see an easy way to determine if we need to add the
>> logical flows that flood self originated GARP packets on a given logical
>> switch? Right now we add them on all switches.
>>
>>>>>>> Please see my comment inline below.
>>>>>>>
>>>>>>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail
>>>>> <gmoodalbail at gmail.com <mailto:gmoodalbail at gmail.com>
>>> <mailto:gmoodalbail at gmail.com <mailto:gmoodalbail at gmail.com>>> wrote:
>>>>>>>>
>>>>>>>> Hello Dumitru,
>>>>>>>>
>>>>>>>> There are several things that are being discussed on this thread.
>>>>> Let me see if I can tease them out for clarity.
>>>>>>>>
>>>>>>>> 1. All the router IPs are known to OVN (the join switch case)
>>>>>>>> 2. Some IPs are known and some are not known (the external logical
>>>>> switch that connects to physical network case).
>>>>>>>>
>>>>>>>> Let us look at each of the case above:
>>>>>>>>
>>>>>>>> 1. Join Switch Case
>>>>>>>>
>>>>>>>> +----------------+        +----------------+
>>>>>>>> |   l3gateway    |        |   l3gateway    |
>>>>>>>> |    router2     |        |    router3     |
>>>>>>>> +-------------+--+        +-+--------------+
>>>>>>>>             IP2,M2         IP3,M3
>>>>>>>>               |             |
>>>>>>>>            +--+-------------+---+
>>>>>>>>            |    join switch     |
>>>>>>>>            +---------+----------+
>>>>>>>>                      |
>>>>>>>>                   IP1,M1
>>>>>>>>              +-------+--------+
>>>>>>>>              |  distributed   |
>>>>>>>>              |     router     |
>>>>>>>>              +----------------+
>>>>>>>>
>>>>>>>>
>>>>>>>> Say, GR router2 wants to send the packet out to DR and that we
>>>>> don't have static mappings of MAC to IP in lr_in_arp_resolve table on GR
>>>>> router2 (with Han's patch of dynamic_neigh_routes=true for all the
>>>>> Gateway Routers). With this in mind, when an ARP request is sent out by
>>>>> router2's hypervisor the packet should be directly sent to the
>>>>> distributed router alone. Your commit 32f5ebb0622 (ovn-northd: Limit
>>>>> ARP/ND broadcast domain whenever possible) should have allowed only
>>>>> unicast. However, in ls_in_l2_lkup table we have
>>>>>>>>
>>>>>>>>   table=19(ls_in_l2_lkup      ), priority=80   , match=(eth.src ==
>>>>> { M2 } && (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood";
>>> output;)
>>>>>>>>   table=19(ls_in_l2_lkup      ), priority=75   , match=(flags[1] ==
>>>>> 0 && arp.op == 1 && arp.tpa == { IP1}), action=(outport =
>>>>> "jtor-router2"; output;)
>>>>>>>>
>>>>>>>> As you can see, `priority=80` rule will always be hit and sent out
>>>>> to all the GRs. The `priority=75` rule is never hit. So, we will see ARP
>>>>> packets on the GENEVE tunnel. So, we need to change `priority=80` to
>>>>> match GARP request packets. That way, for the known OVN IPs case we
>>>>> don't do broadcast.
>>>>>>>
>>>>>>> Since the solution to case 2) below (i.e.
>>>>> learn_from_arp_request=false) solves the problem of case 1), too, I
>>>>> think we don't need this change just for case 1). As @Dumitru Ceara
>>>>>  mentioned, there is some cost because it adds extra flows. It would be
>>>>> significant amount of flows if there are a lot of snat_and_dnat IPs.
>>>>> What do you think?
>>>>
>>>> I think the following might be a solution, although with the cost of
>>>> adding as many flows as dnat_and_snat IPs are configured:
>>>>
>>>> - priority 80: explicitly determine if an ARP request is a self
>>>> originated GARP for configured IP addresses and dnat_and_snat IPs (by
>>>> matching on all eth.src and arp.tpa pairs) and if so flood on all
>>>> non-patch ports.
>>>> - priority 75: if arp.tpa is owned by an OVN logical router port,
>>>> "unicast" it only on the patch port towards the router.
>>>> - priority 1: flood any broadcast packet.
>>>>
>>>> Together with the learn_from_arp_request=false knob this would cover
>>>> both case 1 (join switch) and case 2 (external switch).
>>>>
>>>> Wdyt?
>>>>
>>> Would the "learn_from_arp_request=false knob" cover both cases? If yes,
>>> we don't need to add more flows of priority 80, or more accurately:
>>> whether to update the priority-80 flows is not directly related to the
>>> current problem.
>>>
>>
>> Yes, it would, except for the fact that the ARP requests would still be
>> flooded to all routers (and ignored at the destination). Which is afaiu
>> what Girish was worried about. In order to address that part too I'm
>> afraid we have to update the priority-80 flows.
>>
>> Regards,
>> Dumitru
>>
>>>>>>
>>>>>>
>>>>>> Han, yes it will work. However, my only concern is that we would send
>>>>> all these ARP requests via tunnel to each of 1000 hypervisors and these
>>>>> hypervisors will just drop them on the floor. when they see
>>>>> learn_from_arp_request=false.
>>>>>
>>>>> I think maybe it is not a problem since it happens only once on the Join
>>>>> switch. Once the MAC is learned, it won't broadcast again. It may be
>>>>> more of a problem on the external LS if periodical GARP is required
>>>>> there. However, I'd suggest to have some test and see if it is really a
>>>>> problem, before trying to solve it.
>>>>>
>>>>>>
>>>>>> Han, Dumitru,
>>>>>>
>>>>>> Why can't we swap the priorities of the above two flows so that the
>>>>> ARP request for NexHop IP known to OVN will be always sent via
>>> `unicast`?
>>>>>
>>>>> If swapped, even GARP won't get broadcasted. Maybe that's not the
>>>>> desired behavior.
>>>>>
>>>>
>>>> This is definitely not desired as we'd be hitting the prio 75 flow that
>>>> would send the self originated GARP request (IPx) packet back towards
>>>> the router port that owns IPx.
>>>>
>>>>>>
>>>>>> Regards,
>>>>>> ~Girish
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> 2. External Logical Switch Case
>>>>>>>>
>>>>>>>>                        10.10.10.0/24 <http://10.10.10.0/24>
>>> <http://10.10.10.0/24>
>>>>>
>>>>>>>>    -------------------------+--------------------------
>>>>>>>>                             |
>>>>>>>>                          localnet
>>>>>>>>                       +-----+-----+
>>>>>>>>                       | external  |
>>>>>>>>          +------------+    LS1    +-------------+
>>>>>>>>          |            +-----+-----+             |
>>>>>>>>          |                  |                   |
>>>>>>>>      10.10.10.2         10.10.10.3          10.10.10.4
>>>>>>>>         SNAT               SNAT                SNAT
>>>>>>>>    +-----+-----+      +-----+-----+       +-----------+
>>>>>>>>    | l3gateway |      | l3gateway |       | l3gateway |
>>>>>>>>    |   node1   |      |   node2   |       |   node3   |
>>>>>>>>    +-----------+      +-----------+       +-----------+
>>>>>>>>
>>>>>>>> In this case, we have some of the IPs in OVN and some in the
>>>>> physical network. If we fix (1) above, all the ARP requests for the
>>>>> OVN's router IPs will be unicast. However, all the ARP requests to
>>>>> external IPs, say 10.10.10.1 on the "physical router", will be
>>>>> broadcast. Now, we will see these ARP broadcasts on all the L3 gateway
>>>>> routers. With 'learn_from_arp_request=false' [a], then the MAC_Binding
>>>>> table will not explode for both ARP and GARP requests.
>>>>>>>>
>>>>>>>> So, I don't think GARP requests and replies is the issue here?
>>>>> Furthermore, learning from the GARP replies are blocked on certain
>>>>> routers. For example:
>>>>>
>>>  https://www.juniper.net/documentation/en_US/junose15.1/topics/concept/ip-gratuitous-arps-transmission-overview.html
>>>>>  says "By default, updating the ARP cache on GARP replies is disabled on
>>>>> the router.". So, our NAT addresses mapping will not be learnt.
>>>>
>>>> Just as a side note, the above doesn't mean Juniper boxes don't support
>>>> learning from GARP replies, just that they'd need extra configuration. I
>>>> don't necessarily think that's a bad thing if properly documented in OVN
>>>> that we would be generating GARP replies.
>>>>
>>>> Regards,
>>>> Dumitru
>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> ~Girish
>>>>>>>>
>>>>>>>>
>>>>>>>> [a] - From Han's mail, the meaning of learn_from_arp_request=false
>>>>> --> if the TPA is on the router, add a new entry (it means the
>>>>>>>>>     remote wants to communicate with this node, so it makes
>>> sense to
>>>>>>>>>     learn the remote as well). Otherwise, ignore it and no new
>>>>> entry added.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "ovn-kubernetes" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to ovn-kubernetes+unsubscribe at googlegroups.com
>>> <mailto:ovn-kubernetes%2Bunsubscribe at googlegroups.com>
>>>>> <mailto:ovn-kubernetes%2Bunsubscribe at googlegroups.com
>>> <mailto:ovn-kubernetes%252Bunsubscribe at googlegroups.com>>.
>>>>>> To view this discussion on the web visit
>>>>>
>>> https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STRnem2PeSahuwhro1t%2BQJxchZNC7viq8n-ngM9KU%2B%2B-Xw%40mail.gmail.com.
>>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "ovn-kubernetes" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to ovn-kubernetes+unsubscribe at googlegroups.com
>>> <mailto:ovn-kubernetes+unsubscribe at googlegroups.com>.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>>
>> _______________________________________________
>> discuss mailing list
>> discuss at openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> 



More information about the discuss mailing list