[ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

Fri May 8 18:02:54 UTC 2020

On Fri, May 8, 2020 at 6:41 AM Tim Rozet <trozet at redhat.com> wrote:

> Girish, Han,
> From my understanding the GR (per node) <----> DR link is local subnet and
> you don't want the overhead of many switch objects in OVN, but you also
> dont want a all the GRs connecting to a single switch to stop large L2
> domain. Isn't the simple solution to allow connecting routers to each other
> without an intermediary switch?
>
>
Tim Rozet
> Red Hat CTO Networking Team
>
>
Hi Tim,

Thanks for the suggestion. This should be an improvement, but it doesn't
completely solve the problem mentioned by Girish.
- Subnet management for the large number of transit subnet is still needed.
- For the external logical switch, this doesn't help.
It is still O(n) regarding number of datapaths, same as the approach of
spliting the join LS, but it is more optimal, because for each of the
direct connections between the LR and GRs, the cost of <patch_port - LS -
patch_port> is avoided. I think it is worth to try.

Hi Girish, for the DGP solution, please see my comments below:

>
> On Fri, May 8, 2020 at 3:17 AM Girish Moodalbail <gmoodalbail at gmail.com>
> wrote:
>
>>
>>
>> On Thu, May 7, 2020 at 11:24 PM Han Zhou <zhouhan at gmail.com> wrote:
>>
>>> (Add the MLs back)
>>>
>>> On Thu, May 7, 2020 at 4:01 PM Girish Moodalbail <gmoodalbail at gmail.com>
>>> wrote:
>>>
>>>> Hello Han,
>>>>
>>>> Sorry, I was monitoring the ovn-kubernetes google group and didn't see
>>>> your emails till now.
>>>>
>>>>
>>>>>
>>>>> On the other hand, why wouldn't splitting the join logical switch to
>>>>> 1000 LSes solve the problem? I understand that there will be 1000 more
>>>>> datapaths, and 1000 more LRPs, but these are all O(n), which is much more
>>>>> efficient than the O(n^2) exploding. What's the other scale issues created
>>>>> by this?
>>>>>
>>>>
>>>> Splitting a single join logical switch into 1000 different logical
>>>> switch is how I have resolved the problem now. However, with this design I
>>>> see following issues.
>>>> (1) Complexity
>>>>    where one logical switch should have sufficed, we now need to create
>>>> 1000 logical switches just to workaround the O(n^2) logical flows
>>>> (2) IPAM management
>>>>   - before I had one IP subnet 100.64.0.0/16 for the single logical
>>>> switch and depended on OVN IPAM to allocate IPs off of that subnet
>>>>   - now I need to first do subnet management (break a /16 to /29 CIDR)
>>>> in OVN K8s and then assign each subnet to each of the join logical switch
>>>> (3) each of this join logical switch is a distributed switch. The flows
>>>> related to each one of them will be present in each hypervisor. This will
>>>> increase the number of OpenFlow flows  However, from OVN K8s point of view
>>>> this logical switch is essentially pinned to an hypervisor and its role is
>>>> to connect the hypervisor's l3gateway to the distributed router.
>>>>
>>>> We are trying to simplify the OVN logical topology for OVN K8s so that
>>>> the number of logical flows (and therefore the number of OpenFlow flows)
>>>> are reduced and that reduces the pressure on ovn-northd, OVN SB DB, and
>>>> finally ovn-controller processes.
>>>>
>>>> Every node in OVN K8s cluster adds 4 resources. So, in a 1000 node
>>>> k8s-cluster we will have 4000 + 1 (distributed router). This ends up
>>>> creating around 250K OpenFlow rules in each of the hypervisior. This number
>>>> is to just support the initial logical topology. I am not accounting for
>>>> any flows that will be generated for k8s network polices, services, and so
>>>> on.
>>>>
>>>>
>>>>>
>>>>> In addition, Girish, for the external LS, I am not sure why can't it
>>>>> be shared, if all the nodes are connected to a single L2 network. (If they
>>>>> are connected to separate L2 networks, different external LSes should be
>>>>> created, at least according to current OVN model).
>>>>>
>>>>
>>>> Yes, the plan was to share the same external LS with all of the L3
>>>> gateway routers since they are all on the same broadcast domain. However,
>>>> we will end up with the same 2M logical flows since a single external LS
>>>> connects all the L3 gateway routers on the same broadcast domain.
>>>>
>>>> In short, for a 1000-node K8s cluster, if we reduce the logical flow
>>>> explosion, then we can reduce the number of logical resources in OVN K8s
>>>> topology by 1998  (1000 Join LS will become 1 and 1000 external LS will
>>>> become 1).
>>>>
>>>>
>>> Ok, so now we are not satisfied with even O(n), and instead we want to
>>> make it O(1) for some of the resources.
>>> I think the major problem is the per-node gateway routers. It seems not
>>> really necessary in theory. Ideally the topology can be simplified with the
>>> concept of distributed gateway ports, on a single logical router (the join
>>> router), and then we can remove all the join LSes and gateway routers,
>>> something like below:
>>>
>>>     +------------------------------------------+
>>>     |        external logical switch           |
>>>     +-+-------------+--------------------+-----+
>>>       |             |                    |
>>> +-----+-----+ +-----------+        +-----+-----------+
>>> | dgp1 at node1| | dgp2 at node2|   ...  |dgp1000 at node1000 |
>>> +-----+-----+ +-----+-----+        +-----+-----------+
>>>       |             |                    |
>>>     +-+-------------+--------------------+-----+
>>>     |             logical router               |
>>>     +------------------------------------------+
>>>
>>> (dgp = distributed gateway port)
>>>
>>> This way, you only need one router, and also one external logical
>>> switch, and there won't be the O(n^2) flow exploding problem for ARP
>>> resolving because you have 1 LR only. The number of logical routers and
>>> switches become O(1). The number of router ports are still O(n), but it is
>>> also halved.
>>>
>>> In reality, there are some problems of this solution that need to be
>>> addressed.
>>>
>>> Firstly, it would require some change in OVN because currently OVN has a
>>> limitation that each LR can only have one gateway router port. However, it
>>> doesn't seem to be anything fundamental that would prevent us from removing
>>> that restriction to support multiple distributed gateway ports on a single
>>> LR. I'd like to hear from more OVN folks in case there is some reason we
>>> shouldn't do this.
>>>
>>> The other thing that I am not so sure is about connecting the logical
>>> router to the external logical switch through multiple ports. This means we
>>> will have multiple ports of the logical router on the same subnet, which is
>>> something we usually don't do traditionally. However, I think maybe this
>>> will work with OVN static route with src routing and output_port specified
>>> so that the LR know which port (and chassis) to send the traffic out,
>>> provided that there is only one nexthop, which is the default external GW.
>>> If multiple nexthops need to be supported, this won't work (and we probably
>>> will have to look at the solution that avoids the static neighbour table
>>> population).
>>>
>>
>> Hello Han,
>>
>> I did consider distributed gateway port. However, there are two issues
>> with it
>>
>> 1. In order to support K8s NodePort services we need to create a
>> North-South LB and L3 gateway is a perfect solution for that. AFAIK,
>>    DGP doesn't support it
>>
>
In fact DGP supports LB (at least from code
https://github.com/ovn-org/ovn/blob/master/northd/ovn-northd.c#L9318), but
the ovn-nb manpage may need an update.

> 2. Datapath performance would be bad with DGP. We want the packet meant
>> for the host or the Internet to exit out of the hypervisor on which the pod
>> exists. The L3 gateway router provides us with this functionality. With dgp
>> and with OVN supporting only one instance of it, packets unnecessarily gets
>> forwarded over tunnel to dgp chassis for SNATing and then gets forwarded
>> back over tunnel to the host to just exit out locally.
>>
>
This is related to the changes needed for DGP (the first point I mentioned
in previous email). In the diagram I draw, there will be 1000 DGPs, each
reside on a chassis, just to make sure north-south traffic can be forwarded
on the local chassis without going through a central node, just like how it
works today in ovn-k8s. However, maybe this is not a small change, because
today the NAT and LB processing on such LRs (LRs with DGP) are all based on
the assumption that there is only one DGP. For example, the NB schema would
also need to be changed so that the NAT/LB rules for a router can specify
DGP to determine the central processing location for those rules.

So, to summarize, if we can make multi-DGP work, it would be the best
solution for the ovn-k8s scenario. If we can't (either because of design
problem, or because it is too big effort for the gains), maybe configurably
avoiding the static neighbour flows is a good way to go. Both options
requires changes in OVN. Without changes in OVN, a further optimization
based on your current workaround can be done is what Tim has suggested: to
replace the large number of small join LSes (and LRPs and patch ports on
both sides) by same number of directly connected LRPs.

Thanks,
Han

>
>> Also, I would like to clarify the topology of the external logical switch
>> and l3gateway. The current topology is like this:
>>
>> Topology (A)
>>
>>                        10.10.10.0/24
>>    ------+-----------------+--------------------+------
>>          |                 |                    |
>>       localnet          localnet             localnet
>>    +-----+-----+      +----+------+       +-----+-----+
>>    | external  |      | external  |       | external  |
>>    |    LS1    |      |    LS1    |       |    LS1    |
>>    +-----+-----+      +----+------+       +-----+-----+
>>          |                 |                    |
>>      10.10.10.2        10.10.10.3           10.10.10.4
>>         SNAT              SNAT                 SNAT
>>    +-----+-----+     +-----+-----+        +-----+-----+
>>    | l3gateway |     | l3gateway |        | l3gateway |
>>    |   node1   |     |   node2   |        |   node3   |
>>    +-----------+     +-----------+        +-----------+
>>
>>
>> and I would like to move to a topology like this and is very similar to
>> physical networking where all tenant's VRFs SNAT to common L2 in the DC.
>>
>> Topology (B)
>>                        10.10.10.0/24
>>    -------------------------+--------------------------
>>                             |
>>                          localnet
>>                       +-----+-----+
>>                       | external  |
>>          +------------+    LS1    +-------------+
>>          |            +----+------+             |
>>          |                 |                    |
>>      10.10.10.2        10.10.10.3           10.10.10.4
>>         SNAT              SNAT                 SNAT
>>    +-----+-----+     +-----+-----+        +-----------+
>>    | l3gateway |     | l3gateway |        | l3gateway |
>>    |   node1   |     |   node2   |        |   node3   |
>>    +-----------+     +-----------+        +-----------+
>>
>> I cannot do this because of the 2M logical flows that gets created since
>> we now have connected 1000 l3 gateway routers through a single logical
>> switch.
>>
>> Note: Topology (A) might be still relevant in certain DCs where they
>> don't stretch L2 across the Rack and have pure L3 in the core. So,
>> everything upstream from the TOR towards core switches is L3 and everything
>> downstream from TOR to the nodes will be L2. Topology (B) above is just an
>> optimization for end-users who have a single stretched VLAN.
>>
>> Regards,
>> ~Girish
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "ovn-kubernetes" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to ovn-kubernetes+unsubscribe at googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STSFJ1cWOyGyXDGjstX4L%3DJpZBw2%3D5b22dr1a4h3vKPU4A%40mail.gmail.com
>> <https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STSFJ1cWOyGyXDGjstX4L%3DJpZBw2%3D5b22dr1a4h3vKPU4A%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20200508/1b1d75d9/attachment-0001.html>