[ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

Fri May 8 07:17:20 UTC 2020

On Thu, May 7, 2020 at 11:24 PM Han Zhou <zhouhan at gmail.com> wrote:

> (Add the MLs back)
>
> On Thu, May 7, 2020 at 4:01 PM Girish Moodalbail <gmoodalbail at gmail.com>
> wrote:
>
>> Hello Han,
>>
>> Sorry, I was monitoring the ovn-kubernetes google group and didn't see
>> your emails till now.
>>
>>
>>>
>>> On the other hand, why wouldn't splitting the join logical switch to
>>> 1000 LSes solve the problem? I understand that there will be 1000 more
>>> datapaths, and 1000 more LRPs, but these are all O(n), which is much more
>>> efficient than the O(n^2) exploding. What's the other scale issues created
>>> by this?
>>>
>>
>> Splitting a single join logical switch into 1000 different logical switch
>> is how I have resolved the problem now. However, with this design I see
>> following issues.
>> (1) Complexity
>>    where one logical switch should have sufficed, we now need to create
>> 1000 logical switches just to workaround the O(n^2) logical flows
>> (2) IPAM management
>>   - before I had one IP subnet 100.64.0.0/16 for the single logical
>> switch and depended on OVN IPAM to allocate IPs off of that subnet
>>   - now I need to first do subnet management (break a /16 to /29 CIDR) in
>> OVN K8s and then assign each subnet to each of the join logical switch
>> (3) each of this join logical switch is a distributed switch. The flows
>> related to each one of them will be present in each hypervisor. This will
>> increase the number of OpenFlow flows  However, from OVN K8s point of view
>> this logical switch is essentially pinned to an hypervisor and its role is
>> to connect the hypervisor's l3gateway to the distributed router.
>>
>> We are trying to simplify the OVN logical topology for OVN K8s so that
>> the number of logical flows (and therefore the number of OpenFlow flows)
>> are reduced and that reduces the pressure on ovn-northd, OVN SB DB, and
>> finally ovn-controller processes.
>>
>> Every node in OVN K8s cluster adds 4 resources. So, in a 1000 node
>> k8s-cluster we will have 4000 + 1 (distributed router). This ends up
>> creating around 250K OpenFlow rules in each of the hypervisior. This number
>> is to just support the initial logical topology. I am not accounting for
>> any flows that will be generated for k8s network polices, services, and so
>> on.
>>
>>
>>>
>>> In addition, Girish, for the external LS, I am not sure why can't it be
>>> shared, if all the nodes are connected to a single L2 network. (If they are
>>> connected to separate L2 networks, different external LSes should be
>>> created, at least according to current OVN model).
>>>
>>
>> Yes, the plan was to share the same external LS with all of the L3
>> gateway routers since they are all on the same broadcast domain. However,
>> we will end up with the same 2M logical flows since a single external LS
>> connects all the L3 gateway routers on the same broadcast domain.
>>
>> In short, for a 1000-node K8s cluster, if we reduce the logical flow
>> explosion, then we can reduce the number of logical resources in OVN K8s
>> topology by 1998  (1000 Join LS will become 1 and 1000 external LS will
>> become 1).
>>
>>
> Ok, so now we are not satisfied with even O(n), and instead we want to
> make it O(1) for some of the resources.
> I think the major problem is the per-node gateway routers. It seems not
> really necessary in theory. Ideally the topology can be simplified with the
> concept of distributed gateway ports, on a single logical router (the join
> router), and then we can remove all the join LSes and gateway routers,
> something like below:
>
>     +------------------------------------------+
>     |        external logical switch           |
>     +-+-------------+--------------------+-----+
>       |             |                    |
> +-----+-----+ +-----------+        +-----+-----------+
> | dgp1 at node1| | dgp2 at node2|   ...  |dgp1000 at node1000 |
> +-----+-----+ +-----+-----+        +-----+-----------+
>       |             |                    |
>     +-+-------------+--------------------+-----+
>     |             logical router               |
>     +------------------------------------------+
>
> (dgp = distributed gateway port)
>
> This way, you only need one router, and also one external logical switch,
> and there won't be the O(n^2) flow exploding problem for ARP resolving
> because you have 1 LR only. The number of logical routers and switches
> become O(1). The number of router ports are still O(n), but it is also
> halved.
>
> In reality, there are some problems of this solution that need to be
> addressed.
>
> Firstly, it would require some change in OVN because currently OVN has a
> limitation that each LR can only have one gateway router port. However, it
> doesn't seem to be anything fundamental that would prevent us from removing
> that restriction to support multiple distributed gateway ports on a single
> LR. I'd like to hear from more OVN folks in case there is some reason we
> shouldn't do this.
>
> The other thing that I am not so sure is about connecting the logical
> router to the external logical switch through multiple ports. This means we
> will have multiple ports of the logical router on the same subnet, which is
> something we usually don't do traditionally. However, I think maybe this
> will work with OVN static route with src routing and output_port specified
> so that the LR know which port (and chassis) to send the traffic out,
> provided that there is only one nexthop, which is the default external GW.
> If multiple nexthops need to be supported, this won't work (and we probably
> will have to look at the solution that avoids the static neighbour table
> population).
>

Hello Han,

I did consider distributed gateway port. However, there are two issues with
it

1. In order to support K8s NodePort services we need to create a
North-South LB and L3 gateway is a perfect solution for that. AFAIK,
   DGP doesn't support it
2. Datapath performance would be bad with DGP. We want the packet meant for
the host or the Internet to exit out of the hypervisor on which the pod
exists. The L3 gateway router provides us with this functionality. With dgp
and with OVN supporting only one instance of it, packets unnecessarily gets
forwarded over tunnel to dgp chassis for SNATing and then gets forwarded
back over tunnel to the host to just exit out locally.

Also, I would like to clarify the topology of the external logical switch
and l3gateway. The current topology is like this:

Topology (A)

                       10.10.10.0/24
   ------+-----------------+--------------------+------
         |                 |                    |
      localnet          localnet             localnet
   +-----+-----+      +----+------+       +-----+-----+
   | external  |      | external  |       | external  |
   |    LS1    |      |    LS1    |       |    LS1    |
   +-----+-----+      +----+------+       +-----+-----+
         |                 |                    |
     10.10.10.2        10.10.10.3           10.10.10.4
        SNAT              SNAT                 SNAT
   +-----+-----+     +-----+-----+        +-----+-----+
   | l3gateway |     | l3gateway |        | l3gateway |
   |   node1   |     |   node2   |        |   node3   |
   +-----------+     +-----------+        +-----------+

and I would like to move to a topology like this and is very similar to
physical networking where all tenant's VRFs SNAT to common L2 in the DC.

Topology (B)
                       10.10.10.0/24
   -------------------------+--------------------------
                            |
                         localnet
                      +-----+-----+
                      | external  |
         +------------+    LS1    +-------------+
         |            +----+------+             |
         |                 |                    |
     10.10.10.2        10.10.10.3           10.10.10.4
        SNAT              SNAT                 SNAT
   +-----+-----+     +-----+-----+        +-----------+
   | l3gateway |     | l3gateway |        | l3gateway |
   |   node1   |     |   node2   |        |   node3   |
   +-----------+     +-----------+        +-----------+

I cannot do this because of the 2M logical flows that gets created since we
now have connected 1000 l3 gateway routers through a single logical switch.

Note: Topology (A) might be still relevant in certain DCs where they don't
stretch L2 across the Rack and have pure L3 in the core. So, everything
upstream from the TOR towards core switches is L3 and everything downstream
from TOR to the nodes will be L2. Topology (B) above is just an
optimization for end-users who have a single stretched VLAN.

Regards,
~Girish
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20200508/10b8a266/attachment-0001.html>