[ovs-discuss] [OVN] logical flow explosion in lr_in_ip_input table for dnat_and_snat IPs

Girish Moodalbail gmoodalbail at gmail.com
Thu Jun 25 19:34:47 UTC 2020


Hello Dumitru, Han,

So, we applied this patchset and gave it a spin on our large scale cluster
and saw a significant reduction in the number of logical flows in
lr_in_ip_input table. Before this patch there were around 1.6M flows in
lr_in_ip_input table. However, after the patch we see about 26K flows. So
that is significant reduction in number of logical flows.

In lr_in_ip_input, I see

   - priority 92 flows matching ARP requests for dnat_and_snat IPs on
   distributed gateway port with is_chassis_resident() and corresponding ARP
   reply
   - priority 91 flows matching ARP requests for dnat_and_snat IPs on
   distributed gateway port with !is_chassis_resident() and corresponding drop
   - priority 90 flow matching ARP request for dnat_and_snat IPs and
   corresponding ARP replies

So far so good.

However, not directly related to this patch per-se but directly related to
the behaviour of ARP and dnat_and_snat IP, on the OVN chassis we are seeing
a significant number of OpenFlow flows in table 27 (around 2.3M OpenFlow
flows). This table gets populated from logical flows in table=19
(ls_in_l2_lkup) of logical switch.

The two logical flows in l2_in_l2_lkup that are contributing to huge number
of OpenFlow flows are: (for the  entire logical flow entry, please see:
https://gist.github.com/girishmg/57b3005030d421c59b30e6c36cfc9c18)

Priority=75 flow
=============
This flow looks like below (where 169.254.0.0/29 is dnat_and_snat subnet
and 192.168.0.1 is the logical_switch's gateway IP)

table=19(ls_in_l2_lkup      ), priority=75   , match=(flags[1] == 0 &&
arp.op == 1 && arp.tpa == { 169.254.3.107, 169.254.1.85, 192.168.0.1,
169.254.10.155, 169.254.1.6}), action=(outport = "stor-sdn-test1"; output;)

What this flow says is that any ARP request packet from the switch heading
towards the default gateway or any of those 1-to-1 nat send it out through
the port towards  the ovn_cluster_router’s ingress pipeline. Question
though is why any Pod on the logical switch would send an ARP for an IP
that is not in its subnet. A packet from a Pod towards a non-subnet IP
should ARP only for the default gateway IP.

Priority=80 Flow
=============
This flow looks like below

table=19(ls_in_l2_lkup      ), priority=80   , match=(eth.src == {
0a:58:c0:a8:00:01, 6a:93:f4:55:aa:a7, ae:92:2d:33:24:ea, ba:0a:d3:7d:bc:e8,
b2:2f:40:4d:d9:2b} && (arp.op == 1 || nd_ns)), action=(outport =
"_MC_flood"; output;)

The question again for this flow is why will there be a self-originated arp
requests for the dnat_and_snat IPs from inside of the node's logical
switch. I can see how this is a possibility on the switch that has
`localnet port` on it and to which the distributed router connects to
through a gateway port.

Regards,
~Girish

On Wed, Jun 24, 2020 at 8:55 AM Dumitru Ceara <dceara at redhat.com> wrote:

> Hi Girish,
>
> I sent a patch series to implement Han's suggestion:
> https://patchwork.ozlabs.org/project/openvswitch/list/?series=185580
> https://mail.openvswitch.org/pipermail/ovs-dev/2020-June/372005.html
>
> It would be great if you could give it a run on your setup too.
>
> Thanks,
> Dumitru
>
> On 6/16/20 5:18 PM, Girish Moodalbail wrote:
> > Thanks Han for the update.
> >
> > Regards,
> > ~Girish
> >
> > On Mon, Jun 15, 2020 at 12:55 PM Han Zhou <zhouhan at gmail.com
> > <mailto:zhouhan at gmail.com>> wrote:
> >
> >     Sorry Girish, I can't promise for now. I will see if I have time in
> >     the next couple of weeks, but welcome anyone to volunteer on this if
> >     it is urgent.
> >
> >     On Mon, Jun 15, 2020 at 10:56 AM Girish Moodalbail
> >     <gmoodalbail at gmail.com <mailto:gmoodalbail at gmail.com>> wrote:
> >
> >         Hello Han,
> >
> >         On Wed, Jun 3, 2020 at 9:39 PM Han Zhou <zhouhan at gmail.com
> >         <mailto:zhouhan at gmail.com>> wrote:
> >
> >
> >
> >             On Wed, Jun 3, 2020 at 7:16 PM Girish Moodalbail
> >             <gmoodalbail at gmail.com <mailto:gmoodalbail at gmail.com>>
> wrote:
> >
> >                 Hello all,
> >
> >                 While working on an extension, see the diagram below, to
> >                 the existing OVN logical topology for the ovn-kubernetes
> >                 project, I am seeing an explosion of the "Reply to ARP
> >                 requests" logical flows in the `lr_in_ip_input` table
> >                 for the distributed router (ovn_cluster_router)
> >                 configured with gateway port (rtol-LS)
> >
> >                                         internet
> >                                ---------+-------------->
> >                                         |
> >                                         |
>
> >                       +----------localnet-port---------+
> >                       |LS                              |
> >                       +-----------------ltor-LS--------+
> >                                            |
> >                                            |
> >                  +---------------------rtol-LS------------+
> >                  |           ovn_cluster_router           |
> >                  |          (Distributed Router)          |
> >                  +-rtos-ls0------rtos-ls1--------rtos-ls2-+
> >                       |              |              |
> >                       |              |              |
> >                 +-----+-+       +----+--+     +-----+-+
> >                 |  LS0  |       |  LS1  |     |  LS2  |
> >                 +-+-----+       +-+-----+     +-+-----+
> >                   |               |             |
> >                   p0              p1            p2
> >                  IA0             IA1           IA2
> >                  EA0             EA1           EA2
> >                 (Node0)          (Node1)       (Node2)
> >
> >                 In the topology above, each of the three logical switch
> >                 port has an internal address of IAx and an external
> >                 address of EAx (dnat_and_snat IP). They are all bound to
> >                 their respective nodes (Nodex). A packet from `p0`
> >                 heading towards the internet will be SNAT'ed to EA0 on
> >                 the local hypervisor and then sent out through the LS's
> >                 localnet-port on that hypervisor. Basically, they are
> >                 configured for distributed NATing.
> >
> >                 I am seeing interesting "Reply to ARP requests" flows
> >                 for arp.tpa set to "EAX". Flows are like this:
> >
> >                 For EA0
> >                 priority=90, match=(inport == "rtos-ls0" && arp.tpa ==
> >                 EA0 && arp.op == 1), action=(/* ARP reply */)
> >                 priority=90, match=(inport == "rtos-ls1" && arp.tpa ==
> >                 EA0 && arp.op == 1), action=(/* ARP reply */)
> >                 priority=90, match=(inport == "rtos-ls2" && arp.tpa ==
> >                 EA0 && arp.op == 1), action=(/* ARP reply */)
> >
> >                 For EA1
> >                 priority=90, match=(inport == "rtos-ls0" && arp.tpa ==
> >                 EA1 && arp.op == 1), action=(/* ARP reply */)
> >                 priority=90, match=(inport == "rtos-ls1" && arp.tpa ==
> >                 EA0 && arp.op == 1), action=(/* ARP reply */)
> >                 priority=90, match=(inport == "rtos-ls2" && arp.tpa ==
> >                 EA1 && arp.op == 1), action=(/* ARP reply */)
> >
> >                 Similarly, for EA2.
> >
> >                 So, we have N * N "Reply to ARP requests" flows for N
> >                 nodes each with 1 dnat_and_snat ip.
> >                 This is causing scale issues.
> >
> >                 If you look at the flows for `EA0`, i am confused as to
> >                 why is it needed?
> >
> >                  1. When will one see an ARP request for the EA0 from
> >                     any of the LS{0,1,2}'s logical switch port.
> >                  2. If it is needed at all, can't we just remove the
> >                     `inport` thing altogether since the flow is
> >                     configured for every port of logical router port
> >                     except for the distributed gateway port rtol-LS. For
> >                     this port, we could add an higher priority rule with
> >                     action set to `next`.
> >                  3. Say, we don't need east-west NAT connectivity. Is
> >                     there a way to make these ARPs be learnt
> >                     dynamically, like we are doing for join and external
> >                     logical switch (the other thread [1]).
> >
> >                 Regards,
> >                 ~Girish
> >
> >                 [1]
> https://mail.openvswitch.org/pipermail/ovs-discuss/2020-May/049994.html
> >
> >
> >             In general, these flows should be per router instead of per
> >             router port, since the nat addresses are not attached to any
> >             router port. For distributed gateway ports, there will need
> >             per-port flows to match
> >             is_chassis_resident(gateway-chassis). I think this can be
> >             handled by:
> >             - priority X + 20 flows for each distributed gateway port
> >             with is_chassis_resident(), reply ARP
> >             - priority X + 10 flows for each distributed gateway port
> >             without is_chassis_resident(), drop
> >             - priority X flows for each router (no need to match
> >             inport), reply ARP
> >
> >             This way, there are N * (2D + 1) flows per router. N =
> >             number of NAT IPs, D = number of distributed gateway ports.
> >             This would optimize the above scenario where there is only 1
> >             distributed gateway port but many regular router ports.
> >             Thoughts?
> >
> >
> >         We went ahead and added support for this topology in
> >         ovn-kubernetes project in this commit
> >
> https://github.com/ovn-org/ovn-kubernetes/commit/edb24e6a71142f2e835b67b29c11e1688c645683
>
> >
> >         Han, was curious to know if the above fix is in your radar?
> Thanks.
> >
> >         The number of OpenFlow flows in each of the hypervisors is
> >         insanely high and is consuming a lot of memory.
> >
> >         Regards,
> >         ~Girish
> >
> >
> >
> >
> >
> >
> >             Thanks,
> >             Han
> >
> >         --
> >         You received this message because you are subscribed to the
> >         Google Groups "ovn-kubernetes" group.
> >         To unsubscribe from this group and stop receiving emails from
> >         it, send an email to ovn-kubernetes+unsubscribe at googlegroups.com
> >         <mailto:ovn-kubernetes+unsubscribe at googlegroups.com>.
> >         To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STTOrzx-zy48TKpbxx4yxxQ_X5bN05VPqBHA79gpCBQfwg%40mail.gmail.com
> >         <
> https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STTOrzx-zy48TKpbxx4yxxQ_X5bN05VPqBHA79gpCBQfwg%40mail.gmail.com?utm_medium=email&utm_source=footer
> >.
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20200625/055f4a35/attachment-0001.html>


More information about the discuss mailing list