[ovs-dev] Scaling of Logical_Flows and MAC_Binding tables

Anil Vishnoi vishnoianil at gmail.com
Tue Dec 1 07:11:08 UTC 2020


On Mon, Nov 30, 2020 at 9:26 PM Han Zhou <hzhou at ovn.org> wrote:
>
>
>
> On Mon, Nov 30, 2020 at 8:22 PM Anil Vishnoi <vishnoianil at gmail.com> wrote:
> >
> > I am just wondering if MAC_Binding table entries can expire after a
> > certain timeout will help here? Just like we do for openflow flows
> > (idle_timeout and hard_timeout). That can help address the scale
> > problem as well as stale entry problems. Even if we move the
> > MAC_Binding table to LS, i think it doesn't guarantee that this table
> > won't bloat over the time, because we don't flush any of these MAC
> > entries? I believe kernel networking arp cache uses a similar approach
> > to maintain this cache.
> >
>
> Hi Anil,
>
> This has been discussed before. It is just hard to implement a timeout mechanism in OVSDB without significant performance penalty.
Even with time granularity in Minute?
>
> For the scale problem, I think for most use cases the two options I mentioned were enough to solve the problem, although it now needs some more fix since it is broken, as discussed in the other replies. If they are not sufficient for some special use cases, e.g. large number of routers need E-W communications in a full-mesh fashion (I wonder if this scenario is realistic), then some more optimization might be needed, such as sharing the MAC_Binding entries per LS, which reduces the problem of O(n^2) to O(n).
>
> For the stale entries, it is not a problem in most cases, because if an endpoint is gone but the entry remains in MAC_Binding, in the end there is not much difference when someone tries to send packet to the endpoint. It is just unreachable. What matters is when an entry is updated but the update itself is lost for some reason (e.g. control plane outage), then packets would always go to the staled MAC instead of the correct one. This can usually be mitigated by periodical GARP from endpoints.

I am not very sure if endpoints send periodic GARP by default( I know
OS/VIMs send at bootup), until and unless you run something like
keepalived or something. Do container runtimes support periodic GARP?
Wondering if processing periodic GARPs from multiple endpoints would
be any cheaper compared to periodic flushing (but don't have any
datapoint as such if it isn't :) ).
>
> Thanks,
> Han
>
> > On Sun, Nov 29, 2020 at 10:08 PM Numan Siddique <numans at ovn.org> wrote:
> > >
> > > On Mon, Nov 30, 2020 at 7:37 AM Han Zhou <hzhou at ovn.org> wrote:
> > > >
> > > > On Sat, Nov 28, 2020 at 12:31 PM Tony Liu <tonyliu0592 at hotmail.com> wrote:
> > > > >
> > > > > Hi Renat,
> > > > >
> > > > > What's this "logical datapath patches that Ilya Maximets submitted"?
> > > > > Could you share some links?
> > > > >
> > > > > There were couple discussions for the similar issue.
> > > > > [1] raised the issue and results a new option
> > > > > always_learn_from_arp_request to be added [2].
> > > > > [3] results a patch to OVN ML2 driver [4] to set the option added by [1].
> > > > >
> > > > > It seems that it helps to optimize logical_flow table.
> > > > > I am not sure if it helps on mac_binding as well.
> > > > >
> > > > > Is it the same issue we are trying to address here, by either
> > > > > Numan's local cache or the solution proposed by Dumitru?
> > > > >
> > > > > [1]
> > > > https://mail.openvswitch.org/pipermail/ovs-discuss/2020-May/049994.html
> > > > > [2]
> > > > https://github.com/ovn-org/ovn/commit/61ccc6b5fc7c49b512e26347cfa12b86f0ec2fd9#diff-05b24a3133733fb7b0f979698083b8128e8f1f18c3c2bd09002ae788d34a32f5
> > > > > [3] http://osdir.com/openstack-discuss/msg16002.html
> > > > > [4] https://review.opendev.org/c/openstack/neutron/+/752678
> > > > >
> > > > >
> > > > > Thanks!
> > > > > Tony
> > > >
> > > > Thanks Tony for pointing to the old discussion [0]. I thought setting the
> > > > option always_learn_from_arp_request to "false" on the logical routers
> > > > should have solved this scale problem in MAC_Binding table in this scenario.
> > > >
> > > > However, it seems the commit a2b88dc513 ("pinctrl: Directly update
> > > > MAC_Bindings created by self originated GARPs.") have overridden the
> > > > option. (I haven't tested, but maybe @Dumitru Ceara <dceara at redhat.com> can
> > > > confirm.)
> > > >
> > > > Similarly, for the Logical_Flow explosion, it should have been solved by
> > > > setting the option dynamic_neigh_routers to "true".
> > > >
> > > > I think these two options are exactly for the scenario Renat is
> > > > reporting. @Renat, could you try setting these options as suggested above
> > > > using the OVN version before the commit a2b88dc513 to see if it solves your
> > > > problem?
> > > >
> > >
> > > When you test it out with the suggested commit, please delete the
> > > mac_binding entries manually
> > > as ovn-northd or ovn-controllers don't delete any entries from
> > > mac_binding table.
> > >
> > > > Regarding the proposals in this thread:
> > > > - Move MAC_Binding to LS (by Dumitru)
> > > >     This sounds good to me, while I am not sure about all the implications
> > > > yet, wondering why it was associated with LRP instead in the beginning.
> > > >
> > > > - Remove MAC_Binding from SB (by Numan)
> > > >     I am a little concerned about this. The MAC_Binding in SB is required
> > > > for distributed LR to work for dynamic ARP resolving. Consider a general
> > > > use case: A - LS1 - LR1 - LS2 - B. A is on HV1 and B is on HV2. Now A sends
> > > > a packet to B's IP. Assume B's IP is unknown by OVN. The packet is routed
> > > > by LR1 and on the LRP facing LS2 an ARP is sent out over the LS1 logical
> > > > network. The above steps happen on HV1. Now the ARP request reaches HV2 and
> > > > is received by B, so B sends an ARP response. With the current
> > > > implementation, HV2's OVS flow would learn the MAC-IP binding from the ARP
> > > > response and update SB DB, and HV1 will get the SB update and install the
> > > > MAC Binding flow as a result of ARP resolving. The next time A sends a
> > > > packet to B, the HV1 will directly resolve the ARP from the MAC Binding
> > > > flows locally and send the IP packet to HV2. The SB DB MAC_Binding table
> > > > works as a distributed ARP/Neighbor cache. It is a mechanism to sync the
> > > > ARP cache from the place where it is learned to the place where it is
> > > > initiated, and all HVs benefit from this without the need to send ARP
> > > > themselves for the same LRP. In other words, the LRP is distributed, so the
> > > > ARP resolving is in a distributed fashion. Without this, each HV would
> > > > initiate ARP request on behalf of the same LRP, which would largely
> > > > increase the ARP traffic unnecessarily - even more than the traditional
> > > > network (where one physical router only needs to do one ARP resolving for
> > > > each neighbor and maintain one copy of ARP cache). And I am not sure if
> > > > there are other side effects when an endpoint sees unexpectedly frequent
> > > > ARP requests from the same LRP - would there be any rate limit that even
> > > > discards repeated ARP requests from the same source? Numan, maybe you have
> > > > already considered these. Would you share your thoughts?
> > >
> > > Thanks for the comments and highlighting this use case which I missed
> > > completely.
> > >
> > > I was thinking more in lines on the N-S usecase with a distributed
> > > gateway router port.
> > > And I completely missed the E-W with an unknown address scenario. If
> > > we don't consider
> > > the unknown address scenario, I think moving away from MAC_Binding
> > > south db tabe would
> > > be beneficial in the long run. For  few reasons
> > >    1. For better scale.
> > >    2. To address the mac_binding stale entries (which presently CMS
> > > have to handle)
> > >
> > > For N-S traffic scenario, ovn-controller claiming the gw router port
> > > will take care of generating the ARP.
> > > For Floating IP dvr scenario, each compute node will have to generate
> > > the ARP request to learn a remote.
> > > I think this should be fine as it is just a one time thing.
> > >
> > > Regarding the unknown address scenario, right now ovn controller
> > > floods the packet to all the unknown logical ports
> > > of a switch if OVN doesn't know the MAC. All these are unknown logical
> > > ports belonging to a multicast group.
> > >
> > > I think we should solve this case. In the case of Openstack, when port
> > > security is disabled for a neutron port, the logical
> > > port will have an unknown address configured. There are a few related
> > > bugzillas/lauchpad bugs [1].
> > >
> > > I think we should fix this behavior in OVN and ovn should do the mac
> > > learning on the switch for the unknown ports. And If we do that,
> > > I think the scenario you mentioned will be addressed.
> > >
> > > Maybe we can extend Dumitru's suggestion and have just one approach
> > > which does the mac learning on the switch (keeping
> > > the SB Mac_binding table).
> > >     -  for unknown logical ports
> > >     -  for unknown macs for the N-S routing.
> > >
> > > Any thoughts ?
> > >
> > > FYI - I have a PoC/RFC patch in progress which adds the mac binding
> > > cache support -
> > > https://github.com/numansiddique/ovn/commit/22082d04ca789155ea2edd3c1706bde509ae44da
> > >
> > > [1] - https://review.opendev.org/c/openstack/neutron/+/763567/
> > >        https://bugzilla.redhat.com/show_bug.cgi?id=1888441
> > >       https://bugs.launchpad.net/neutron/+bug/1904412
> > >       https://bugzilla.redhat.com/show_bug.cgi?id=1672625
> > >
> > > Thanks
> > > Numan
> > >
> > > >
> > > > Thanks,
> > > > Han
> > > >
> > > > > > -----Original Message-----
> > > > > > From: dev <ovs-dev-bounces at openvswitch.org> On Behalf Of Numan Siddique
> > > > > > Sent: Thursday, November 26, 2020 11:36 AM
> > > > > > To: Daniel Alvarez Sanchez <dalvarez at redhat.com>
> > > > > > Cc: ovs-dev <ovs-dev at openvswitch.org>
> > > > > > Subject: Re: [ovs-dev] Scaling of Logical_Flows and MAC_Binding tables
> > > > > >
> > > > > > On Thu, Nov 26, 2020 at 4:32 PM Numan Siddique <numans at ovn.org> wrote:
> > > > > > >
> > > > > > > On Thu, Nov 26, 2020 at 4:11 PM Daniel Alvarez Sanchez
> > > > > > > <dalvarez at redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Nov 25, 2020 at 7:59 PM Dumitru Ceara <dceara at redhat.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > On 11/25/20 7:06 PM, Numan Siddique wrote:
> > > > > > > > > > On Wed, Nov 25, 2020 at 10:24 PM Renat Nurgaliyev
> > > > > > > > > > <impleman at gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> On 25.11.20 16:14, Dumitru Ceara wrote:
> > > > > > > > > >>> On 11/25/20 3:30 PM, Renat Nurgaliyev wrote:
> > > > > > > > > >>>> Hello folks,
> > > > > > > > > >>>>
> > > > > > > > > >>> Hi Renat,
> > > > > > > > > >>>
> > > > > > > > > >>>> we run a lab where we try to evaluate scalability potential
> > > > > > > > > >>>> of OVN
> > > > > > > > > with
> > > > > > > > > >>>> OpenStack as CMS.
> > > > > > > > > >>>> Current lab setup is following:
> > > > > > > > > >>>>
> > > > > > > > > >>>> 500 networks
> > > > > > > > > >>>> 500 routers
> > > > > > > > > >>>> 1500 VM ports (3 per network/router)
> > > > > > > > > >>>> 1500 Floating IPs (one per VM port)
> > > > > > > > > >>>>
> > > > > > > > > >>>> There is an external network, which is bridged to br-provider
> > > > > > > > > >>>> on
> > > > > > > > > gateway
> > > > > > > > > >>>> nodes. There are 2000 ports
> > > > > > > > > >>>> connected to this external network (1500 Floating IPs + 500
> > > > > > > > > >>>> SNAT
> > > > > > > > > router
> > > > > > > > > >>>> ports). So the setup is not
> > > > > > > > > >>>> very big we'd say, but after applying this configuration via
> > > > > > > > > >>>> ML2/OVN plugin, northd kicks in and does its job, and after
> > > > > > > > > >>>> its done, Logical_Flow table gets 645877 entries, which is
> > > > > > > > > >>>> way too much. But ok, we move on and start one controller on
> > > > > > > > > >>>> the gateway chassis, and here things get really messy.
> > > > > > > > > >>>> MAC_Binding table grows from 0 to 999088 entries in one
> > > > > > > > > >>>> moment, and after its done, the size of SB biggest tables
> > > > > > > > > >>>> look like this:
> > > > > > > > > >>>>
> > > > > > > > > >>>> 999088 MAC_Binding
> > > > > > > > > >>>> 645877 Logical_Flow
> > > > > > > > > >>>> 4726 Port_Binding
> > > > > > > > > >>>> 1117 Multicast_Group
> > > > > > > > > >>>> 1068 Datapath_Binding
> > > > > > > > > >>>> 1046 Port_Group
> > > > > > > > > >>>> 551 IP_Multicast
> > > > > > > > > >>>> 519 DNS
> > > > > > > > > >>>> 517 HA_Chassis_Group
> > > > > > > > > >>>> 517 HA_Chassis
> > > > > > > > > >>>> ...
> > > > > > > > > >>>>
> > > > > > > > > >>>> MAC binding table gets huge, basically it now has an entry
> > > > > > > > > >>>> for every port that is connected to external network * number
> > > > > > > > > >>>> of datapaths, which roughly makes it one million entries.
> > > > > > > > > >>>> This table by itself increases the size of the SB by 200
> > > > > > > > > >>>> megabytes. Logical_Flow table also gets very heavy, we have
> > > > > > > > > >>>> already played a bit with logical datapath patches that Ilya
> > > > > > > > > >>>> Maximets submitted, and it
> > > > > > > > > looks
> > > > > > > > > >>>> much better, but the size of
> > > > > > > > > >>>> the MAC_Binding table still feels inadequate.
> > > > > > > > > >>>>
> > > > > > > > > >>>> We would like to start to work at least on MAC_Binding table
> > > > > > > > > >>>> optimisation, but it is a bit difficult to start working from
> > > > > > > > > >>>> scratch. Can someone help us with ideas how this could be
> > > > > > > > > >>>> optimised?
> > > > > > > > > >>>>
> > > > > > > > > >>>> Maybe it would also make sense to group entries in
> > > > > > > > > >>>> MAC_Binding table
> > > > > > > > > in
> > > > > > > > > >>>> the same way like it is proposed for logical flows in Ilya's
> > > > > > > > > >>>> patch?
> > > > > > > > > >>>>
> > > > > > > > > >>> Maybe it would work but I'm not really sure how, right now.
> > > > > > > > > >>> However, what if we change the way MAC_Bindings are created?
> > > > > > > > > >>>
> > > > > > > > > >>> Right now a MAC Binding is created for each logical router
> > > > > > > > > >>> port but in your case there are a lot of logical router ports
> > > > > > > > > >>> connected to the single provider logical switch and they all
> > > > > > learn the same ARPs.
> > > > > > > > > >>>
> > > > > > > > > >>> What if we instead store MAC_Bindings per logical switch?
> > > > > > > > > >>> Basically sharing all these MAC_Bindings between all router
> > > > > > > > > >>> ports connected to
> > > > > > > > > the
> > > > > > > > > >>> same LS.
> > > > > > > > > >>>
> > > > > > > > > >>> Do you see any problem with this approach?
> > > > > > > > > >>>
> > > > > > > > > >>> Thanks,
> > > > > > > > > >>> Dumitru
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >> I believe that this approach is way to go, at least nothing
> > > > > > > > > >> comes to my
> > > > > > > > > mind
> > > > > > > > > >> that could go wrong here. We will try to make a patch for that.
> > > > > > > > > However, if
> > > > > > > > > >> someone is familiar with the code and knows how to do it fast,
> > > > > > > > > >> it would
> > > > > > > > > also
> > > > > > > > > >> be very nice.
> > > > > > > > > >
> > > > > > > > > > This approach should work.
> > > > > > > > > >
> > > > > > > > > > I've another idea (I won't call it a solution yet). What if we
> > > > > > > > > > drop the usage of MAC_Binding altogether ?
> > > > > > > > >
> > > > > > > > > This would be great!
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > - When ovn-controller learns a mac_binding, it will not create a
> > > > > > > > > > row into the SB MAC_binding table
> > > > > > > > > > - Instead it will maintain the learnt mac binding in its memory.
> > > > > > > > > > - ovn-controller will still program the table 66 with the flow
> > > > > > > > > > to set the eth.dst (for the get_arp() action)
> > > > > > > > > >
> > > > > > > > > > This has a couple of advantages
> > > > > > > > > >   - Right now we never flush the old/stale mac_binding entries.
> > > > > > > > > >   - If suppose the mac of an external IP has changed, but OVN
> > > > > > > > > > has an entry for that IP with old mac in the mac_binding table,
> > > > > > > > > >     we will use the old mac, causing the packet to be sent out
> > > > > > > > > > to the wrong destination and the packet might get lost.
> > > > > > > > > >   - So we will get rid of this problem
> > > > > > > > > >   - We will also save SB DB space.
> > > > > > > > > >
> > > > > > > > > > There are few disadvantages
> > > > > > > > > >   -  Other ovn-controllers will not add the flows in table 66. I
> > > > > > > > > > guess this should be fine as each ovn-controller can generate
> > > > > > > > > > the ARP request and learn the mac.
> > > > > > > > > >   - When ovn-controller restarts we lose the learnt macs and
> > > > > > > > > > would need to learn again.
> > > > > > > > > >
> > > > > > > > > > Any thoughts on this ?
> > > > > > > > >
> > > > > > > >
> > > > > > > > It'd be great to have some sort of local ARP cache but I'm concerned
> > > > > > > > about the performance implications.
> > > > > > > >
> > > > > > > > - How are you going to determine when an entry is stale?
> > > > > > > > If you slow path the packets to reset the timeout everytime a pkt
> > > > > > > > with source mac is received, it doesn't look good. Maybe you have
> > > > > > > > something else in mind.
> > > > > > >
> > > > > > > Right now we don't stale any mac_binding entry. If I understand you
> > > > > > > correctly, your concern is for the scenario where a floating ip is
> > > > > > > updated with a different mac, how the local cache is updated ?
> > > > > > >
> > > > > > > Right now networking-ovn (in the case of openstack) updates the
> > > > > > > mac_binding entry in the South db for such cases right ?
> > > > > > >
> > > > > >
> > > > > > FYI - I have started working on this approach as PoC. i.e to use local
> > > > > > mac_binding cache
> > > > > > instead of using the SB mac_binding table.
> > > > > >
> > > > > > I will update this thread about the progress.
> > > > > >
> > > > > > Thanks
> > > > > > Numan
> > > > > >
> > > > > > > Thanks
> > > > > > > Numan
> > > > > > >
> > > > > > > >
> > > > > > > > -
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > There's another scenario that we need to take care of and doesn't
> > > > > > seem
> > > > > > > > > too obvious to address without MAC_Bindings.
> > > > > > > > >
> > > > > > > > > GARPs were being injected in the L2 broadcast domain of a LS for
> > > > > > nat
> > > > > > > > > addresses in case FIPs are reused by the CMS, introduced by:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > https://github.com/ovn-
> > > > > > org/ovn/commit/069a32cbf443c937feff44078e8828d7a2702da8
> > > > > > > >
> > > > > > > >
> > > > > > > > Dumitru and I have been discussing the possibility of reverting this
> > > > > > patch
> > > > > > > > and rely on CMSs to maintain the MAC_Binding entries associated with
> > > > > > the
> > > > > > > > FIPs [0].
> > > > > > > > I'm against reverting this patch in OVN [1] for multiple reasons
> > > > > > being the
> > > > > > > > most important one the fact that if we rely on workarounds in the
> > > > > > CMS side,
> > > > > > > > we'll be creating a control plane dependency for something that is
> > > > > > pure
> > > > > > > > dataplane only (ie. if Neutron server is down - outage, upgrades,
> > > > > > etc. -,
> > > > > > > > traffic is going to be disrupted). On the other hand one could argue
> > > > > > that
> > > > > > > > the same dependency now exists on ovn-controller being up & running
> > > > > > but I
> > > > > > > > believe that this is better than a) relying on workarounds on CMSs
> > > > b)
> > > > > > > > relying on CMSs availability.
> > > > > > > >
> > > > > > > > In the short term I think that moving the MAC_Binding entries to LS
> > > > > > instead
> > > > > > > > of LRP as it was suggested up thread would be a good idea and in the
> > > > > > long
> > > > > > > > haul, the ARP *local* cache seems to be the right solution.
> > > > > > Brainstorming
> > > > > > > > with Dumitru he suggested inspecting the flows regularly to see if
> > > > > > the
> > > > > > > > packet count on flows that check if src_mac == X has not increased
> > > > > > in a
> > > > > > > > while and then remove the ARP responder flows locally.
> > > > > > > >
> > > > > > > > [0]
> > > > > > > > https://github.com/openstack/networking-
> > > > > > ovn/commit/5181f1106ff839d08152623c25c9a5f6797aa2d7
> > > > > > > >
> > > > > > > > [1]
> > > > > > > > https://github.com/ovn-
> > > > > > org/ovn/commit/069a32cbf443c937feff44078e8828d7a2702da8
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Recently, due to the dataplane scaling issue (4K resubmit limit
> > > > > > being
> > > > > > > > > hit), we don't flood these packets on non-router ports and instead
> > > > > > > > > create the MAC Bindings directly from ovn-controller:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > https://github.com/ovn-
> > > > > > org/ovn/commit/a2b88dc5136507e727e4bcdc4bf6fde559f519a9
> > > > > > > > >
> > > > > > > > > Without the MAC_Binding table we'd need to find a way to update or
> > > > > > flush
> > > > > > > > > stale bindings when an IP is used for a VIF or FIP.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Dumitru
> > > > > > > > >
> > > > > > > > > _______________________________________________
> > > > > > > > > dev mailing list
> > > > > > > > > dev at openvswitch.org
> > > > > > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > > > > > > > >
> > > > > > > > >
> > > > > > > > _______________________________________________
> > > > > > > > dev mailing list
> > > > > > > > dev at openvswitch.org
> > > > > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > > > > > > >
> > > > > > _______________________________________________
> > > > > > dev mailing list
> > > > > > dev at openvswitch.org
> > > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > > > > _______________________________________________
> > > > > dev mailing list
> > > > > dev at openvswitch.org
> > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > > > _______________________________________________
> > > > dev mailing list
> > > > dev at openvswitch.org
> > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > > >
> > > _______________________________________________
> > > dev mailing list
> > > dev at openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
> >
> >
> > --
> > Thanks
> > Anil



-- 
Thanks
Anil


More information about the dev mailing list