[ovs-dev] [PATCH ovn] northd: Don't merge ARP flood flows for all (un)reachable IPs.

Numan Siddique numans at ovn.org
Thu Jul 29 16:53:17 UTC 2021


On Thu, Jul 29, 2021 at 12:31 PM Dumitru Ceara <dceara at redhat.com> wrote:
>
> On 7/29/21 5:42 PM, Mark Gray wrote:
> > On 29/07/2021 14:20, Dumitru Ceara wrote:
> >> On 7/29/21 11:59 AM, Dumitru Ceara wrote:
> >>> Logical_Flows are immutable, that is, when the match or action change
> >>> ovn-northd will actually remove the old logical flow and add a new one.
> >>>
> >>> Commit ecc941c70f16 added flows to flood ARPs to routers for unreachable
> >>> addresses.
> >>>
> >>> In the following topology (2 switches connected to a router with a load
> >>> balancer attached to it):
> >>>
> >>> S1 --- (IP1/MASK1) R (IP2/MASK2) --- S2
> >>>
> >>> LB1 (applied on R) with N VIPs.
> >>>
> >>> The following logical flows were generated:
> >>>
> >>> S1 (in_l2_lkup): if "arp.op == 1 && arp.tpa == { VIP1, .., VIPn, IP1 }" then ...
> >>> S2 (in_l2_lkup): if "arp.op == 1 && arp.tpa == { VIP1, .., VIPn, IP2 }" then ...
> >>>
> >>> While this seemed a good idea initially because it would reduce the
> >>> total number of logical flows, this approach has a few major downsides
> >>> that severely affect scalability:
> >>>
> >>> 1. When logical datapath grouping is enabled (which is how high scale
> >>>    deployments should use OVN) there is no chance that the flows for
> >>>    reachable/unreachable router owned IPs are grouped together for all
> >>>    datapaths that correspond to the logical switches they are generated
> >>>    for.
> >>> 2. Whenever a VIP (VIPn+1) is added to LB1 all these flows will be
> >>>    removed and new ones will be added with the only difference being the
> >>>    VIPn+1 aded to the flow match.  As this is a new logical flow record,
> >>>    ovn-controller will have to remove all previously installed openflows
> >>>    (for the deleted logical flow) and add new ones.  However, for most
> >>>    of these openflows, the only change is the cookie value (the first 32
> >>>    bits of the logical flow UUID).  At scale this unnecessary openflow
> >>>    churn will induce significant latency.  Moreover, this also creates
> >>>    unnecessary jsonrpc traffic from ovn-northd towards OVN_Southbound
> >>>    because the old flow needs to be removed and a new one (with a large
> >>>    match field) needs to be installed.
> >>>
> >>> To avoid this, we now install individual flows, one per IP, for
> >>> managing ARP/ND traffic from logical switches towards router owned
> >>> IPs that are reachable/unreachable on the logical port connecting
> >>> the switch.
> >>>
> >>> Fixes: ecc941c70f16 ("northd: Flood ARPs to routers for "unreachable" addresses.")
> >>> Signed-off-by: Dumitru Ceara <dceara at redhat.com>
> >>> ---
> >>> On a deployment based on an ovn-kubernetes cluster with 20 nodes, 2500
> >>> load balancer VIPs, 5000 pods, we:
> >>> - measure number of logical flows in the SB DB, size of the SB DB
> >>> - then create an additional 250 pods (logical ports), bind them, and add
> >>>   a load balancer VIP (with a single backend) for each of the additional
> >>>   pods, and measure how long it takes for all openflows corresponding to
> >>>   each of the pods to be installed in OVS.
> >>>
> >>> Results:
> >>> - without fix:
> >>>   ------------
> >>>   SB lflows:             67976
> >>>   SB size (uncompacted): 46 MiB
> >>>   SB size (compacted):   40 MiB
> >>>
> >>>   after 250 ports + VIPs added:
> >>>   SB lflows:             71479
> >>>   SB size (uncompacted): 248 MiB
> >>>   SB size (compacted):   42  MiB
> >>>   Max time to install all relevant openflows for a single pod: ~6sec
> >>>
> >>> - with fix:
> >>>   ---------
> >>>   SB lflows:             70399
> >>>   SB size (uncompacted): 46 MiB
> >>>   SB size (compacted):   40 MiB
> >>>
> >>>   after 250 ports + VIPs added:
> >>>   SB lflows:             74149
> >>>   SB size (uncompacted): 165 MiB
> >>>   SB size (compacted):   42 MiB
> >>>   Max time to install all relevant openflows for a single pod: ~100msec
> >>
> >>
> >
> > I think you should put these^ (or a summary of these - including maybe
> > the northd degradation) into the commit message.
> >
>
> Sure, I'll do that in the next revision.
>
> >> CC: Mark Michelson (original author of ecc941c70f16 ("northd: Flood ARPs
> >> to routers for "unreachable" addresses."))
> >>
> >> There's another note I should mention here:
> >>
> >> Since we now call build_lswitch_rport_arp_req_flow_for_unreachable_ip()
> >> for every load balancer VIP (instead of once for a list of VIPs) load on
> >> ovn-northd will be increased as it has to create more logical flows that
> >> will eventually be grouped together on the same datapath group.
> >>
> >> When testing at higher scales (120 nodes setups similar topologies to
> >> the one above but with 45K VIPs) I see ovn-northd poll loop intervals
> >> taking ~3s longer on my machine (from ~10s to ~13s).  I do, however,
> >
> > I'm kind of surprised at that degradation because your change doesn't
> > introduce any additional looping. This is purely due to the cost of
> > ovn_lflow_add_with_hint()?
>
> Well, yes, and building of the match.  I commented it out and the hit is
> gone.
>
> >
> >> still think that not merging the flow matches is a good thing because it
> >> has a very substantial impact on all ovn-controllers in the cluster,
> >> already visible in our tests with smaller, 20 node clusters.
> >>
> >> A point here is that I think it's fine if
> >> build_lswitch_rport_arp_req_flow_for_unreachable_ip() installs flows at
> >> a priority lower than the one used for reachable IPs, e.g., 79.  That
> >> would allow us to avoid the !lrouter_port_ipv*_reachable() condition
> >> when installing these flows.
> >
> > I don't get this point?
>
> In my opinion we don't need to do:
>
> if (port_reachable(op)) {
>     add_reachable_flows(prio=80);
> } else {
>     add_unreachable_flows(prio=90);
> }
>
> We can probably just do:
>
> if (port_reachable(op)) {
>     add_reachable_flows(prio=80);
> }
> add_unreachable_flows(prio=79);
>
> The idea being that a packet is either for a "reachable IP" in which
> case we hit the prio 80 flow, or it's not, in which case we hit the prio
> 79 flow.
>
> The advantage is we don't have the port dependency for
> add_unreachable_flows() anymore which would allow us to prebuild all
> possible the matches/actions once and just reuse them for all logical
> switches where they need to be added.
>
> Thanks,
> Dumitru
>
> >>
> >> As a follow up optimization we can take the same approach as in
> >> 0fd39c6cf0ba "northd: move build_lb_rules in build_lswitch_flows_for_lb"
> >> and precompute the matches once for all load balancer VIPs, and just use
> >> them when installing the flows on the connected logical switch datapaths.
> >>
> >> Thoughts?

Hi Dumitru,

Thanks for addressing this issue.

I looked into the issue and in my opinion we can also solve this
problem using address sets.

This is how I'd see an lflow to be generated by ovn-northd

 -  table=22(ls_in_l2_lkup      ), priority=90   , match=(flags[1] ==
0 && arp.op == 1 && arp.tpa == ${<LS_NAME>_V4_UNREACHABLE_IPS})
     action=(outport = "_MC_flood"; output;)

And ovn-northd would create an Address set for each logical switch and
store the unreachable v4 ips (and the same for v6).
It should be straightforward to generate an address set name for each
logical switch.

Presently the function build_lswitch_rport_arp_req_flows() builds an
sset of ips_v4_match_reachable.  This part of the code
can be separated out and it can be used to sync the address sets in the SB DB.

When an unreachable ip is added/deleted to/from the sset,
ovn-controller can incrementally handle the change to the address set
and this should not result in the openflow cookie updates.

IMHO we should make use of address sets here.

Thoughts ? And concerns you see w.r.t performance with this approach ?

Thanks
Numan

> >>
> >> Thanks,
> >> Dumitru
> >>
> >
>
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>


More information about the dev mailing list