[ovs-dev] [PATCH ovn] northd: Don't merge ARP flood flows for all (un)reachable IPs.

Mark Michelson mmichels at redhat.com
Thu Jul 29 20:52:49 UTC 2021


On 7/29/21 1:49 PM, Numan Siddique wrote:
> On Thu, Jul 29, 2021 at 1:14 PM Dumitru Ceara <dceara at redhat.com> wrote:
>>
>> On 7/29/21 6:53 PM, Numan Siddique wrote:
>>> On Thu, Jul 29, 2021 at 12:31 PM Dumitru Ceara <dceara at redhat.com> wrote:
>>>>
>>>> On 7/29/21 5:42 PM, Mark Gray wrote:
>>>>> On 29/07/2021 14:20, Dumitru Ceara wrote:
>>>>>> On 7/29/21 11:59 AM, Dumitru Ceara wrote:
>>>>>>> Logical_Flows are immutable, that is, when the match or action change
>>>>>>> ovn-northd will actually remove the old logical flow and add a new one.
>>>>>>>
>>>>>>> Commit ecc941c70f16 added flows to flood ARPs to routers for unreachable
>>>>>>> addresses.
>>>>>>>
>>>>>>> In the following topology (2 switches connected to a router with a load
>>>>>>> balancer attached to it):
>>>>>>>
>>>>>>> S1 --- (IP1/MASK1) R (IP2/MASK2) --- S2
>>>>>>>
>>>>>>> LB1 (applied on R) with N VIPs.
>>>>>>>
>>>>>>> The following logical flows were generated:
>>>>>>>
>>>>>>> S1 (in_l2_lkup): if "arp.op == 1 && arp.tpa == { VIP1, .., VIPn, IP1 }" then ...
>>>>>>> S2 (in_l2_lkup): if "arp.op == 1 && arp.tpa == { VIP1, .., VIPn, IP2 }" then ...
>>>>>>>
>>>>>>> While this seemed a good idea initially because it would reduce the
>>>>>>> total number of logical flows, this approach has a few major downsides
>>>>>>> that severely affect scalability:
>>>>>>>
>>>>>>> 1. When logical datapath grouping is enabled (which is how high scale
>>>>>>>     deployments should use OVN) there is no chance that the flows for
>>>>>>>     reachable/unreachable router owned IPs are grouped together for all
>>>>>>>     datapaths that correspond to the logical switches they are generated
>>>>>>>     for.
>>>>>>> 2. Whenever a VIP (VIPn+1) is added to LB1 all these flows will be
>>>>>>>     removed and new ones will be added with the only difference being the
>>>>>>>     VIPn+1 aded to the flow match.  As this is a new logical flow record,
>>>>>>>     ovn-controller will have to remove all previously installed openflows
>>>>>>>     (for the deleted logical flow) and add new ones.  However, for most
>>>>>>>     of these openflows, the only change is the cookie value (the first 32
>>>>>>>     bits of the logical flow UUID).  At scale this unnecessary openflow
>>>>>>>     churn will induce significant latency.  Moreover, this also creates
>>>>>>>     unnecessary jsonrpc traffic from ovn-northd towards OVN_Southbound
>>>>>>>     because the old flow needs to be removed and a new one (with a large
>>>>>>>     match field) needs to be installed.
>>>>>>>
>>>>>>> To avoid this, we now install individual flows, one per IP, for
>>>>>>> managing ARP/ND traffic from logical switches towards router owned
>>>>>>> IPs that are reachable/unreachable on the logical port connecting
>>>>>>> the switch.
>>>>>>>
>>>>>>> Fixes: ecc941c70f16 ("northd: Flood ARPs to routers for "unreachable" addresses.")
>>>>>>> Signed-off-by: Dumitru Ceara <dceara at redhat.com>
>>>>>>> ---
>>>>>>> On a deployment based on an ovn-kubernetes cluster with 20 nodes, 2500
>>>>>>> load balancer VIPs, 5000 pods, we:
>>>>>>> - measure number of logical flows in the SB DB, size of the SB DB
>>>>>>> - then create an additional 250 pods (logical ports), bind them, and add
>>>>>>>    a load balancer VIP (with a single backend) for each of the additional
>>>>>>>    pods, and measure how long it takes for all openflows corresponding to
>>>>>>>    each of the pods to be installed in OVS.
>>>>>>>
>>>>>>> Results:
>>>>>>> - without fix:
>>>>>>>    ------------
>>>>>>>    SB lflows:             67976
>>>>>>>    SB size (uncompacted): 46 MiB
>>>>>>>    SB size (compacted):   40 MiB
>>>>>>>
>>>>>>>    after 250 ports + VIPs added:
>>>>>>>    SB lflows:             71479
>>>>>>>    SB size (uncompacted): 248 MiB
>>>>>>>    SB size (compacted):   42  MiB
>>>>>>>    Max time to install all relevant openflows for a single pod: ~6sec
>>>>>>>
>>>>>>> - with fix:
>>>>>>>    ---------
>>>>>>>    SB lflows:             70399
>>>>>>>    SB size (uncompacted): 46 MiB
>>>>>>>    SB size (compacted):   40 MiB
>>>>>>>
>>>>>>>    after 250 ports + VIPs added:
>>>>>>>    SB lflows:             74149
>>>>>>>    SB size (uncompacted): 165 MiB
>>>>>>>    SB size (compacted):   42 MiB
>>>>>>>    Max time to install all relevant openflows for a single pod: ~100msec
>>>>>>
>>>>>>
>>>>>
>>>>> I think you should put these^ (or a summary of these - including maybe
>>>>> the northd degradation) into the commit message.
>>>>>
>>>>
>>>> Sure, I'll do that in the next revision.
>>>>
>>>>>> CC: Mark Michelson (original author of ecc941c70f16 ("northd: Flood ARPs
>>>>>> to routers for "unreachable" addresses."))
>>>>>>
>>>>>> There's another note I should mention here:
>>>>>>
>>>>>> Since we now call build_lswitch_rport_arp_req_flow_for_unreachable_ip()
>>>>>> for every load balancer VIP (instead of once for a list of VIPs) load on
>>>>>> ovn-northd will be increased as it has to create more logical flows that
>>>>>> will eventually be grouped together on the same datapath group.
>>>>>>
>>>>>> When testing at higher scales (120 nodes setups similar topologies to
>>>>>> the one above but with 45K VIPs) I see ovn-northd poll loop intervals
>>>>>> taking ~3s longer on my machine (from ~10s to ~13s).  I do, however,

I'm fine with this tradeoff. As was stated by someone else, this is 
relieving load on many ovn-controllers to instead slightly load 
ovn-northd more. As you've noted, there are potential additional 
optimizations we could make in ovn-northd to offset this hit as well.

>>>>>
>>>>> I'm kind of surprised at that degradation because your change doesn't
>>>>> introduce any additional looping. This is purely due to the cost of
>>>>> ovn_lflow_add_with_hint()?
>>>>
>>>> Well, yes, and building of the match.  I commented it out and the hit is
>>>> gone.
>>>>
>>>>>
>>>>>> still think that not merging the flow matches is a good thing because it
>>>>>> has a very substantial impact on all ovn-controllers in the cluster,
>>>>>> already visible in our tests with smaller, 20 node clusters.
>>>>>>
>>>>>> A point here is that I think it's fine if
>>>>>> build_lswitch_rport_arp_req_flow_for_unreachable_ip() installs flows at
>>>>>> a priority lower than the one used for reachable IPs, e.g., 79.  That
>>>>>> would allow us to avoid the !lrouter_port_ipv*_reachable() condition
>>>>>> when installing these flows.
>>>>>
>>>>> I don't get this point?
>>>>
>>>> In my opinion we don't need to do:
>>>>
>>>> if (port_reachable(op)) {
>>>>      add_reachable_flows(prio=80);
>>>> } else {
>>>>      add_unreachable_flows(prio=90);
>>>> }
>>>>
>>>> We can probably just do:
>>>>
>>>> if (port_reachable(op)) {
>>>>      add_reachable_flows(prio=80);
>>>> }
>>>> add_unreachable_flows(prio=79);
>>>>
>>>> The idea being that a packet is either for a "reachable IP" in which
>>>> case we hit the prio 80 flow, or it's not, in which case we hit the prio
>>>> 79 flow.
>>>>
>>>> The advantage is we don't have the port dependency for
>>>> add_unreachable_flows() anymore which would allow us to prebuild all
>>>> possible the matches/actions once and just reuse them for all logical
>>>> switches where they need to be added.
>>>>
>>>> Thanks,
>>>> Dumitru
>>>>
>>>>>>
>>>>>> As a follow up optimization we can take the same approach as in
>>>>>> 0fd39c6cf0ba "northd: move build_lb_rules in build_lswitch_flows_for_lb"
>>>>>> and precompute the matches once for all load balancer VIPs, and just use
>>>>>> them when installing the flows on the connected logical switch datapaths.
>>>>>>
>>>>>> Thoughts?
>>>
>>> Hi Dumitru,
>>
>> Hi Numan,
>>
>>>
>>> Thanks for addressing this issue.
>>>
>>> I looked into the issue and in my opinion we can also solve this
>>> problem using address sets.
>>>
>>> This is how I'd see an lflow to be generated by ovn-northd
>>>
>>>   -  table=22(ls_in_l2_lkup      ), priority=90   , match=(flags[1] ==
>>> 0 && arp.op == 1 && arp.tpa == ${<LS_NAME>_V4_UNREACHABLE_IPS})
>>>       action=(outport = "_MC_flood"; output;)
>>
>> As far as I can tell the problem with this approach is that this will
>> not allow the logical flows to take advantage of datapath groups.
> 
> You're right.  I missed out on the datapath group angle.
> 
> Thanks
> Numan
> 
>>
>> For example, for:
>>
>> LS1 ------+
>>            |
>> LS2 ----- LR
>>            |
>> LS3 ------+
>>
>> If LR has a load balancer with VIPs V1, V2, V3, V4, V5.
>>
>> And the unreachable VIPs for each LRP are:
>>
>> LRP1: V1, V2, V3         (AS: LS1_UNREACH_IPS)
>> LRP2: V1, V2, V3, V4     (AS: LS2_UNREACH_IPS)
>> LRP3: V1, V2, V3, V4, V5 (AS: LS3_UNREACH_IPS)
>>
>> Then the lflows would be:
>>
>> LF1: arp.op == 1 && arp.tpa == $LS1_UNREACH_IPS
>> LF2: arp.op == 1 && arp.tpa == $LS2_UNREACH_IPS
>> LF3: arp.op == 1 && arp.tpa == $LS3_UNREACH_IPS
>>
>> Which cannot be grouped in a logical datapath because they don't have
>> the same match.
>>
>> Whereas with the patch I proposed:
>>
>> LF1: arp.op == 1 && arp.tpa == V1, applied on DPG(LS1, LS2, LS3)
>> LF2: arp.op == 1 && arp.tpa == V2, applied on DPG(LS1, LS2, LS3)
>> LF3: arp.op == 1 && arp.tpa == V3, applied on DPG(LS1, LS2, LS3)
>> LF4: arp.op == 1 && arp.tpa == V4, applied on DPG(LS1, LS2, LS3)
>> LF5: arp.op == 1 && arp.tpa == V5, applied on DPG(LS1, LS2, LS3)
>>
>> The problem is that at high scale the address sets will grow larger too,
>> I think, worse than what happens currently with the single logical flow:
>>
>> arp.op == 1 && arp.tpa == {V1, V2, V3, V4, V5}
>>
>>>
>>> And ovn-northd would create an Address set for each logical switch and
>>> store the unreachable v4 ips (and the same for v6).
>>> It should be straightforward to generate an address set name for each
>>> logical switch.
>>>
>>> Presently the function build_lswitch_rport_arp_req_flows() builds an
>>> sset of ips_v4_match_reachable.  This part of the code
>>> can be separated out and it can be used to sync the address sets in the SB DB.
>>>
>>> When an unreachable ip is added/deleted to/from the sset,
>>> ovn-controller can incrementally handle the change to the address set
>>> and this should not result in the openflow cookie updates.
>>>
>>> IMHO we should make use of address sets here.
>>>
>>> Thoughts ? And concerns you see w.r.t performance with this approach ?
>>
>> If I'm not wrong, ovn-controller would probably behave similarly if we
>> use address sets but I think the cost is larger on the SB contents.
>>
>> Also, we'd have to reserve those address sets somehow to not allow users
>> to use them, which is something hard to do in ovn-northd because we
>> don't parse ACL match contents.
>>
>>>
>>> Thanks
>>> Numan
>>
>> Thanks,
>> Dumitru
>>
>> _______________________________________________
>> dev mailing list
>> dev at openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> 



More information about the dev mailing list