[ovs-discuss] [OVN] OVN Load balancing algorithm

Maciej Jozefczyk mjozefcz at redhat.com
Mon May 11 14:19:26 UTC 2020


Hello,

Thanks for working on this.

The change for setting the selection fields for Load_Balancer has been
merged upstream [1].
This change helped a lot and OVN Octavia provider driver (master) is now
able to specify those fields [2] and as effect of it - is able to be
properly tested with tempest tests [3].

The OVN Octavia bug described here [4], which is solved by the selection
fields, is still valid for stable releases and the backport will help.
The original behaviour wasn't changed, because if the selection fields in
Load_Balancer row are not set, then it will still use 5-tuple hash as
previously.

Can I ask You to cherry-pick that change to the latest stable branch-20.03?


[1]
https://github.com/ovn-org/ovn/commit/5af304e7478adcf5ac50ed41e96a55bebebff3e8
[2] https://review.opendev.org/#/c/726787
[3] https://review.opendev.org/#/c/714004/
[4] https://bugs.launchpad.net/neutron/+bug/1871239

On Mon, May 4, 2020 at 3:40 AM Han Zhou <zhouhan at gmail.com> wrote:

>
>
> On Thu, Apr 30, 2020 at 4:12 AM Tonghao Zhang <xiangxia.m.yue at gmail.com>
> wrote:
> >
> > On Thu, Apr 30, 2020 at 2:58 PM Han Zhou <zhouhan at gmail.com> wrote:
> > >
> > > Thanks Maciej for testing.
> > >
> > > On Tue, Apr 28, 2020 at 5:36 AM Maciej Jozefczyk <mjozefcz at redhat.com>
> wrote:
> > > > Here are my findings.
> > > >
> > > > 1) Test LB and sleep 1 second between calls:
> > > > ./get-data.py --lb-dest 172.24.4.58:60092 --sleep-time 1
> > > >
> > > > result: http://paste.openstack.org/show/792818/
> > > > Different backeds are selected and different buckets are being hit
> in group_id=3. Sometimes the bucket1 is hit, sometimes bucket0.
> > > > Output from groups dumps during the test:
> http://paste.openstack.org/show/792820/
> > > >
> > > >
> > > > 2) Test LB and sleep 60 second between calls:
> > > > ./get-data.py --lb-dest 172.24.4.58:60092 --sleep-time 60
> > > >
> > > > Result: http://paste.openstack.org/show/792822/
> > > > Output from group stats: http://paste.openstack.org/show/792823/
> > > > Always one bucket is hit (bucket0) and requests are pointed to the
> same backend.
> > > >
> > >
> > > This test result proved our earlier analysis: different hash values
> generated between kernel fastpath and userspace slowpath was the culprit.
> > >
> > > >
> > > >
> > > > On Fri, Apr 24, 2020 at 6:09 PM Ilya Maximets <i.maximets at ovn.org>
> wrote:
> > > >>
> > > >> On 4/24/20 3:19 AM, Han Zhou wrote:
> > > >> > Based on the discussion in OVN meeting today I did some more
> testing, and here are my findings.
> > > >> >
> > > >> > - With ICMP (ping) between same source and destination it is
> always same bucket selected by dp_hash.
> > > >> > - With "nc" specifying same TCP 5-tuples, the packets can end up
> into different buckets. This is similar to what Numan and Maciej observed.
> > > >> >
> > > >> > However, I was using the OVN ECMP feature to test instead of LB.
> Since ECMP feature doesn't use conntrack, here are some more findings. The
> bucket selection changes only between 2 buckets, and the change happens
> when the packet datapath changes between userspace and kernel datapath.
> Let's say the first packet of a flow (megaflow) goes to userspace, it hits
> bucket1, and then if I send the second and more packets immediately they
> will all hit bucket2, but if I wait for a while until the flow disappears
> from the megaflow cache and then send the next packet, it will hit bucket1
> again. This behavior is consistent.
> > > >> >
> > > >> > So I think it might be because of the different implementation of
> dp_hash in userspace and kernel datapath, the different buckets were
> selected (thanks Ilya for this hint in today's meeting).
> > > >> > Numan/Maciej, in your tests did you see more than 2 buckets hit
> for same 5-tuples? If the above theory is right, you should see at most 2
> buckets hit. For LB, since it uses CT and only the first packet uses the
> group, all packets of the same flow would always be forwarded to same LB
> backend. I guess if you wait long enough between the tests, you should see
> all tests hitting same backend. It would be great if you could confirm this.
> > > >> >
> > > >> > For ECMP, this behavior will cause occasional packets out of
> order even for a single flow (for a burst of packets after some idle time),
> because CT is not used (and we can't use it because when peered with
> physical ECMP router groups we can't ensure the return traffic from
> physical routers hits same LR).
> > > >> >
> > > >> > For LB it causes the unexpected behavior that is reported in this
> thread.
> > > >> >
> > > >> > For the fix, I think we should figure out how to make sure
> dp_hash always uses same hash algorithm for both userspace and kernel
> implementation, if possible.
> > > >> > I am ok with the patch from Numan for the capability of
> configuring the desired hash method instead of always using default.
> However, using "hash" may be a huge performance sacrifice since the packets
> are always handled in slowpath, especially for ECMP. Even though LB uses
> CT, for short-lived flow scenario this is still a big performance penalty
> (for long lived flows of LB it may be ok since the majority of packets are
> still in fastpath).
> > > >> >
> > > >> > I am not familiar with the dp_hash implementation. I will do some
> more study, but any idea on how to ensure the consistency of dp_hash is
> highly appreciated!
> > > >>
> > > >> I had an impression that packet hash is routed from the datapath to
> userspace
> > > >> and back, but it turned out that it's really recent change.  Seems
> like following
> > > >> changes are required:
> > > >>
> > > >> 1. Linux kernel: bd1903b7c459 ("net: openvswitch: add hash info to
> upcall")
> > > >>    This is avaialble starting from upstream kernel v5.5.
> > > >>
> > > >> 2. OVS: 0442bfb11d6c ("ofproto-dpif-upcall: Echo HASH attribute
> back to datapath.")
> > > >>    This is available on branch-2.13.
> > > >>
> > > >> With above two patches first and subsequent packets should have
> same dp_hash
> > > >> calculated by the kernel datapath.
> > > >>
> > >
> > > Thanks Ilya for this information! Do you know why the kernel fix (1)
> was not backported to OVS repo? Was it missed or on purpose? cc Tonghao
> Zhang.
> > > I had a test using ovn-fake-multinode, with OVS master, the problem
> disappeared. However I am using an old kernel module and I don't think the
> patch (1) is there.
> > My most patches were applied in linux kernel upstream, and I don't
> > know who backports that, but I am fine to backport it after holiday.
> > Thanks.
>
> Thanks Tonghao! I backported your kernel patch and related fix here,
> please take a look:
> https://patchwork.ozlabs.org/project/openvswitch/patch/1588554154-30608-1-git-send-email-hzhou@ovn.org/
>
> I tested it on the same environment where I could reproduce the issue with
> OVN ECMP. Here is the new behavior:
>
> - With a single connection (e.g. SSH), all packets are hitting the same
> bucket now, now matter if datapath megaflow exist or not. This solves the
> potential out of order problem for ECMP.
>
> - With "nc" to test same 5-tuple but different connections, still
> different buckets were selected. I think it is probably because the hash
> used now by dp_hash is generated by the socket, which is random, as
> mentioned by the commit message of the kernel patch. It is not a problem
> for ECMP, but it would not solve the problem initially brought up by this
> thread, for the requirement of the special LB use cases.
>
> To solve the LB problem with dp_hash (as a further improvement), can we
> support calculating hash in datapath instead of from skb, while using the
> same mechanism used by this patch to ensure the same hash value is used for
> the upcall handling?
>
> P.S. another problem for LB use case is that there could be many backends
> (more than 64). The current dp_hash implementation has a limitation that if
> there are more than 64 hash values required, it falls back to the original
> "hash" algorithm, which satisfies the same 5-tuple, same bucket need but
> could have performance problem. The limitation mainly comes from the
> requirement of supporting different weight of the bucket, using the Webster
> method. However, the OVN LB doesn't even use the weight (they are all
> equal).
>
> Thanks,
> Han
>
> > > >>
> > > >> >
> > > >> > Thanks,
> > > >> > Han
> > > >> >
> > > >> > On Tue, Apr 21, 2020 at 1:05 AM Daniel Alvarez Sanchez <
> dalvarez at redhat.com <mailto:dalvarez at redhat.com>> wrote:
> > > >> >>
> > > >> >> Thanks Numan for the investigation and the great explanation!
> > > >> >>
> > > >> >> On Tue, Apr 21, 2020 at 9:38 AM Numan Siddique <numans at ovn.org
> <mailto:numans at ovn.org>> wrote:
> > > >> >>>
> > > >> >>> On Fri, Apr 17, 2020 at 12:56 PM Han Zhou <zhouhan at gmail.com
> <mailto:zhouhan at gmail.com>> wrote:
> > > >> >>> >
> > > >> >>> >
> > > >> >>> >
> > > >> >>> > On Tue, Apr 7, 2020 at 7:03 AM Maciej Jozefczyk <
> mjozefcz at redhat.com <mailto:mjozefcz at redhat.com>> wrote:
> > > >> >>> > >
> > > >> >>> > > Hello!
> > > >> >>> > >
> > > >> >>> > > I would like to ask you to clarify how the OVN Load
> balancing algorithm works.
> > > >> >>> > >
> > > >> >>> > > Based on the action [1]:
> > > >> >>> > > 1) If connection is alive the same 'backend' will be chosen,
> > > >> >>> > >
> > > >> >>> > > 2) If it is a new connection the backend will be chosen
> based on selection_method=dp_hash [2].
> > > >> >>> > > Based on changelog the dp_hash uses '5 tuple hash' [3].
> > > >> >>> > > The hash is calculated based on values: source and
> destination IP,  source port, protocol and arbitrary value - 42. [4]
> > > >> >>> > > Based on that information we could name it SOURCE_IP_PORT.
> > > >> >>> > >
> > > >> >>> > > Unfortunately we recently got a bug report in OVN Octavia
> provider driver project, that the Load Balancing in OVN
> > > >> >>> > > works differently [5]. The report shows even when the test
> uses the same source ip and port, but new TCP connection,
> > > >> >>> > > traffic is randomly distributed, but based on [2] it
> shouldn't?
> > > >> >>> > >
> > > >> >>> > > Is it a bug?  Is something else taken to account while
> creating a hash? Can it be fixed in OVS/OVN?
> > > >> >>> > >
> > > >> >>> > >
> > > >> >>> > >
> > > >> >>> > > Thanks,
> > > >> >>> > > Maciej
> > > >> >>> > >
> > > >> >>> > >
> > > >> >>> > > [1]
> https://github.com/ovn-org/ovn/blob/branch-20.03/lib/actions.c#L1017
> > > >> >>> > > [2]
> https://github.com/ovn-org/ovn/blob/branch-20.03/lib/actions.c#L1059
> > > >> >>> > > [3]
> https://github.com/openvswitch/ovs/blob/d58b59c17c70137aebdde37d3c01c26a26b28519/NEWS#L364-L371
> > > >> >>> > > [4]
> https://github.com/openvswitch/ovs/blob/74286173f4d7f51f78e9db09b07a6d4d65263252/lib/flow.c#L2217
> > > >> >>> > > [5] https://bugs.launchpad.net/neutron/+bug/1871239
> > > >> >>> > >
> > > >> >>> > > --
> > > >> >>> > > Best regards,
> > > >> >>> > > Maciej Józefczyk
> > > >> >>> > > _______________________________________________
> > > >> >>> > > discuss mailing list
> > > >> >>> > > discuss at openvswitch.org <mailto:discuss at openvswitch.org>
> > > >> >>> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > > >> >>> >
> > > >> >>> > Hi Maciej,
> > > >> >>> >
> > > >> >>> > Thanks for reporting. It is definitely strange that same
> 5-tuple flow resulted in hitting different backends. I didn't observed such
> behavior before (maybe I should try again myself to confirm). Can you make
> sure during the testing the group bucket didn't change? You can do so by:
> > > >> >>> > # ovs-ofctl dump-groups br-int
> > > >> >>> > and also check the group stats and see if multiple buckets
> has counter increased during the test
> > > >> >>> > # ovs-ofctl dump-group-stats br-int [group]
> > > >> >>> >
> > > >> >>> > For the 5-tuple hash function you are seeing
> flow_hash_5tuple(), it is using all the 5-tuples. It adds both ports (src
> and dst) at once:
> > > >> >>> >        /* Add both ports at once. */
> > > >> >>> >         hash = hash_add(hash,
> > > >> >>> >                         ((const uint32_t
> *)flow)[offsetof(struct flow, tp_src)
> > > >> >>> >                                                  /
> sizeof(uint32_t)]);
> > > >> >>> >
> > > >> >>> > The tp_src is the start of the offset, and the size is 32,
> meaning both src and dst, each is 16 bits. (Although I am not sure if
> dp_hash method is using this function or not. Need to check more code)
> > > >> >>> >
> > > >> >>> > BTW, I am not sure why Neutron give it the name
> SOURCE_IP_PORT. Shall it be called just 5-TUPLE, since protocol,
> destination IP and PORT are also considered in the hash.
> > > >> >>> >
> > > >> >>>
> > > >> >>>
> > > >> >>> Hi Maciej and Han,
> > > >> >>>
> > > >> >>> I did some testing and I can confirm as you're saying. OVN is
> not
> > > >> >>> choosing the same backend with the src ip, src port fixed.
> > > >> >>>
> > > >> >>> I think there is an issue with OVN on how it is programming the
> group
> > > >> >>> flows.  OVN is setting the selection_method as dp_hash.
> > > >> >>> But when ovs-vswitchd receives the  GROUP_MOD openflow message,
> I
> > > >> >>> noticed that the selection_method is not set.
> > > >> >>> From the code I see that selection_method will be encoded only
> if
> > > >> >>> ovn-controller uses openflow version 1.5 [1]
> > > >> >>>
> > > >> >>> Since selection_method is NULL, vswitchd uses the dp_hash
> method [2].
> > > >> >>> dp_hash means it uses the hash calculated by
> > > >> >>> the datapath. In the case of kernel datapath, from what I
> understand
> > > >> >>> it uses skb_get_hash().
> > > >> >>>
> > > >> >>> I modified the vswitchd code to use the selection_method "hash"
> if
> > > >> >>> selection_method is not set. In this case the load balancer
> > > >> >>> works as expected. For a fixed src ip, src port, dst ip and dst
> port,
> > > >> >>> the group action is selecting the same bucket always. [3]
> > > >> >>>
> > > >> >>> I think we need to fix a few issues in OVN
> > > >> >>>   - Use openflow 1.5 so that ovn can set selection_method
> > > >> >>>  -  Use "hash" method if dp_hash is not choosing the same
> bucket for
> > > >> >>> 5-tuple hash.
> > > >> >>>   - May be provide the option for the CMS to choose an
> algorithm i.e.
> > > >> >>> to use dp_hash or hash.
> > > >> >>>
> > > >> >> I'd rather not expose this to the CMS as it depends on the
> datapath implementation as per [0] but maybe it makes sense to eventually
> abstract it to the CMS in a more LB-ish way (common algorithm names used in
> load balancing) in the case at some point the LB feature is enhanced
> somehow to support more algorithms.
> > > >> >>
> > > >> >> I believe that for OVN LB users, using OF 1.5 to force the use
> of 'hash' would be the best solution now.
> > > >> >>
> > > >> >> My 2 cents as I'm not an LB expert.
> > > >> >>
> > > >> >> I also recall that we tested this in the past and seemed to be
> working. I have been checking further in the doc [0] and found this
> paragraph:
> > > >> >>
> > > >> >> "If no selection method is specified, Open vSwitch up to release
> 2.9 applies the hash method with default fields. From 2.10 onwards Open
> vSwitch defaults to the dp_hash method with symmetric L3/L4 hash algorithm,
> unless the weighted group buck‐ ets cannot be mapped to a maximum of 64
> dp_hash values with sufficient accuracy. In those rare cases Open vSwitch
> 2.10 and later fall back to the hash method with the default set of hash
> fields."
> > > >> >>
> > > >> >> The explanation seems to be that when we tested the feature we
> relied on OVS 2.9 and hence the confusion.
> > > >> >>
> > > >> >> Thanks a lot again!
> > > >> >> Daniel
> > > >> >>
> > > >> >> [0]
> http://www.openvswitch.org/support/dist-docs/ovs-ofctl.8.html
> > > >> >>
> > > >> >>>
> > > >> >>> I'll look into it on how to support this.
> > > >> >>>
> > > >> >>> [1] -
> https://github.com/openvswitch/ovs/blob/master/lib/ofp-group.c#L2120
> > > >> >>>
> https://github.com/openvswitch/ovs/blob/master/lib/ofp-group.c#L2082
> > > >> >>>
> > > >> >>> [2] -
> https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif.c#L5108
> > > >> >>> [3] -
> https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-xlate.c#L4553
> > > >> >>>
> > > >> >>>
> > > >> >>> Thanks
> > > >> >>> Numan
> > > >> >>>
> > > >> >>>
> > > >> >>> > Thanks,
> > > >> >>> > Han
> > > >> >>> > _______________________________________________
> > > >> >>> > discuss mailing list
> > > >> >>> > discuss at openvswitch.org <mailto:discuss at openvswitch.org>
> > > >> >>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > > >> >>> _______________________________________________
> > > >> >>> discuss mailing list
> > > >> >>> discuss at openvswitch.org <mailto:discuss at openvswitch.org>
> > > >> >>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > > >>
> > > >> _______________________________________________
> > > >> discuss mailing list
> > > >> discuss at openvswitch.org
> > > >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Maciej Józefczyk
> >
> >
> >
> > --
> > Best regards, Tonghao
>


-- 
Best regards,
Maciej Józefczyk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20200511/74e77f0b/attachment-0001.html>


More information about the discuss mailing list