[ovs-discuss] [OVN] OVN Load balancing algorithm

Han Zhou zhouhan at gmail.com
Mon May 4 01:39:46 UTC 2020


On Thu, Apr 30, 2020 at 4:12 AM Tonghao Zhang <xiangxia.m.yue at gmail.com>
wrote:
>
> On Thu, Apr 30, 2020 at 2:58 PM Han Zhou <zhouhan at gmail.com> wrote:
> >
> > Thanks Maciej for testing.
> >
> > On Tue, Apr 28, 2020 at 5:36 AM Maciej Jozefczyk <mjozefcz at redhat.com>
wrote:
> > > Here are my findings.
> > >
> > > 1) Test LB and sleep 1 second between calls:
> > > ./get-data.py --lb-dest 172.24.4.58:60092 --sleep-time 1
> > >
> > > result: http://paste.openstack.org/show/792818/
> > > Different backeds are selected and different buckets are being hit in
group_id=3. Sometimes the bucket1 is hit, sometimes bucket0.
> > > Output from groups dumps during the test:
http://paste.openstack.org/show/792820/
> > >
> > >
> > > 2) Test LB and sleep 60 second between calls:
> > > ./get-data.py --lb-dest 172.24.4.58:60092 --sleep-time 60
> > >
> > > Result: http://paste.openstack.org/show/792822/
> > > Output from group stats: http://paste.openstack.org/show/792823/
> > > Always one bucket is hit (bucket0) and requests are pointed to the
same backend.
> > >
> >
> > This test result proved our earlier analysis: different hash values
generated between kernel fastpath and userspace slowpath was the culprit.
> >
> > >
> > >
> > > On Fri, Apr 24, 2020 at 6:09 PM Ilya Maximets <i.maximets at ovn.org>
wrote:
> > >>
> > >> On 4/24/20 3:19 AM, Han Zhou wrote:
> > >> > Based on the discussion in OVN meeting today I did some more
testing, and here are my findings.
> > >> >
> > >> > - With ICMP (ping) between same source and destination it is
always same bucket selected by dp_hash.
> > >> > - With "nc" specifying same TCP 5-tuples, the packets can end up
into different buckets. This is similar to what Numan and Maciej observed.
> > >> >
> > >> > However, I was using the OVN ECMP feature to test instead of LB.
Since ECMP feature doesn't use conntrack, here are some more findings. The
bucket selection changes only between 2 buckets, and the change happens
when the packet datapath changes between userspace and kernel datapath.
Let's say the first packet of a flow (megaflow) goes to userspace, it hits
bucket1, and then if I send the second and more packets immediately they
will all hit bucket2, but if I wait for a while until the flow disappears
from the megaflow cache and then send the next packet, it will hit bucket1
again. This behavior is consistent.
> > >> >
> > >> > So I think it might be because of the different implementation of
dp_hash in userspace and kernel datapath, the different buckets were
selected (thanks Ilya for this hint in today's meeting).
> > >> > Numan/Maciej, in your tests did you see more than 2 buckets hit
for same 5-tuples? If the above theory is right, you should see at most 2
buckets hit. For LB, since it uses CT and only the first packet uses the
group, all packets of the same flow would always be forwarded to same LB
backend. I guess if you wait long enough between the tests, you should see
all tests hitting same backend. It would be great if you could confirm this.
> > >> >
> > >> > For ECMP, this behavior will cause occasional packets out of order
even for a single flow (for a burst of packets after some idle time),
because CT is not used (and we can't use it because when peered with
physical ECMP router groups we can't ensure the return traffic from
physical routers hits same LR).
> > >> >
> > >> > For LB it causes the unexpected behavior that is reported in this
thread.
> > >> >
> > >> > For the fix, I think we should figure out how to make sure dp_hash
always uses same hash algorithm for both userspace and kernel
implementation, if possible.
> > >> > I am ok with the patch from Numan for the capability of
configuring the desired hash method instead of always using default.
However, using "hash" may be a huge performance sacrifice since the packets
are always handled in slowpath, especially for ECMP. Even though LB uses
CT, for short-lived flow scenario this is still a big performance penalty
(for long lived flows of LB it may be ok since the majority of packets are
still in fastpath).
> > >> >
> > >> > I am not familiar with the dp_hash implementation. I will do some
more study, but any idea on how to ensure the consistency of dp_hash is
highly appreciated!
> > >>
> > >> I had an impression that packet hash is routed from the datapath to
userspace
> > >> and back, but it turned out that it's really recent change.  Seems
like following
> > >> changes are required:
> > >>
> > >> 1. Linux kernel: bd1903b7c459 ("net: openvswitch: add hash info to
upcall")
> > >>    This is avaialble starting from upstream kernel v5.5.
> > >>
> > >> 2. OVS: 0442bfb11d6c ("ofproto-dpif-upcall: Echo HASH attribute back
to datapath.")
> > >>    This is available on branch-2.13.
> > >>
> > >> With above two patches first and subsequent packets should have same
dp_hash
> > >> calculated by the kernel datapath.
> > >>
> >
> > Thanks Ilya for this information! Do you know why the kernel fix (1)
was not backported to OVS repo? Was it missed or on purpose? cc Tonghao
Zhang.
> > I had a test using ovn-fake-multinode, with OVS master, the problem
disappeared. However I am using an old kernel module and I don't think the
patch (1) is there.
> My most patches were applied in linux kernel upstream, and I don't
> know who backports that, but I am fine to backport it after holiday.
> Thanks.

Thanks Tonghao! I backported your kernel patch and related fix here, please
take a look:
https://patchwork.ozlabs.org/project/openvswitch/patch/1588554154-30608-1-git-send-email-hzhou@ovn.org/

I tested it on the same environment where I could reproduce the issue with
OVN ECMP. Here is the new behavior:

- With a single connection (e.g. SSH), all packets are hitting the same
bucket now, now matter if datapath megaflow exist or not. This solves the
potential out of order problem for ECMP.

- With "nc" to test same 5-tuple but different connections, still different
buckets were selected. I think it is probably because the hash used now by
dp_hash is generated by the socket, which is random, as mentioned by the
commit message of the kernel patch. It is not a problem for ECMP, but it
would not solve the problem initially brought up by this thread, for the
requirement of the special LB use cases.

To solve the LB problem with dp_hash (as a further improvement), can we
support calculating hash in datapath instead of from skb, while using the
same mechanism used by this patch to ensure the same hash value is used for
the upcall handling?

P.S. another problem for LB use case is that there could be many backends
(more than 64). The current dp_hash implementation has a limitation that if
there are more than 64 hash values required, it falls back to the original
"hash" algorithm, which satisfies the same 5-tuple, same bucket need but
could have performance problem. The limitation mainly comes from the
requirement of supporting different weight of the bucket, using the Webster
method. However, the OVN LB doesn't even use the weight (they are all
equal).

Thanks,
Han

> > >>
> > >> >
> > >> > Thanks,
> > >> > Han
> > >> >
> > >> > On Tue, Apr 21, 2020 at 1:05 AM Daniel Alvarez Sanchez <
dalvarez at redhat.com <mailto:dalvarez at redhat.com>> wrote:
> > >> >>
> > >> >> Thanks Numan for the investigation and the great explanation!
> > >> >>
> > >> >> On Tue, Apr 21, 2020 at 9:38 AM Numan Siddique <numans at ovn.org
<mailto:numans at ovn.org>> wrote:
> > >> >>>
> > >> >>> On Fri, Apr 17, 2020 at 12:56 PM Han Zhou <zhouhan at gmail.com
<mailto:zhouhan at gmail.com>> wrote:
> > >> >>> >
> > >> >>> >
> > >> >>> >
> > >> >>> > On Tue, Apr 7, 2020 at 7:03 AM Maciej Jozefczyk <
mjozefcz at redhat.com <mailto:mjozefcz at redhat.com>> wrote:
> > >> >>> > >
> > >> >>> > > Hello!
> > >> >>> > >
> > >> >>> > > I would like to ask you to clarify how the OVN Load
balancing algorithm works.
> > >> >>> > >
> > >> >>> > > Based on the action [1]:
> > >> >>> > > 1) If connection is alive the same 'backend' will be chosen,
> > >> >>> > >
> > >> >>> > > 2) If it is a new connection the backend will be chosen
based on selection_method=dp_hash [2].
> > >> >>> > > Based on changelog the dp_hash uses '5 tuple hash' [3].
> > >> >>> > > The hash is calculated based on values: source and
destination IP,  source port, protocol and arbitrary value - 42. [4]
> > >> >>> > > Based on that information we could name it SOURCE_IP_PORT.
> > >> >>> > >
> > >> >>> > > Unfortunately we recently got a bug report in OVN Octavia
provider driver project, that the Load Balancing in OVN
> > >> >>> > > works differently [5]. The report shows even when the test
uses the same source ip and port, but new TCP connection,
> > >> >>> > > traffic is randomly distributed, but based on [2] it
shouldn't?
> > >> >>> > >
> > >> >>> > > Is it a bug?  Is something else taken to account while
creating a hash? Can it be fixed in OVS/OVN?
> > >> >>> > >
> > >> >>> > >
> > >> >>> > >
> > >> >>> > > Thanks,
> > >> >>> > > Maciej
> > >> >>> > >
> > >> >>> > >
> > >> >>> > > [1]
https://github.com/ovn-org/ovn/blob/branch-20.03/lib/actions.c#L1017
> > >> >>> > > [2]
https://github.com/ovn-org/ovn/blob/branch-20.03/lib/actions.c#L1059
> > >> >>> > > [3]
https://github.com/openvswitch/ovs/blob/d58b59c17c70137aebdde37d3c01c26a26b28519/NEWS#L364-L371
> > >> >>> > > [4]
https://github.com/openvswitch/ovs/blob/74286173f4d7f51f78e9db09b07a6d4d65263252/lib/flow.c#L2217
> > >> >>> > > [5] https://bugs.launchpad.net/neutron/+bug/1871239
> > >> >>> > >
> > >> >>> > > --
> > >> >>> > > Best regards,
> > >> >>> > > Maciej Józefczyk
> > >> >>> > > _______________________________________________
> > >> >>> > > discuss mailing list
> > >> >>> > > discuss at openvswitch.org <mailto:discuss at openvswitch.org>
> > >> >>> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > >> >>> >
> > >> >>> > Hi Maciej,
> > >> >>> >
> > >> >>> > Thanks for reporting. It is definitely strange that same
5-tuple flow resulted in hitting different backends. I didn't observed such
behavior before (maybe I should try again myself to confirm). Can you make
sure during the testing the group bucket didn't change? You can do so by:
> > >> >>> > # ovs-ofctl dump-groups br-int
> > >> >>> > and also check the group stats and see if multiple buckets has
counter increased during the test
> > >> >>> > # ovs-ofctl dump-group-stats br-int [group]
> > >> >>> >
> > >> >>> > For the 5-tuple hash function you are seeing
flow_hash_5tuple(), it is using all the 5-tuples. It adds both ports (src
and dst) at once:
> > >> >>> >        /* Add both ports at once. */
> > >> >>> >         hash = hash_add(hash,
> > >> >>> >                         ((const uint32_t
*)flow)[offsetof(struct flow, tp_src)
> > >> >>> >                                                  /
sizeof(uint32_t)]);
> > >> >>> >
> > >> >>> > The tp_src is the start of the offset, and the size is 32,
meaning both src and dst, each is 16 bits. (Although I am not sure if
dp_hash method is using this function or not. Need to check more code)
> > >> >>> >
> > >> >>> > BTW, I am not sure why Neutron give it the name
SOURCE_IP_PORT. Shall it be called just 5-TUPLE, since protocol,
destination IP and PORT are also considered in the hash.
> > >> >>> >
> > >> >>>
> > >> >>>
> > >> >>> Hi Maciej and Han,
> > >> >>>
> > >> >>> I did some testing and I can confirm as you're saying. OVN is not
> > >> >>> choosing the same backend with the src ip, src port fixed.
> > >> >>>
> > >> >>> I think there is an issue with OVN on how it is programming the
group
> > >> >>> flows.  OVN is setting the selection_method as dp_hash.
> > >> >>> But when ovs-vswitchd receives the  GROUP_MOD openflow message, I
> > >> >>> noticed that the selection_method is not set.
> > >> >>> From the code I see that selection_method will be encoded only if
> > >> >>> ovn-controller uses openflow version 1.5 [1]
> > >> >>>
> > >> >>> Since selection_method is NULL, vswitchd uses the dp_hash method
[2].
> > >> >>> dp_hash means it uses the hash calculated by
> > >> >>> the datapath. In the case of kernel datapath, from what I
understand
> > >> >>> it uses skb_get_hash().
> > >> >>>
> > >> >>> I modified the vswitchd code to use the selection_method "hash"
if
> > >> >>> selection_method is not set. In this case the load balancer
> > >> >>> works as expected. For a fixed src ip, src port, dst ip and dst
port,
> > >> >>> the group action is selecting the same bucket always. [3]
> > >> >>>
> > >> >>> I think we need to fix a few issues in OVN
> > >> >>>   - Use openflow 1.5 so that ovn can set selection_method
> > >> >>>  -  Use "hash" method if dp_hash is not choosing the same bucket
for
> > >> >>> 5-tuple hash.
> > >> >>>   - May be provide the option for the CMS to choose an algorithm
i.e.
> > >> >>> to use dp_hash or hash.
> > >> >>>
> > >> >> I'd rather not expose this to the CMS as it depends on the
datapath implementation as per [0] but maybe it makes sense to eventually
abstract it to the CMS in a more LB-ish way (common algorithm names used in
load balancing) in the case at some point the LB feature is enhanced
somehow to support more algorithms.
> > >> >>
> > >> >> I believe that for OVN LB users, using OF 1.5 to force the use of
'hash' would be the best solution now.
> > >> >>
> > >> >> My 2 cents as I'm not an LB expert.
> > >> >>
> > >> >> I also recall that we tested this in the past and seemed to be
working. I have been checking further in the doc [0] and found this
paragraph:
> > >> >>
> > >> >> "If no selection method is specified, Open vSwitch up to release
2.9 applies the hash method with default fields. From 2.10 onwards Open
vSwitch defaults to the dp_hash method with symmetric L3/L4 hash algorithm,
unless the weighted group buck‐ ets cannot be mapped to a maximum of 64
dp_hash values with sufficient accuracy. In those rare cases Open vSwitch
2.10 and later fall back to the hash method with the default set of hash
fields."
> > >> >>
> > >> >> The explanation seems to be that when we tested the feature we
relied on OVS 2.9 and hence the confusion.
> > >> >>
> > >> >> Thanks a lot again!
> > >> >> Daniel
> > >> >>
> > >> >> [0] http://www.openvswitch.org/support/dist-docs/ovs-ofctl.8.html
> > >> >>
> > >> >>>
> > >> >>> I'll look into it on how to support this.
> > >> >>>
> > >> >>> [1] -
https://github.com/openvswitch/ovs/blob/master/lib/ofp-group.c#L2120
> > >> >>>
https://github.com/openvswitch/ovs/blob/master/lib/ofp-group.c#L2082
> > >> >>>
> > >> >>> [2] -
https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif.c#L5108
> > >> >>> [3] -
https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-xlate.c#L4553
> > >> >>>
> > >> >>>
> > >> >>> Thanks
> > >> >>> Numan
> > >> >>>
> > >> >>>
> > >> >>> > Thanks,
> > >> >>> > Han
> > >> >>> > _______________________________________________
> > >> >>> > discuss mailing list
> > >> >>> > discuss at openvswitch.org <mailto:discuss at openvswitch.org>
> > >> >>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > >> >>> _______________________________________________
> > >> >>> discuss mailing list
> > >> >>> discuss at openvswitch.org <mailto:discuss at openvswitch.org>
> > >> >>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > >>
> > >> _______________________________________________
> > >> discuss mailing list
> > >> discuss at openvswitch.org
> > >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Maciej Józefczyk
>
>
>
> --
> Best regards, Tonghao
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20200503/336ca6a6/attachment-0001.html>


More information about the discuss mailing list