[ovs-discuss] [OVN] OVN Load balancing algorithm

Tonghao Zhang xiangxia.m.yue at gmail.com
Thu Apr 30 11:11:38 UTC 2020


On Thu, Apr 30, 2020 at 2:58 PM Han Zhou <zhouhan at gmail.com> wrote:
>
> Thanks Maciej for testing.
>
> On Tue, Apr 28, 2020 at 5:36 AM Maciej Jozefczyk <mjozefcz at redhat.com> wrote:
> > Here are my findings.
> >
> > 1) Test LB and sleep 1 second between calls:
> > ./get-data.py --lb-dest 172.24.4.58:60092 --sleep-time 1
> >
> > result: http://paste.openstack.org/show/792818/
> > Different backeds are selected and different buckets are being hit in group_id=3. Sometimes the bucket1 is hit, sometimes bucket0.
> > Output from groups dumps during the test: http://paste.openstack.org/show/792820/
> >
> >
> > 2) Test LB and sleep 60 second between calls:
> > ./get-data.py --lb-dest 172.24.4.58:60092 --sleep-time 60
> >
> > Result: http://paste.openstack.org/show/792822/
> > Output from group stats: http://paste.openstack.org/show/792823/
> > Always one bucket is hit (bucket0) and requests are pointed to the same backend.
> >
>
> This test result proved our earlier analysis: different hash values generated between kernel fastpath and userspace slowpath was the culprit.
>
> >
> >
> > On Fri, Apr 24, 2020 at 6:09 PM Ilya Maximets <i.maximets at ovn.org> wrote:
> >>
> >> On 4/24/20 3:19 AM, Han Zhou wrote:
> >> > Based on the discussion in OVN meeting today I did some more testing, and here are my findings.
> >> >
> >> > - With ICMP (ping) between same source and destination it is always same bucket selected by dp_hash.
> >> > - With "nc" specifying same TCP 5-tuples, the packets can end up into different buckets. This is similar to what Numan and Maciej observed.
> >> >
> >> > However, I was using the OVN ECMP feature to test instead of LB. Since ECMP feature doesn't use conntrack, here are some more findings. The bucket selection changes only between 2 buckets, and the change happens when the packet datapath changes between userspace and kernel datapath. Let's say the first packet of a flow (megaflow) goes to userspace, it hits bucket1, and then if I send the second and more packets immediately they will all hit bucket2, but if I wait for a while until the flow disappears from the megaflow cache and then send the next packet, it will hit bucket1 again. This behavior is consistent.
> >> >
> >> > So I think it might be because of the different implementation of dp_hash in userspace and kernel datapath, the different buckets were selected (thanks Ilya for this hint in today's meeting).
> >> > Numan/Maciej, in your tests did you see more than 2 buckets hit for same 5-tuples? If the above theory is right, you should see at most 2 buckets hit. For LB, since it uses CT and only the first packet uses the group, all packets of the same flow would always be forwarded to same LB backend. I guess if you wait long enough between the tests, you should see all tests hitting same backend. It would be great if you could confirm this.
> >> >
> >> > For ECMP, this behavior will cause occasional packets out of order even for a single flow (for a burst of packets after some idle time), because CT is not used (and we can't use it because when peered with physical ECMP router groups we can't ensure the return traffic from physical routers hits same LR).
> >> >
> >> > For LB it causes the unexpected behavior that is reported in this thread.
> >> >
> >> > For the fix, I think we should figure out how to make sure dp_hash always uses same hash algorithm for both userspace and kernel implementation, if possible.
> >> > I am ok with the patch from Numan for the capability of configuring the desired hash method instead of always using default. However, using "hash" may be a huge performance sacrifice since the packets are always handled in slowpath, especially for ECMP. Even though LB uses CT, for short-lived flow scenario this is still a big performance penalty (for long lived flows of LB it may be ok since the majority of packets are still in fastpath).
> >> >
> >> > I am not familiar with the dp_hash implementation. I will do some more study, but any idea on how to ensure the consistency of dp_hash is highly appreciated!
> >>
> >> I had an impression that packet hash is routed from the datapath to userspace
> >> and back, but it turned out that it's really recent change.  Seems like following
> >> changes are required:
> >>
> >> 1. Linux kernel: bd1903b7c459 ("net: openvswitch: add hash info to upcall")
> >>    This is avaialble starting from upstream kernel v5.5.
> >>
> >> 2. OVS: 0442bfb11d6c ("ofproto-dpif-upcall: Echo HASH attribute back to datapath.")
> >>    This is available on branch-2.13.
> >>
> >> With above two patches first and subsequent packets should have same dp_hash
> >> calculated by the kernel datapath.
> >>
>
> Thanks Ilya for this information! Do you know why the kernel fix (1) was not backported to OVS repo? Was it missed or on purpose? cc Tonghao Zhang.
> I had a test using ovn-fake-multinode, with OVS master, the problem disappeared. However I am using an old kernel module and I don't think the patch (1) is there.
My most patches were applied in linux kernel upstream, and I don't
know who backports that, but I am fine to backport it after holiday.
Thanks.
> >>
> >> >
> >> > Thanks,
> >> > Han
> >> >
> >> > On Tue, Apr 21, 2020 at 1:05 AM Daniel Alvarez Sanchez <dalvarez at redhat.com <mailto:dalvarez at redhat.com>> wrote:
> >> >>
> >> >> Thanks Numan for the investigation and the great explanation!
> >> >>
> >> >> On Tue, Apr 21, 2020 at 9:38 AM Numan Siddique <numans at ovn.org <mailto:numans at ovn.org>> wrote:
> >> >>>
> >> >>> On Fri, Apr 17, 2020 at 12:56 PM Han Zhou <zhouhan at gmail.com <mailto:zhouhan at gmail.com>> wrote:
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > On Tue, Apr 7, 2020 at 7:03 AM Maciej Jozefczyk <mjozefcz at redhat.com <mailto:mjozefcz at redhat.com>> wrote:
> >> >>> > >
> >> >>> > > Hello!
> >> >>> > >
> >> >>> > > I would like to ask you to clarify how the OVN Load balancing algorithm works.
> >> >>> > >
> >> >>> > > Based on the action [1]:
> >> >>> > > 1) If connection is alive the same 'backend' will be chosen,
> >> >>> > >
> >> >>> > > 2) If it is a new connection the backend will be chosen based on selection_method=dp_hash [2].
> >> >>> > > Based on changelog the dp_hash uses '5 tuple hash' [3].
> >> >>> > > The hash is calculated based on values: source and destination IP,  source port, protocol and arbitrary value - 42. [4]
> >> >>> > > Based on that information we could name it SOURCE_IP_PORT.
> >> >>> > >
> >> >>> > > Unfortunately we recently got a bug report in OVN Octavia provider driver project, that the Load Balancing in OVN
> >> >>> > > works differently [5]. The report shows even when the test uses the same source ip and port, but new TCP connection,
> >> >>> > > traffic is randomly distributed, but based on [2] it shouldn't?
> >> >>> > >
> >> >>> > > Is it a bug?  Is something else taken to account while creating a hash? Can it be fixed in OVS/OVN?
> >> >>> > >
> >> >>> > >
> >> >>> > >
> >> >>> > > Thanks,
> >> >>> > > Maciej
> >> >>> > >
> >> >>> > >
> >> >>> > > [1] https://github.com/ovn-org/ovn/blob/branch-20.03/lib/actions.c#L1017
> >> >>> > > [2] https://github.com/ovn-org/ovn/blob/branch-20.03/lib/actions.c#L1059
> >> >>> > > [3] https://github.com/openvswitch/ovs/blob/d58b59c17c70137aebdde37d3c01c26a26b28519/NEWS#L364-L371
> >> >>> > > [4] https://github.com/openvswitch/ovs/blob/74286173f4d7f51f78e9db09b07a6d4d65263252/lib/flow.c#L2217
> >> >>> > > [5] https://bugs.launchpad.net/neutron/+bug/1871239
> >> >>> > >
> >> >>> > > --
> >> >>> > > Best regards,
> >> >>> > > Maciej Józefczyk
> >> >>> > > _______________________________________________
> >> >>> > > discuss mailing list
> >> >>> > > discuss at openvswitch.org <mailto:discuss at openvswitch.org>
> >> >>> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >> >>> >
> >> >>> > Hi Maciej,
> >> >>> >
> >> >>> > Thanks for reporting. It is definitely strange that same 5-tuple flow resulted in hitting different backends. I didn't observed such behavior before (maybe I should try again myself to confirm). Can you make sure during the testing the group bucket didn't change? You can do so by:
> >> >>> > # ovs-ofctl dump-groups br-int
> >> >>> > and also check the group stats and see if multiple buckets has counter increased during the test
> >> >>> > # ovs-ofctl dump-group-stats br-int [group]
> >> >>> >
> >> >>> > For the 5-tuple hash function you are seeing flow_hash_5tuple(), it is using all the 5-tuples. It adds both ports (src and dst) at once:
> >> >>> >        /* Add both ports at once. */
> >> >>> >         hash = hash_add(hash,
> >> >>> >                         ((const uint32_t *)flow)[offsetof(struct flow, tp_src)
> >> >>> >                                                  / sizeof(uint32_t)]);
> >> >>> >
> >> >>> > The tp_src is the start of the offset, and the size is 32, meaning both src and dst, each is 16 bits. (Although I am not sure if dp_hash method is using this function or not. Need to check more code)
> >> >>> >
> >> >>> > BTW, I am not sure why Neutron give it the name SOURCE_IP_PORT. Shall it be called just 5-TUPLE, since protocol, destination IP and PORT are also considered in the hash.
> >> >>> >
> >> >>>
> >> >>>
> >> >>> Hi Maciej and Han,
> >> >>>
> >> >>> I did some testing and I can confirm as you're saying. OVN is not
> >> >>> choosing the same backend with the src ip, src port fixed.
> >> >>>
> >> >>> I think there is an issue with OVN on how it is programming the group
> >> >>> flows.  OVN is setting the selection_method as dp_hash.
> >> >>> But when ovs-vswitchd receives the  GROUP_MOD openflow message, I
> >> >>> noticed that the selection_method is not set.
> >> >>> From the code I see that selection_method will be encoded only if
> >> >>> ovn-controller uses openflow version 1.5 [1]
> >> >>>
> >> >>> Since selection_method is NULL, vswitchd uses the dp_hash method [2].
> >> >>> dp_hash means it uses the hash calculated by
> >> >>> the datapath. In the case of kernel datapath, from what I understand
> >> >>> it uses skb_get_hash().
> >> >>>
> >> >>> I modified the vswitchd code to use the selection_method "hash" if
> >> >>> selection_method is not set. In this case the load balancer
> >> >>> works as expected. For a fixed src ip, src port, dst ip and dst port,
> >> >>> the group action is selecting the same bucket always. [3]
> >> >>>
> >> >>> I think we need to fix a few issues in OVN
> >> >>>   - Use openflow 1.5 so that ovn can set selection_method
> >> >>>  -  Use "hash" method if dp_hash is not choosing the same bucket for
> >> >>> 5-tuple hash.
> >> >>>   - May be provide the option for the CMS to choose an algorithm i.e.
> >> >>> to use dp_hash or hash.
> >> >>>
> >> >> I'd rather not expose this to the CMS as it depends on the datapath implementation as per [0] but maybe it makes sense to eventually abstract it to the CMS in a more LB-ish way (common algorithm names used in load balancing) in the case at some point the LB feature is enhanced somehow to support more algorithms.
> >> >>
> >> >> I believe that for OVN LB users, using OF 1.5 to force the use of 'hash' would be the best solution now.
> >> >>
> >> >> My 2 cents as I'm not an LB expert.
> >> >>
> >> >> I also recall that we tested this in the past and seemed to be working. I have been checking further in the doc [0] and found this paragraph:
> >> >>
> >> >> "If no selection method is specified, Open vSwitch up to release 2.9 applies the hash method with default fields. From 2.10 onwards Open vSwitch defaults to the dp_hash method with symmetric L3/L4 hash algorithm, unless the weighted group buck‐ ets cannot be mapped to a maximum of 64 dp_hash values with sufficient accuracy. In those rare cases Open vSwitch 2.10 and later fall back to the hash method with the default set of hash fields."
> >> >>
> >> >> The explanation seems to be that when we tested the feature we relied on OVS 2.9 and hence the confusion.
> >> >>
> >> >> Thanks a lot again!
> >> >> Daniel
> >> >>
> >> >> [0] http://www.openvswitch.org/support/dist-docs/ovs-ofctl.8.html
> >> >>
> >> >>>
> >> >>> I'll look into it on how to support this.
> >> >>>
> >> >>> [1] - https://github.com/openvswitch/ovs/blob/master/lib/ofp-group.c#L2120
> >> >>>        https://github.com/openvswitch/ovs/blob/master/lib/ofp-group.c#L2082
> >> >>>
> >> >>> [2] - https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif.c#L5108
> >> >>> [3] - https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-xlate.c#L4553
> >> >>>
> >> >>>
> >> >>> Thanks
> >> >>> Numan
> >> >>>
> >> >>>
> >> >>> > Thanks,
> >> >>> > Han
> >> >>> > _______________________________________________
> >> >>> > discuss mailing list
> >> >>> > discuss at openvswitch.org <mailto:discuss at openvswitch.org>
> >> >>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >> >>> _______________________________________________
> >> >>> discuss mailing list
> >> >>> discuss at openvswitch.org <mailto:discuss at openvswitch.org>
> >> >>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >>
> >> _______________________________________________
> >> discuss mailing list
> >> discuss at openvswitch.org
> >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >
> >
> >
> > --
> > Best regards,
> > Maciej Józefczyk



-- 
Best regards, Tonghao


More information about the discuss mailing list