[ovs-discuss] [OVN] OVN Load balancing algorithm

Tue Apr 28 12:38:18 UTC 2020

Hello!

Thank you for discussion and those findings.
I'm replying to Han question:

Numan/Maciej, in your tests did you see more than 2 buckets hit for same
> 5-tuples? If the above theory is right, you should see at most 2 buckets
> hit. For LB, since it uses CT and only the first packet uses the group, all
> packets of the same flow would always be forwarded to same LB backend. I
> guess if you wait long enough between the tests, you should see all tests
> hitting same backend. It would be great if you could confirm this.
>

The scenario that is tested by Octavia tempest test [1] is following:

Load balancer:
stack at mjozefcz-devstack-ovn-lb-master-new-localconf:~$ ovn-nbctl list
load_balancer
_uuid               : d63b6a9a-5971-4032-ad15-c14d3cb85e54
external_ids        : {enabled=True,...}
health_check        : []
ip_port_mappings    : {}
name                : "b019b27d-f4e3-4928-b2cb-8d63cbf8b499"
protocol            : tcp
vips                : {"10.1.1.122:60092"="10.2.1.144:80,10.2.2.103:80", "
172.24.4.58:60092"="10.2.1.144:80,10.2.2.103:80"}

Loadbalancer VIP: 10.1.1.122 has assigned Floating IP 172.24.4.58 (NAT).
member_1: 10.2.1.144:80 -- this member always responds with "1"
member_2: 10.2.2.103:80 -- this member always responds with "5"

The list of Logical Switch Ports and NAT entries related to this LB:
http://paste.openstack.org/show/792816/
The test verifies answers from 172.24.4.58:60092 with using the same caller
port on each HTTP GET request.

For this scenario we have following groups configured:
===============================================================================================
 sudo ovs-ofctl dump-groups br-int ; sudo ovs-ofctl dump-group-stats br-int

NXST_GROUP_DESC reply (xid=0x2):

 group_id=3,type=select,bucket=bucket_id:0,weight:100,actions=ct(commit,table=15,zone=NXM_NX_REG11[0..15],nat(dst=10.2.1.144:80
)),bucket=bucket_id:1,weight:100,actions=ct(commit,table=15,zone=NXM_NX_REG11[0..15],nat(dst=
10.2.2.103:80))
 group_id=4,type=select,bucket=bucket_id:0,weight:100,actions=ct(commit,table=19,zone=NXM_NX_REG13[0..15],nat(dst=10.2.1.144:80
)),bucket=bucket_id:1,weight:100,actions=ct(commit,table=19,zone=NXM_NX_REG13[0..15],nat(dst=
10.2.2.103:80))

NXST_GROUP reply (xid=0x6):

 group_id=3,duration=10448.051s,ref_count=2,packet_count=0,byte_count=0,bucket0:packet_count=0,byte_count=0,bucket1:packet_count=0,byte_count=0

 group_id=4,duration=10448.051s,ref_count=6,packet_count=0,byte_count=0,bucket0:packet_count=0,byte_count=0,bucket1:packet_count=0,byte_count=0

===============================================================================================

I reproduced the issue (random selection of members when the same client
source_ip is used) with this simple python script: [2], that can be used
with other scenarios.
Please note that it uses socket.SO_REUSEADDR [3] to perform HTTP GET
requests using the same source port without waiting for connection
being removed from conntrack.

Here are my findings.

1) Test LB and sleep 1 second between calls:
./get-data.py --lb-dest 172.24.4.58:60092 --sleep-time 1

result: http://paste.openstack.org/show/792818/
Different backeds are selected and different buckets are being hit in
group_id=3. Sometimes the bucket1 is hit, sometimes bucket0.
Output from groups dumps during the test:
http://paste.openstack.org/show/792820/

2) Test LB and sleep 60 second between calls:
./get-data.py --lb-dest 172.24.4.58:60092 --sleep-time 60

Result: http://paste.openstack.org/show/792822/
Output from group stats: http://paste.openstack.org/show/792823/
Always one bucket is hit (bucket0) and requests are pointed to the same
backend.

If you need OVN NorthBound DB from this test - let me know.

[1] https://bugs.launchpad.net/neutron/+bug/1871239
[2] http://paste.openstack.org/show/792817/
[3] http://man7.org/linux/man-pages/man7/socket.7.html

On Fri, Apr 24, 2020 at 6:09 PM Ilya Maximets <i.maximets at ovn.org> wrote:

> On 4/24/20 3:19 AM, Han Zhou wrote:
> > Based on the discussion in OVN meeting today I did some more testing,
> and here are my findings.
> >
> > - With ICMP (ping) between same source and destination it is always same
> bucket selected by dp_hash.
> > - With "nc" specifying same TCP 5-tuples, the packets can end up into
> different buckets. This is similar to what Numan and Maciej observed.
> >
> > However, I was using the OVN ECMP feature to test instead of LB. Since
> ECMP feature doesn't use conntrack, here are some more findings. The bucket
> selection changes only between 2 buckets, and the change happens when the
> packet datapath changes between userspace and kernel datapath. Let's say
> the first packet of a flow (megaflow) goes to userspace, it hits bucket1,
> and then if I send the second and more packets immediately they will all
> hit bucket2, but if I wait for a while until the flow disappears from the
> megaflow cache and then send the next packet, it will hit bucket1 again.
> This behavior is consistent.
> >
> > So I think it might be because of the different implementation of
> dp_hash in userspace and kernel datapath, the different buckets were
> selected (thanks Ilya for this hint in today's meeting).
> > Numan/Maciej, in your tests did you see more than 2 buckets hit for same
> 5-tuples? If the above theory is right, you should see at most 2 buckets
> hit. For LB, since it uses CT and only the first packet uses the group, all
> packets of the same flow would always be forwarded to same LB backend. I
> guess if you wait long enough between the tests, you should see all tests
> hitting same backend. It would be great if you could confirm this.
> >
> > For ECMP, this behavior will cause occasional packets out of order even
> for a single flow (for a burst of packets after some idle time), because CT
> is not used (and we can't use it because when peered with physical ECMP
> router groups we can't ensure the return traffic from physical routers hits
> same LR).
> >
> > For LB it causes the unexpected behavior that is reported in this thread.
> >
> > For the fix, I think we should figure out how to make sure dp_hash
> always uses same hash algorithm for both userspace and kernel
> implementation, if possible.
> > I am ok with the patch from Numan for the capability of configuring the
> desired hash method instead of always using default. However, using "hash"
> may be a huge performance sacrifice since the packets are always handled in
> slowpath, especially for ECMP. Even though LB uses CT, for short-lived flow
> scenario this is still a big performance penalty (for long lived flows of
> LB it may be ok since the majority of packets are still in fastpath).
> >
> > I am not familiar with the dp_hash implementation. I will do some more
> study, but any idea on how to ensure the consistency of dp_hash is highly
> appreciated!
>
> I had an impression that packet hash is routed from the datapath to
> userspace
> and back, but it turned out that it's really recent change.  Seems like
> following
> changes are required:
>
> 1. Linux kernel: bd1903b7c459 ("net: openvswitch: add hash info to upcall")
>    This is avaialble starting from upstream kernel v5.5.
>
> 2. OVS: 0442bfb11d6c ("ofproto-dpif-upcall: Echo HASH attribute back to
> datapath.")
>    This is available on branch-2.13.
>
> With above two patches first and subsequent packets should have same
> dp_hash
> calculated by the kernel datapath.
>
>
> >
> > Thanks,
> > Han
> >
> > On Tue, Apr 21, 2020 at 1:05 AM Daniel Alvarez Sanchez <
> dalvarez at redhat.com <mailto:dalvarez at redhat.com>> wrote:
> >>
> >> Thanks Numan for the investigation and the great explanation!
> >>
> >> On Tue, Apr 21, 2020 at 9:38 AM Numan Siddique <numans at ovn.org <mailto:
> numans at ovn.org>> wrote:
> >>>
> >>> On Fri, Apr 17, 2020 at 12:56 PM Han Zhou <zhouhan at gmail.com <mailto:
> zhouhan at gmail.com>> wrote:
> >>> >
> >>> >
> >>> >
> >>> > On Tue, Apr 7, 2020 at 7:03 AM Maciej Jozefczyk <mjozefcz at redhat.com
> <mailto:mjozefcz at redhat.com>> wrote:
> >>> > >
> >>> > > Hello!
> >>> > >
> >>> > > I would like to ask you to clarify how the OVN Load balancing
> algorithm works.
> >>> > >
> >>> > > Based on the action [1]:
> >>> > > 1) If connection is alive the same 'backend' will be chosen,
> >>> > >
> >>> > > 2) If it is a new connection the backend will be chosen based on
> selection_method=dp_hash [2].
> >>> > > Based on changelog the dp_hash uses '5 tuple hash' [3].
> >>> > > The hash is calculated based on values: source and destination IP,
>  source port, protocol and arbitrary value - 42. [4]
> >>> > > Based on that information we could name it SOURCE_IP_PORT.
> >>> > >
> >>> > > Unfortunately we recently got a bug report in OVN Octavia provider
> driver project, that the Load Balancing in OVN
> >>> > > works differently [5]. The report shows even when the test uses
> the same source ip and port, but new TCP connection,
> >>> > > traffic is randomly distributed, but based on [2] it shouldn't?
> >>> > >
> >>> > > Is it a bug?  Is something else taken to account while creating a
> hash? Can it be fixed in OVS/OVN?
> >>> > >
> >>> > >
> >>> > >
> >>> > > Thanks,
> >>> > > Maciej
> >>> > >
> >>> > >
> >>> > > [1]
> https://github.com/ovn-org/ovn/blob/branch-20.03/lib/actions.c#L1017
> >>> > > [2]
> https://github.com/ovn-org/ovn/blob/branch-20.03/lib/actions.c#L1059
> >>> > > [3]
> https://github.com/openvswitch/ovs/blob/d58b59c17c70137aebdde37d3c01c26a26b28519/NEWS#L364-L371
> >>> > > [4]
> https://github.com/openvswitch/ovs/blob/74286173f4d7f51f78e9db09b07a6d4d65263252/lib/flow.c#L2217
> >>> > > [5] https://bugs.launchpad.net/neutron/+bug/1871239
> >>> > >
> >>> > > --
> >>> > > Best regards,
> >>> > > Maciej Józefczyk
> >>> > > _______________________________________________
> >>> > > discuss mailing list
> >>> > > discuss at openvswitch.org <mailto:discuss at openvswitch.org>
> >>> > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >>> >
> >>> > Hi Maciej,
> >>> >
> >>> > Thanks for reporting. It is definitely strange that same 5-tuple
> flow resulted in hitting different backends. I didn't observed such
> behavior before (maybe I should try again myself to confirm). Can you make
> sure during the testing the group bucket didn't change? You can do so by:
> >>> > # ovs-ofctl dump-groups br-int
> >>> > and also check the group stats and see if multiple buckets has
> counter increased during the test
> >>> > # ovs-ofctl dump-group-stats br-int [group]
> >>> >
> >>> > For the 5-tuple hash function you are seeing flow_hash_5tuple(), it
> is using all the 5-tuples. It adds both ports (src and dst) at once:
> >>> >        /* Add both ports at once. */
> >>> >         hash = hash_add(hash,
> >>> >                         ((const uint32_t *)flow)[offsetof(struct
> flow, tp_src)
> >>> >                                                  /
> sizeof(uint32_t)]);
> >>> >
> >>> > The tp_src is the start of the offset, and the size is 32, meaning
> both src and dst, each is 16 bits. (Although I am not sure if dp_hash
> method is using this function or not. Need to check more code)
> >>> >
> >>> > BTW, I am not sure why Neutron give it the name SOURCE_IP_PORT.
> Shall it be called just 5-TUPLE, since protocol, destination IP and PORT
> are also considered in the hash.
> >>> >
> >>>
> >>>
> >>> Hi Maciej and Han,
> >>>
> >>> I did some testing and I can confirm as you're saying. OVN is not
> >>> choosing the same backend with the src ip, src port fixed.
> >>>
> >>> I think there is an issue with OVN on how it is programming the group
> >>> flows.  OVN is setting the selection_method as dp_hash.
> >>> But when ovs-vswitchd receives the  GROUP_MOD openflow message, I
> >>> noticed that the selection_method is not set.
> >>> From the code I see that selection_method will be encoded only if
> >>> ovn-controller uses openflow version 1.5 [1]
> >>>
> >>> Since selection_method is NULL, vswitchd uses the dp_hash method [2].
> >>> dp_hash means it uses the hash calculated by
> >>> the datapath. In the case of kernel datapath, from what I understand
> >>> it uses skb_get_hash().
> >>>
> >>> I modified the vswitchd code to use the selection_method "hash" if
> >>> selection_method is not set. In this case the load balancer
> >>> works as expected. For a fixed src ip, src port, dst ip and dst port,
> >>> the group action is selecting the same bucket always. [3]
> >>>
> >>> I think we need to fix a few issues in OVN
> >>>   - Use openflow 1.5 so that ovn can set selection_method
> >>>  -  Use "hash" method if dp_hash is not choosing the same bucket for
> >>> 5-tuple hash.
> >>>   - May be provide the option for the CMS to choose an algorithm i.e.
> >>> to use dp_hash or hash.
> >>>
> >> I'd rather not expose this to the CMS as it depends on the datapath
> implementation as per [0] but maybe it makes sense to eventually abstract
> it to the CMS in a more LB-ish way (common algorithm names used in load
> balancing) in the case at some point the LB feature is enhanced somehow to
> support more algorithms.
> >>
> >> I believe that for OVN LB users, using OF 1.5 to force the use of
> 'hash' would be the best solution now.
> >>
> >> My 2 cents as I'm not an LB expert.
> >>
> >> I also recall that we tested this in the past and seemed to be working.
> I have been checking further in the doc [0] and found this paragraph:
> >>
> >> "If no selection method is specified, Open vSwitch up to release 2.9
> applies the hash method with default fields. From 2.10 onwards Open vSwitch
> defaults to the dp_hash method with symmetric L3/L4 hash algorithm, unless
> the weighted group buck‐ ets cannot be mapped to a maximum of 64 dp_hash
> values with sufficient accuracy. In those rare cases Open vSwitch 2.10 and
> later fall back to the hash method with the default set of hash fields."
> >>
> >> The explanation seems to be that when we tested the feature we relied
> on OVS 2.9 and hence the confusion.
> >>
> >> Thanks a lot again!
> >> Daniel
> >>
> >> [0] http://www.openvswitch.org/support/dist-docs/ovs-ofctl.8.html
> >>
> >>>
> >>> I'll look into it on how to support this.
> >>>
> >>> [1] -
> https://github.com/openvswitch/ovs/blob/master/lib/ofp-group.c#L2120
> >>>
> https://github.com/openvswitch/ovs/blob/master/lib/ofp-group.c#L2082
> >>>
> >>> [2] -
> https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif.c#L5108
> >>> [3] -
> https://github.com/openvswitch/ovs/blob/master/ofproto/ofproto-dpif-xlate.c#L4553
> >>>
> >>>
> >>> Thanks
> >>> Numan
> >>>
> >>>
> >>> > Thanks,
> >>> > Han
> >>> > _______________________________________________
> >>> > discuss mailing list
> >>> > discuss at openvswitch.org <mailto:discuss at openvswitch.org>
> >>> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >>> _______________________________________________
> >>> discuss mailing list
> >>> discuss at openvswitch.org <mailto:discuss at openvswitch.org>
> >>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>
> _______________________________________________
> discuss mailing list
> discuss at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>

-- 
Best regards,
Maciej Józefczyk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20200428/37605c26/attachment-0001.html>