[ovs-dev] 答复: [PATCH v2] userspace: fix bad UDP performance issue of veth

Aaron Conole aconole at redhat.com
Thu Sep 17 14:34:10 UTC 2020


"Yi Yang (杨燚)-云服务集团" <yangyi01 at inspur.com> writes:

> Aaron, thank you so much for comments, I'll update it to fix your comment in v3, replies for comments inline, please check them.

Thanks.

I have one more comment to consider.  SO_SNDBUF / SO_RCVBUF are
available on many OSes - does it make sense to make a similar change to
the BSD code as well since that is also a userspace datapath component?

> -----邮件原件-----
> 发件人: dev [mailto:ovs-dev-bounces at openvswitch.org] 代表 Aaron Conole
> 发送时间: 2020年9月17日 1:17
> 收件人: yang_y_yi at 163.com
> 抄送: ovs-dev at openvswitch.org; i.maximets at ovn.org; fbl at sysclose.org
> 主题: Re: [ovs-dev] [PATCH v2] userspace: fix bad UDP performance issue of veth
>
> yang_y_yi at 163.com writes:
>
>> From: Yi Yang <yangyi01 at inspur.com>
>>
>> iperf3 UDP performance of veth to veth case is very very bad because 
>> of too many packet loss, the root cause is rmem_default and 
>> wmem_default are just 212992, but iperf3 UDP test used 8K UDP size 
>> which resulted in many UDP fragment in case that MTU size is 1500, one 
>> 8K UDP send would enqueue 6 UDP fragments to socket receive queue, the 
>> default small socket buffer size can't cache so many packets that many 
>> packets are lost.
>>
>> This commit fixed packet loss issue, it allows users to set socket 
>> receive and send buffer size per their own system environment to 
>> proper value, therefore there will not be packet loss.
>>
>> Users can set system interface socket buffer size by command lines:
>>
>>   $ sudo sh -c "1073741823 > /proc/sys/net/core/wmem_max"
>>   $ sudo sh -c "1073741823 > /proc/sys/net/core/rmem_max"
>>
>> or
>>
>>   $ sudo ovs-vsctl set Open_vSwitch . \
>>         other_config:userspace-sock-buf-size=1073741823
>>
>> But final socket buffer size is minimum one among of them.
>> Possible value range is 212992 to 1073741823. Current default value 
>> for other_config:userspace-sock-buf-size is 212992, users need to 
>> increase it to improve UDP performance, the changed value will take 
>> effect after restarting ovs-vswitchd. More details about it is in the 
>> document Documentation/howto/userspace-udp-performance-tunning.rst.
>>
>> By the way, big socket buffer doesn't mean it will allocate big buffer 
>> on creating socket, actually it won't alocate any extra buffer 
>> compared to default socket buffer size, it just means more skbuffs can 
>> be enqueued to socket receive queue and send queue, therefore there 
>> will not be packet loss.
>>
>> The below is for your reference.
>>
>> The result before apply this commit
>> ===================================
>> $ ip netns exec ns02 iperf3 -t 5 -i 1 -u -b 100M -c 10.15.2.6 
>> --get-server-output -A 5 Connecting to host 10.15.2.6, port 5201 [  4] 
>> local 10.15.2.2 port 59053 connected to 10.15.2.6 port 5201
>> [ ID] Interval           Transfer     Bandwidth       Total Datagrams
>> [  4]   0.00-1.00   sec  10.8 MBytes  90.3 Mbits/sec  1378
>> [  4]   1.00-2.00   sec  11.9 MBytes   100 Mbits/sec  1526
>> [  4]   2.00-3.00   sec  11.9 MBytes   100 Mbits/sec  1526
>> [  4]   3.00-4.00   sec  11.9 MBytes   100 Mbits/sec  1526
>> [  4]   4.00-5.00   sec  11.9 MBytes   100 Mbits/sec  1526
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>> [  4]   0.00-5.00   sec  58.5 MBytes  98.1 Mbits/sec  0.047 ms  357/531 (67%)
>> [  4] Sent 531 datagrams
>>
>> Server output:
>> -----------------------------------------------------------
>> Accepted connection from 10.15.2.2, port 60314 [  5] local 10.15.2.6 
>> port 5201 connected to 10.15.2.2 port 59053
>> [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>> [  5]   0.00-1.00   sec  1.36 MBytes  11.4 Mbits/sec  0.047 ms  357/531 (67%)
>> [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec  0.047 ms  0/0 (-nan%)
>> [  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec  0.047 ms  0/0 (-nan%)
>> [  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec  0.047 ms  0/0 (-nan%)
>> [  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec  0.047 ms  0/0 (-nan%)
>>
>> iperf Done.
>>
>> The result after apply this commit
>> ===================================
>> $ sudo ip netns exec ns02 iperf3 -t 5 -i 1 -u -b 4G -c 10.15.2.6 
>> --get-server-output -A 5 Connecting to host 10.15.2.6, port 5201 [  4] 
>> local 10.15.2.2 port 48547 connected to 10.15.2.6 port 5201
>> [ ID] Interval           Transfer     Bandwidth       Total Datagrams
>> [  4]   0.00-1.00   sec   440 MBytes  3.69 Gbits/sec  56276
>> [  4]   1.00-2.00   sec   481 MBytes  4.04 Gbits/sec  61579
>> [  4]   2.00-3.00   sec   474 MBytes  3.98 Gbits/sec  60678
>> [  4]   3.00-4.00   sec   480 MBytes  4.03 Gbits/sec  61452
>> [  4]   4.00-5.00   sec   480 MBytes  4.03 Gbits/sec  61441
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>> [  4]   0.00-5.00   sec  2.30 GBytes  3.95 Gbits/sec  0.024 ms  0/301426 (0%)
>> [  4] Sent 301426 datagrams
>>
>> Server output:
>> -----------------------------------------------------------
>> Accepted connection from 10.15.2.2, port 60320 [  5] local 10.15.2.6 
>> port 5201 connected to 10.15.2.2 port 48547
>> [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>> [  5]   0.00-1.00   sec   209 MBytes  1.75 Gbits/sec  0.021 ms  0/26704 (0%)
>> [  5]   1.00-2.00   sec   258 MBytes  2.16 Gbits/sec  0.025 ms  0/32967 (0%)
>> [  5]   2.00-3.00   sec   258 MBytes  2.16 Gbits/sec  0.022 ms  0/32987 (0%)
>> [  5]   3.00-4.00   sec   257 MBytes  2.16 Gbits/sec  0.023 ms  0/32954 (0%)
>> [  5]   4.00-5.00   sec   257 MBytes  2.16 Gbits/sec  0.021 ms  0/32937 (0%)
>> [  5]   5.00-6.00   sec   255 MBytes  2.14 Gbits/sec  0.026 ms  0/32685 (0%)
>> [  5]   6.00-7.00   sec   254 MBytes  2.13 Gbits/sec  0.025 ms  0/32453 (0%)
>> [  5]   7.00-8.00   sec   255 MBytes  2.14 Gbits/sec  0.026 ms  0/32679 (0%)
>> [  5]   8.00-9.00   sec   255 MBytes  2.14 Gbits/sec  0.022 ms  0/32669 (0%)
>>
>> iperf Done.
>>
>> Signed-off-by: Yi Yang <yangyi01 at inspur.com>
>> ---
>>
>> Changelog
>> ---------
>>   v2 -> v1: Add howto document
>>             Add other_config:userspace-sock-buf-size
>>
>> ---
>>  Documentation/automake.mk                          |   1 +
>>  Documentation/howto/index.rst                      |   1 +
>>  .../howto/userspace-udp-performance-tunning.rst    | 220 +++++++++++++++++++++
>>  lib/automake.mk                                    |   2 +
>>  lib/netdev-linux.c                                 |  55 ++++++
>>  lib/userspace-sock-buf-size.c                      |  75 +++++++
>>  lib/userspace-sock-buf-size.h                      |  23 +++
>>  vswitchd/bridge.c                                  |   2 +
>>  8 files changed, 379 insertions(+)
>>  create mode 100644 
>> Documentation/howto/userspace-udp-performance-tunning.rst
>>  create mode 100644 lib/userspace-sock-buf-size.c  create mode 100644 
>> lib/userspace-sock-buf-size.h
>>
>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk 
>> index f85c432..4431097 100644
>> --- a/Documentation/automake.mk
>> +++ b/Documentation/automake.mk
>> @@ -71,6 +71,7 @@ DOC_SOURCE = \
>>  	Documentation/howto/sflow.rst \
>>  	Documentation/howto/tunneling.png \
>>  	Documentation/howto/tunneling.rst \
>> +	Documentation/howto/userspace-udp-performance-tunning.rst \
>>  	Documentation/howto/userspace-tunneling.rst \
>>  	Documentation/howto/vlan.png \
>>  	Documentation/howto/vlan.rst \
>> diff --git a/Documentation/howto/index.rst 
>> b/Documentation/howto/index.rst index 60fb8a7..d5271f0 100644
>> --- a/Documentation/howto/index.rst
>> +++ b/Documentation/howto/index.rst
>> @@ -44,6 +44,7 @@ OVS
>>     lisp
>>     tunneling
>>     userspace-tunneling
>> +   userspace-udp-performance-tunning
>>     vlan
>>     qos
>>     vtep
>> diff --git a/Documentation/howto/userspace-udp-performance-tunning.rst 
>> b/Documentation/howto/userspace-udp-performance-tunning.rst
>> new file mode 100644
>> index 0000000..c5bd7f9
>> --- /dev/null
>> +++ b/Documentation/howto/userspace-udp-performance-tunning.rst
>> @@ -0,0 +1,220 @@
>> +..
>> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
>> +      not use this file except in compliance with the License. You may obtain
>> +      a copy of the License at
>> +
>> +          http://www.apache.org/licenses/LICENSE-2.0
>> +
>> +      Unless required by applicable law or agreed to in writing, software
>> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
>> +      License for the specific language governing permissions and limitations
>> +      under the License.
>> +
>> +      Convention for heading levels in Open vSwitch documentation:
>> +
>> +      =======  Heading 0 (reserved for the title in a document)
>> +      -------  Heading 1
>> +      ~~~~~~~  Heading 2
>> +      +++++++  Heading 3
>> +      '''''''  Heading 4
>> +
>> +      Avoid deeper levels because they do not render well.
>> +
>> +=================================
>> +Userspace UDP performance tunning
>> +=================================
>> +
>> +This document describes how to tune UDP performance for Open vSwitch 
>> +userspace. In Open vSwitch userspace case, if you run iperf3 to test 
>> +UDP performance, you will see bigger packet loss rate, sometimes, you 
>> +also will see iperf3 outputs some information as below.
>> +
>> +[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec  0.018 ms  0/0 (-nan%)
>> +[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec  0.018 ms  0/0 (-nan%)
>> +[  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec  0.018 ms  0/0 (-nan%)
>> +[  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec  0.018 ms  0/0 (-nan%)
>> +[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec  0.018 ms  0/0 (-nan%)
>> +[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec  0.018 ms  0/0 (-nan%)
>> +[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec  0.018 ms  0/0 (-nan%)
>> +[  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec  0.018 ms  0/0 (-nan%)
>> +[  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec  0.018 ms  0/0 (-nan%)
>> +
>> +or
>> +
>> +iperf3: OUT OF ORDER - incoming packet = 70 and received packet = 97 
>> +AND SP = 5
>> +iperf3: OUT OF ORDER - incoming packet = 71 and received packet = 97 
>> +AND SP = 5
>> +iperf3: OUT OF ORDER - incoming packet = 72 and received packet = 99 
>> +AND SP = 5
>> +iperf3: OUT OF ORDER - incoming packet = 14 and received packet = 123 
>> +AND SP = 5
>> +iperf3: OUT OF ORDER - incoming packet = 15 and received packet = 125 
>> +AND SP = 5
>> +iperf3: OUT OF ORDER - incoming packet = 78 and received packet = 137 
>> +AND SP = 5
>> +iperf3: OUT OF ORDER - incoming packet = 79 and received packet = 137 
>> +AND SP = 5
>> +iperf3: OUT OF ORDER - incoming packet = 80 and received packet = 139 
>> +AND SP = 5
>> +iperf3: OUT OF ORDER - incoming packet = 82 and received packet = 172 
>> +AND SP = 5
>> +iperf3: OUT OF ORDER - incoming packet = 83 and received packet = 173 
>> +AND SP = 5
>> +
>> +There are many reasons resulting in such issues, for example, you 
>> +don't use -b to limit bandwidth, big packet(UDP packet data size is 
>> +8192 by default if you don't use -l to specify UDP payload size) 
>> +means many IP fragments if your MTU is 1500/1450, any one of them is 
>> +lost, that means the whole UDP packet is lost because TCP/IP protocol 
>> +stack can't reassemble original UDP packet, so big packet isn't 
>> +always good for performance. But among of them, the most important
> reason is socket buffer size of UDP send side and receive side.
>> +
>> +Here is iperf3 output if system interface added to OVS use default 
>> +buffer size (which is 212992 by default).
>> +
>> +$ sudo ip netns exec ns03 iperf3 -t 10 -i 1 -u -b 10G -c 10.15.2.3 
>> +--get-server-output Connecting to host 10.15.2.3, port 5201 [  4] 
>> +local 10.15.2.7 port 39415 connected to 10.15.2.3 port 5201
>> +[ ID] Interval           Transfer     Bandwidth       Total Datagrams
>> +[  4]   0.00-1.00   sec   572 MBytes  4.79 Gbits/sec  73154
>> +[  4]   1.00-2.00   sec   611 MBytes  5.12 Gbits/sec  78196
>> +[  4]   2.00-3.00   sec   588 MBytes  4.93 Gbits/sec  75248
>> +[  4]   3.00-4.00   sec   619 MBytes  5.19 Gbits/sec  79200
>> +[  4]   4.00-5.00   sec   625 MBytes  5.24 Gbits/sec  79937
>> +[  4]   5.00-6.00   sec   664 MBytes  5.57 Gbits/sec  85043
>> +[  4]   6.00-7.00   sec   636 MBytes  5.34 Gbits/sec  81417
>> +[  4]   7.00-8.00   sec   629 MBytes  5.27 Gbits/sec  80461
>> +[  4]   8.00-9.00   sec   635 MBytes  5.33 Gbits/sec  81326
>> +[  4]   9.00-10.00  sec   627 MBytes  5.26 Gbits/sec  80270
>> +- - - - - - - - - - - - - - - - - - - - - - - - -
>> +[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>> +[  4]   0.00-10.00  sec  6.06 GBytes  5.21 Gbits/sec  0.067 ms  3793/5791 (65%)
>> +[  4] Sent 5791 datagrams
>> +
>> +Server output:
>> +- - - - - - - -
>> +Accepted connection from 10.15.2.7, port 54090 [  5] local 10.15.2.3 
>> +port 5201 connected to 10.15.2.7 port 39415
>> +[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>> +[  5]   0.00-1.00   sec  15.6 MBytes   131 Mbits/sec  0.067 ms  3793/5791 (65%)
>> +[  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec  0.067 ms  0/0 (-nan%)
>> +[  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec  0.067 ms  0/0 (-nan%)
>> +[  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec  0.067 ms  0/0 (-nan%)
>> +[  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec  0.067 ms  0/0 (-nan%)
>> +[  5]   5.00-6.00   sec  0.00 Bytes  0.00 bits/sec  0.067 ms  0/0 (-nan%)
>> +[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec  0.067 ms  0/0 (-nan%)
>> +[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec  0.067 ms  0/0 (-nan%)
>> +[  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec  0.067 ms  0/0 (-nan%)
>> +[  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec  0.067 ms  0/0 (-nan%)
>> +
>> +
>> +iperf Done.
>> +
>> +Test setup is below:
>> +
>> +  netns ns02                           netns ns03
>> ++------------+                       +------------+
>> +|10.15.2.3/24|                       |10.15.2.7/24|
>> +|            |                       |            |
>> +|   veth02   |                       |   veth03   |
>> ++------|-----+  +-----------------+  +-----|------+
>> +       |        |                 |        |
>> +       +--------|       br0       |--------+
>> +                |(datapath=netdev)|
>> +                +-----------------+
>> +
>> +
>> +But what if you increase socket buffer size? Let us increase it to 
>> +1073741823 and check it again.
>> +
>> +$ sudo ip netns exec ns03 iperf3 -t 10 -i 1 -u -b 3G -c 10.15.2.3 
>> +--get-server-output Connecting to host 10.15.2.3, port 5201 [  4] 
>> +local 10.15.2.7 port 52686 connected to 10.15.2.3 port 5201
>> +[ ID] Interval           Transfer     Bandwidth       Total Datagrams
>> +[  4]   0.00-1.00   sec   343 MBytes  2.88 Gbits/sec  43945
>> +[  4]   1.00-2.00   sec   357 MBytes  3.00 Gbits/sec  45742
>> +[  4]   2.00-3.00   sec   357 MBytes  3.00 Gbits/sec  45759
>> +[  4]   3.00-4.00   sec   357 MBytes  3.00 Gbits/sec  45716
>> +[  4]   4.00-5.00   sec   358 MBytes  3.01 Gbits/sec  45882
>> +[  4]   5.00-6.00   sec   360 MBytes  3.02 Gbits/sec  46046
>> +[  4]   6.00-7.00   sec   368 MBytes  3.09 Gbits/sec  47163
>> +[  4]   7.00-8.00   sec   357 MBytes  3.00 Gbits/sec  45734
>> +[  4]   8.00-9.00   sec   353 MBytes  2.97 Gbits/sec  45246
>> +[  4]   9.00-10.00  sec   356 MBytes  2.99 Gbits/sec  45630
>> +- - - - - - - - - - - - - - - - - - - - - - - - -
>> +[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>> +[  4]   0.00-10.00  sec  3.49 GBytes  2.99 Gbits/sec  0.027 ms  0/456861 (0%)
>> +[  4] Sent 456861 datagrams
>> +
>> +Server output:
>> +- - - - - - - -
>> +Accepted connection from 10.15.2.7, port 54096 [  5] local 10.15.2.3 
>> +port 5201 connected to 10.15.2.7 port 52686
>> +[ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
>> +[  5]   0.00-1.00   sec   190 MBytes  1.59 Gbits/sec  0.031 ms  0/24303 (0%)
>> +[  5]   1.00-2.00   sec   219 MBytes  1.84 Gbits/sec  0.023 ms  0/28025 (0%)
>> +[  5]   2.00-3.00   sec   219 MBytes  1.84 Gbits/sec  0.029 ms  0/28006 (0%)
>> +[  5]   3.00-4.00   sec   219 MBytes  1.83 Gbits/sec  0.030 ms  0/27990 (0%)
>> +[  5]   4.00-5.00   sec   218 MBytes  1.83 Gbits/sec  0.031 ms  0/27920 (0%)
>> +[  5]   5.00-6.00   sec   209 MBytes  1.76 Gbits/sec  0.094 ms  0/26807 (0%)
>> +[  5]   6.00-7.00   sec   185 MBytes  1.55 Gbits/sec  0.032 ms  0/23673 (0%)
>> +[  5]   7.00-8.00   sec   217 MBytes  1.82 Gbits/sec  0.030 ms  0/27721 (0%)
>> +[  5]   8.00-9.00   sec   208 MBytes  1.75 Gbits/sec  0.029 ms  0/26646 (0%)
>> +[  5]   9.00-10.00  sec   219 MBytes  1.84 Gbits/sec  0.029 ms  0/28007 (0%)
>> +[  5]  10.00-11.00  sec   217 MBytes  1.82 Gbits/sec  0.026 ms  0/27816 (0%)
>> +[  5]  11.00-12.00  sec   218 MBytes  1.83 Gbits/sec  0.024 ms  0/27936 (0%)
>> +[  5]  12.00-13.00  sec   213 MBytes  1.79 Gbits/sec  0.036 ms  0/27282 (0%)
>> +[  5]  13.00-14.00  sec   211 MBytes  1.77 Gbits/sec  0.035 ms  0/27018 (0%)
>> +[  5]  14.00-15.00  sec   212 MBytes  1.78 Gbits/sec  0.029 ms  0/27162 (0%)
>> +[  5]  15.00-16.00  sec   216 MBytes  1.81 Gbits/sec  0.025 ms  0/27605 (0%)
>> +
>> +
>> +iperf Done.
>> +
>> +You can see the performance number has huge improvement, packet loss 
>> +rate is 0.
>> +
>> +.. note::
>> +
>> +   This howto covers the steps required to tune UDP performance. The same
>> +   approach can be used for iperf3 client and iperf3 server in VMs or network
>> +   namespaces.
>> +
>> +Tunning Steps
>> +-------------
>> +
>> +Perform the following steps on OVS node to tune socket buffer for OVS 
>> +system interface.
>> +
>> +#. Change Linux system maximum socket buffer size for send and 
>> +receive sides
>> +
>> +       $ sudo sh -c "1073741823 > /proc/sys/net/core/wmem_max"
>> +       $ sudo sh -c "1073741823 > /proc/sys/net/core/rmem_max"
>> +
>> +   In order to ensure they are still set to the above value after your system
>> +   is rebooted, you also need change systctl config to persist these values.
>> +
>> +       $ sudo sh -c "echo net.core.rmem_max=1073741823 >> /etc/sysctl.conf"
>> +       $ sudo sh -c "echo net.core.wmem_max=1073741823 >> /etc/sysctl.conf"
>> +
>> +#. Change socket buffer size for OVS system interface
>> +
>> +       $ sudo ovs-vsctl set Open_vSwitch . 
>> + other_config:userspace-sock-buf-size=1073741823
>> +
>> +   Note: You can set it to smaller value per your system, final recv socket
>> +   buffer size for OVS system interface is minimum one of rmem_max and
>> +   this value, final send socket buffer size for OVS system interface is
>> +   minimum one of wmem_max and this value. So you can change it to the value
>> +   you want just by changing other_config:userspace-sock-buf-size, you also
>> +   can set other_config:userspace-sock-buf-size to 1073741823 and just change
>> +   /proc/sys/net/core/rmem_max and /proc/sys/net/core/wmem_max to set the
>> +   value you want, but the changed value will take effect only after you
>> +   restart ovs-vswitchd no matter which one you prefer to use.
>
> You should make it obvious that this sets both the read and write sockbuf to this value.
>
> [Yi Yang] Good idea, will add such statement in v3.
>
>
>> +#. Restart ovs-vswitchd
>> +
>> +   Note: The changed value will take effect only after you restart
>> +   ovs-vswitchd.
>
> Why this limitation?
> [Yi Yang] Maybe more explanation is needed here, only newly-added
> system interfaces will take the changed value, existing system
> interfaces still use old value. So restart is necessary if you want it
> to take effect for all the system interfaces in bridge.
>
>> +#. You need repeat the above steps on all the OVS nodes to make sure
>> +   cross-node veth-to-veth, veth-to-tap, or tap-to-tap UDP performance
>> +   can get improved.
>> +
>> +Potential Impact
>> +----------------
>> +
>> +Although this tunning can improve UDP performance, it possibly also 
>> +impacts on TCP performance, please reset the above values to default 
>> +values in your system if you see it hurts your TCP performance.
>
> You are setting the values explicitly in the code, regardless of the
> user's decision.  One other side effect is that it triggers 'spurious'
> wakes in the system.
> [Yi Yang] reasonable concern, it is good idea not to set socket buffer
> size if a user doesn't set other_config:userspace-sock-buf-size
> explicitly.
>
>> diff --git a/lib/automake.mk b/lib/automake.mk index 380a672..ffbc3e3 
>> 100644
>> --- a/lib/automake.mk
>> +++ b/lib/automake.mk
>> @@ -343,6 +343,8 @@ lib_libopenvswitch_la_SOURCES = \
>>  	lib/unicode.h \
>>  	lib/unixctl.c \
>>  	lib/unixctl.h \
>> +        lib/userspace-sock-buf-size.c \
>> +        lib/userspace-sock-buf-size.h \
>>  	lib/userspace-tso.c \
>>  	lib/userspace-tso.h \
>>  	lib/util.c \
>> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c index 
>> fe7fb9b..a374b43 100644
>> --- a/lib/netdev-linux.c
>> +++ b/lib/netdev-linux.c
>> @@ -78,6 +78,7 @@
>>  #include "timer.h"
>>  #include "unaligned.h"
>>  #include "openvswitch/vlog.h"
>> +#include "userspace-sock-buf-size.h"
>>  #include "userspace-tso.h"
>>  #include "util.h"
>>  
>> @@ -1103,6 +1104,18 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
>>              ARRAY_SIZE(filt), (struct sock_filter *) filt
>>          };
>>  
>> +        /* sock_buf_size must be less than 1G, so maximum value is
>> +         * (1 << 30) - 1, i.e. 1073741823, this doesn't mean this
>> +         * socket will allocate so big buffer, it just means the
>> +         * packets client sends won't be dropped because of small
>> +         * default socket buffer, the result is we can get the best
>> +         * possible throughtput, no packet loss, this can improve
>> +         * UDP and TCP performance significantly, especially for
>> +         * fragmented UDP.
>> +         */
>> +        uint32_t sock_buf_size = userspace_get_sock_buf_size();
>> +        uint32_t sock_opt_len = sizeof(sock_buf_size);
>> +
>>          /* Create file descriptor. */
>>          rx->fd = socket(PF_PACKET, SOCK_RAW, 0);
>>          if (rx->fd < 0) {
>> @@ -1161,6 +1174,48 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
>>                       netdev_get_name(netdev_), ovs_strerror(error));
>>              goto error;
>>          }
>> +
>> +        /* Set send socket buffer size */
>
> If the user has used systemctl to tune rmem_default to some value
> other than the hardcoded one you have, it will not be used by OvS in
> this case.  That means the user will not get expected behavior.
>
> The existing behavior isn't preserved.  Please don't set these unless the user explicitly pushes a value into the database.
> [Yi Yang] ok, will do this way in v3.
>
>> +        error = setsockopt(rx->fd, SOL_SOCKET, SO_SNDBUF, &sock_buf_size, 4);
>> +        if (error) {
>> +            error = errno;
>> +            VLOG_ERR("%s: failed to set send socket buffer size (%s)",
>> +                     netdev_get_name(netdev_), ovs_strerror(error));
>> +            goto error;
>> +        }
>> +
>> +        /* Set recv socket buffer size */
>> +        error = setsockopt(rx->fd, SOL_SOCKET, SO_RCVBUF, &sock_buf_size, 4);
>> +        if (error) {
>> +            error = errno;
>> +            VLOG_ERR("%s: failed to set recv socket buffer size (%s)",
>> +                     netdev_get_name(netdev_), ovs_strerror(error));
>> +            goto error;
>> +        }
>
> Please don't use the hardcoded '4' - use sock_opt_len.
> [Yi Yang] ok.
>
> I think we should only error here if we see EBADF or ENOTSOCK.
> [Yi Yang] Make sense, will check these error code in v3.
>
>> +        /* Get final recv socket buffer size, it should be
>> +         * 2 * ((1 << 30) - 1) (i.e. 2147483646) if successfully.
>> +         * Don't doubt it is wrong, Linux kernel does so, i.e.
>> +         * final sk_rcvbuf = val * 2.
>> +         */
>> +        error=  getsockopt(rx->fd, SOL_SOCKET, SO_RCVBUF, &sock_buf_size,
>> +                           &sock_opt_len);
>
> Whitespace error here
> [Yi Yang] will correct it in v3.
>
>> +        if (!error) {
>> +            VLOG_INFO("netdev %s socket recv buffer size: %d",
>> +                      netdev_get_name(netdev_), sock_buf_size);
>> +        }
>> +
>> +        /* Get final send socket buffer size, it should be
>> +         * 2 * ((1 << 30) - 1) (i.e. 2147483646) if successfully.
>> +         * Don't doubt it is wrong, Linux kernel does so, i.e.
>> +         * final sk_sndbuf = val * 2.
>> +         */
>> +        error = getsockopt(rx->fd, SOL_SOCKET, SO_SNDBUF, &sock_buf_size,
>> +                           &sock_opt_len);
>> +        if (!error) {
>> +            VLOG_INFO("netdev %s socket send buffer size: %d",
>> +                      netdev_get_name(netdev_), sock_buf_size);
>> +        }
>
> Maybe only print when the sndbuf isn't the size requested - otherwise
> for every port added we see this message.  We already know what the
> size will be - you log it when it is read from the DB, and it is
> present in the DB.
> [Yi Yang] Good idea, will do that way in v3.
>
>>      }
>>      ovs_mutex_unlock(&netdev->mutex);
>>  
>> diff --git a/lib/userspace-sock-buf-size.c 
>> b/lib/userspace-sock-buf-size.c new file mode 100644 index 
>> 0000000..e4c9381
>> --- /dev/null
>> +++ b/lib/userspace-sock-buf-size.c
>> @@ -0,0 +1,75 @@
>> +/*
>> + * Copyright (c) 2020 Inspur, Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, 
>> +software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> + * See the License for the specific language governing permissions 
>> +and
>> + * limitations under the License.
>> + */
>> +
>> +#include <config.h>
>> +
>> +#include "smap.h"
>> +#include "ovs-thread.h"
>> +#include "openvswitch/vlog.h"
>> +#include "userspace-sock-buf-size.h"
>> +
>> +VLOG_DEFINE_THIS_MODULE(userspace_sock_buf_size);
>> +
>> +/* Default socket buffer size for system interface is
>> + * 1073741823, i.e. 1024 * 1024 * 1024 - 1, it can help
>> + * improve UDP performance, you can tune it per your
>> + * system by the below command
>> + *   ovs-vsctl set Open_vSwitch . \
>> + *     other_config:userspace_sock_buf_size = XXXX
>> + *
>> + * 1073741823 is maximum possible value, the value you
>> + * set must be less than or equal to 1073741823.
>> + */
>> +
>> +/* Minimum socket buffer size, it is Linux default size */ #define 
>> +MIN_SOCK_BUF_SIZE 212992
>> +
>> +/* Maximum possible socket buffer size */ #define MAX_SOCK_BUF_SIZE 
>> +1073741823
>> +
>> +#define DEFAULT_SOCK_BUF_SIZE MIN_SOCK_BUF_SIZE
>> +
>> +static uint32_t userspace_sock_buf_size = DEFAULT_SOCK_BUF_SIZE;
>> +
>> +void
>> +userspace_sock_buf_size_init(const struct smap *ovs_other_config) {
>> +    static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
>> +
>> +    if (ovsthread_once_start(&once)) {
>> +        uint32_t sock_buf_size;
>
> I think this can easily be runtime configurable.  Why protect it to only be set once?
> [Yi Yang] Make sense, as I said above, it will take effect only on
> newly-added interfaces if it is changeable at runtime. This can make
> confusion,
> i.e. some have good performance and others have bad performance, they are using different socket buffer size.
>
>> +        sock_buf_size = smap_get_int(ovs_other_config,
>> +                                     "userspace-sock-buf-size",
>> +                                     DEFAULT_SOCK_BUF_SIZE);
>> +        if (sock_buf_size < MIN_SOCK_BUF_SIZE) {
>> +            sock_buf_size = MIN_SOCK_BUF_SIZE;
>> +        } else if (sock_buf_size > MAX_SOCK_BUF_SIZE) {
>> +            sock_buf_size = MAX_SOCK_BUF_SIZE;
>> +        }
>> +
>> +        userspace_sock_buf_size = sock_buf_size;
>> +        VLOG_INFO("Userspace socket buffer size for system interface: %d",
>> +                  userspace_sock_buf_size);
>> +        ovsthread_once_done(&once);
>> +    }
>> +}
>> +
>> +uint32_t
>> +userspace_get_sock_buf_size(void)
>> +{
>> +    return userspace_sock_buf_size;
>> +}
>> diff --git a/lib/userspace-sock-buf-size.h 
>> b/lib/userspace-sock-buf-size.h new file mode 100644 index 
>> 0000000..80385ba
>> --- /dev/null
>> +++ b/lib/userspace-sock-buf-size.h
>> @@ -0,0 +1,23 @@
>> +/*
>> + * Copyright (c) 2020 Inspur Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, 
>> +software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> + * See the License for the specific language governing permissions 
>> +and
>> + * limitations under the License.
>> + */
>> +
>> +#ifndef USERSPACE_SOCK_SIZE_H
>> +#define USERSPACE_SOCK_SIZE_H 1
>> +
>> +void userspace_sock_buf_size_init(const struct smap 
>> +*ovs_other_config); uint32_t userspace_get_sock_buf_size(void);
>> +
>> +#endif /* userspace-sock-buf-size.h */
>> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c index 
>> a3e7fac..8ab33ee 100644
>> --- a/vswitchd/bridge.c
>> +++ b/vswitchd/bridge.c
>> @@ -65,6 +65,7 @@
>>  #include "system-stats.h"
>>  #include "timeval.h"
>>  #include "tnl-ports.h"
>> +#include "userspace-sock-buf-size.h"
>>  #include "userspace-tso.h"
>>  #include "util.h"
>>  #include "unixctl.h"
>> @@ -3291,6 +3292,7 @@ bridge_run(void)
>>          netdev_set_flow_api_enabled(&cfg->other_config);
>>          dpdk_init(&cfg->other_config);
>>          userspace_tso_init(&cfg->other_config);
>> +        userspace_sock_buf_size_init(&cfg->other_config);
>>      }
>>  
>>      /* Initialize the ofproto library.  This only needs to run once, 
>> but
>
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev



More information about the dev mailing list