[ovs-dev] 答复: [PATCH] userspace: fix bad UDP performance issue of veth

Yi Yang (杨燚)-云服务集团 yangyi01 at inspur.com
Wed Aug 26 00:47:43 UTC 2020


Aaron, thank for your comments, actually final value depends on /proc/sys/net/core/rmem_max and /proc/sys/net/core/wmem_max, so it is still configurable. setsockopt(...) will set it to minimum one among of 
1073741823 and w/rmem_max.

-----邮件原件-----
发件人: dev [mailto:ovs-dev-bounces at openvswitch.org] 代表 Aaron Conole
发送时间: 2020年8月25日 23:26
收件人: yang_y_yi at 163.com
抄送: ovs-dev at openvswitch.org; i.maximets at ovn.org; fbl at sysclose.org
主题: Re: [ovs-dev] [PATCH] userspace: fix bad UDP performance issue of veth

yang_y_yi at 163.com writes:

> From: Yi Yang <yangyi01 at inspur.com>
>
> iperf3 UDP performance of veth to veth case is very very bad because 
> of too many packet loss, the root cause is rmem_default and 
> wmem_default are just 212992, but iperf3 UDP test used 8K UDP size 
> which resulted in many UDP fragment in case that MTU size is 1500, one 
> 8K UDP send would enqueue 6 UDP fragments to socket receive queue, the 
> default small socket buffer size can't cache so many packets that many 
> packets are lost.
>
> This commit fixed packet loss issue, it set socket receive and send 
> buffer to maximum possible value, therefore there will not be packet 
> loss forever, this also helps improve TCP performance because of no 
> retransmit.
>
> By the way, big socket buffer doesn't mean it will allocate big buffer 
> on creating socket, actually it won't alocate any extra buffer 
> compared to default socket buffer size, it just means more skbuffs can 
> be enqueued to socket receive queue and send queue, therefore there 
> will not be packet loss.
>
> The below is for your reference.
>
> The result before apply this commit
> ===================================
> $ ip netns exec ns02 iperf3 -t 5 -i 1 -u -b 100M -c 10.15.2.6 
> --get-server-output -A 5 Connecting to host 10.15.2.6, port 5201 [  4] 
> local 10.15.2.2 port 59053 connected to 10.15.2.6 port 5201
> [ ID] Interval           Transfer     Bandwidth       Total Datagrams
> [  4]   0.00-1.00   sec  10.8 MBytes  90.3 Mbits/sec  1378
> [  4]   1.00-2.00   sec  11.9 MBytes   100 Mbits/sec  1526
> [  4]   2.00-3.00   sec  11.9 MBytes   100 Mbits/sec  1526
> [  4]   3.00-4.00   sec  11.9 MBytes   100 Mbits/sec  1526
> [  4]   4.00-5.00   sec  11.9 MBytes   100 Mbits/sec  1526
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> [  4]   0.00-5.00   sec  58.5 MBytes  98.1 Mbits/sec  0.047 ms  357/531 (67%)
> [  4] Sent 531 datagrams
>
> Server output:
> -----------------------------------------------------------
> Accepted connection from 10.15.2.2, port 60314 [  5] local 10.15.2.6 
> port 5201 connected to 10.15.2.2 port 59053
> [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> [  5]   0.00-1.00   sec  1.36 MBytes  11.4 Mbits/sec  0.047 ms  357/531 (67%)
> [  5]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec  0.047 ms  0/0 (-nan%)
> [  5]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec  0.047 ms  0/0 (-nan%)
> [  5]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec  0.047 ms  0/0 (-nan%)
> [  5]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec  0.047 ms  0/0 (-nan%)
>
> iperf Done.
>
> The result after apply this commit
> ===================================
> $ sudo ip netns exec ns02 iperf3 -t 5 -i 1 -u -b 4G -c 10.15.2.6 
> --get-server-output -A 5 Connecting to host 10.15.2.6, port 5201 [  4] 
> local 10.15.2.2 port 48547 connected to 10.15.2.6 port 5201
> [ ID] Interval           Transfer     Bandwidth       Total Datagrams
> [  4]   0.00-1.00   sec   440 MBytes  3.69 Gbits/sec  56276
> [  4]   1.00-2.00   sec   481 MBytes  4.04 Gbits/sec  61579
> [  4]   2.00-3.00   sec   474 MBytes  3.98 Gbits/sec  60678
> [  4]   3.00-4.00   sec   480 MBytes  4.03 Gbits/sec  61452
> [  4]   4.00-5.00   sec   480 MBytes  4.03 Gbits/sec  61441
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> [  4]   0.00-5.00   sec  2.30 GBytes  3.95 Gbits/sec  0.024 ms  0/301426 (0%)
> [  4] Sent 301426 datagrams
>
> Server output:
> -----------------------------------------------------------
> Accepted connection from 10.15.2.2, port 60320 [  5] local 10.15.2.6 
> port 5201 connected to 10.15.2.2 port 48547
> [ ID] Interval           Transfer     Bandwidth       Jitter    Lost/Total Datagrams
> [  5]   0.00-1.00   sec   209 MBytes  1.75 Gbits/sec  0.021 ms  0/26704 (0%)
> [  5]   1.00-2.00   sec   258 MBytes  2.16 Gbits/sec  0.025 ms  0/32967 (0%)
> [  5]   2.00-3.00   sec   258 MBytes  2.16 Gbits/sec  0.022 ms  0/32987 (0%)
> [  5]   3.00-4.00   sec   257 MBytes  2.16 Gbits/sec  0.023 ms  0/32954 (0%)
> [  5]   4.00-5.00   sec   257 MBytes  2.16 Gbits/sec  0.021 ms  0/32937 (0%)
> [  5]   5.00-6.00   sec   255 MBytes  2.14 Gbits/sec  0.026 ms  0/32685 (0%)
> [  5]   6.00-7.00   sec   254 MBytes  2.13 Gbits/sec  0.025 ms  0/32453 (0%)
> [  5]   7.00-8.00   sec   255 MBytes  2.14 Gbits/sec  0.026 ms  0/32679 (0%)
> [  5]   8.00-9.00   sec   255 MBytes  2.14 Gbits/sec  0.022 ms  0/32669 (0%)
>
> iperf Done.
>
> Signed-off-by: Yi Yang <yangyi01 at inspur.com>
> ---

I think we should make it configurable.  Each RXQ will potentially allow a huge number of skbuffs to be enqueued after this.  That might, ironically, lead to worse performance (since there could be some kind of buffer bloat effect at higher rmem values as documented at https://serverfault.com/questions/410230/higher-rmem-max-value-leading-to-more-packet-loss).

I think it should be a decision that the operator can take.  Currently, they could modify it anyway via procfs, so we shouldn't break that.
Instead, I think there could be a config knob (or maybe reuse the
'n_{r,t}xq_desc'?) that when set would be used, and otherwise could just be from default.

WDYT?

>  lib/netdev-linux.c | 54 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 54 insertions(+)
>
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c index 
> fe7fb9b..3c45191 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -1103,6 +1103,18 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
>              ARRAY_SIZE(filt), (struct sock_filter *) filt
>          };
>  
> +        /* sock_buf_size must be less than 1G, so maximum value is
> +         * (1 << 30) - 1, i.e. 1073741823, this doesn't mean this
> +         * socket will allocate so big buffer, it just means the
> +         * packets client sends won't be dropped because of small
> +         * default socket buffer, the result is we can get the best
> +         * possible throughtput, no packet loss, this can improve
> +         * UDP and TCP performance significantly, especially for
> +         * fragmented UDP.
> +         */
> +        unsigned int sock_buf_size = (1 << 30) - 1;
> +        unsigned int sock_opt_len = sizeof(sock_buf_size);
> +
>          /* Create file descriptor. */
>          rx->fd = socket(PF_PACKET, SOCK_RAW, 0);
>          if (rx->fd < 0) {
> @@ -1161,6 +1173,48 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
>                       netdev_get_name(netdev_), ovs_strerror(error));
>              goto error;
>          }
> +
> +        /* Set send socket buffer size */
> +        error = setsockopt(rx->fd, SOL_SOCKET, SO_SNDBUF, &sock_buf_size, 4);
> +        if (error) {
> +            error = errno;
> +            VLOG_ERR("%s: failed to set send socket buffer size (%s)",
> +                     netdev_get_name(netdev_), ovs_strerror(error));
> +            goto error;
> +        }
> +
> +        /* Set recv socket buffer size */
> +        error = setsockopt(rx->fd, SOL_SOCKET, SO_RCVBUF, &sock_buf_size, 4);
> +        if (error) {
> +            error = errno;
> +            VLOG_ERR("%s: failed to set recv socket buffer size (%s)",
> +                     netdev_get_name(netdev_), ovs_strerror(error));
> +            goto error;
> +        }
> +
> +        /* Get final recv socket buffer size, it should be
> +         * 2 * ((1 << 30) - 1) (i.e. 2147483646) if successfully.
> +         * Don't doubt it is wrong, Linux kernel does so, i.e.
> +         * final sk_rcvbuf = val * 2.
> +         */
> +        error=  getsockopt(rx->fd, SOL_SOCKET, SO_RCVBUF, &sock_buf_size,
> +                           &sock_opt_len);
> +        if (!error) {
> +            VLOG_INFO("netdev %s socket recv buffer size: %d",
> +                      netdev_get_name(netdev_), sock_buf_size);
> +        }
> +
> +        /* Get final send socket buffer size, it should be
> +         * 2 * ((1 << 30) - 1) (i.e. 2147483646) if successfully.
> +         * Don't doubt it is wrong, Linux kernel does so, i.e.
> +         * final sk_sndbuf = val * 2.
> +         */
> +        error = getsockopt(rx->fd, SOL_SOCKET, SO_SNDBUF, &sock_buf_size,
> +                           &sock_opt_len);
> +        if (!error) {
> +            VLOG_INFO("netdev %s socket send buffer size: %d",
> +                      netdev_get_name(netdev_), sock_buf_size);
> +        }
>      }
>      ovs_mutex_unlock(&netdev->mutex);

_______________________________________________
dev mailing list
dev at openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


More information about the dev mailing list