[ovs-dev] [PATCH v6 0/7] Output packet batching.
jan.scheurich at ericsson.com
Tue Dec 5 17:26:32 UTC 2017
We have now repeated our earlier iperf3 tests for this patch series.
We use an iperf3 server as representative for a typical IO-intensive kernel application. The iperf3 server executes in a VM with 2 vCPUs where both virtio interrupts and iperf process are pinned to the same vCPU for best performance. We run two iperf3 clients in parallel on a different server to avoid the client to become the bottleneck when enabling tx batching.
OVS tx-flush- iperf3 Avg. PMD PMD Iperf ping -f
version interval Gbps cycles/pkt util CPU load avg rtt
master - 7.24 1778 46.5% 99.7% 23 us
Patch v6 0 7.18 1873 47.7% 100.0% 29 us
Patch v6 50 8.99 1108 36.3% 99.7% 38 us
Patch v6 100 ---- ---- ---- ----- 88 us
In all cases the vCPU capacity of the of the server VM handing the virtio interrupts and the iperf3 server thread is the bottleneck. The TCP throughput is throttled by packets being dropped on Tx to the vhostuser port of the server VM. The Linux kernel is not fast enough to handle the interrupts and poll the incoming packets.
As expected the tx batching patch alone with tx-flush-interval=0 does not provide any benefit as it doesn't reduce the virtio interrupt rate.
Setting the tx-flush-interval to 50 microseconds immediately improves the throughput: The PMD utilization drops from 47% to 36% due to the reduced rate of write calls to the virtio kick fd. (I believe the more pronounced drop in processing cycles/pkt is an artifact of the patch. The cycles used for delayed tx to vhostuser are no longer counted as packet processing cost. To be checked in the individual patch review.)
More importantly, the iperf3 server VM can now receive 8.99 instead of 7.24 Gbit/s, an increase by 24%. I am sure that 10G line rate could be reached with vhost multi-queue in the server VM.
Compared to the v4 version of the patches, the impact on latency is now reduced a lot. Packets with an inter-arrival time larger than the configured tx-flush-interval are not affected at all. For a 50 us tx-flush-interval this means packet flows with a packet rate of up to 20 Kpps!
Hence the average RTT reported by "ping -f" experience only a small increases from 23 us on master to 38 us with tx-flush-interval=50. Only when increasing tx-flush-interval well beyond the intrinsic average inter-arrival time, it translates directly into increased latency.
Conclusion: Time-based tx batching fulfills the expectations for interrupt-driven kernel workloads, while avoiding a latency impact even on moderately loaded ports.
From: Jan Scheurich
Sent: Tuesday, 05 December, 2017 00:21
To: Ilya Maximets <i.maximets at samsung.com>; ovs-dev at openvswitch.org; Bhanuprakash Bodireddy <bhanuprakash.bodireddy at intel.com>
Cc: Heetae Ahn <heetae82.ahn at samsung.com>; Antonio Fischetti <antonio.fischetti at intel.com>; Eelco Chaudron <echaudro at redhat.com>; Ciara Loftus <ciara.loftus at intel.com>; Kevin Traynor <ktraynor at redhat.com>; Ian Stokes <ian.stokes at intel.com>
Subject: RE: [PATCH v6 0/7] Output packet batching.
I have retested your "Output patches batching" v6 in our standard PVP L3-VPN/VXLAN benchmark setup . The configuration is a single PMD serving a physical 10G port and a VM running DPDK testpmd as IP reflector with 4 equally loaded vhostuser ports. The tests are run with 64 byte packets. Below are Mpps values averaged over four 10 second runs:
master patch patch
Flows Mpps tx-flush-interval=0 tx-flush-interval=50
8 4.419 4.342 -1.7% 4.749 7.5%
100 4.026 3.956 -1.7% 4.281 6.3%
1000 3.630 3.632 0.1% 3.760 3.6%
2000 3.394 3.390 -0.1% 3.490 2.8%
5000 2.989 2.938 -1.7% 2.994 0.2%
10000 2.756 2.711 -1.6% 2.746 -0.4%
20000 2.641 2.598 -1.6% 2.622 -0.7%
50000 2.604 2.558 -1.8% 2.579 -1.0%
100000 2.598 2.552 -1.8% 2.572 -1.0%
500000 2.598 2.550 -1.8% 2.571 -1.0%
As expected output batching within rx bursts (tx-flush-interval=0) provides little or no benefit in this scenario. The test results reflect roughly a 1.7% performance penalty due to the tx batching overhead. This overhead is measurable, but should in my eyes not be a blocker for merging this patch series.
Interestingly, tests with time-based tx batching and a minimum flush interval of 50 microseconds show a consistent and significant performance increase for small number of flows (in the regime where EMC is effective) and a reduced penalty of 1% for many flows. I don't have a good explanation yet for this phenomenon. I would be interested to see if other benchmark results support the general positive impact of time-based tx batching on throughput also for synthetic DPDK applications in the VM. The average Ping RTT increases by 20-30 us as expected.
We will also retest the performance improvement of time-based tx batching on interrupt driven Linux kernel applications (such as iperf3).
> -----Original Message-----
> From: Ilya Maximets [mailto:i.maximets at samsung.com]
> Sent: Friday, 01 December, 2017 16:44
> To: ovs-dev at openvswitch.org<mailto:ovs-dev at openvswitch.org>; Bhanuprakash Bodireddy <bhanuprakash.bodireddy at intel.com<mailto:bhanuprakash.bodireddy at intel.com>>
> Cc: Heetae Ahn <heetae82.ahn at samsung.com<mailto:heetae82.ahn at samsung.com>>; Antonio Fischetti <antonio.fischetti at intel.com<mailto:antonio.fischetti at intel.com>>; Eelco Chaudron
> <echaudro at redhat.com<mailto:echaudro at redhat.com>>; Ciara Loftus <ciara.loftus at intel.com<mailto:ciara.loftus at intel.com>>; Kevin Traynor <ktraynor at redhat.com<mailto:ktraynor at redhat.com>>; Jan Scheurich
> <jan.scheurich at ericsson.com<mailto:jan.scheurich at ericsson.com>>; Ian Stokes <ian.stokes at intel.com<mailto:ian.stokes at intel.com>>; Ilya Maximets <i.maximets at samsung.com<mailto:i.maximets at samsung.com>>
> Subject: [PATCH v6 0/7] Output packet batching.
> This patch-set inspired by  from Bhanuprakash Bodireddy.
> Implementation of  looks very complex and introduces many pitfalls 
> for later code modifications like possible packet stucks.
> This version targeted to make simple and flexible output packet batching on
> higher level without introducing and even simplifying netdev layer.
> Basic testing of 'PVP with OVS bonding on phy ports' scenario shows
> significant performance improvement.
> Test results for time-based batching for v3:
> Test results for v4:
>  [PATCH v4 0/5] netdev-dpdk: Use intermediate queue during packet transmission.
>  For example:
> Version 6:
> * Rebased on current master:
> - Added new patch to refactor dp_netdev_pmd_thread structure
> according to following suggestion:
> NOTE: I still prefer reverting of the padding related patch.
> Rebase done to not block acepting of this series.
> Revert patch and discussion here:
> * Added comment about pmd_thread_ctx_time_update() usage.
> Version 5:
> * pmd_thread_ctx_time_update() calls moved to different places to
> call them only from dp_netdev_process_rxq_port() and main
> polling functions:
> pmd_thread_main, dpif_netdev_run and dpif_netdev_execute.
> All other functions should use cached time from pmd->ctx.now.
> It's guaranteed to be updated at least once per polling cycle.
> * 'may_steal' patch returned to version from v3 because
> 'may_steal' in qos is a completely different variable. This
> patch only removes 'may_steal' from netdev API.
> * 2 more usec functions added to timeval to have complete public API.
> * Checking of 'output_cnt' turned to assertion.
> Version 4:
> * Rebased on current master.
> * Rebased on top of "Keep latest measured time for PMD thread."
> (Jan Scheurich)
> * Microsecond resolution related patches integrated.
> * Time-based batching without RFC tag.
> * 'output_time' renamed to 'flush_time'. (Jan Scheurich)
> * 'flush_time' update moved to 'dp_netdev_pmd_flush_output_on_port'.
> (Jan Scheurich)
> * 'output-max-latency' renamed to 'tx-flush-interval'.
> * Added patch for output batching statistics.
> Version 3:
> * Rebased on current master.
> * Time based RFC: fixed assert on n_output_batches <= 0.
> Version 2:
> * Rebased on current master.
> * Added time based batching RFC patch.
> * Fixed mixing packets with different sources in same batch.
> Ilya Maximets (7):
> dpif-netdev: Refactor PMD thread structure for further extension.
> dpif-netdev: Keep latest measured time for PMD thread.
> dpif-netdev: Output packet batching.
> netdev: Remove unused may_steal.
> netdev: Remove useless cutlen.
> dpif-netdev: Time based output batching.
> dpif-netdev: Count sent packets and batches.
> lib/dpif-netdev.c | 412 +++++++++++++++++++++++++++++++++++++-------------
> lib/netdev-bsd.c | 6 +-
> lib/netdev-dpdk.c | 30 ++--
> lib/netdev-dummy.c | 6 +-
> lib/netdev-linux.c | 8 +-
> lib/netdev-provider.h | 7 +-
> lib/netdev.c | 12 +-
> lib/netdev.h | 2 +-
> vswitchd/vswitch.xml | 16 ++
> 9 files changed, 349 insertions(+), 150 deletions(-)
More information about the dev