[ovs-dev] [PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for output-max-latency.

Jan Scheurich jan.scheurich at ericsson.com
Sat Sep 2 15:14:01 UTC 2017


Hi,

Vishal and I have been benchmarking the impact of the several Tx-batching patches on the performance of OVS in the phy-VM-phy scenario with different applications in the VM:

The OVS versions we tested are:

(master):	OVS master (
(Ilya-3): 	Output batching within one Rx batch :
		(master) + [PATCH v3 1-3/4] Output packet batching
(Ilya-6): 	Time-based output batching with us resolution using CLOCK_MONOTONIC
		(Ilya-3) +  [PATCH RFC v3 4/4] dpif-netdev: Time based output batching + 
		[PATCH RFC 1/2] timeval: Introduce time_usec() + 
		[PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for output-max-latency.
(Ilya-4-Jan):	Time-based output batching with us resolution using TSC cycles
		(Ilya-3) +  [PATCH RFC v3 4/4] dpif-netdev: Time based output batching + 
		Incremental patch using TSC cycles in 
		https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/337402.html

Application 1: iperf server as representative for kernel applications:

The iperf server executes in a VM with 2 vCPUs where both virtio interrupts and iperf process are pinned to the same vCPU for best performance. The iperf client also runs in a VM on a different server. OVS nodes on client and server are configured identically.

                Iperf                                 iperf CPU  Ping
OVS version      Gbps   Avg.PMD cycles/pkt  PMD util  host util  rtt
------------------------------------------------------------------------
Master           6.83        1708.63        43.50%      100%     39 us
Ilya-3           6.88        1951.35        47.17%      100%     40 us
Ilya-6 50 us     7.83        1049.21        31.74%      99.7%   228 us
Ilya-4-Jan 50 us 7.75        1086.2         30.65%      99.7%   230 us

Discussion:
- Without time-based Tx batching the iperf server CPU is the bottleneck due to virtio interrupt load.
- Ilya-3 does not provide any benefit.
- With 50us time-based batching the PMD load reduces by 1/3rd (less kicks to the virtio eventfd).
- The iperf throughput increases by 15%, still limited by the vCPU capacity. But the bottleneck moves from the virtio interrupt handlers in the guest kernel to the TCP stack and iperf process. With multiple threads can fully load the 10G physical link.
- As expected the RTT latency increases by 190 ~= 4*50 us (2 OVS hops on server and client side)
- There is no significant difference between the CLOCK_MONOTONIC and the TSC-based implementations.


Application 2: dpdk pktgen as representative for DPDK application:

OVS version  max-latency  Mpps   Avg.PMD cycles/pkt  PMD utilization
----------------------------------------------------------------------
Master       n/a          3.92        305.43         99.65%
Ilya-3       n/a          3.84        310.58         99.31%
Ilya-6       0 us         3.82        312.47         99.67%
Ilya-6       50 us        3.80        314.60         99.65%
Ilya-4-Jan   50 us        3.78        313.65         98.86%

Discussion:
- For DPDK applications in the VM Tx batching does not provide any throughput benefit.
- At full PMD load the output batching overhead causes a capacity drop of 2-3%.
- There is no significant difference between CLOCK_MONOTONIC and TSC implementations.
- perf top measurements indicate that the clock_gettime system call eats about 0.6% of the PMD cycles. This appears not enough to replace it by some TSC-based time implementation.

A zip file with the detailed measurement results can be downloaded from 
https://drive.google.com/open?id=0ByBuumQUR_NYNlRzbUhJX2R6NW8


Conclusions: 
-----------------
1. Time based Tx-batching provides significant performance improvements for kernel-based applications.
2. DPDK applications do not benefit in throughput but suffer from the latency increase.
3. The worst case overhead implied by Tx batching is about 3% and should be acceptable.
4. As there is the obvious trade-off between throughput improvement and latency increase, the maximum output latency should be a configuration option. Ideally OVS should have a default parameter per switch and an additional parameter per interface to override the default parameter.
5. Ilya's CLOCK_MONOTONIC implementation seems efficient enough. No urgent need to go replace this by some TSC-based clock.

Regards, Jan and Vishal

> -----Original Message-----
> From: Ilya Maximets [mailto:i.maximets at samsung.com]
> Sent: Monday, 14 August, 2017 14:10
> To: ovs-dev at openvswitch.org; Jan Scheurich
> <jan.scheurich at ericsson.com>
> Cc: Bhanuprakash Bodireddy <bhanuprakash.bodireddy at intel.com>;
> Heetae Ahn <heetae82.ahn at samsung.com>; Vishal Deep Ajmera
> <vishal.deep.ajmera at ericsson.com>; Ilya Maximets
> <i.maximets at samsung.com>
> Subject: [PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for
> output-max-latency.


More information about the dev mailing list