[ovs-dev] [PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for output-max-latency.

Jan Scheurich jan.scheurich at ericsson.com
Wed Sep 20 14:57:13 UTC 2017


Hi Ilya,

I have spent some more time on analyzing and thinking through your latest propose patch set for time-based Tx batching:

> (Ilya-6): 	Time-based output batching with us resolution using CLOCK_MONOTONIC
> 		(master) + [PATCH v3 1-3/4] Output packet batching +
> 		[PATCH RFC v3 4/4] dpif-netdev: Time based output batching +
> 		[PATCH RFC 1/2] timeval: Introduce time_usec() +
> 		[PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for output-max-latency.

I would like to suggest that you re-spin a new version where you integrate the last three RFC patches as non-RFC with the following changes/additions:

1. Fold-in patch http://patchwork.ozlabs.org/patch/800276/ (dpif-netdev: Keep latest measured time for PMD thread) to store the time in us resolution in the PMD struct. That may seem a small optimization but makes the code so much cleaner and will help avoid unnecessary extra system calls to read CLOCK_MONOTONIC.

2. Don't set port->output_time when you enqueue a new batch to an output port in function dp_execute_cb(), but when you actually send a batch to the netdev in dp_netdev_pmd_flush_output_on_port(). This still ensures we don't flush more frequently than specified in cur_max_latency (unless the batch size limit is reached), but we can avoid any unnecessary delay when packets are received in intervals larger than cur_max_latency (at 50 us this would be the case for packet rates below 20Kpps!). In this case each packet (batch) would be flushed immediately at the end of each iteration as in non-time based tx batching.

In this context it might be good to rename the configuration parameter to something like "tx-batch-gap".

3. Considering that time-based tx batching is beneficial if and only if the guest virtio driver is interrupt-based, I believe it would be best if OVS automatically applied time-based tx batching for vhostuser tx queues where the driver has requested interrupts. Unfortunately this information is today hidden deep in DPDK's rte_vhost library (file virtio_net.c)

	/* Kick the guest if necessary. */
	if (!(vq->avail->flags & VRING_AVAIL_F_NO_INTERRUPT)
			&& (vq->callfd >= 0))
		eventfd_write(vq->callfd, (eventfd_t)1);
	return count;

So to automate this, we'd need a new library function in rte_vhost for OVS to be able to query this queue property. Perhaps it is not too late to get this into DPDK 17.11. Interaction with vhostuser PMD?

Having to configure time-based tx batching per port is only a second best option. Nova in OpenStack , for example, does not have the knowledge if time-based tx batching is appropriate for a vhostuser port and there is no Neutron port attribute today that would help determining that.

Thanks, Jan


> -----Original Message-----
> From: ovs-dev-bounces at openvswitch.org [mailto:ovs-dev-bounces at openvswitch.org] On Behalf Of Jan Scheurich
> Sent: Saturday, 02 September, 2017 17:14
> To: dev at openvswitch.org; Ilya Maximets <i.maximets at samsung.com>
> Subject: Re: [ovs-dev] [PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for output-max-latency.
> 
> Hi,
> 
> Vishal and I have been benchmarking the impact of the several Tx-batching patches on the performance of OVS in the phy-VM-phy
> scenario with different applications in the VM:
> 
> The OVS versions we tested are:
> 
> (master):	OVS master (
> (Ilya-3): 	Output batching within one Rx batch :
> 		(master) + [PATCH v3 1-3/4] Output packet batching
> (Ilya-6): 	Time-based output batching with us resolution using CLOCK_MONOTONIC
> 		(Ilya-3) +  [PATCH RFC v3 4/4] dpif-netdev: Time based output batching +
> 		[PATCH RFC 1/2] timeval: Introduce time_usec() +
> 		[PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for output-max-latency.
> (Ilya-4-Jan):	Time-based output batching with us resolution using TSC cycles
> 		(Ilya-3) +  [PATCH RFC v3 4/4] dpif-netdev: Time based output batching +
> 		Incremental patch using TSC cycles in
> 		https://mail.openvswitch.org/pipermail/ovs-dev/2017-August/337402.html
> 
> Application 1: iperf server as representative for kernel applications:
> 
> The iperf server executes in a VM with 2 vCPUs where both virtio interrupts and iperf process are pinned to the same vCPU for best
> performance. The iperf client also runs in a VM on a different server. OVS nodes on client and server are configured identically.
> 
>                 Iperf                                 iperf CPU  Ping
> OVS version      Gbps   Avg.PMD cycles/pkt  PMD util  host util  rtt
> ------------------------------------------------------------------------
> Master           6.83        1708.63        43.50%      100%     39 us
> Ilya-3           6.88        1951.35        47.17%      100%     40 us
> Ilya-6 50 us     7.83        1049.21        31.74%      99.7%   228 us
> Ilya-4-Jan 50 us 7.75        1086.2         30.65%      99.7%   230 us
> 
> Discussion:
> - Without time-based Tx batching the iperf server CPU is the bottleneck due to virtio interrupt load.
> - Ilya-3 does not provide any benefit.
> - With 50us time-based batching the PMD load reduces by 1/3rd (less kicks to the virtio eventfd).
> - The iperf throughput increases by 15%, still limited by the vCPU capacity. But the bottleneck moves from the virtio interrupt handlers
> in the guest kernel to the TCP stack and iperf process. With multiple threads can fully load the 10G physical link.
> - As expected the RTT latency increases by 190 ~= 4*50 us (2 OVS hops on server and client side)
> - There is no significant difference between the CLOCK_MONOTONIC and the TSC-based implementations.
> 
> 
> Application 2: dpdk pktgen as representative for DPDK application:
> 
> OVS version  max-latency  Mpps   Avg.PMD cycles/pkt  PMD utilization
> ----------------------------------------------------------------------
> Master       n/a          3.92        305.43         99.65%
> Ilya-3       n/a          3.84        310.58         99.31%
> Ilya-6       0 us         3.82        312.47         99.67%
> Ilya-6       50 us        3.80        314.60         99.65%
> Ilya-4-Jan   50 us        3.78        313.65         98.86%
> 
> Discussion:
> - For DPDK applications in the VM Tx batching does not provide any throughput benefit.
> - At full PMD load the output batching overhead causes a capacity drop of 2-3%.
> - There is no significant difference between CLOCK_MONOTONIC and TSC implementations.
> - perf top measurements indicate that the clock_gettime system call eats about 0.6% of the PMD cycles. This appears not enough to
> replace it by some TSC-based time implementation.
> 
> A zip file with the detailed measurement results can be downloaded from
> https://drive.google.com/open?id=0ByBuumQUR_NYNlRzbUhJX2R6NW8
> 
> 
> Conclusions:
> -----------------
> 1. Time based Tx-batching provides significant performance improvements for kernel-based applications.
> 2. DPDK applications do not benefit in throughput but suffer from the latency increase.
> 3. The worst case overhead implied by Tx batching is about 3% and should be acceptable.
> 4. As there is the obvious trade-off between throughput improvement and latency increase, the maximum output latency should be a
> configuration option. Ideally OVS should have a default parameter per switch and an additional parameter per interface to override
> the default parameter.
> 5. Ilya's CLOCK_MONOTONIC implementation seems efficient enough. No urgent need to go replace this by some TSC-based clock.
> 
> Regards, Jan and Vishal
> 
> > -----Original Message-----
> > From: Ilya Maximets [mailto:i.maximets at samsung.com]
> > Sent: Monday, 14 August, 2017 14:10
> > To: ovs-dev at openvswitch.org; Jan Scheurich
> > <jan.scheurich at ericsson.com>
> > Cc: Bhanuprakash Bodireddy <bhanuprakash.bodireddy at intel.com>;
> > Heetae Ahn <heetae82.ahn at samsung.com>; Vishal Deep Ajmera
> > <vishal.deep.ajmera at ericsson.com>; Ilya Maximets
> > <i.maximets at samsung.com>
> > Subject: [PATCH RFC 2/2] dpif-netdev: Use microseconds granularity for
> > output-max-latency.
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev


More information about the dev mailing list