[ovs-dev] [PATCH 0/3] Output packet batching.

Ilya Maximets i.maximets at samsung.com
Mon Jul 3 13:30:59 UTC 2017

On 03.07.2017 13:15, Bodireddy, Bhanuprakash wrote:
>> This patch-set inspired by [1] from Bhanuprakash Bodireddy.
>> Implementation of [1] looks very complex and introduces many pitfalls for
>> later code modifications like possible packet stucks.
>> This version targeted to make simple and flexible output packet batching on
>> higher level without introducing and even simplifying netdev layer.
> I didn't test the patches yet. In this series, the batching is done at dpif layer where as in [1] it's in netdev layer.
> In [1], batching was implemented by introducing intermediate queue in netdev layer and had some added 
> complexity due to XPS.
> However I think [1] is more flexible and can be easily tweaked to suit different use cases along with some of the
> APIs potentially consumed by future implementations.

> 1. Why [1] is flexible?
> PMD thread polls the rxq[s] mapped to it and post classification transmits the packets on the tx ports.
> For optimum performance, we need to queue and burst maximum no. of packets to mitigate transmission (MMIO) cost.
> As it is now (on master),  we end up transmitting fewer packets due to current instant send logic.
> Bit of background on this patch series:
> In *v1* of [1], we added intermediate queue and tried flushing the packets once every 1024 PMD polling 
> cycles as below.
> ----------------------pmd_thread_main()--------------------------------------------------
> pmd_thread_main(void *f_) {
>      for (i = 0; i < poll_cnt; i++) {
>             dp_netdev_process_rxq_port(pmd, poll_list[i].rx, poll_list[i].port_no);
>      }
>      if (lc++ > 1024) {
>                  if ((now - prev) > DRAIN_TSC) {
>                           HMAP_FOR_EACH (tx_port, node, &pmd->send_port_cache) {
>                                       dp_netdev_flush_txq_port(pmd, tx_port->port, now);
>                           }
>                  }
>     }
> ..
> }
> ---------------------------------------------------------------------------------------------------
> Pros: 
> -  The idea behind bursting them once(lc > 1024 cycles) instead of per rxq port processing, was to queue more 
>     packets in the respective txq[s] ports and burst them to greatly improve throughput.
> Cons:
> -  Increases latency as flushing happens once every 1024 polling cycle.
> Minimizing Latency:
>    To minimize latency 'INTERIM_QUEUE_BURST_THRESHOLD' was introduced that can be tuned 
>    based on use case (throughput hungry vs latency sensitive). Infact we also published nos. with
>    BURST_THRESHOLD set to 16 and flushing triggered every 1024 Polling cycles.  This was done to
>    allow users to tune thresholds for their respective use cases.
> However in *V3 [1]* the flushing was performed per rxq processing to get the patches accepted
> as latency was raised as primary concern.

It was introduced in V2. And as soon as you flush all the queues right after each
receive, implementation on dpif-netdev layer, actually, doesn't have any logical
difference with yours [1].

Anyway, the ability to collect more packets in output queues can be easily implemented
on dpif-netdev layer too in exactly same way because it doesn't netdev dependent thing.

So, IMHO, implementation of dpif-netdev layer is more flexible because it has the
same configuration abilities as yours [1] and additionally doesn't require
to handle queueing implementations for all netdevs separately (i.e. more maintainable).
> 2. Why flush APIs in netdev layer?
> The Flush APIs were introduced for dpdk and vHost User ports as they can be consumed in future
>  by other implementation(QoS Priority queues).

I think that if QoS priority queues will be implemented, they should be implemented
using 'librte_sched' dpdk library. But this library doesn't need any additional
software queues. This means that implementation of queueing on netdev layer will
not help but additionally complicate the logic in netdev-dpdk.

>From the other side, if 'librte_sched' will not be used, than such priority queueing
should be implemented on dpif-netdev layer to be generic and useful for all the port
types at once. And it can be implemented that way because will be netdev independent.

> Also the current queueing and flushing logic can be abstracted using DPDK APIs rte_eth_tx_buffer(),
> rte_eth_tx_buffer_flush() further simplifying the code a lot.

So, why you're implementing what is already implemented instead of using this API
in your patch-set [1] ? I assume that it because of vhost-user ports that isn't an
ethdev and doesn't have such API. But it's still not clear why you're not using it
for hardware NICs because there is no code reusing between queueing implementation
for vhost and hw nics in [1].

My solution is still useful even if we will use tx_buffer() API for netdev-dpdk because
it works for all netdevs including vhost-user and also will reduce number of dpdk
library calls for HW NICs.

> We were targeting some of the optimizations in the future like using rte functions for buffering, 
> flushing and further introducing timer based flushing logic that invokes flushing in a timely manner
> instead of per rxq port there by having a balance between throughput and latency.

I expect huge difficulties with implementation of such solution because you
will send packets using different thread (timer thread). So, there will be
performance issues connected with cache and cross numa memory moving.

Actually, Your current patch-set [1] has the same issues with possible sends
invoked from the threads not local to buffered packets.

For example: If one thread constantly sends big amount of packets to the tx port.
At the same time another thread pings that port sending 1 packet periodically.
Such situation will lead in ideal world to flushing the queue by the second thread
once in each few tries to send. But the packets are in cache of the first thread.
In such scenario second thread will load all the buffered packets to his cache and
than send them. This could be very slow especially if the second thread located
on different NUMA node. For the all that time the first thread will not be able
to send packets because the transmission queue is locked.
The situation will become even worse if you'll try to increase the size of
intermediate queue, because sending using different thread will become more likely.
Same concern described in DPDK QoS framework guide (see paragraph

I think, above concern is the big issue of implementation on netdev layer.

My implementation on higher level doesn't have such issue because each thread
sends only packets that was initially received by this thread.

Additionally, I think that simplicity is a big advantage of implementation on
dpif-netdev layer.

Best regards, Ilya Maximets.

>> Patch set consists of 3 patches. All the functionality introduced in the first
>> patch. Two others are just cleanups of netdevs to not do unnecessary things.
> The cleanup of code in netdevs is good.
> - Bhanuprakash.

More information about the dev mailing list