[ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue during packet transmission.

Mon Dec 19 18:05:14 UTC 2016

Thanks Ilya and Aaron for reviewing this patch and providing your comments, my reply inline. 

>-----Original Message-----
>From: Ilya Maximets [mailto:i.maximets at samsung.com]
>Sent: Monday, December 19, 2016 8:41 AM
>To: Aaron Conole <aconole at redhat.com>; Bodireddy, Bhanuprakash
><bhanuprakash.bodireddy at intel.com>
>Cc: dev at openvswitch.org; Daniele Di Proietto <diproiettod at vmware.com>;
>Thadeu Lima de Souza Cascardo <cascardo at redhat.com>; Fischetti, Antonio
><antonio.fischetti at intel.com>; Markus Magnusson
><markus.magnusson at ericsson.com>
>Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue during
>packet transmission.
>
>Hi,
>
>I didn't test this patch yet. Bhanu, could you please describe your test scenario
>and performance results in more details.

During the recent performance analysis improvements for classifier, we found that bottleneck was also observed at flow batching.
This was due to fewer packets in batch. To reproduce this, a simple P2P test case can be used with 30 IXIA streams and matching IP flow rules.
Eg:  For an IXIA stream with src Ip: 2.2.2.1, dst tip: 5.5.5.1 
        ovs-ofctl add-flow br0 dl_type=0x0800,nw_src=2.2.2.1,actions=output:2

        For an IXIA stream with src Ip: 4.4.4.1, dst tip: 15.15.15.1
        ovs-ofctl add-flow br0 dl_type=0x0800,nw_src=4.4.4.1,actions=output:2

This leaves fewer packets in batches and packet_batch_per_flow_execute() shall be invoked for every batch. 
With 30 flows, I see the throughput drops to ~7.0 Mpps in PHY2PHY case for 64 byte udp packets. 

>It'll be nice if you provide throughput and latency measurement results for
>different scenarios and packet sizes. Latency is important here.
We are yet to do latency measurements in this case. With 30 IXIA streams comprising of 64 byte udp packets there was
an throughput improvement of 30% in P2P case and 13-15% in PVP case(single queue). we will try to get the latency stats
with and without this patch.

>
>About the patch itself:
>
>  1. 'drain' called only for PMD threads. This will lead to
>     broken connection with non-PMD ports.
I see that non-PMD ports are handled with vswitchd thread. 
Tested PVP loopback case with tap ports and found to be working as expected. Can you let me know the specific case
you are referring here so that I can verify if the patch breaks it.

>
>  2. 'xps_get_tx_qid()' called twice. First time on send and
>     the second time on drain. This may lead to different
>     returned 'tx_qid's and packets will stuck forever in
>     tx buffer.

You are right and xps_get_tx_qid() can return different tx_qids.
 I was always testing for XPS disabled cases and completely overlooked this case.
This could be a potential problem for us and any suggestions should be very helpful.

>
>  3. 'txq_drain()' must take the 'tx_lock' for queue in case
>     of dynamic tx queues.

Agree, will handle this in next version. 

>
>  4. Waiting for 1024 polling cycles of PMD thread may cause
>     a huge latency if we have few packets per second on one
>     port and intensive traffic on others.

I agree with you. We discussed this here and thought invoking the drain
logic once every 1024 cycles is an optimal solution. But if the community thinks otherwise
we can move this in to the main 'for' loop so that it can be invoked more often provided that
'DRAIN_TSC' cycles elapsed. 

>
>  5. This patch breaks the counting of 'tx_dropped' packets
>     in netdev-dpdk.

I presume you are referring to below line in the code. 

- dropped += netdev_dpdk_eth_tx_burst() 
+ netdev_dpdk_eth_tx_queue()

Will handle this in v2 of this patch.

>
>  6. Comments in netdev-provider.h should be fixed to reflect
>     all the changes.

Will make necessary changes in netdev-provider.h

>
>  7. At last, I agree with Aaron that explicit allowing only
>     'dpdk' ports is not a good style. Also, mentioning name of
>     exact netdev inside the common code is a bad style too.

Will handle this appropriately.

Regards,
Bhanu Prakash. 

>Best regards, Ilya Maximets.
>
>On 16.12.2016 22:24, Aaron Conole wrote:
>> Hi Bhanu,
>>
>> Bhanuprakash Bodireddy <bhanuprakash.bodireddy at intel.com> writes:
>>
>>> In exact match cache processing on an EMC hit, packets are queued in
>>> to batches matching the flow. Thereafter, packets are processed in
>>> batches for faster packet processing. This particularly is
>>> inefficient if there are fewer packets in a batch as
>>> rte_eth_tx_burst() incurs expensive MMIO write.
>>>
>>> This commit adds back intermediate queue implementation. Packets are
>>> queued and burst when the packet count >= NETDEV_MAX_BURST. Also
>>> drain logic is refactored to handle fewer packets in the tx queues.
>>> Testing shows significant performance gains with queueing.
>>>
>>> Fixes: b59cc14e032d("netdev-dpdk: Use instant sending instead of
>>> queueing of packets.")
>>> Signed-off-by: Bhanuprakash Bodireddy
>>> <bhanuprakash.bodireddy at intel.com>
>>> Signed-off-by: Antonio Fischetti <antonio.fischetti at intel.com>
>>> Co-authored-by: Antonio Fischetti <antonio.fischetti at intel.com>
>>> Signed-off-by: Markus Magnusson <markus.magnusson at ericsson.com>
>>> Co-authored-by: Markus Magnusson <markus.magnusson at ericsson.com>
>>> ---
>>
>> I've Cc'd Ilya just in hopes that the patch gets a better review than
>> I could give.  As a general comment, I like the direction - batched
>> operations are usually a better way of going.
>>
>> Just minor below.
>>
>> ... snip ...
>>>
>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>> 3509493..65dff83 100644
>>> --- a/lib/dpif-netdev.c
>>> +++ b/lib/dpif-netdev.c
>>> @@ -622,6 +622,9 @@ static int dpif_netdev_xps_get_tx_qid(const
>>> struct dp_netdev_pmd_thread *pmd,  static inline bool
>>> emc_entry_alive(struct emc_entry *ce);  static void
>>> emc_clear_entry(struct emc_entry *ce);
>>>
>>> +static struct tx_port *pmd_send_port_cache_lookup
>>> +    (const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
>>> +
>>>  static void
>>>  emc_cache_init(struct emc_cache *flow_cache)  { @@ -2877,6 +2880,31
>>> @@ cycles_count_end(struct dp_netdev_pmd_thread *pmd,  }
>>>
>>>  static void
>>> +dp_netdev_drain_txq_port(struct dp_netdev_pmd_thread *pmd,
>>> +                         struct dp_netdev_port *port,
>>> +                         uint64_t now) {
>>> +    int tx_qid;
>>> +
>>> +    if (!strcmp(port->type, "dpdk")) {
>>
>> Any reason to restrict this only to dpdk ports?  It looks like you've
>> added a new netdev operation, so why not just call the
>> netdev_txq_drain unconditionally?
>>
>> Also, bit of a nit, but tq_qid can be reduced in scope down to the if
>> block below.
>>
>>> +        struct tx_port *tx = pmd_send_port_cache_lookup(pmd,
>>> +                 u32_to_odp(port->port_no));
>>> +
>>> +        if (OVS_LIKELY(tx)) {
>>> +            bool dynamic_txqs = tx->port->dynamic_txqs;
>>> +
>>> +            if (dynamic_txqs) {
>>> +                tx_qid = dpif_netdev_xps_get_tx_qid(pmd, tx, now);
>>> +            } else {
>>> +                tx_qid = pmd->static_tx_qid;
>>> +            }
>>> +
>>> +            netdev_txq_drain(port->netdev, tx_qid);
>>> +        }
>>> +    }
>>> +}
>>> +
>>
>>
>>