[ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue during packet transmission.

Tue Dec 20 08:09:02 UTC 2016

On 19.12.2016 21:05, Bodireddy, Bhanuprakash wrote:
> Thanks Ilya and Aaron for reviewing this patch and providing your comments, my reply inline. 
> 
>> -----Original Message-----
>> From: Ilya Maximets [mailto:i.maximets at samsung.com]
>> Sent: Monday, December 19, 2016 8:41 AM
>> To: Aaron Conole <aconole at redhat.com>; Bodireddy, Bhanuprakash
>> <bhanuprakash.bodireddy at intel.com>
>> Cc: dev at openvswitch.org; Daniele Di Proietto <diproiettod at vmware.com>;
>> Thadeu Lima de Souza Cascardo <cascardo at redhat.com>; Fischetti, Antonio
>> <antonio.fischetti at intel.com>; Markus Magnusson
>> <markus.magnusson at ericsson.com>
>> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue during
>> packet transmission.
>>
>> Hi,
>>
>> I didn't test this patch yet. Bhanu, could you please describe your test scenario
>> and performance results in more details.
> 
> During the recent performance analysis improvements for classifier, we found that bottleneck was also observed at flow batching.
> This was due to fewer packets in batch. To reproduce this, a simple P2P test case can be used with 30 IXIA streams and matching IP flow rules.
> Eg:  For an IXIA stream with src Ip: 2.2.2.1, dst tip: 5.5.5.1 
>         ovs-ofctl add-flow br0 dl_type=0x0800,nw_src=2.2.2.1,actions=output:2
> 
>         For an IXIA stream with src Ip: 4.4.4.1, dst tip: 15.15.15.1
>         ovs-ofctl add-flow br0 dl_type=0x0800,nw_src=4.4.4.1,actions=output:2
> 
> This leaves fewer packets in batches and packet_batch_per_flow_execute() shall be invoked for every batch. 
> With 30 flows, I see the throughput drops to ~7.0 Mpps in PHY2PHY case for 64 byte udp packets. 
> 
>> It'll be nice if you provide throughput and latency measurement results for
>> different scenarios and packet sizes. Latency is important here.
> We are yet to do latency measurements in this case. With 30 IXIA streams comprising of 64 byte udp packets there was
> an throughput improvement of 30% in P2P case and 13-15% in PVP case(single queue). we will try to get the latency stats
> with and without this patch.
> 
>>
>> About the patch itself:
>>
>>  1. 'drain' called only for PMD threads. This will lead to
>>     broken connection with non-PMD ports.
> I see that non-PMD ports are handled with vswitchd thread. 
> Tested PVP loopback case with tap ports and found to be working as expected. Can you let me know the specific case
> you are referring here so that I can verify if the patch breaks it.

I meant something like this:

 *-----------HOST-1(br-int)-----*               *---------HOST-2(br-int)------*
 |                              |               |                             |
 |   internal_port <--> dpdk0 <-------------------> dpdk0 <--> internal_port  |
 |   192.168.0.1/24             |               |              192.168.0.2/24 |
 *------------------------------*               *-----------------------------*

 (HOST-1)# ping 192.168.0.2

In this case I'm expecting that first (NETDEV_MAX_BURST - 1) icmp packets will
stuck in TX queue. The next packet will cause sending the whole batch of
NETDEV_MAX_BURST icmp packets. That is not good behaviour.

>>  2. 'xps_get_tx_qid()' called twice. First time on send and
>>     the second time on drain. This may lead to different
>>     returned 'tx_qid's and packets will stuck forever in
>>     tx buffer.
> 
> You are right and xps_get_tx_qid() can return different tx_qids.
>  I was always testing for XPS disabled cases and completely overlooked this case.
> This could be a potential problem for us and any suggestions should be very helpful.
> 
>>
>>  3. 'txq_drain()' must take the 'tx_lock' for queue in case
>>     of dynamic tx queues.
> 
> Agree, will handle this in next version. 
> 
>>
>>  4. Waiting for 1024 polling cycles of PMD thread may cause
>>     a huge latency if we have few packets per second on one
>>     port and intensive traffic on others.
> 
> I agree with you. We discussed this here and thought invoking the drain
> logic once every 1024 cycles is an optimal solution. But if the community thinks otherwise
> we can move this in to the main 'for' loop so that it can be invoked more often provided that
> 'DRAIN_TSC' cycles elapsed. 
> 
>>
>>  5. This patch breaks the counting of 'tx_dropped' packets
>>     in netdev-dpdk.
> 
> I presume you are referring to below line in the code. 
> 
> - dropped += netdev_dpdk_eth_tx_burst() 
> + netdev_dpdk_eth_tx_queue()
> 
> Will handle this in v2 of this patch.
> 
>>
>>  6. Comments in netdev-provider.h should be fixed to reflect
>>     all the changes.
> 
> Will make necessary changes in netdev-provider.h
> 
>>
>>  7. At last, I agree with Aaron that explicit allowing only
>>     'dpdk' ports is not a good style. Also, mentioning name of
>>     exact netdev inside the common code is a bad style too.
> 
> Will handle this appropriately.
> 
> Regards,
> Bhanu Prakash. 
> 
>> Best regards, Ilya Maximets.
>>
>> On 16.12.2016 22:24, Aaron Conole wrote:
>>> Hi Bhanu,
>>>
>>> Bhanuprakash Bodireddy <bhanuprakash.bodireddy at intel.com> writes:
>>>
>>>> In exact match cache processing on an EMC hit, packets are queued in
>>>> to batches matching the flow. Thereafter, packets are processed in
>>>> batches for faster packet processing. This particularly is
>>>> inefficient if there are fewer packets in a batch as
>>>> rte_eth_tx_burst() incurs expensive MMIO write.
>>>>
>>>> This commit adds back intermediate queue implementation. Packets are
>>>> queued and burst when the packet count >= NETDEV_MAX_BURST. Also
>>>> drain logic is refactored to handle fewer packets in the tx queues.
>>>> Testing shows significant performance gains with queueing.
>>>>
>>>> Fixes: b59cc14e032d("netdev-dpdk: Use instant sending instead of
>>>> queueing of packets.")
>>>> Signed-off-by: Bhanuprakash Bodireddy
>>>> <bhanuprakash.bodireddy at intel.com>
>>>> Signed-off-by: Antonio Fischetti <antonio.fischetti at intel.com>
>>>> Co-authored-by: Antonio Fischetti <antonio.fischetti at intel.com>
>>>> Signed-off-by: Markus Magnusson <markus.magnusson at ericsson.com>
>>>> Co-authored-by: Markus Magnusson <markus.magnusson at ericsson.com>
>>>> ---
>>>
>>> I've Cc'd Ilya just in hopes that the patch gets a better review than
>>> I could give.  As a general comment, I like the direction - batched
>>> operations are usually a better way of going.
>>>
>>> Just minor below.
>>>
>>> ... snip ...
>>>>
>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>>> 3509493..65dff83 100644
>>>> --- a/lib/dpif-netdev.c
>>>> +++ b/lib/dpif-netdev.c
>>>> @@ -622,6 +622,9 @@ static int dpif_netdev_xps_get_tx_qid(const
>>>> struct dp_netdev_pmd_thread *pmd,  static inline bool
>>>> emc_entry_alive(struct emc_entry *ce);  static void
>>>> emc_clear_entry(struct emc_entry *ce);
>>>>
>>>> +static struct tx_port *pmd_send_port_cache_lookup
>>>> +    (const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
>>>> +
>>>>  static void
>>>>  emc_cache_init(struct emc_cache *flow_cache)  { @@ -2877,6 +2880,31
>>>> @@ cycles_count_end(struct dp_netdev_pmd_thread *pmd,  }
>>>>
>>>>  static void
>>>> +dp_netdev_drain_txq_port(struct dp_netdev_pmd_thread *pmd,
>>>> +                         struct dp_netdev_port *port,
>>>> +                         uint64_t now) {
>>>> +    int tx_qid;
>>>> +
>>>> +    if (!strcmp(port->type, "dpdk")) {
>>>
>>> Any reason to restrict this only to dpdk ports?  It looks like you've
>>> added a new netdev operation, so why not just call the
>>> netdev_txq_drain unconditionally?
>>>
>>> Also, bit of a nit, but tq_qid can be reduced in scope down to the if
>>> block below.
>>>
>>>> +        struct tx_port *tx = pmd_send_port_cache_lookup(pmd,
>>>> +                 u32_to_odp(port->port_no));
>>>> +
>>>> +        if (OVS_LIKELY(tx)) {
>>>> +            bool dynamic_txqs = tx->port->dynamic_txqs;
>>>> +
>>>> +            if (dynamic_txqs) {
>>>> +                tx_qid = dpif_netdev_xps_get_tx_qid(pmd, tx, now);
>>>> +            } else {
>>>> +                tx_qid = pmd->static_tx_qid;
>>>> +            }
>>>> +
>>>> +            netdev_txq_drain(port->netdev, tx_qid);
>>>> +        }
>>>> +    }
>>>> +}
>>>> +
>>>
>>>
>>>
> 
> 
>