[ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue during packet transmission.

Fischetti, Antonio antonio.fischetti at intel.com
Thu Jan 12 17:40:00 UTC 2017


Hi Ilya,
thanks for your detailed feedback. I've added a couple of replies 
inline about when the instant send happens and the may_steal param.


Regards,
Antonio

> -----Original Message-----
> From: Ilya Maximets [mailto:i.maximets at samsung.com]
> Sent: Tuesday, December 20, 2016 12:47 PM
> To: Bodireddy, Bhanuprakash <bhanuprakash.bodireddy at intel.com>; Aaron
> Conole <aconole at redhat.com>
> Cc: dev at openvswitch.org; Daniele Di Proietto <diproiettod at vmware.com>;
> Thadeu Lima de Souza Cascardo <cascardo at redhat.com>; Fischetti, Antonio
> <antonio.fischetti at intel.com>; Markus Magnusson
> <markus.magnusson at ericsson.com>
> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue during
> packet transmission.
> 
> On 20.12.2016 15:19, Bodireddy, Bhanuprakash wrote:
> >> -----Original Message-----
> >> From: Ilya Maximets [mailto:i.maximets at samsung.com]
> >> Sent: Tuesday, December 20, 2016 8:09 AM
> >> To: Bodireddy, Bhanuprakash <bhanuprakash.bodireddy at intel.com>; Aaron
> >> Conole <aconole at redhat.com>
> >> Cc: dev at openvswitch.org; Daniele Di Proietto <diproiettod at vmware.com>;
> >> Thadeu Lima de Souza Cascardo <cascardo at redhat.com>; Fischetti, Antonio
> >> <antonio.fischetti at intel.com>; Markus Magnusson
> >> <markus.magnusson at ericsson.com>
> >> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue
> during
> >> packet transmission.
> >>
> >> On 19.12.2016 21:05, Bodireddy, Bhanuprakash wrote:
> >>> Thanks Ilya and Aaron for reviewing this patch and providing your
> >> comments, my reply inline.
> >>>
> >>>> -----Original Message-----
> >>>> From: Ilya Maximets [mailto:i.maximets at samsung.com]
> >>>> Sent: Monday, December 19, 2016 8:41 AM
> >>>> To: Aaron Conole <aconole at redhat.com>; Bodireddy, Bhanuprakash
> >>>> <bhanuprakash.bodireddy at intel.com>
> >>>> Cc: dev at openvswitch.org; Daniele Di Proietto
> >>>> <diproiettod at vmware.com>; Thadeu Lima de Souza Cascardo
> >>>> <cascardo at redhat.com>; Fischetti, Antonio
> >>>> <antonio.fischetti at intel.com>; Markus Magnusson
> >>>> <markus.magnusson at ericsson.com>
> >>>> Subject: Re: [ovs-dev] [PATCH] netdev-dpdk: Use intermediate queue
> >>>> during packet transmission.
> >>>>
> >>>> Hi,
> >>>>
> >>>> I didn't test this patch yet. Bhanu, could you please describe your
> >>>> test scenario and performance results in more details.
> >>>
> >>> During the recent performance analysis improvements for classifier, we
> >> found that bottleneck was also observed at flow batching.
> >>> This was due to fewer packets in batch. To reproduce this, a simple
> P2P test
> >> case can be used with 30 IXIA streams and matching IP flow rules.
> >>> Eg:  For an IXIA stream with src Ip: 2.2.2.1, dst tip: 5.5.5.1
> >>>         ovs-ofctl add-flow br0
> >>> dl_type=0x0800,nw_src=2.2.2.1,actions=output:2
> >>>
> >>>         For an IXIA stream with src Ip: 4.4.4.1, dst tip: 15.15.15.1
> >>>         ovs-ofctl add-flow br0
> >>> dl_type=0x0800,nw_src=4.4.4.1,actions=output:2
> >>>
> >>> This leaves fewer packets in batches and
> packet_batch_per_flow_execute()
> >> shall be invoked for every batch.
> >>> With 30 flows, I see the throughput drops to ~7.0 Mpps in PHY2PHY case
> for
> >> 64 byte udp packets.
> >>>
> >>>> It'll be nice if you provide throughput and latency measurement
> >>>> results for different scenarios and packet sizes. Latency is
> important here.
> >>> We are yet to do latency measurements in this case. With 30 IXIA
> >>> streams comprising of 64 byte udp packets there was an throughput
> >>> improvement of 30% in P2P case and 13-15% in PVP case(single queue).
> we
> >> will try to get the latency stats with and without this patch.
> >>>
> >>>>
> >>>> About the patch itself:
> >>>>
> >>>>  1. 'drain' called only for PMD threads. This will lead to
> >>>>     broken connection with non-PMD ports.
> >>> I see that non-PMD ports are handled with vswitchd thread.
> >>> Tested PVP loopback case with tap ports and found to be working as
> >>> expected. Can you let me know the specific case you are referring here
> so
> >> that I can verify if the patch breaks it.
> >>
> >> I meant something like this:
> >>
> >>
> >> *-----------HOST-1(br-int)-----*               *---------HOST-2(br-
> int)------*
> >> |                              |               |
> |
> >> |   internal_port <--> dpdk0 <-------------------> dpdk0 <-->
> internal_port  |
> >> |   192.168.0.1/24             |               |
> 192.168.0.2/24 |
> >> *------------------------------*               *-----------------------
> ------*
> >>
> >> (HOST-1)# ping 192.168.0.2
> >>
> >> In this case I'm expecting that first (NETDEV_MAX_BURST - 1) icmp
> packets
> >> will stuck in TX queue. The next packet will cause sending the whole
> batch of
> >> NETDEV_MAX_BURST icmp packets. That is not good behaviour.
> >
> > Thanks for test case. I tested this and found to be working with the
> patch.
> > The reason being, PMD threads uses intermediate queue implementation by
> invoking 'netdev_dpdk_eth_tx_queue()',
> > Whereas the non-pmd thread(vswitchd) continues to call
> netdev_dpdk_eth_tx_burst() in the patch and no change in previous behavior
> here.
> >
> > I had setup the case similar to one mentioned in above diagram using 2
> hosts and assigning IP addresses to the respective bridges.
> > When I ping the  Host2(192.168.0.2) from Host 1(192.168.0.1) below is
> the call path.
> >
> > (HOST-1)# ping 192.168.0.2 -c 1  (send only one packet)
> >
> > When sending ICMP echo request packet:
> > Vswitchd thread:    dp_execute_cb()
> >                                            netdev_send()           [ pkt
> sent on dpdk0, qid: 0]
> >                                                  netdev_dpdk_send__()
> >                                                        dpdk_do_tx_copy()
> >
> netdev_dpdk_eth_tx_burst().
> >
> >
> > On receiving ICMP echo reply packet:
> > PMD thread:   dp_execute_cb()
> >                                     netdev_send()    [pkt sent on br0,
> qid: 1]
> >                                          netdev_linux_send()
> >
> 
> OK. Maybe in this exact case it works, but it isn't guaranteed that non-
> PMD
> threads will always call send with 'may_steal=false'.
> 

[Antonio F] In my test case - below the details - the non-PMD thread
is calling send with 'may_steal=true' and the packet is sent instantly.

Regardless of the may_steal value the netdev_dpdk_send__() function
calls dpdk_do_tx_copy() for non-PMD threads because the condition
(batch->packets[0]->source != DPBUF_DPDK)
is true, as buffer data is not from DPDK allocated memory.

In my test case I have one host with 
- 2 dpdk ports dpdk0, dpdk1, 
- an IP address is assigned to the internal port 'br0' and
- 1 single flow is set as
  ovs-ofctl add-flow br0 actions=NORMAL

When an ARP request is sent from an ixia generator to dpdk0 
1. the PMD thread forwards it to the internal br0 with may_steal=0, 
   via netdev_linux_send() 
2. the PMD thread also forwards the ARP request to dpdk1 with may_steal=1.
   As buffer data is from dpdk allocated mem
   if (OVS_UNLIKELY(!may_steal ||
                     batch->packets[0]->source != DPBUF_DPDK))
   is false => ARP request is put into the queue. 


When an ARP reply is received from br0:
1. the non-PMD thread forwards it to dpdk0 with may_steal=1.
   As buffer data is NOT from dpdk allocated mem the following
   if (OVS_UNLIKELY(!may_steal ||
                     batch->packets[0]->source != DPBUF_DPDK))
   is true => dpdk_do_tx_copy() is called and the packet is sent out immediately.


> Also, I see another issue here:
> 
>     8. netdev_send(.., may_steal=true, ...) uses instant send while
>        netdev_send(.., may_steal=false, ...) uses queueing of packets.
>        This behaviour will lead to packet reordering issues in some
>        cases.

[Antonio F] The instant send will happen when in netdev_dpdk_send__() the condition

   if (OVS_UNLIKELY(!may_steal ||
                     batch->packets[0]->source != DPBUF_DPDK))

is true => dpdk_do_tx_copy() => netdev_dpdk_eth_tx_burst() is called.
Can you please provide more details about the packet reordering issue you mentioned?

> 
>     9. You're decreasing 'txq->count' inside 'netdev_dpdk_eth_tx_burst()'
>        while it can be called with different from 'txq->burst_pkts'
>        argument. Some of queued packets will be lost + memory leak.
> 
> Best regards, Ilya Maximets.
> 
> >
> >>
> >>>>  2. 'xps_get_tx_qid()' called twice. First time on send and
> >>>>     the second time on drain. This may lead to different
> >>>>     returned 'tx_qid's and packets will stuck forever in
> >>>>     tx buffer.
> >>>
> >>> You are right and xps_get_tx_qid() can return different tx_qids.
> >>>  I was always testing for XPS disabled cases and completely overlooked
> this
> >> case.
> >>> This could be a potential problem for us and any suggestions should be
> very
> >> helpful.
> >>>
> >>>>
> >>>>  3. 'txq_drain()' must take the 'tx_lock' for queue in case
> >>>>     of dynamic tx queues.
> >>>
> >>> Agree, will handle this in next version.
> >>>
> >>>>
> >>>>  4. Waiting for 1024 polling cycles of PMD thread may cause
> >>>>     a huge latency if we have few packets per second on one
> >>>>     port and intensive traffic on others.
> >>>
> >>> I agree with you. We discussed this here and thought invoking the
> >>> drain logic once every 1024 cycles is an optimal solution. But if the
> >>> community thinks otherwise we can move this in to the main 'for' loop
> >>> so that it can be invoked more often provided that 'DRAIN_TSC' cycles
> >> elapsed.
> >>>
> >>>>
> >>>>  5. This patch breaks the counting of 'tx_dropped' packets
> >>>>     in netdev-dpdk.
> >>>
> >>> I presume you are referring to below line in the code.
> >>>
> >>> - dropped += netdev_dpdk_eth_tx_burst()
> >>> + netdev_dpdk_eth_tx_queue()
> >>>
> >>> Will handle this in v2 of this patch.
> >>>
> >>>>
> >>>>  6. Comments in netdev-provider.h should be fixed to reflect
> >>>>     all the changes.
> >>>
> >>> Will make necessary changes in netdev-provider.h
> >>>
> >>>>
> >>>>  7. At last, I agree with Aaron that explicit allowing only
> >>>>     'dpdk' ports is not a good style. Also, mentioning name of
> >>>>     exact netdev inside the common code is a bad style too.
> >>>
> >>> Will handle this appropriately.
> >>>
> >>> Regards,
> >>> Bhanu Prakash.
> >>>
> >>>> Best regards, Ilya Maximets.
> >>>>
> >>>> On 16.12.2016 22:24, Aaron Conole wrote:
> >>>>> Hi Bhanu,
> >>>>>
> >>>>> Bhanuprakash Bodireddy <bhanuprakash.bodireddy at intel.com> writes:
> >>>>>
> >>>>>> In exact match cache processing on an EMC hit, packets are queued
> >>>>>> in to batches matching the flow. Thereafter, packets are processed
> >>>>>> in batches for faster packet processing. This particularly is
> >>>>>> inefficient if there are fewer packets in a batch as
> >>>>>> rte_eth_tx_burst() incurs expensive MMIO write.
> >>>>>>
> >>>>>> This commit adds back intermediate queue implementation. Packets
> >>>>>> are queued and burst when the packet count >= NETDEV_MAX_BURST.
> >>>>>> Also drain logic is refactored to handle fewer packets in the tx
> queues.
> >>>>>> Testing shows significant performance gains with queueing.
> >>>>>>
> >>>>>> Fixes: b59cc14e032d("netdev-dpdk: Use instant sending instead of
> >>>>>> queueing of packets.")
> >>>>>> Signed-off-by: Bhanuprakash Bodireddy
> >>>>>> <bhanuprakash.bodireddy at intel.com>
> >>>>>> Signed-off-by: Antonio Fischetti <antonio.fischetti at intel.com>
> >>>>>> Co-authored-by: Antonio Fischetti <antonio.fischetti at intel.com>
> >>>>>> Signed-off-by: Markus Magnusson
> >> <markus.magnusson at ericsson.com>
> >>>>>> Co-authored-by: Markus Magnusson
> >> <markus.magnusson at ericsson.com>
> >>>>>> ---
> >>>>>
> >>>>> I've Cc'd Ilya just in hopes that the patch gets a better review
> >>>>> than I could give.  As a general comment, I like the direction -
> >>>>> batched operations are usually a better way of going.
> >>>>>
> >>>>> Just minor below.
> >>>>>
> >>>>> ... snip ...
> >>>>>>
> >>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> >>>>>> 3509493..65dff83 100644
> >>>>>> --- a/lib/dpif-netdev.c
> >>>>>> +++ b/lib/dpif-netdev.c
> >>>>>> @@ -622,6 +622,9 @@ static int dpif_netdev_xps_get_tx_qid(const
> >>>>>> struct dp_netdev_pmd_thread *pmd,  static inline bool
> >>>>>> emc_entry_alive(struct emc_entry *ce);  static void
> >>>>>> emc_clear_entry(struct emc_entry *ce);
> >>>>>>
> >>>>>> +static struct tx_port *pmd_send_port_cache_lookup
> >>>>>> +    (const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
> >>>>>> +
> >>>>>>  static void
> >>>>>>  emc_cache_init(struct emc_cache *flow_cache)  { @@ -2877,6
> >>>>>> +2880,31 @@ cycles_count_end(struct dp_netdev_pmd_thread *pmd,
> >> }
> >>>>>>
> >>>>>>  static void
> >>>>>> +dp_netdev_drain_txq_port(struct dp_netdev_pmd_thread *pmd,
> >>>>>> +                         struct dp_netdev_port *port,
> >>>>>> +                         uint64_t now) {
> >>>>>> +    int tx_qid;
> >>>>>> +
> >>>>>> +    if (!strcmp(port->type, "dpdk")) {
> >>>>>
> >>>>> Any reason to restrict this only to dpdk ports?  It looks like
> >>>>> you've added a new netdev operation, so why not just call the
> >>>>> netdev_txq_drain unconditionally?
> >>>>>
> >>>>> Also, bit of a nit, but tq_qid can be reduced in scope down to the
> >>>>> if block below.
> >>>>>
> >>>>>> +        struct tx_port *tx = pmd_send_port_cache_lookup(pmd,
> >>>>>> +                 u32_to_odp(port->port_no));
> >>>>>> +
> >>>>>> +        if (OVS_LIKELY(tx)) {
> >>>>>> +            bool dynamic_txqs = tx->port->dynamic_txqs;
> >>>>>> +
> >>>>>> +            if (dynamic_txqs) {
> >>>>>> +                tx_qid = dpif_netdev_xps_get_tx_qid(pmd, tx, now);
> >>>>>> +            } else {
> >>>>>> +                tx_qid = pmd->static_tx_qid;
> >>>>>> +            }
> >>>>>> +
> >>>>>> +            netdev_txq_drain(port->netdev, tx_qid);
> >>>>>> +        }
> >>>>>> +    }
> >>>>>> +}
> >>>>>> +
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>>
> >
> >
> >


More information about the dev mailing list