[ovs-dev] [PATCH v2] Do not flush tx queue which is shared among CPUs since it is always flushed

Fri Jun 26 13:30:12 UTC 2015

> From: Pravin Shelar [mailto:pshelar at nicira.com]
> 
> On Wed, Jun 17, 2015 at 3:17 AM, Gray, Mark D <mark.d.gray at intel.com>
> wrote:
> >
> >
> >> -----Original Message-----
> >> From: Daniele Di Proietto [mailto:diproiettod at vmware.com]
> >> Sent: Tuesday, June 16, 2015 6:45 PM
> >> To: Pravin Shelar; Wei li; Gray, Mark D
> >> Cc: dev at openvswitch.com
> >> Subject: Re: [ovs-dev] [PATCH v2] Do not flush tx queue which is
> >> shared among CPUs since it is always flushed
> >>
> >>
> >>
> >> On 16/06/2015 07:40, "Pravin Shelar" <pshelar at nicira.com> wrote:
> >>
> >> >On Mon, Jun 8, 2015 at 7:42 PM, Pravin Shelar <pshelar at nicira.com>
> wrote:
> >> >> On Mon, Jun 8, 2015 at 6:13 PM, Wei li <liw at dtdream.com> wrote:
> >> >>> When tx queue is shared among CPUS,the pkts always be flush in
> >> >>>'netdev_dpdk_eth_send'
> >> >>> So it is unnecessarily for flushing in netdev_dpdk_rxq_recv
> >> >>>Otherwise tx will be accessed without locking
> >> >>>
> >> >>> Signed-off-by: Wei li <liw at dtdream.com>
> >> >>> ---
> >> >>> v1->v2: fix typo
> >> >>>
> >> >>>  lib/netdev-dpdk.c | 7 +++++--
> >> >>>  1 file changed, 5 insertions(+), 2 deletions(-)
> >> >>>
> >> >>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
> >> >>> 63243d8..1e0a483 100644
> >> >>> --- a/lib/netdev-dpdk.c
> >> >>> +++ b/lib/netdev-dpdk.c
> >> >>> @@ -892,8 +892,11 @@ netdev_dpdk_rxq_recv(struct netdev_rxq
> >> *rxq_,
> >> >>>struct dp_packet **packets,
> >> >>>      int nb_rx;
> >> >>>
> >> >>>      /* There is only one tx queue for this core.  Do not flush other
> >> >>> -     * queueus. */
> >> >>> -    if (rxq_->queue_id == rte_lcore_id()) {
> >> >>> +     * queues.
> >> >>> +     * Do not flush tx queue which is shared among CPUs
> >> >>> +     * since it is always flushed */
> >> >>> +    if (rxq_->queue_id == rte_lcore_id() &&
> >> >>> +               OVS_LIKELY(!dev->txq_needs_locking)) {
> >> >>>          dpdk_queue_flush(dev, rxq_->queue_id);
> >> >>>      }
> >> >>
> >> >> Patch looks good, But Daniele has plan to get rid of queue
> >> >> flushing logic. do you see problem with doing this?
> >> >
> >> >Daniele,
> >> >I am not sure if we should fix this queue flushing logic if we are
> >> >going to remove it. So I would like to discuss it first.
> >> >You mentioned that the removing flushing logic results in
> >> >performance drop. Can you explain it?  How much is performance drop
> >> >and which is the case?
> >>
> >> Hi Pravin,
> >>
> >> sorry for the late reply.  I suspected that removing the queues in
> >> netdev- dpdk was going to have a performance impact in the following
> >> scenario: a batch of packets from different megaflows with the same
> >> action (output on port 1).
> >> Here's what happens:
> >>
> >> 1) With the queues in netdev-dpdk.  dpif-netdev calls netdev_send()
> >> for each
> >>    packet.  netdev_dpdk_send() just copies the pointer in the queue.
> Before
> >>    receiving the next batch dpdk_queue_flush() call rte_eth_tx_burst()
> with
> >>    a whole batch of packets
> >> 2) Without the queues in netdev-dpdk. dpif-netdev calls netdev_send()
> >> for each
> >>    packet.  netdev_dpdk_send() calls rte_eth_tx_burst() for each packet.
> >>
> >> Scenario 2 could be slower because rte_eth_tx_burst() is called for
> >> each packet (instead of each batch). After some testing this turns
> >> out not to be the case.
> >> It appears that the bottleneck is not rte_eth_tx_burst(), and copying
> >> the pointers in the temporary queue makes the code slower.
> >>
> >> What do you think?  Should we just remove the queues?
> >> CC'ing Mark since he expressed interest in the issue
> >
> > Every call to rte_eth_tx_burst() will generate an expensive MMIO
> > write. Best practice would be to burst as many packets as you can to
> > amortize the cost of the MMIO write. I am surprised at your findings
> > for (2) above. Maybe in this case the cost of queuing is greater than the
> MMIO write?
> >
> 
> Hi Mark,
> Can you explain your test case where you do see performance flow down if
> there is no queuing in netdev-dpdk?

Hi Pravin,

I have seen this before from work we did in OVDK. However to help justify this
assertion here, I have tried to reproduce with DPDK and OVS. 

I was able to reproduce using the DPDK l2fwding app using the following patch. I
also had to disable the vector pmd to allow me to receive mbufs 1 at a time 
(CONFIG_RTE_IXGBE_INC_VECTOR=n in the DPDK .config file)

--- a/examples/l2fwd/main.c
+++ b/examples/l2fwd/main.c
@@ -222,11 +222,17 @@ l2fwd_send_packet(struct rte_mbuf *m, uint8_t port)
        qconf->tx_mbufs[port].m_table[len] = m;
        len++;

+#define ENABLE_QUEUES
+#ifdef ENABLE_QUEUES
        /* enough pkts to be sent */
        if (unlikely(len == MAX_PKT_BURST)) {
                l2fwd_send_burst(qconf, MAX_PKT_BURST, port);
                len = 0;
        }
+#else
+    l2fwd_send_burst(qconf, 1, port);
+    len = 0;
+#endif

        qconf->tx_mbufs[port].len = len;
        return 0;
@@ -294,6 +300,7 @@ l2fwd_main_loop(void)
                diff_tsc = cur_tsc - prev_tsc;
                if (unlikely(diff_tsc > drain_tsc)) {

+#ifdef ENABLE_QUEUES
                        for (portid = 0; portid < RTE_MAX_ETHPORTS; portid++) {
                                if (qconf->tx_mbufs[portid].len == 0)
                                        continue;
@@ -302,6 +309,7 @@ l2fwd_main_loop(void)
                                                 (uint8_t) portid);
                                qconf->tx_mbufs[portid].len = 0;
                        }
+#endif

                        /* if timer is enabled */
                        if (timer_period > 0) {
@@ -331,7 +339,7 @@ l2fwd_main_loop(void)

                        portid = qconf->rx_port_list[i];
                        nb_rx = rte_eth_rx_burst((uint8_t) portid, 0,
-                                                pkts_burst, MAX_PKT_BURST);
+                                                pkts_burst, 1);

                        port_statistics[portid].rx += nb_rx;

With these results on my setup:
ENABLE_QUEUES defined: 25Mpps
ENABLE_QUEUES not defined: 11.7 Mpps

I tried something similar in OVS but didn’t see a change in performance! This may
because there is a bottleneck somewhere else that is masking the cost of the MMIO
writes:

--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -62,6 +62,8 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
 #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
 #define OVS_VPORT_DPDK "ovs_dpdk"

+//#define ENABLE_QUEUES
+
 /*
  * need to reserve tons of extra space in the mbufs so we can align the
  * DMA addresses to 4KB.
@@ -927,15 +929,17 @@ netdev_dpdk_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet **packets,
     struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
     int nb_rx;

+#ifdef ENABLE_QUEUES
     /* There is only one tx queue for this core.  Do not flush other
      * queueus. */
     if (rxq_->queue_id == rte_lcore_id()) {
         dpdk_queue_flush(dev, rxq_->queue_id);
     }
+#endif

     nb_rx = rte_eth_rx_burst(rx->port_id, rxq_->queue_id,
                              (struct rte_mbuf **) packets,
-                             NETDEV_MAX_BURST);
+                             1);
     if (!nb_rx) {
         return EAGAIN;
     }
@@ -1034,7 +1038,7 @@ dpdk_queue_pkts(struct netdev_dpdk *dev, int qid,

         txq->count += tocopy;
         i += tocopy;
-
+#ifdef ENABLE_QUEUES
         if (txq->count == MAX_TX_QUEUE_LEN || txq->flush_tx) {
             dpdk_queue_flush__(dev, qid);
         }
@@ -1042,6 +1046,9 @@ dpdk_queue_pkts(struct netdev_dpdk *dev, int qid,
         if (diff_tsc >= DRAIN_TSC) {
             dpdk_queue_flush__(dev, qid);
         }
+#else
+        dpdk_queue_flush__(dev, qid);
+#endif
     }
 }

With these results on my setup:
ENABLE_QUEUES defined: 3.3Mpps
ENABLE_QUEUES not defined: 3.3 Mpps

If we were to change the flushing design to remove these queues, we may not see
a performance drop now, but we may be inadvertently introducing another
bottleneck. MMIO writes are expensive as they are ordered, usually not write combining
and have to traverse PCI.

> 
> Thanks.