[ovs-dev] [PATCH v8 5/6] dpif-netdev: Time based output batching.

Wed Dec 20 13:43:55 UTC 2017

On 20.12.2017 14:03, Ilya Maximets wrote:
>>>> +
>>>> +        if (need_to_flush) {
>>>> +            /* We didn't receive anything in the process loop.
>>>> +             * Check if we need to send something.
>>>> +             * There was no time updates on current iteration. */
>>>> +            pmd_thread_ctx_time_update(pmd);
>>>> +            process_packets = dp_netdev_pmd_flush_output_packets(pmd,
>>> false);
>>>> +            cycles_count_intermediate(pmd, NULL,
>>>> +                                      process_packets ?
>>> PMD_CYCLES_PROCESSING
>>>> +                                                      :
>>>> + PMD_CYCLES_IDLE);
>>>
>>> I noticed the processing cycles for an rxq are not counted here. It means
>>> those processing cycles to tx pkts will no longer be reflected in the rxq
>>> to pmd assignment (or any rxq stats). I realize the tx processing cycles
>>> are now shared so we will have some inaccuracy anyway but for an
>>> individual rxq that has to send to vhost, it could be a significant % of
>>> it's measured cycles.
>>>
>>> OTOH, this code seems like it would only hit when there are low rates (no
>>> packets on any queue during the last poll)? so I'm not sure how
>>> significant it would be in the overall rxq-pmd assignment. e.g. if the rxq
>>> is measured as using 2% or 7% of a pmd it's probably fine for rxq-pmd
>>> assignment, whereas 20% or 70% could create a very imbalanced
>>> distribution.
>>>
>>> If it was significant for rxq-pmd assignment, I'm thinking the best way
>>> would be to add in the cycles required to tx flush each port to each rxq
>>> that has packets in the flush. It's over counting rather than under
>>> counting but we can't assume any shared batching after a new assignment.
>>>
>>> Let me know what you think? Do you think it would only impact the rxq
>>> measurements during low traffic rates?
>>>
>>> Kevin.
> 
> Hmm. I made some tests on my PVP with bonded PHY installation with
> tx-flush-interval=50, packet_size=512B and n_flows=1M.
> (full environment description available here:
> 	https://mail.openvswitch.org/pipermail/ovs-dev/2017-December/341700.html )
> Results:
> 
>    Packet rate
> % of 10G line rate    % of packets flushed with rxq = NULL
> ------------------    ------------------------------------
>        10%                      27.44%
>        25%                      24.03%
>        50%                      23.34%
>        60%                      18.63%
>        65%                      18.40%
>        70%                      3.57%
>        75%                      1.21%
>        95%                      0.04%
> 
> Almost all of these flushes happened on core which polls HW NICs only.
> Somewhere between 65% and 70% of 10G line rate HW NICs (10G ixgbe) starts
> to receive at least one packet on each polling cycle.
> 
> Do you think above numbers are big enough?
> 
> About possible fix:
> It's hard to find out from which rx queue packet was received at the
> send time. We're not even pass this information to processing functions.
> One way is to add last_rxq to pmd context on receive and add this rxq to
> some list near output_pkts inside OVS_ACTION_ATTR_OUTPUT.
> After that we'll have to move 'cycles_count_intermediate' inside
> 'dp_netdev_process_rxq_port' and 'dp_netdev_pmd_flush_output_on_port' where
> we'll count cycles for receiving+processing and sending respectively.
> In this case we'll be able to collect almost accurate cycles for each
> rxq separately and even count impact of each rxq to the batched send and
> add appropriate portion of shared send cycles.
> 
> For me, above approach looks like the only possible to keep nearly accurate
> per-rxq statistics in case we need it.
> 
> Thoughts?
> 
>>
>> Good catch,
>>
>> I was going to push this today but I'll hold off until this issue is resolved, I don’t want an existing feature such as the rxq balancing being negatively impacted upon if we can avoid it.
>>
>> Ian
> 
> Ian, in general, I'm good with pushing the series without this patch because
> simple output batching provides significant performance improvement itself
> for bonding/multiflow cases. Maybe we can delay only time-based approach until
> we have solution for rxq-cycles issue.
> 
> Best regards, Ilya Maximets.
> 

Following incremental should implement above approach (compile tested only):

-------------------------------------------------------------------

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index a612f99..e490793 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -540,6 +540,7 @@ struct tx_port {
     struct hmap_node node;
     long long flush_time;
     struct dp_packet_batch output_pkts;
+    struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
 };
 
 /* A set of properties for the current processing loop that is not directly
@@ -551,6 +552,10 @@ struct dp_netdev_pmd_thread_ctx {
     long long now;
     /* Used to count cycles. See 'cycles_count_end()' */
     unsigned long long last_cycles;
+    /* RX queue from which last packet was received. */
+    struct dp_netdev_rxq *last_rxq;
+    /* Indicates how should be treated last counted cycles.  */
+    enum pmd_cycles_counter_type current_pmd_cycles_type;
 };
 
 /* PMD: Poll modes drivers.  PMD accesses devices via polling to eliminate
@@ -3219,42 +3224,53 @@ cycles_counter(void)
 /* Fake mutex to make sure that the calls to cycles_count_* are balanced */
 extern struct ovs_mutex cycles_counter_fake_mutex;
 
-/* Start counting cycles.  Must be followed by 'cycles_count_end()' */
+/* Start counting cycles.  Must be followed by 'cycles_count_end()'.
+ * Counting starts from the idle type state.  */
 static inline void
 cycles_count_start(struct dp_netdev_pmd_thread *pmd)
     OVS_ACQUIRES(&cycles_counter_fake_mutex)
     OVS_NO_THREAD_SAFETY_ANALYSIS
 {
+    pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_IDLE;
     pmd->ctx.last_cycles = cycles_counter();
 }
 
-/* Stop counting cycles and add them to the counter 'type' */
+/* Stop counting cycles and add them to the counter of the current type. */
 static inline void
-cycles_count_end(struct dp_netdev_pmd_thread *pmd,
-                 enum pmd_cycles_counter_type type)
+cycles_count_end(struct dp_netdev_pmd_thread *pmd)
     OVS_RELEASES(&cycles_counter_fake_mutex)
     OVS_NO_THREAD_SAFETY_ANALYSIS
 {
     unsigned long long interval = cycles_counter() - pmd->ctx.last_cycles;
+    enum pmd_cycles_counter_type type = pmd->ctx.current_pmd_cycles_type;
 
     non_atomic_ullong_add(&pmd->cycles.n[type], interval);
 }
 
-/* Calculate the intermediate cycle result and add to the counter 'type' */
+/* Calculate the intermediate cycle result and add to the counter of
+ * the current type */
 static inline void
 cycles_count_intermediate(struct dp_netdev_pmd_thread *pmd,
-                          struct dp_netdev_rxq *rxq,
-                          enum pmd_cycles_counter_type type)
+                          struct dp_netdev_rxq **rxqs, int n_rxqs)
     OVS_NO_THREAD_SAFETY_ANALYSIS
 {
     unsigned long long new_cycles = cycles_counter();
     unsigned long long interval = new_cycles - pmd->ctx.last_cycles;
+    enum pmd_cycles_counter_type type = pmd->ctx.current_pmd_cycles_type;
+    int i;
+
     pmd->ctx.last_cycles = new_cycles;
 
     non_atomic_ullong_add(&pmd->cycles.n[type], interval);
-    if (rxq && (type == PMD_CYCLES_PROCESSING)) {
+    if (n_rxqs && (type == PMD_CYCLES_PROCESSING)) {
         /* Add to the amount of current processing cycles. */
-        non_atomic_ullong_add(&rxq->cycles[RXQ_CYCLES_PROC_CURR], interval);
+        interval /= n_rxqs;
+        for (i = 0; i < n_rxqs; i++) {
+            if (rxqs[i]) {
+                non_atomic_ullong_add(&rxqs[i]->cycles[RXQ_CYCLES_PROC_CURR],
+                                      interval);
+            }
+        }
     }
 }
 
@@ -3299,6 +3315,16 @@ dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd,
     int output_cnt;
     bool dynamic_txqs;
     uint32_t tx_flush_interval;
+    enum pmd_cycles_counter_type save_pmd_cycles_type;
+
+    /* In case we're in PMD_CYCLES_PROCESSING state we need to count
+     * cycles for rxq we're processing now. */
+    cycles_count_intermediate(pmd, &pmd->ctx.last_rxq, 1);
+
+    /* Save current cycles counting state to restore after accounting
+     * send cycles. */
+    save_pmd_cycles_type = pmd->ctx.current_pmd_cycles_type;
+    pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_PROCESSING;
 
     dynamic_txqs = p->port->dynamic_txqs;
     if (dynamic_txqs) {
@@ -3323,6 +3349,9 @@ dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd,
     dp_netdev_count_packet(pmd, DP_STAT_SENT_PKTS, output_cnt);
     dp_netdev_count_packet(pmd, DP_STAT_SENT_BATCHES, 1);
 
+    /* Update send cycles for all the rx queues and restore previous state. */
+    cycles_count_intermediate(pmd, p->output_pkts_rxqs, output_cnt);
+    pmd->ctx.current_pmd_cycles_type = save_pmd_cycles_type;
     return output_cnt;
 }
 
@@ -3348,7 +3377,7 @@ dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
 
 static int
 dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
-                           struct netdev_rxq *rx,
+                           struct dp_netdev_rxq *rxq,
                            odp_port_t port_no)
 {
     struct dp_packet_batch batch;
@@ -3356,20 +3385,30 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
     int batch_cnt = 0, output_cnt = 0;
 
     dp_packet_batch_init(&batch);
-    error = netdev_rxq_recv(rx, &batch);
+
+    cycles_count_intermediate(pmd, NULL, 0);
+    pmd->ctx.last_rxq = rxq;
+    pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_PROCESSING;
+    error = netdev_rxq_recv(rxq->rx, &batch);
+    cycles_count_intermediate(pmd, &rxq, 1);
+
     if (!error) {
         *recirc_depth_get() = 0;
         pmd_thread_ctx_time_update(pmd);
 
         batch_cnt = batch.count;
         dp_netdev_input(pmd, &batch, port_no);
+        cycles_count_intermediate(pmd, &rxq, 1);
+
         output_cnt = dp_netdev_pmd_flush_output_packets(pmd, false);
     } else if (error != EAGAIN && error != EOPNOTSUPP) {
         static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
 
         VLOG_ERR_RL(&rl, "error receiving data from %s: %s",
-                    netdev_rxq_get_name(rx), ovs_strerror(error));
+                    netdev_rxq_get_name(rxq->rx), ovs_strerror(error));
     }
+    pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_IDLE;
+    pmd->ctx.last_rxq = NULL;
 
     return batch_cnt + output_cnt;
 }
@@ -3994,7 +4033,6 @@ dpif_netdev_run(struct dpif *dpif)
     struct dp_netdev *dp = get_dp_netdev(dpif);
     struct dp_netdev_pmd_thread *non_pmd;
     uint64_t new_tnl_seq;
-    int process_packets;
     bool need_to_flush = true;
 
     ovs_mutex_lock(&dp->port_mutex);
@@ -4007,15 +4045,9 @@ dpif_netdev_run(struct dpif *dpif)
                 int i;
 
                 for (i = 0; i < port->n_rxq; i++) {
-                    process_packets =
-                        dp_netdev_process_rxq_port(non_pmd,
-                                                   port->rxqs[i].rx,
-                                                   port->port_no);
-                    cycles_count_intermediate(non_pmd, NULL,
-                                              process_packets
-                                              ? PMD_CYCLES_PROCESSING
-                                              : PMD_CYCLES_IDLE);
-                    if (process_packets) {
+                    if (dp_netdev_process_rxq_port(non_pmd,
+                                                   &port->rxqs[i],
+                                                   port->port_no)) {
                         need_to_flush = false;
                     }
                 }
@@ -4026,14 +4058,10 @@ dpif_netdev_run(struct dpif *dpif)
              * Check if we need to send something.
              * There was no time updates on current iteration. */
             pmd_thread_ctx_time_update(non_pmd);
-            process_packets = dp_netdev_pmd_flush_output_packets(non_pmd,
-                                                                 false);
-            cycles_count_intermediate(non_pmd, NULL, process_packets
-                                                     ? PMD_CYCLES_PROCESSING
-                                                     : PMD_CYCLES_IDLE);
+            dp_netdev_pmd_flush_output_packets(non_pmd, false);
         }
 
-        cycles_count_end(non_pmd, PMD_CYCLES_IDLE);
+        cycles_count_end(non_pmd);
         dpif_netdev_xps_revalidate_pmd(non_pmd, false);
         ovs_mutex_unlock(&dp->non_pmd_mutex);
 
@@ -4213,17 +4241,11 @@ reload:
 
     cycles_count_start(pmd);
     for (;;) {
-        int process_packets;
         bool need_to_flush = true;
 
         for (i = 0; i < poll_cnt; i++) {
-            process_packets =
-                dp_netdev_process_rxq_port(pmd, poll_list[i].rxq->rx,
-                                           poll_list[i].port_no);
-            cycles_count_intermediate(pmd, poll_list[i].rxq,
-                                      process_packets ? PMD_CYCLES_PROCESSING
-                                                      : PMD_CYCLES_IDLE);
-            if (process_packets) {
+            if (dp_netdev_process_rxq_port(pmd, poll_list[i].rxq,
+                                           poll_list[i].port_no)) {
                 need_to_flush = false;
             }
         }
@@ -4233,10 +4255,7 @@ reload:
              * Check if we need to send something.
              * There was no time updates on current iteration. */
             pmd_thread_ctx_time_update(pmd);
-            process_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
-            cycles_count_intermediate(pmd, NULL,
-                                      process_packets ? PMD_CYCLES_PROCESSING
-                                                      : PMD_CYCLES_IDLE);
+            dp_netdev_pmd_flush_output_packets(pmd, false);
         }
 
         if (lc++ > 1024) {
@@ -4257,7 +4276,7 @@ reload:
         }
     }
 
-    cycles_count_end(pmd, PMD_CYCLES_IDLE);
+    cycles_count_end(pmd);
 
     poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
     exiting = latch_is_set(&pmd->exit_latch);
@@ -4697,6 +4716,8 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
     ovs_mutex_init(&pmd->port_mutex);
     cmap_init(&pmd->flow_table);
     cmap_init(&pmd->classifiers);
+    pmd->ctx.last_rxq = NULL;
+    pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_IDLE;
     pmd_thread_ctx_time_update(pmd);
     pmd->next_optimization = pmd->ctx.now + DPCLS_OPTIMIZATION_INTERVAL;
     pmd->rxq_next_cycle_store = pmd->ctx.now + PMD_RXQ_INTERVAL_LEN;
@@ -5575,6 +5596,8 @@ dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
             }
 
             DP_PACKET_BATCH_FOR_EACH (packet, packets_) {
+                p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
+                                                            pmd->ctx.last_rxq;
                 dp_packet_batch_add(&p->output_pkts, packet);
             }
             return;
-------------------------------------------------------------------

What do you think?

Best regards, Ilya Maximets.