[ovs-dev] [PATCH v8 5/6] dpif-netdev: Time based output batching.

Kevin Traynor ktraynor at redhat.com
Wed Dec 20 19:17:14 UTC 2017


On 12/20/2017 02:58 PM, Ilya Maximets wrote:
> On 20.12.2017 16:43, Ilya Maximets wrote:
>>
>> On 20.12.2017 14:03, Ilya Maximets wrote:
>>>>>> +
>>>>>> +        if (need_to_flush) {
>>>>>> +            /* We didn't receive anything in the process loop.
>>>>>> +             * Check if we need to send something.
>>>>>> +             * There was no time updates on current iteration. */
>>>>>> +            pmd_thread_ctx_time_update(pmd);
>>>>>> +            process_packets = dp_netdev_pmd_flush_output_packets(pmd,
>>>>> false);
>>>>>> +            cycles_count_intermediate(pmd, NULL,
>>>>>> +                                      process_packets ?
>>>>> PMD_CYCLES_PROCESSING
>>>>>> +                                                      :
>>>>>> + PMD_CYCLES_IDLE);
>>>>>
>>>>> I noticed the processing cycles for an rxq are not counted here. It means
>>>>> those processing cycles to tx pkts will no longer be reflected in the rxq
>>>>> to pmd assignment (or any rxq stats). I realize the tx processing cycles
>>>>> are now shared so we will have some inaccuracy anyway but for an
>>>>> individual rxq that has to send to vhost, it could be a significant % of
>>>>> it's measured cycles.
>>>>>
>>>>> OTOH, this code seems like it would only hit when there are low rates (no
>>>>> packets on any queue during the last poll)? so I'm not sure how
>>>>> significant it would be in the overall rxq-pmd assignment. e.g. if the rxq
>>>>> is measured as using 2% or 7% of a pmd it's probably fine for rxq-pmd
>>>>> assignment, whereas 20% or 70% could create a very imbalanced
>>>>> distribution.
>>>>>
>>>>> If it was significant for rxq-pmd assignment, I'm thinking the best way
>>>>> would be to add in the cycles required to tx flush each port to each rxq
>>>>> that has packets in the flush. It's over counting rather than under
>>>>> counting but we can't assume any shared batching after a new assignment.
>>>>>
>>>>> Let me know what you think? Do you think it would only impact the rxq
>>>>> measurements during low traffic rates?
>>>>>
>>>>> Kevin.
>>>
>>> Hmm. I made some tests on my PVP with bonded PHY installation with
>>> tx-flush-interval=50, packet_size=512B and n_flows=1M.
>>> (full environment description available here:
>>> 	https://mail.openvswitch.org/pipermail/ovs-dev/2017-December/341700.html )
>>> Results:
>>>
>>>    Packet rate
>>> % of 10G line rate    % of packets flushed with rxq = NULL
>>> ------------------    ------------------------------------
>>>        10%                      27.44%
>>>        25%                      24.03%
>>>        50%                      23.34%
>>>        60%                      18.63%
>>>        65%                      18.40%
>>>        70%                      3.57%
>>>        75%                      1.21%
>>>        95%                      0.04%
>>>
>>> Almost all of these flushes happened on core which polls HW NICs only.
>>> Somewhere between 65% and 70% of 10G line rate HW NICs (10G ixgbe) starts
>>> to receive at least one packet on each polling cycle.
>>>
>>> Do you think above numbers are big enough?
>>>
>>> About possible fix:
>>> It's hard to find out from which rx queue packet was received at the
>>> send time. We're not even pass this information to processing functions.
>>> One way is to add last_rxq to pmd context on receive and add this rxq to
>>> some list near output_pkts inside OVS_ACTION_ATTR_OUTPUT.
>>> After that we'll have to move 'cycles_count_intermediate' inside
>>> 'dp_netdev_process_rxq_port' and 'dp_netdev_pmd_flush_output_on_port' where
>>> we'll count cycles for receiving+processing and sending respectively.
>>> In this case we'll be able to collect almost accurate cycles for each
>>> rxq separately and even count impact of each rxq to the batched send and
>>> add appropriate portion of shared send cycles.
>>>
>>> For me, above approach looks like the only possible to keep nearly accurate
>>> per-rxq statistics in case we need it.
>>>
>>> Thoughts?
>>>
>>>>
>>>> Good catch,
>>>>
>>>> I was going to push this today but I'll hold off until this issue is resolved, I don’t want an existing feature such as the rxq balancing being negatively impacted upon if we can avoid it.
>>>>
>>>> Ian
>>>
>>> Ian, in general, I'm good with pushing the series without this patch because
>>> simple output batching provides significant performance improvement itself
>>> for bonding/multiflow cases. Maybe we can delay only time-based approach until
>>> we have solution for rxq-cycles issue.
>>>
>>> Best regards, Ilya Maximets.
>>>
>>
> 
> I tested below incremental with small fix inline. From the batching point of view
> it works. There is no performance difference in PVP with bonding PHY test.
> All the results from the following link are same:
> 	https://mail.openvswitch.org/pipermail/ovs-dev/2017-December/341700.html
> 
> RX queue cycles looks good at the first glance.
> More reviews and testing of rxq cycles needed.
> 
> I'll start formatting proper patches.
> 
> Best regards, Ilya Maximets.
> 
> 
>> Following incremental should implement above approach (compile tested only):
>>
>> -------------------------------------------------------------------
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index a612f99..e490793 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -540,6 +540,7 @@ struct tx_port {
>>      struct hmap_node node;
>>      long long flush_time;
>>      struct dp_packet_batch output_pkts;
>> +    struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST];
>>  };
>>  
>>  /* A set of properties for the current processing loop that is not directly
>> @@ -551,6 +552,10 @@ struct dp_netdev_pmd_thread_ctx {
>>      long long now;
>>      /* Used to count cycles. See 'cycles_count_end()' */
>>      unsigned long long last_cycles;
>> +    /* RX queue from which last packet was received. */
>> +    struct dp_netdev_rxq *last_rxq;
>> +    /* Indicates how should be treated last counted cycles.  */
>> +    enum pmd_cycles_counter_type current_pmd_cycles_type;
>>  };
>>  
>>  /* PMD: Poll modes drivers.  PMD accesses devices via polling to eliminate
>> @@ -3219,42 +3224,53 @@ cycles_counter(void)
>>  /* Fake mutex to make sure that the calls to cycles_count_* are balanced */
>>  extern struct ovs_mutex cycles_counter_fake_mutex;
>>  
>> -/* Start counting cycles.  Must be followed by 'cycles_count_end()' */
>> +/* Start counting cycles.  Must be followed by 'cycles_count_end()'.
>> + * Counting starts from the idle type state.  */
>>  static inline void
>>  cycles_count_start(struct dp_netdev_pmd_thread *pmd)
>>      OVS_ACQUIRES(&cycles_counter_fake_mutex)
>>      OVS_NO_THREAD_SAFETY_ANALYSIS
>>  {
>> +    pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_IDLE;
>>      pmd->ctx.last_cycles = cycles_counter();
>>  }
>>  
>> -/* Stop counting cycles and add them to the counter 'type' */
>> +/* Stop counting cycles and add them to the counter of the current type. */
>>  static inline void
>> -cycles_count_end(struct dp_netdev_pmd_thread *pmd,
>> -                 enum pmd_cycles_counter_type type)
>> +cycles_count_end(struct dp_netdev_pmd_thread *pmd)
>>      OVS_RELEASES(&cycles_counter_fake_mutex)
>>      OVS_NO_THREAD_SAFETY_ANALYSIS
>>  {
>>      unsigned long long interval = cycles_counter() - pmd->ctx.last_cycles;
>> +    enum pmd_cycles_counter_type type = pmd->ctx.current_pmd_cycles_type;
>>  
>>      non_atomic_ullong_add(&pmd->cycles.n[type], interval);
>>  }
>>  
>> -/* Calculate the intermediate cycle result and add to the counter 'type' */
>> +/* Calculate the intermediate cycle result and add to the counter of
>> + * the current type */
>>  static inline void
>>  cycles_count_intermediate(struct dp_netdev_pmd_thread *pmd,
>> -                          struct dp_netdev_rxq *rxq,
>> -                          enum pmd_cycles_counter_type type)
>> +                          struct dp_netdev_rxq **rxqs, int n_rxqs)
>>      OVS_NO_THREAD_SAFETY_ANALYSIS
>>  {
>>      unsigned long long new_cycles = cycles_counter();
>>      unsigned long long interval = new_cycles - pmd->ctx.last_cycles;
>> +    enum pmd_cycles_counter_type type = pmd->ctx.current_pmd_cycles_type;
>> +    int i;
>> +
>>      pmd->ctx.last_cycles = new_cycles;
>>  
>>      non_atomic_ullong_add(&pmd->cycles.n[type], interval);
>> -    if (rxq && (type == PMD_CYCLES_PROCESSING)) {
>> +    if (n_rxqs && (type == PMD_CYCLES_PROCESSING)) {
>>          /* Add to the amount of current processing cycles. */
>> -        non_atomic_ullong_add(&rxq->cycles[RXQ_CYCLES_PROC_CURR], interval);
>> +        interval /= n_rxqs;
>> +        for (i = 0; i < n_rxqs; i++) {
>> +            if (rxqs[i]) {
>> +                non_atomic_ullong_add(&rxqs[i]->cycles[RXQ_CYCLES_PROC_CURR],
>> +                                      interval);
>> +            }
>> +        }
>>      }
>>  }
>>  
>> @@ -3299,6 +3315,16 @@ dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd,
>>      int output_cnt;
>>      bool dynamic_txqs;
>>      uint32_t tx_flush_interval;
>> +    enum pmd_cycles_counter_type save_pmd_cycles_type;
>> +
>> +    /* In case we're in PMD_CYCLES_PROCESSING state we need to count
>> +     * cycles for rxq we're processing now. */
>> +    cycles_count_intermediate(pmd, &pmd->ctx.last_rxq, 1);
>> +
>> +    /* Save current cycles counting state to restore after accounting
>> +     * send cycles. */
>> +    save_pmd_cycles_type = pmd->ctx.current_pmd_cycles_type;
>> +    pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_PROCESSING;
>>  
>>      dynamic_txqs = p->port->dynamic_txqs;
>>      if (dynamic_txqs) {
>> @@ -3323,6 +3349,9 @@ dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd,
>>      dp_netdev_count_packet(pmd, DP_STAT_SENT_PKTS, output_cnt);
>>      dp_netdev_count_packet(pmd, DP_STAT_SENT_BATCHES, 1);
>>  
>> +    /* Update send cycles for all the rx queues and restore previous state. */
>> +    cycles_count_intermediate(pmd, p->output_pkts_rxqs, output_cnt);
>> +    pmd->ctx.current_pmd_cycles_type = save_pmd_cycles_type;
>>      return output_cnt;
>>  }
>>  
>> @@ -3348,7 +3377,7 @@ dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd,
>>  
>>  static int
>>  dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
>> -                           struct netdev_rxq *rx,
>> +                           struct dp_netdev_rxq *rxq,
>>                             odp_port_t port_no)
>>  {
>>      struct dp_packet_batch batch;
>> @@ -3356,20 +3385,30 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
>>      int batch_cnt = 0, output_cnt = 0;
>>  
>>      dp_packet_batch_init(&batch);
>> -    error = netdev_rxq_recv(rx, &batch);
>> +
>> +    cycles_count_intermediate(pmd, NULL, 0);
>> +    pmd->ctx.last_rxq = rxq;
>> +    pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_PROCESSING;
>> +    error = netdev_rxq_recv(rxq->rx, &batch);
>> +    cycles_count_intermediate(pmd, &rxq, 1);
> 
> Above line should be removed. We don't need to count processing cycles here.
> 
>> +
>>      if (!error) {
>>          *recirc_depth_get() = 0;
>>          pmd_thread_ctx_time_update(pmd);
>>  
>>          batch_cnt = batch.count;
>>          dp_netdev_input(pmd, &batch, port_no);
>> +        cycles_count_intermediate(pmd, &rxq, 1);
>> +
>>          output_cnt = dp_netdev_pmd_flush_output_packets(pmd, false);
>>      } else if (error != EAGAIN && error != EOPNOTSUPP) {
>>          static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
>>  
>>          VLOG_ERR_RL(&rl, "error receiving data from %s: %s",
>> -                    netdev_rxq_get_name(rx), ovs_strerror(error));
>> +                    netdev_rxq_get_name(rxq->rx), ovs_strerror(error));
>>      }
>> +    pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_IDLE;
>> +    pmd->ctx.last_rxq = NULL;
>>  
>>      return batch_cnt + output_cnt;
>>  }
>> @@ -3994,7 +4033,6 @@ dpif_netdev_run(struct dpif *dpif)
>>      struct dp_netdev *dp = get_dp_netdev(dpif);
>>      struct dp_netdev_pmd_thread *non_pmd;
>>      uint64_t new_tnl_seq;
>> -    int process_packets;
>>      bool need_to_flush = true;
>>  
>>      ovs_mutex_lock(&dp->port_mutex);
>> @@ -4007,15 +4045,9 @@ dpif_netdev_run(struct dpif *dpif)
>>                  int i;
>>  
>>                  for (i = 0; i < port->n_rxq; i++) {
>> -                    process_packets =
>> -                        dp_netdev_process_rxq_port(non_pmd,
>> -                                                   port->rxqs[i].rx,
>> -                                                   port->port_no);
>> -                    cycles_count_intermediate(non_pmd, NULL,
>> -                                              process_packets
>> -                                              ? PMD_CYCLES_PROCESSING
>> -                                              : PMD_CYCLES_IDLE);
>> -                    if (process_packets) {
>> +                    if (dp_netdev_process_rxq_port(non_pmd,
>> +                                                   &port->rxqs[i],
>> +                                                   port->port_no)) {
>>                          need_to_flush = false;
>>                      }
>>                  }
>> @@ -4026,14 +4058,10 @@ dpif_netdev_run(struct dpif *dpif)
>>               * Check if we need to send something.
>>               * There was no time updates on current iteration. */
>>              pmd_thread_ctx_time_update(non_pmd);
>> -            process_packets = dp_netdev_pmd_flush_output_packets(non_pmd,
>> -                                                                 false);
>> -            cycles_count_intermediate(non_pmd, NULL, process_packets
>> -                                                     ? PMD_CYCLES_PROCESSING
>> -                                                     : PMD_CYCLES_IDLE);
>> +            dp_netdev_pmd_flush_output_packets(non_pmd, false);
>>          }
>>  
>> -        cycles_count_end(non_pmd, PMD_CYCLES_IDLE);
>> +        cycles_count_end(non_pmd);
>>          dpif_netdev_xps_revalidate_pmd(non_pmd, false);
>>          ovs_mutex_unlock(&dp->non_pmd_mutex);
>>  
>> @@ -4213,17 +4241,11 @@ reload:
>>  
>>      cycles_count_start(pmd);
>>      for (;;) {
>> -        int process_packets;
>>          bool need_to_flush = true;
>>  
>>          for (i = 0; i < poll_cnt; i++) {
>> -            process_packets =
>> -                dp_netdev_process_rxq_port(pmd, poll_list[i].rxq->rx,
>> -                                           poll_list[i].port_no);
>> -            cycles_count_intermediate(pmd, poll_list[i].rxq,
>> -                                      process_packets ? PMD_CYCLES_PROCESSING
>> -                                                      : PMD_CYCLES_IDLE);
>> -            if (process_packets) {
>> +            if (dp_netdev_process_rxq_port(pmd, poll_list[i].rxq,
>> +                                           poll_list[i].port_no)) {
>>                  need_to_flush = false;
>>              }
>>          }
>> @@ -4233,10 +4255,7 @@ reload:
>>               * Check if we need to send something.
>>               * There was no time updates on current iteration. */
>>              pmd_thread_ctx_time_update(pmd);
>> -            process_packets = dp_netdev_pmd_flush_output_packets(pmd, false);
>> -            cycles_count_intermediate(pmd, NULL,
>> -                                      process_packets ? PMD_CYCLES_PROCESSING
>> -                                                      : PMD_CYCLES_IDLE);
>> +            dp_netdev_pmd_flush_output_packets(pmd, false);
>>          }
>>  
>>          if (lc++ > 1024) {
>> @@ -4257,7 +4276,7 @@ reload:
>>          }
>>      }
>>  
>> -    cycles_count_end(pmd, PMD_CYCLES_IDLE);
>> +    cycles_count_end(pmd);
>>  
>>      poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list);
>>      exiting = latch_is_set(&pmd->exit_latch);
>> @@ -4697,6 +4716,8 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>>      ovs_mutex_init(&pmd->port_mutex);
>>      cmap_init(&pmd->flow_table);
>>      cmap_init(&pmd->classifiers);
>> +    pmd->ctx.last_rxq = NULL;
>> +    pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_IDLE;
>>      pmd_thread_ctx_time_update(pmd);
>>      pmd->next_optimization = pmd->ctx.now + DPCLS_OPTIMIZATION_INTERVAL;
>>      pmd->rxq_next_cycle_store = pmd->ctx.now + PMD_RXQ_INTERVAL_LEN;
>> @@ -5575,6 +5596,8 @@ dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
>>              }
>>  
>>              DP_PACKET_BATCH_FOR_EACH (packet, packets_) {
>> +                p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
>> +                                                            pmd->ctx.last_rxq;
>>                  dp_packet_batch_add(&p->output_pkts, packet);
>>              }
>>              return;
>> -------------------------------------------------------------------
>>
>> What do you think?
>>

Thanks Ilya. I didn't test or check the cycle counts but on first review
it looks good. I had something similar in mind about saving the rxq's
that contributed packets. I'm not sure if there's any performance hit
with adding the cycles used on a per pkt basis - if so, we could do
something a bit less accurate. In my original email I also missed the
case where flush is done with pkts from multiple rxq's but cycles are
attributed solely to one rxq, but this patch fixes that too.

I'm not sure yet of the impact towards rework on patches Jan sent for
pmd debug or I sent for balance stats. It seems a lot of people are
finishing up for the holidays in the next day or two, so I think we have
some more time for discussion and test before it is merged anyway.

btw, Bhanu/Ilya thanks for tx batching :-)

Kevin.

>> Best regards, Ilya Maximets.
>>



More information about the dev mailing list