[ovs-dev] [PATCH] RFC for support of PMD Auto load balancing

Nitin Katiyar nitin.katiyar at ericsson.com
Mon Oct 29 02:37:03 UTC 2018


Thanks Kevin for reviewing it. I will look into your comments and send the new version for review.

I would like to clarify that it samples load every 10 seconds and if the criterion for triggering dry run (i.e load threshold and/or drops) is met for consecutive 6 iterations then only it will trigger dry run. For this rxq->overloading_pmd is required.

Regards,
Nitin

-----Original Message-----
From: Kevin Traynor [mailto:ktraynor at redhat.com] 
Sent: Friday, October 26, 2018 3:53 PM
To: Nitin Katiyar <nitin.katiyar at ericsson.com>; ovs-dev at openvswitch.org
Subject: Re: [ovs-dev] [PATCH] RFC for support of PMD Auto load balancing

Hi Nitin,

Thanks for your work on this and sharing the RFC. Initial comments below.

On 10/11/2018 08:59 PM, Nitin Katiyar wrote:
> Port rx queues that have not been statically assigned to PMDs are 
> currently assigned based on periodically sampled load measurements.
> The assignment is performed at specific instances – port addition, 
> port deletion, upon reassignment request via CLI etc.
> 
> Over time it can cause uneven load among the PMDs due to change in 
> traffic pattern and thus resulting in lower overall throughout.
> 
> This patch enables the support of auto load balancing of PMDs based on 
> measured load of RX queues. Each PMD measures the processing load for 
> each of its associated queues every 10 seconds. If the aggregated PMD 
> load exceeds a configured threshold for 6 consecutive intervals and if 
> there are receive packet drops at the NIC the PMD considers itself to be overloaded.
> 
> If any PMD considers itself to be overloaded, a dry-run of the PMD 
> assignment algorithm is performed by OVS main thread. The dry-run does 
> NOT change the existing queue to PMD assignments.
> 
> If the resultant mapping of dry-run indicates an improved distribution 
> of the load then the actual reassignment will be performed. The 
> automatic rebalancing will be disabled by default and has to be 
> enabled via configuration option. Load thresholds, improvement factor 
> etc are also configurable.
> 
> Following example commands can be used to set the auto-lb params:
> ovs-vsctl set open_vswitch . other_config:pmd-auto-lb="true"
> ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-thresh="80"
> ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-min-improvement="5"
> ovs-vsctl set open_vswitch . other_config:pmd-auto-lb-drop-check="true"
> 

As you mentioned in follow up, this will never be perfect, so there needs to be a way that the user can limit it happening even if the criteria above is met. Something like allowing the user to set a max number of rebalances for some time period. e.g. pmd-auto-lb-max-num and pmd-auto-lb-time.

> Co-authored-by: Rohith Basavaraja <rohith.basavaraja at gmail.com>
> 
> Signed-off-by: Nitin Katiyar <nitin.katiyar at ericsson.com>
> Signed-off-by: Rohith Basavaraja <rohith.basavaraja at gmail.com>
> ---
>  lib/dpif-netdev.c | 589 
> +++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 561 insertions(+), 28 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 
> e322f55..28593cc 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -80,6 +80,26 @@
>  
>  VLOG_DEFINE_THIS_MODULE(dpif_netdev);
>  
> +/* Auto Load Balancing Defaults */
> +#define ACCEPT_IMPROVE_DEFAULT   (25)
> +#define PMD_LOAD_THRE_DEFAULT    (99)
> +#define PMD_AUTO_LB_DISABLE      false
> +#define SKIP_DROP_CHECK_DEFAULT  false
> +
> +//TODO: Should we make it configurable??
> +#define PMD_MIN_NUM_DROPS        (1)
> +#define PMD_MIN_NUM_QFILLS       (1)

It seems like a very small default. This would indicate that one dropped pkt or one vhost q full is considered enough to do a dry run.

> +#define PMD_REBALANCE_POLL_TIMER_INTERVAL 60000
> +
> +extern uint32_t log_q_thr;
> +
> +static bool pmd_auto_lb = PMD_AUTO_LB_DISABLE; static bool 
> +auto_lb_skip_drop_check = SKIP_DROP_CHECK_DEFAULT; static float 
> +auto_lb_pmd_load_ther = PMD_LOAD_THRE_DEFAULT; static unsigned int 
> +auto_lb_accept_improve = ACCEPT_IMPROVE_DEFAULT; static long long int 
> +pmd_rebalance_poll_timer = 0;
> +

I think these can be in 'struct dp_netdev' like the other config items instead of globals

> +
>  #define FLOW_DUMP_MAX_BATCH 50
>  /* Use per thread recirc_depth to prevent recirculation loop. */  
> #define MAX_RECIRC_DEPTH 6 @@ -393,6 +413,8 @@ enum 
> rxq_cycles_counter_type {
>                                     interval. */
>      RXQ_CYCLES_PROC_HIST,       /* Total cycles of all intervals that are used
>                                     during rxq to pmd assignment. */
> +    RXQ_CYCLES_IDLE_CURR,       /* Cycles spent in idling. */
> +    RXQ_CYCLES_IDLE_HIST,       /* Total cycles of all idle intervals. */

I'm not sure if it's really needed to measure this, or you can just use 'pmd->intrvl_cycles - sum(rxq intrvl's on that pmd)' like in pmd_info_show_rxq(). It would be worth to try with that and see if there's any noticable difference.

>      RXQ_N_CYCLES
>  };
>  
> @@ -429,6 +451,14 @@ static struct ovsthread_once offload_thread_once
>  
>  #define XPS_TIMEOUT 500000LL    /* In microseconds. */
>  
> +typedef struct {
> +    unsigned long long prev_drops;
> +} q_drops;

Doesn't seem like the struct is needed here

> +typedef struct {
> +    unsigned int num_vhost_qfill;
> +    unsigned int prev_num_vhost_qfill; } vhost_qfill;
> +
>  /* Contained by struct dp_netdev_port's 'rxqs' member.  */  struct 
> dp_netdev_rxq {
>      struct dp_netdev_port *port;
> @@ -439,6 +469,10 @@ struct dp_netdev_rxq {
>                                            particular core. */
>      unsigned intrvl_idx;               /* Write index for 'cycles_intrvl'. */
>      struct dp_netdev_pmd_thread *pmd;  /* pmd thread that polls this 
> queue. */
> +    struct dp_netdev_pmd_thread *dry_run_pmd;
> +                                       /* During auto lb trigger, pmd thread
> +                                          associated with this q during dry
> +                                          run. */
>      bool is_vhost;                     /* Is rxq of a vhost port. */
>  
>      /* Counters of cycles spent successfully polling and processing 
> pkts. */ @@ -446,6 +480,16 @@ struct dp_netdev_rxq {
>      /* We store PMD_RXQ_INTERVAL_MAX intervals of data for an rxq and then
>         sum them to yield the cycles used for an rxq. */
>      atomic_ullong cycles_intrvl[PMD_RXQ_INTERVAL_MAX];
> +
> +    /* Following param are used to determine the load on the PMD
> +     * for automatic load balance
> +     */
> +    atomic_ullong idle_intrvl[PMD_RXQ_INTERVAL_MAX];
> +    union {
> +        q_drops rxq_drops;
> +        vhost_qfill rxq_vhost_qfill;
> +    } rxq_drops_or_qfill;
> +    atomic_uint   overloading_pmd;
>  };
>  
>  /* A port in a netdev-based datapath. */ @@ -682,6 +726,12 @@ struct 
> dp_netdev_pmd_thread {
>      struct ovs_mutex port_mutex;    /* Mutex for 'poll_list' and 'tx_ports'. */
>      /* List of rx queues to poll. */
>      struct hmap poll_list OVS_GUARDED;
> +
> +    /* List of rx queues got associated during
> +       pmd load balance dry run. These queues are
> +       not polled by pmd. */
> +    struct hmap dry_poll_list OVS_GUARDED;
> +
>      /* Map of 'tx_port's used for transmission.  Written by the main thread,
>       * read by the pmd thread. */
>      struct hmap tx_ports OVS_GUARDED; @@ -702,6 +752,11 @@ struct 
> dp_netdev_pmd_thread {
>      /* Keep track of detailed PMD performance statistics. */
>      struct pmd_perf_stats perf_stats;
>  
> +    /* Some stats from previous iteration used by automatic pmd
> +       load balance logic. */
> +    uint64_t prev_stats[PMD_N_STATS];
> +    bool pmd_overloaded; /* Need to make it volatile ?? */
> +
>      /* Set to true if the pmd thread needs to be reloaded. */
>      bool need_reload;
>  };
> @@ -764,7 +819,8 @@ static void dp_netdev_del_port_tx_from_pmd(struct dp_netdev_pmd_thread *pmd,
>                                             struct tx_port *tx)
>      OVS_REQUIRES(pmd->port_mutex);
>  static void dp_netdev_add_rxq_to_pmd(struct dp_netdev_pmd_thread *pmd,
> -                                     struct dp_netdev_rxq *rxq)
> +                                     struct dp_netdev_rxq *rxq,
> +                                     bool dry_run)
>      OVS_REQUIRES(pmd->port_mutex);
>  static void dp_netdev_del_rxq_from_pmd(struct dp_netdev_pmd_thread *pmd,
>                                         struct rxq_poll *poll) @@ 
> -780,6 +836,8 @@ static void dp_netdev_pmd_unref(struct 
> dp_netdev_pmd_thread *pmd);  static void 
> dp_netdev_pmd_flow_flush(struct dp_netdev_pmd_thread *pmd);  static void pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd)
>      OVS_REQUIRES(pmd->port_mutex);
> +
> +/* PMD AUTO_LB calls */
>  static inline void
>  dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
>                             struct polled_queue *poll_list, int 
> poll_cnt); @@ -792,9 +850,11 @@ dp_netdev_rxq_get_cycles(struct dp_netdev_rxq *rx,
>                           enum rxq_cycles_counter_type type);  static 
> void  dp_netdev_rxq_set_intrvl_cycles(struct dp_netdev_rxq *rx,
> -                           unsigned long long cycles);
> +                                unsigned long long cycles,
> +                                unsigned idx);
>  static uint64_t
> -dp_netdev_rxq_get_intrvl_cycles(struct dp_netdev_rxq *rx, unsigned 
> idx);
> +dp_netdev_rxq_get_intrvl_cycles(struct dp_netdev_rxq *rx,
> +                                unsigned idx);
>  static void
>  dpif_netdev_xps_revalidate_pmd(const struct dp_netdev_pmd_thread *pmd,
>                                 bool purge); @@ -3736,6 +3796,11 @@ 
> dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>                          DEFAULT_EM_FLOW_INSERT_INV_PROB);
>      uint32_t insert_min, cur_min;
>      uint32_t tx_flush_interval, cur_tx_flush_interval;
> +    bool drop_check;
> +    bool pmd_auto_lb_config;
> +    unsigned long long pmd_load_config;
> +    float pmd_load;
> +    unsigned int accept_variance;
>  
>      tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
>                                       DEFAULT_TX_FLUSH_INTERVAL); @@ 
> -3807,6 +3872,60 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>                    pmd_rxq_assign);
>          dp_netdev_request_reconfigure(dp);
>      }
> +
> +    /* Handle Automatic PMD load balancing parameters. */
> +    pmd_auto_lb_config = smap_get_bool(other_config, "pmd-auto-lb",
> +                              PMD_AUTO_LB_DISABLE);
> +    if (pmd_auto_lb_config && pmd_rxq_assign_cyc) {
> +        drop_check = smap_get_bool(other_config, "pmd-auto-lb-drop-check",
> +                              SKIP_DROP_CHECK_DEFAULT);
> +
> +        pmd_load_config = smap_get_ullong(other_config, "pmd-auto-lb-thresh",
> +                        PMD_LOAD_THRE_DEFAULT);
> +
> +        if (pmd_load_config == PMD_LOAD_THRE_DEFAULT) {
> +            pmd_load = 99.9f;

I don't think floating point is really needed

> +        } else {
> +            pmd_load = (float) pmd_load_config;
> +        }
> +
> +        accept_variance = smap_get_int(other_config,
> +                                       "pmd-auto-lb-min-improvement",
> +                                        ACCEPT_IMPROVE_DEFAULT);
> +
> +        if ((pmd_auto_lb_config != pmd_auto_lb) ||
> +            (drop_check != auto_lb_skip_drop_check) ||
> +            (pmd_load != auto_lb_pmd_load_ther) ||
> +            (accept_variance != auto_lb_accept_improve)) {
> +
> +            VLOG_INFO("PMD auto lb  curr params : "
> +                "pmd-auto-lb-drop-check (%s) "
> +                "pmd-auto-lb-thresh (%.02f%%) "
> +                "pmd-auto-lb-min-improvement (%d%%)\n",
> +                (auto_lb_skip_drop_check ? "true" : "false"),
> +                auto_lb_pmd_load_ther, auto_lb_accept_improve);
> +
> +            VLOG_INFO("PMD auto lb  params changed : "
> +                "pmd-auto-lb-drop-check (%s) "
> +                "pmd-auto-lb-thresh (%.02f%%) "
> +                "pmd-auto-lb-min-improvement (%d%%)\n",
> +                (drop_check ? "true" : "false"),
> +                pmd_load, accept_variance);
> +
> +            auto_lb_skip_drop_check = drop_check;

This looks incorrect - setting pmd-auto-lb-drop-check="false" will result in the drops being checked. It would better to stick to drop_check naming consistently, rather than mixing with the seemingly reverse logic of skip_drop_check.

It would also need some additional warnings, like if qfill was set but pmd-perf was not set etc.

> +            auto_lb_pmd_load_ther = pmd_load;
> +            auto_lb_accept_improve = accept_variance;
> +            pmd_auto_lb = pmd_auto_lb_config;
> +
> +            dp_netdev_request_reconfigure(dp);
> +        }
> +    } else if (pmd_auto_lb) {
> +        VLOG_INFO("PMD auto lb is disabled now.\n");
> +        pmd_auto_lb = false;
> +        pmd_rebalance_poll_timer = 0;
> +        dp_netdev_request_reconfigure(dp);

You'd only need this if it has changed since the last time. You can use the example from pmd-rxq-assign section above.

> +    }
> +
>      return 0;
>  }
>  
> @@ -3934,7 +4053,7 @@ dp_netdev_actions_free(struct dp_netdev_actions 
> *actions)  {
>      free(actions);
>  }
> -

> +

You deleted a page break here

>  static void
>  dp_netdev_rxq_set_cycles(struct dp_netdev_rxq *rx,
>                           enum rxq_cycles_counter_type type, @@ 
> -3962,9 +4081,9 @@ dp_netdev_rxq_get_cycles(struct dp_netdev_rxq *rx,
>  
>  static void
>  dp_netdev_rxq_set_intrvl_cycles(struct dp_netdev_rxq *rx,
> -                                unsigned long long cycles)
> +                                unsigned long long cycles,
> +                                unsigned idx)
>  {
> -    unsigned int idx = rx->intrvl_idx++ % PMD_RXQ_INTERVAL_MAX;
>      atomic_store_relaxed(&rx->cycles_intrvl[idx], cycles);  }
>  
> @@ -3976,6 +4095,23 @@ dp_netdev_rxq_get_intrvl_cycles(struct dp_netdev_rxq *rx, unsigned idx)
>      return processing_cycles;
>  }
>  
> +static void
> +dp_netdev_rxq_set_idle_cycles(struct dp_netdev_rxq *rx,
> +                              unsigned long long cycles,
> +                              unsigned idx) {
> +    atomic_store_relaxed(&rx->idle_intrvl[idx], cycles); }
> +
> +static uint64_t
> +dp_netdev_rxq_get_idle_cycles(struct dp_netdev_rxq *rx,
> +                              unsigned idx) {
> +    unsigned long long idle_cycles;
> +    atomic_read_relaxed(&rx->idle_intrvl[idx], &idle_cycles);
> +    return idle_cycles;
> +}
> +
>  #if ATOMIC_ALWAYS_LOCK_FREE_8B
>  static inline bool
>  pmd_perf_metrics_enabled(const struct dp_netdev_pmd_thread *pmd) @@ 
> -4105,6 +4241,12 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
>                  if (qfill > s->current.max_vhost_qfill) {
>                      s->current.max_vhost_qfill = qfill;
>                  }
> +                if (qfill >= log_q_thr) {
> +                    vhost_qfill *rxq_vhost_qfill =
> +                            &rxq->rxq_drops_or_qfill.rxq_vhost_qfill;
> +                    atomic_count_inc((struct atomic_count *)
> +                                     &rxq_vhost_qfill->num_vhost_qfill);
> +                }
>              }
>          }
>          /* Process packet batch. */
> @@ -4117,12 +4259,14 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd,
>          dp_netdev_pmd_flush_output_packets(pmd, false);
>      } else {
>          /* Discard cycles. */
> -        cycle_timer_stop(&pmd->perf_stats, &timer);
> +        cycles = cycle_timer_stop(&pmd->perf_stats, &timer);
>          if (error != EAGAIN && error != EOPNOTSUPP) {
>              static struct vlog_rate_limit rl = 
> VLOG_RATE_LIMIT_INIT(1, 5);
>  
>              VLOG_ERR_RL(&rl, "error receiving data from %s: %s",
>                      netdev_rxq_get_name(rxq->rx), 
> ovs_strerror(error));
> +        } else {
> +            dp_netdev_rxq_add_cycles(rxq, RXQ_CYCLES_IDLE_CURR, 
> + cycles);
>          }
>      }
>  
> @@ -4182,6 +4326,7 @@ port_reconfigure(struct dp_netdev_port *port)
>          }
>  
>          port->rxqs[i].port = port;
> +        port->rxqs[i].dry_run_pmd = NULL;
>          port->rxqs[i].is_vhost = !strncmp(port->type, "dpdkvhost", 
> 9);
>  
>          err = netdev_rxq_open(netdev, &port->rxqs[i].rx, i); @@ 
> -4366,7 +4511,8 @@ compare_rxq_cycles(const void *a, const void *b)
>   * The function doesn't touch the pmd threads, it just stores the assignment
>   * in the 'pmd' member of each rxq. */  static void 
> -rxq_scheduling(struct dp_netdev *dp, bool pinned) 
> OVS_REQUIRES(dp->port_mutex)
> +rxq_scheduling(struct dp_netdev *dp, bool pinned, bool dry_run)
> +    OVS_REQUIRES(dp->port_mutex)
>  {
>      struct dp_netdev_port *port;
>      struct rr_numa_list rr;
> @@ -4400,6 +4546,7 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
>                  }
>              } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
>                  uint64_t cycle_hist = 0;
> +                uint64_t cycle_hist_idle = 0;
>  
>                  if (n_rxqs == 0) {
>                      rxqs = xmalloc(sizeof *rxqs); @@ -4411,9 +4558,12 
> @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
>                      /* Sum the queue intervals and store the cycle history. */
>                      for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
>                          cycle_hist += 
> dp_netdev_rxq_get_intrvl_cycles(q, i);
> +                        cycle_hist_idle += 
> + dp_netdev_rxq_get_idle_cycles(q, i);
>                      }
>                      dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST,
>                                               cycle_hist);
> +                    dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_IDLE_HIST,
> +                                             cycle_hist_idle);
>                  }
>                  /* Store the queue. */
>                  rxqs[n_rxqs++] = q;
> @@ -4430,6 +4580,12 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
>      rr_numa_list_populate(dp, &rr);
>      /* Assign the sorted queues to pmds in round robin. */
>      for (int i = 0; i < n_rxqs; i++) {
> +        if (!dry_run && rxqs[i]->dry_run_pmd) {
> +            rxqs[i]->pmd = rxqs[i]->dry_run_pmd;
> +            rxqs[i]->dry_run_pmd = NULL;
> +            continue;

It looks like the logs will be printed when the dry run is done but not during the actually assignment.

> +        }
> +
>          numa_id = netdev_get_numa_id(rxqs[i]->port->netdev);
>          numa = rr_numa_list_lookup(&rr, numa_id);
>          if (!numa) {
> @@ -4445,28 +4601,48 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned) OVS_REQUIRES(dp->port_mutex)
>                           netdev_rxq_get_queue_id(rxqs[i]->rx));
>                  continue;
>              }
> -            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa, assign_cyc);
> +
> +            if (dry_run) {
> +                rxqs[i]->dry_run_pmd = rr_numa_get_pmd(non_local_numa,
> +                                                       assign_cyc);
> +            } else {
> +                rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa, assign_cyc);
> +            }
> +

I think you could use a local ptr and set it once to point to dry_run_pmd or pmd, to avoid keep checking which one you should use.

>              VLOG_WARN("There's no available (non-isolated) pmd thread "
>                        "on numa node %d. Queue %d on port \'%s\' will "
>                        "be assigned to the pmd on core %d "
>                        "(numa node %d). Expect reduced performance.",
>                        numa_id, netdev_rxq_get_queue_id(rxqs[i]->rx),
>                        netdev_rxq_get_name(rxqs[i]->rx),
> -                      rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id);
> +                      (dry_run) ? rxqs[i]->dry_run_pmd->core_id :
> +                                  rxqs[i]->pmd->core_id,
> +                      (dry_run) ? rxqs[i]->dry_run_pmd->numa_id :
> +                                  rxqs[i]->pmd->numa_id);
>          } else {
> -            rxqs[i]->pmd = rr_numa_get_pmd(numa, assign_cyc);
> +            if (dry_run) {
> +                rxqs[i]->dry_run_pmd = rr_numa_get_pmd(numa, assign_cyc);
> +            } else {
> +                rxqs[i]->pmd = rr_numa_get_pmd(numa, assign_cyc);
> +            }
> +
>              if (assign_cyc) {
>                  VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
>                            "rx queue %d "
>                            "(measured processing cycles %"PRIu64").",
> -                          rxqs[i]->pmd->core_id, numa_id,
> +                          (dry_run) ? rxqs[i]->dry_run_pmd->core_id :
> +                                      rxqs[i]->pmd->core_id,
> +                          numa_id,
>                            netdev_rxq_get_name(rxqs[i]->rx),
>                            netdev_rxq_get_queue_id(rxqs[i]->rx),
>                            dp_netdev_rxq_get_cycles(rxqs[i],
>                                                     RXQ_CYCLES_PROC_HIST));
>              } else {
>                  VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
> -                          "rx queue %d.", rxqs[i]->pmd->core_id, numa_id,
> +                          "rx queue %d.",
> +                          (dry_run) ? rxqs[i]->dry_run_pmd->core_id :
> +                                      rxqs[i]->pmd->core_id,
> +                          numa_id,
>                            netdev_rxq_get_name(rxqs[i]->rx),
>                            netdev_rxq_get_queue_id(rxqs[i]->rx));
>              }
> @@ -4696,10 +4872,10 @@ reconfigure_datapath(struct dp_netdev *dp)
>      }
>  
>      /* Add pinned queues and mark pmd threads isolated. */
> -    rxq_scheduling(dp, true);
> +    rxq_scheduling(dp, true, false);
>  
>      /* Add non-pinned queues. */
> -    rxq_scheduling(dp, false);
> +    rxq_scheduling(dp, false, false);
>  
>      /* Step 5: Remove queues not compliant with new scheduling. */
>      CMAP_FOR_EACH (pmd, node, &dp->poll_threads) { @@ -4730,7 +4906,7 
> @@ reconfigure_datapath(struct dp_netdev *dp)
>  
>              if (q->pmd) {
>                  ovs_mutex_lock(&q->pmd->port_mutex);
> -                dp_netdev_add_rxq_to_pmd(q->pmd, q);
> +                dp_netdev_add_rxq_to_pmd(q->pmd, q, false);
>                  ovs_mutex_unlock(&q->pmd->port_mutex);
>              }
>          }
> @@ -4768,6 +4944,189 @@ ports_require_restart(const struct dp_netdev *dp)
>      return false;
>  }
>  
> +/* Function for calculating variance */ static uint64_t 
> +variance(uint64_t a[], int n) {
> +    /* Compute mean (average of elements) */
> +    uint64_t sum = 0;
> +    uint64_t mean;
> +    uint64_t sqDiff = 0;
> +
> +    if (!n) {
> +        return 0;
> +    }
> +
> +    for (int i = 0; i < n; i++) {
> +        VLOG_INFO("PMD_AUTO_LB_MON pmd_load[%d]=%"PRIu64"",
> +                i, a[i]);
> +        sum += a[i];
> +    }
> +    mean = (uint64_t)sum / (uint64_t)n;
> +
> +    /* Compute sum squared differences with mean. */
> +    for (int i = 0; i < n; i++) {
> +        sqDiff += (a[i] - mean)*(a[i] - mean);
> +    }
> +    VLOG_INFO("PMD_AUTO_LB_MON variance %"PRIu64"",
> +              (uint64_t)sqDiff / (uint64_t)n);
> +
> +    return (uint64_t)sqDiff / (uint64_t)n; }
> +
> +static bool
> +pmd_rebalance_dry_run(struct dp_netdev *dp) {
> +    struct dp_netdev_pmd_thread *pmd;
> +    struct dp_netdev_port *port;
> +    struct rxq_poll *poll, *poll_next;
> +    uint64_t *curr_pmd_usage;
> +    uint64_t *new_pmd_usage;
> +
> +    uint64_t new_variance;
> +    uint64_t curr_variance;
> +    uint64_t improvement = 0;
> +    uint32_t num_pmds = 0;
> +    bool pmd_mapping_changed = false;
> +
> +    rxq_scheduling(dp, false, true);
> +

You may want to think about abandoning this process if there are rxq's that are assigned cross numa. I'm not sure that it would work in that case, as processing cycles could drastically change. It's probably a misconfiguration so reduced performance is expected anyway (as noted by the logs).

> +    /* Checking mapping of PMD to q's.
> +     * If it remains same then don't do anything.
> +     */
> +    HMAP_FOR_EACH (port, node, &dp->ports) {
> +        if (!netdev_is_pmd(port->netdev)) {
> +            /* Port is not polled by PMD */
> +            continue;
> +        }
> +
> +        for (int qid = 0; qid < port->n_rxq; qid++) {
> +            struct dp_netdev_rxq *q = &port->rxqs[qid];
> +
> +            if (q->dry_run_pmd) {
> +                dp_netdev_add_rxq_to_pmd(q->dry_run_pmd, q, true);
> +                if (q->dry_run_pmd != q->pmd) {
> +                    pmd_mapping_changed = true;
> +                }
> +            }
> +        }
> +    }
> +
> +    if (!pmd_mapping_changed) {
> +        VLOG_INFO("PMD_AUTO_LB_MON DRY RUN indicating no pmd-q mapping change"
> +                 "So skipping reconfiguration");
> +
> +        CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +            if (pmd->pmd_overloaded) {
> +                pmd->pmd_overloaded = false;
> +            }
> +            HMAP_FOR_EACH_POP (poll, node, &pmd->dry_poll_list) {
> +                free(poll);
> +            }
> +        }
> +
> +        goto UNDO_DRYRUN;
> +    }
> +
> +    num_pmds = cmap_count(&dp->poll_threads);
> +    curr_pmd_usage = xcalloc(num_pmds, sizeof(uint64_t));
> +    new_pmd_usage = xcalloc(num_pmds, sizeof(uint64_t));
> +
> +    num_pmds = 0;
> +    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +        uint64_t tot_proc = 0;
> +        uint64_t tot_idle = 0;
> +        uint64_t pmd_usage = 0;
> +
> +        if ((pmd->core_id == NON_PMD_CORE_ID) || (pmd->isolated)) {
> +            continue;
> +        }
> +

--
> +        HMAP_FOR_EACH_SAFE (poll, poll_next, node, &pmd->dry_poll_list) {
> +            tot_proc += dp_netdev_rxq_get_cycles(poll->rxq,
> +                                                 RXQ_CYCLES_PROC_HIST);
> +            tot_idle += dp_netdev_rxq_get_cycles(poll->rxq,
> +                                                 
> +RXQ_CYCLES_IDLE_HIST); #if 0
> +            VLOG_DBG("PMD_AUTO_LB_MON tot_proc %"PRIu64", tot_idle %"PRIu64"",
> +                     tot_proc, tot_idle);
> +            VLOG_DBG("Core %d processing port \'%s\' with q-id %d q core:%d\n",
> +                pmd->core_id, netdev_rxq_get_name(poll->rxq->rx),
> +                netdev_rxq_get_queue_id(poll->rxq->rx), 
> +poll->rxq->core_id); #endif
> +        }
> +
> +        if (tot_proc) {
> +            pmd_usage = (tot_proc * 100) / (double)(tot_proc + tot_idle) ;
> +            VLOG_INFO("PMD_AUTO_LB_MON new_pmd_usage(%d) %"PRIu64"",
> +                      pmd->core_id, pmd_usage);
> +        }
> +        new_pmd_usage[num_pmds] = pmd_usage;
> +

Seems like a good candidate for a fn. as you repeat it below

> +        tot_proc = 0;
> +        tot_idle = 0;
> +        pmd_usage = 0;
> +        HMAP_FOR_EACH_SAFE (poll, poll_next, node, &pmd->poll_list) {
> +            tot_proc += dp_netdev_rxq_get_cycles(poll->rxq,
> +                                                 RXQ_CYCLES_PROC_HIST);
> +            tot_idle += dp_netdev_rxq_get_cycles(poll->rxq,
> +                                                 RXQ_CYCLES_IDLE_HIST);
> +        }
> +
> +        if (tot_proc) {
> +            pmd_usage = (tot_proc * 100) / (double)(tot_proc + tot_idle) ;
> +            VLOG_INFO("PMD_AUTO_LB_MON curr_pmd_usage(%d)` %"PRIu64"",
> +                      pmd->core_id, pmd_usage);
> +        }
> +
> +        curr_pmd_usage[num_pmds] = pmd_usage;
> +
> +        if (pmd->pmd_overloaded) {
> +            pmd->pmd_overloaded = false;
> +        }
> +        HMAP_FOR_EACH_POP (poll, node, &pmd->dry_poll_list) {
> +            free(poll);
> +        }
> +        num_pmds++;
> +    }
> +
> +    if (num_pmds) {
> +        curr_variance = variance(curr_pmd_usage, num_pmds);
> +        new_variance = variance(new_pmd_usage, num_pmds);
> +        VLOG_DBG("PMD_AUTO_LB_MON new variance: %"PRIu64",
> +                   curr_variance: %"PRIu64"",
> +                  new_variance, curr_variance);
> +

Looks like this line was split incorrectly, it needs below to compile.

--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -5105,7 +5105,6 @@ pmd_rebalance_dry_run(struct dp_netdev *dp)
         curr_variance = variance(curr_pmd_usage, num_pmds);
         new_variance = variance(new_pmd_usage, num_pmds);
-        VLOG_DBG("PMD_AUTO_LB_MON new variance: %"PRIu64",
-                   curr_variance: %"PRIu64"",
-                  new_variance, curr_variance);
+        VLOG_DBG("PMD_AUTO_LB_MON new variance: %"PRIu64", "
+                 "curr_variance: %"PRIu64"", new_variance, 
+ curr_variance);

         if (new_variance < curr_variance) {


> +        if (new_variance < curr_variance) {
> +            improvement =
> +                ((curr_variance - new_variance) * 100) / 
> + curr_variance;
> +
> +            VLOG_DBG("PMD_AUTO_LB_MON improvement %"PRIu64"", improvement);
> +        }
> +    }
> +
> +    free(curr_pmd_usage);
> +    free(new_pmd_usage);
> +
> +    if (improvement >= auto_lb_accept_improve) {
> +        return true;
> +    }
> +
> +UNDO_DRYRUN:
> +    HMAP_FOR_EACH (port, node, &dp->ports) {
> +        if (!netdev_is_pmd(port->netdev)) {
> +            continue;
> +         }
> +
> +         for (int qid = 0; qid < port->n_rxq; qid++) {
> +            struct dp_netdev_rxq *q = &port->rxqs[qid];
> +            q->dry_run_pmd = NULL;
> +         }
> +    }
> +    return false;
> +}
> +
> +
>  /* Return true if needs to revalidate datapath flows. */  static bool  
> dpif_netdev_run(struct dpif *dpif) @@ -4777,6 +5136,9 @@ 
> dpif_netdev_run(struct dpif *dpif)
>      struct dp_netdev_pmd_thread *non_pmd;
>      uint64_t new_tnl_seq;
>      bool need_to_flush = true;
> +    bool pmd_rebalance = false;
> +    long long int now = time_msec();
> +    struct dp_netdev_pmd_thread *pmd;
>  
>      ovs_mutex_lock(&dp->port_mutex);
>      non_pmd = dp_netdev_get_pmd(dp, NON_PMD_CORE_ID); @@ -4809,6 
> +5171,37 @@ dpif_netdev_run(struct dpif *dpif)
>          dp_netdev_pmd_unref(non_pmd);
>      }
>  
> +    if (pmd_auto_lb) {
> +        if (!pmd_rebalance_poll_timer) {
> +            pmd_rebalance_poll_timer = now;
> +        }
> +
> +        if ((pmd_rebalance_poll_timer +
> +             PMD_REBALANCE_POLL_TIMER_INTERVAL) < now) {
> +            pmd_rebalance_poll_timer = now;
> +            CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
> +                if (pmd->pmd_overloaded) {
> +                    pmd_rebalance = true;
> +                    break;
> +                }
> +            }
> +            VLOG_DBG("PMD_AUTO_LB_MON periodic check:pmd rebalance:%d",
> +                      pmd_rebalance);
> +
> +            if (pmd_rebalance && !dp_netdev_is_reconf_required(dp) &&
> +                !ports_require_restart(dp)) {
> +                if (pmd_rebalance_dry_run(dp)) {
> +                    ovs_mutex_unlock(&dp->port_mutex);
> +                    ovs_mutex_lock(&dp_netdev_mutex);
> +                    VLOG_INFO("PMD_AUTO_LB_MON Invoking PMD RECONFIGURE");
> +                    dp_netdev_request_reconfigure(dp);
> +                    ovs_mutex_unlock(&dp_netdev_mutex);
> +                    ovs_mutex_lock(&dp->port_mutex);
> +                }
> +            }
> +        }
> +    }
> +
>      if (dp_netdev_is_reconf_required(dp) || ports_require_restart(dp)) {
>          reconfigure_datapath(dp);
>      }
> @@ -4967,13 +5360,31 @@ pmd_thread_main(void *f_)
>  reload:
>      pmd_alloc_static_tx_qid(pmd);
>  
> +    VLOG_DBG("PMD_AUTO_LB_MON PMD RELOAD START %"PRId64"", time_msec());
> +    pmd->pmd_overloaded = 0;
> +
> +//Nitin: Reset prev_stats also ??
> +
>      /* List port/core affinity */
>      for (i = 0; i < poll_cnt; i++) {
> +       struct dp_netdev_rxq *rxq = poll_list[i].rxq;
> +       vhost_qfill *rxq_vhost_qfill = 
> + &rxq->rxq_drops_or_qfill.rxq_vhost_qfill;
>         VLOG_DBG("Core %d processing port \'%s\' with queue-id %d\n",
> -                pmd->core_id, netdev_rxq_get_name(poll_list[i].rxq->rx),
> -                netdev_rxq_get_queue_id(poll_list[i].rxq->rx));
> +                pmd->core_id, netdev_rxq_get_name(rxq->rx),
> +                netdev_rxq_get_queue_id(rxq->rx));
>         /* Reset the rxq current cycles counter. */
> -       dp_netdev_rxq_set_cycles(poll_list[i].rxq, RXQ_CYCLES_PROC_CURR, 0);
> +       dp_netdev_rxq_set_cycles(rxq, RXQ_CYCLES_PROC_CURR, 0);
> +       dp_netdev_rxq_set_cycles(rxq, RXQ_CYCLES_IDLE_CURR, 0);
> +       dp_netdev_rxq_set_cycles(rxq, RXQ_CYCLES_PROC_HIST, 0);
> +       dp_netdev_rxq_set_cycles(rxq, RXQ_CYCLES_IDLE_HIST, 0);
> +       rxq->overloading_pmd = 0;
> +       rxq_vhost_qfill->num_vhost_qfill = 0;
> +       rxq_vhost_qfill->prev_num_vhost_qfill = 0;
> +
> +       for (unsigned j = 0; j < PMD_RXQ_INTERVAL_MAX; j++) {
> +            dp_netdev_rxq_set_intrvl_cycles(rxq, 0, j);
> +            dp_netdev_rxq_set_idle_cycles(rxq, 0, j);
> +       }
>      }
>  
>      if (!poll_cnt) {
> @@ -5465,6 +5876,7 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>      pmd->next_optimization = pmd->ctx.now + DPCLS_OPTIMIZATION_INTERVAL;
>      pmd->rxq_next_cycle_store = pmd->ctx.now + PMD_RXQ_INTERVAL_LEN;
>      hmap_init(&pmd->poll_list);
> +    hmap_init(&pmd->dry_poll_list);
>      hmap_init(&pmd->tx_ports);
>      hmap_init(&pmd->tnl_port_cache);
>      hmap_init(&pmd->send_port_cache); @@ -5489,6 +5901,7 @@ 
> dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
>      hmap_destroy(&pmd->tnl_port_cache);
>      hmap_destroy(&pmd->tx_ports);
>      hmap_destroy(&pmd->poll_list);
> +    hmap_destroy(&pmd->dry_poll_list);
>      /* All flows (including their dpcls_rules) have been deleted already */
>      CMAP_FOR_EACH (cls, node, &pmd->classifiers) {
>          dpcls_destroy(cls);
> @@ -5585,25 +5998,33 @@ dp_netdev_pmd_clear_ports(struct 
> dp_netdev_pmd_thread *pmd)
>  /* Adds rx queue to poll_list of PMD thread, if it's not there 
> already. */  static void  dp_netdev_add_rxq_to_pmd(struct 
> dp_netdev_pmd_thread *pmd,
> -                         struct dp_netdev_rxq *rxq)
> +                         struct dp_netdev_rxq *rxq,
> +                         bool dry_run)
>      OVS_REQUIRES(pmd->port_mutex)
>  {
>      int qid = netdev_rxq_get_queue_id(rxq->rx);
>      uint32_t hash = hash_2words(odp_to_u32(rxq->port->port_no), qid);
>      struct rxq_poll *poll;
> +    struct hmap *poll_list = dry_run ? &pmd->dry_poll_list : 
> + &pmd->poll_list;
>  
> -    HMAP_FOR_EACH_WITH_HASH (poll, node, hash, &pmd->poll_list) {
> +    HMAP_FOR_EACH_WITH_HASH (poll, node, hash, poll_list) {
>          if (poll->rxq == rxq) {
>              /* 'rxq' is already polled by this thread. Do nothing. */
> +            VLOG_DBG("rxq(%s) is already polled by this pmd(%d)\n",
> +                     netdev_rxq_get_name(rxq->rx), pmd->core_id);
>              return;
>          }
>      }
> +    VLOG_DBG("Adding rxq(%s) to pmd(%d)\n",
> +                     netdev_rxq_get_name(rxq->rx), pmd->core_id);
>  
>      poll = xmalloc(sizeof *poll);
>      poll->rxq = rxq;
> -    hmap_insert(&pmd->poll_list, &poll->node, hash);
> +    hmap_insert(poll_list, &poll->node, hash);
>  
> -    pmd->need_reload = true;
> +    if (!dry_run) {
> +        pmd->need_reload = true;
> +    }
>  }
>  
>  /* Delete 'poll' from poll_list of PMD thread. */ @@ -7176,17 
> +7597,129 @@ dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd,
>                             struct polled_queue *poll_list, int 
> poll_cnt)  {
>      struct dpcls *cls;
> +    struct netdev_stats stats;
> +    struct dp_netdev_port *port;
> +    uint32_t drops;
> +    uint32_t num_qfull;
> +    uint64_t prev_dropped;
> +    uint32_t prev_qfill, curr_qfill;
> +    uint64_t tot_idle = 0, tot_proc = 0;
> +    unsigned int idx;
> +    float pmd_load = 0;
>  
>      if (pmd->ctx.now > pmd->rxq_next_cycle_store) {
>          uint64_t curr_tsc;
> +
> +        if (pmd_auto_lb) {
> +            tot_idle = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE] -
> +                       pmd->prev_stats[PMD_CYCLES_ITER_IDLE];
> +            tot_proc = pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY] -
> +                       pmd->prev_stats[PMD_CYCLES_ITER_BUSY];
> +
> +            if (tot_proc) {
> +                pmd_load = ((tot_proc * 100) / (double)(tot_idle + tot_proc));
> +            }
> +
> +            VLOG_DBG("PMD_AUTO_LB_MON PMD LOAD in  (%.02f%%)"
> +                     "tot_idle %"PRIu64" tot_proc %"PRIu64"", pmd_load,
> +                      tot_idle, tot_proc);
> +        }
> +
> +        pmd->prev_stats[PMD_CYCLES_ITER_IDLE] =
> +                        pmd->perf_stats.counters.n[PMD_CYCLES_ITER_IDLE];
> +        pmd->prev_stats[PMD_CYCLES_ITER_BUSY] =
> +                        
> + pmd->perf_stats.counters.n[PMD_CYCLES_ITER_BUSY];
> +
>          /* Get the cycles that were used to process each queue and store. */
>          for (unsigned i = 0; i < poll_cnt; i++) {
> -            uint64_t rxq_cyc_curr = dp_netdev_rxq_get_cycles(poll_list[i].rxq,
> -                                                        RXQ_CYCLES_PROC_CURR);
> -            dp_netdev_rxq_set_intrvl_cycles(poll_list[i].rxq, rxq_cyc_curr);
> -            dp_netdev_rxq_set_cycles(poll_list[i].rxq, RXQ_CYCLES_PROC_CURR,
> -                                     0);
> +            uint64_t rxq_cyc_curr, rxq_idle_curr;
> +            struct dp_netdev_rxq *rxq;
> +
> +            rxq = poll_list[i].rxq;
> +            idx = rxq->intrvl_idx++ % PMD_RXQ_INTERVAL_MAX;
> +
> +            rxq_cyc_curr = dp_netdev_rxq_get_cycles(rxq, RXQ_CYCLES_PROC_CURR);
> +            dp_netdev_rxq_set_intrvl_cycles(rxq, rxq_cyc_curr, idx);
> +            dp_netdev_rxq_set_cycles(rxq, RXQ_CYCLES_PROC_CURR, 0);
> +
> +            rxq_idle_curr = dp_netdev_rxq_get_cycles(rxq,
> +                                                     RXQ_CYCLES_IDLE_CURR);
> +            dp_netdev_rxq_set_idle_cycles(rxq, rxq_idle_curr, idx);
> +            dp_netdev_rxq_set_cycles(rxq, RXQ_CYCLES_IDLE_CURR, 0);
> +
> +
> +            if (pmd_auto_lb && !pmd->isolated) {
> +                bool pkt_drops = false;
> +
> +                VLOG_DBG("PMD_AUTO_LB_MON rxq_cyc_curr %"PRIu64" rxq_idle_curr"
> +                         " %"PRIu64"", rxq_cyc_curr, rxq_idle_curr);
> +
> +                if (!auto_lb_skip_drop_check) {
> +                    if (rxq->is_vhost && (PMD_MIN_NUM_QFILLS > 0)) {
> +                        /* It is dpdk vhost port */
> +                        vhost_qfill *rxq_vhost_qfill =
> +                            &rxq->rxq_drops_or_qfill.rxq_vhost_qfill;
> +                        atomic_read_relaxed(
> +                            &rxq_vhost_qfill->prev_num_vhost_qfill,
> +                            &prev_qfill);
> +                        atomic_read_relaxed(
> +                            &rxq_vhost_qfill->num_vhost_qfill,
> +                            &curr_qfill);
> +
> +                        num_qfull = curr_qfill - prev_qfill;
> +                        VLOG_DBG("PMD_AUTO_LB_MON log_q_thr %d prev_qfill %d "
> +                                 "curr_qfill %d", log_q_thr, prev_qfill,
> +                                 curr_qfill);
> +                        if (curr_qfill != prev_qfill) {
> +                            atomic_store_relaxed(
> +                                    &rxq_vhost_qfill->prev_num_vhost_qfill,
> +                                    curr_qfill);
> +                        }
> +
> +                        if (num_qfull >= PMD_MIN_NUM_QFILLS) {
> +                            pkt_drops = true;
> +                        }
> +
> +                    } else if (PMD_MIN_NUM_DROPS > 0) {
> +                        /* DPDK port */
> +                        port = rxq->port;
> +                        port->netdev->netdev_class->get_stats(port->netdev,
> +                                                          &stats);

This will result in getting a lock before getting the stats. That could potentially block the pmd thread and processing. I think some other way of getting stats would be needed, even if they are not as accurate or a little older.

> +                        q_drops *rxq_drop =
> +                            &rxq->rxq_drops_or_qfill.rxq_drops;
> +                        atomic_read_relaxed(&rxq_drop->prev_drops,
> +                                            &prev_dropped);
> +                        drops = stats.rx_dropped - prev_dropped;
> +                        VLOG_DBG("PMD_AUTO_LB_MON curr_drops %"PRIu64" "
> +                                 "prev_drops %"PRIu64"", stats.rx_dropped,
> +                                 prev_dropped);
> +                        if (drops) {
> +                            atomic_store_relaxed(&rxq_drop->prev_drops,
> +                                                 stats.rx_dropped);
> +                        }
> +                        if (drops >= PMD_MIN_NUM_DROPS) {
> +                            pkt_drops = true;
> +                        }
> +                    }
> +                }
> +                if ((pmd_load >= auto_lb_pmd_load_ther) &&
> +                    (auto_lb_skip_drop_check || pkt_drops) &&
> +                    (!pmd->pmd_overloaded)) {
> +                    rxq->overloading_pmd++;

I'm not sure why rxq->overloading_pmd is needed. Maybe I missed it.

> +
> +                    VLOG_DBG("PMD_AUTO_LB_MON PMD OVERLOAD DETECT iter %d",
> +                              rxq->overloading_pmd);
> +                    if (rxq->overloading_pmd >= PMD_RXQ_INTERVAL_MAX) {
> +                       pmd->pmd_overloaded = true; //TODO : No need 
> +to reset overloading flag??
> +                       rxq->overloading_pmd = 0;
> +                    }
> +                } else {
> +                    rxq->overloading_pmd = 0;
> +                }
> +            } // if (!pmd->isolated)

The code added above is very nested. Perhaps some new function(s) could be used.

>          }
> +
>          curr_tsc = cycles_counter_update(&pmd->perf_stats);
>          if (pmd->intrvl_tsc_prev) {
>              /* There is a prev timestamp, store a new intrvl cycle 
> count. */
> 



More information about the dev mailing list