[ovs-dev] [PATCH v4 2/7] dpif-netdev: Make PMD auto load balance use common rxq scheduling.

Tue Jul 13 16:49:54 UTC 2021

Hi Jan,

On 13/07/2021 09:38, Jan Scheurich wrote:
>> -----Original Message-----
>> From: dev <ovs-dev-bounces at openvswitch.org> On Behalf Of Kevin Traynor
>> Sent: Thursday, 8 July, 2021 15:54
>> To: dev at openvswitch.org
>> Cc: david.marchand at redhat.com
>> Subject: [ovs-dev] [PATCH v4 2/7] dpif-netdev: Make PMD auto load balance
>> use common rxq scheduling.
>>
>> PMD auto load balance had its own separate implementation of the rxq
>> scheduling that it used for dry runs. This was done because previously the rxq
>> scheduling was not made reusable for a dry run.
>>
>> Apart from the code duplication (which is a good enough reason to replace it
>> alone) this meant that if any further rxq scheduling changes or assignment
>> types were added they would also have to be duplicated in the auto load
>> balance code too.
>>
>> This patch replaces the current PMD auto load balance rxq scheduling code to
>> reuse the common rxq scheduling code.
>>
>> The behaviour does not change from a user perspective, except the logs are
>> updated to be more consistent.
>>
>> As the dry run will compare the pmd load variances for current and estimated
>> assignments, new functions are added to populate the current assignments and
>> use the rxq scheduling data structs for variance calculations.
>>
>> Now that the new rxq scheduling data structures are being used in PMD auto
>> load balance, the older rr_* data structs and associated functions can be
>> removed.
>>
>> Signed-off-by: Kevin Traynor <ktraynor at redhat.com>
>> ---
>>  lib/dpif-netdev.c | 508 +++++++++++++++-------------------------------
>>  1 file changed, 161 insertions(+), 347 deletions(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index beafa00a0..338ffd971
>> 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -4903,138 +4903,4 @@ port_reconfigure(struct dp_netdev_port *port)  }
>>
>> -struct rr_numa_list {
>> -    struct hmap numas;  /* Contains 'struct rr_numa' */
>> -};
>> -
>> -struct rr_numa {
>> -    struct hmap_node node;
>> -
>> -    int numa_id;
>> -
>> -    /* Non isolated pmds on numa node 'numa_id' */
>> -    struct dp_netdev_pmd_thread **pmds;
>> -    int n_pmds;
>> -
>> -    int cur_index;
>> -    bool idx_inc;
>> -};
>> -
>> -static size_t
>> -rr_numa_list_count(struct rr_numa_list *rr) -{
>> -    return hmap_count(&rr->numas);
>> -}
>> -
>> -static struct rr_numa *
>> -rr_numa_list_lookup(struct rr_numa_list *rr, int numa_id) -{
>> -    struct rr_numa *numa;
>> -
>> -    HMAP_FOR_EACH_WITH_HASH (numa, node, hash_int(numa_id, 0), &rr-
>>> numas) {
>> -        if (numa->numa_id == numa_id) {
>> -            return numa;
>> -        }
>> -    }
>> -
>> -    return NULL;
>> -}
>> -
>> -/* Returns the next node in numa list following 'numa' in round-robin fashion.
>> - * Returns first node if 'numa' is a null pointer or the last node in 'rr'.
>> - * Returns NULL if 'rr' numa list is empty. */ -static struct rr_numa * -
>> rr_numa_list_next(struct rr_numa_list *rr, const struct rr_numa *numa) -{
>> -    struct hmap_node *node = NULL;
>> -
>> -    if (numa) {
>> -        node = hmap_next(&rr->numas, &numa->node);
>> -    }
>> -    if (!node) {
>> -        node = hmap_first(&rr->numas);
>> -    }
>> -
>> -    return (node) ? CONTAINER_OF(node, struct rr_numa, node) : NULL;
>> -}
>> -
>> -static void
>> -rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr) -{
>> -    struct dp_netdev_pmd_thread *pmd;
>> -    struct rr_numa *numa;
>> -
>> -    hmap_init(&rr->numas);
>> -
>> -    CMAP_FOR_EACH (pmd, node, &dp->poll_threads) {
>> -        if (pmd->core_id == NON_PMD_CORE_ID || pmd->isolated) {
>> -            continue;
>> -        }
>> -
>> -        numa = rr_numa_list_lookup(rr, pmd->numa_id);
>> -        if (!numa) {
>> -            numa = xzalloc(sizeof *numa);
>> -            numa->numa_id = pmd->numa_id;
>> -            hmap_insert(&rr->numas, &numa->node, hash_int(pmd->numa_id, 0));
>> -        }
>> -        numa->n_pmds++;
>> -        numa->pmds = xrealloc(numa->pmds, numa->n_pmds * sizeof *numa-
>>> pmds);
>> -        numa->pmds[numa->n_pmds - 1] = pmd;
>> -        /* At least one pmd so initialise curr_idx and idx_inc. */
>> -        numa->cur_index = 0;
>> -        numa->idx_inc = true;
>> -    }
>> -}
>> -
>> -/*
>> - * Returns the next pmd from the numa node.
>> - *
>> - * If 'updown' is 'true' it will alternate between selecting the next pmd in
>> - * either an up or down walk, switching between up/down when the first or
>> last
>> - * core is reached. e.g. 1,2,3,3,2,1,1,2...
>> - *
>> - * If 'updown' is 'false' it will select the next pmd wrapping around when last
>> - * core reached. e.g. 1,2,3,1,2,3,1,2...
>> - */
>> -static struct dp_netdev_pmd_thread *
>> -rr_numa_get_pmd(struct rr_numa *numa, bool updown) -{
>> -    int numa_idx = numa->cur_index;
>> -
>> -    if (numa->idx_inc == true) {
>> -        /* Incrementing through list of pmds. */
>> -        if (numa->cur_index == numa->n_pmds-1) {
>> -            /* Reached the last pmd. */
>> -            if (updown) {
>> -                numa->idx_inc = false;
>> -            } else {
>> -                numa->cur_index = 0;
>> -            }
>> -        } else {
>> -            numa->cur_index++;
>> -        }
>> -    } else {
>> -        /* Decrementing through list of pmds. */
>> -        if (numa->cur_index == 0) {
>> -            /* Reached the first pmd. */
>> -            numa->idx_inc = true;
>> -        } else {
>> -            numa->cur_index--;
>> -        }
>> -    }
>> -    return numa->pmds[numa_idx];
>> -}
>> -
>> -static void
>> -rr_numa_list_destroy(struct rr_numa_list *rr) -{
>> -    struct rr_numa *numa;
>> -
>> -    HMAP_FOR_EACH_POP (numa, node, &rr->numas) {
>> -        free(numa->pmds);
>> -        free(numa);
>> -    }
>> -    hmap_destroy(&rr->numas);
>> -}
>> -
>>  struct sched_numa_list {
>>      struct hmap numas;  /* Contains 'struct sched_numa'. */ @@ -5197,4
>> +5063,37 @@ sched_pmd_add_rxq(struct sched_pmd *sched_pmd, struct
>> dp_netdev_rxq *rxq,  }
>>
>> +static void
>> +sched_numa_list_assignments(struct sched_numa_list *numa_list,
>> +                            struct dp_netdev *dp)
>> +    OVS_REQUIRES(dp->port_mutex)
>> +{
>> +    struct dp_netdev_port *port;
>> +
>> +    /* For each port. */
>> +    HMAP_FOR_EACH (port, node, &dp->ports) {
>> +        if (!netdev_is_pmd(port->netdev)) {
>> +            continue;
>> +        }
>> +        /* For each rxq on the port. */
>> +        for (unsigned qid = 0; qid < port->n_rxq; qid++) {
>> +            struct dp_netdev_rxq *rxq = &port->rxqs[qid];
>> +            struct sched_pmd *sched_pmd;
>> +            uint64_t proc_cycles = 0;
>> +
>> +            for (int i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
>> +                proc_cycles  += dp_netdev_rxq_get_intrvl_cycles(rxq, i);
>> +            }
>> +
>> +            sched_pmd = sched_pmd_find_by_pmd(numa_list, rxq->pmd);
>> +            if (sched_pmd) {
>> +                if (rxq->core_id != OVS_CORE_UNSPEC) {
>> +                    sched_pmd->isolated = true;
>> +                }
>> +                sched_pmd_add_rxq(sched_pmd, rxq, proc_cycles);
>> +            }
>> +        }
>> +    }
>> +}
>> +
>>  static void
>>  sched_numa_list_put_in_place(struct sched_numa_list *numa_list) @@ -
>> 5220,4 +5119,31 @@ sched_numa_list_put_in_place(struct sched_numa_list
>> *numa_list)  }
>>
>> +static bool
>> +sched_numa_list_cross_numa_polling(struct sched_numa_list *numa_list) {
>> +    struct sched_numa *numa;
>> +
>> +    /* For each numa */
>> +    HMAP_FOR_EACH (numa, node, &numa_list->numas) {
>> +        /* For each pmd */
>> +        for (int i = 0; i < numa->n_pmds; i++) {
>> +            struct sched_pmd *sched_pmd;
>> +
>> +            sched_pmd = &numa->pmds[i];
>> +            /* For each rxq. */
>> +            for (unsigned k = 0; k < sched_pmd->n_rxq; k++) {
>> +                struct dp_netdev_rxq *rxq = sched_pmd->rxqs[k];
>> +
>> +                if (!sched_pmd->isolated &&
>> +                    rxq->pmd->numa_id !=
>> +                        netdev_get_numa_id(rxq->port->netdev)) {
>> +                    return true;
>> +                }
>> +            }
>> +        }
>> +    }
>> +    return false;
>> +}
>> +
>>  static unsigned
>>  sched_numa_noniso_pmd_count(struct sched_numa *numa) @@ -5525,4
>> +5451,103 @@ rxq_scheduling(struct dp_netdev *dp) OVS_REQUIRES(dp-
>>> port_mutex)  }
>>
>> +static uint64_t variance(uint64_t a[], int n);
>> +
>> +static uint64_t
>> +sched_numa_list_variance(struct sched_numa_list *numa_list) {
>> +    struct sched_numa *numa;
>> +    uint64_t *percent_busy = NULL;
>> +    unsigned total_pmds = 0;
>> +    int n_proc = 0;
>> +    uint64_t var;
>> +
>> +    HMAP_FOR_EACH (numa, node, &numa_list->numas) {
>> +        total_pmds += numa->n_pmds;
>> +        percent_busy = xrealloc(percent_busy,
>> +                                total_pmds * sizeof *percent_busy);
>> +
>> +        for (unsigned i = 0; i < numa->n_pmds; i++) {
>> +            struct sched_pmd *sched_pmd;
>> +            uint64_t total_cycles = 0;
>> +
>> +            sched_pmd = &numa->pmds[i];
>> +            /* Exclude isolated PMDs from variance calculations. */
>> +            if (sched_pmd->isolated == true) {
>> +                continue;
>> +            }
>> +            /* Get the total pmd cycles for an interval. */
>> +            atomic_read_relaxed(&sched_pmd->pmd->intrvl_cycles,
>> + &total_cycles);
>> +
>> +            if (total_cycles) {
>> +                /* Estimate the cycles to cover all intervals. */
>> +                total_cycles *= PMD_RXQ_INTERVAL_MAX;
>> +                percent_busy[n_proc++] = (sched_pmd->pmd_proc_cycles * 100)
>> +                                             / total_cycles;
>> +            } else {
>> +                percent_busy[n_proc++] = 0;
>> +            }
>> +        }
>> +    }
>> +    var = variance(percent_busy, n_proc);
>> +    free(percent_busy);
>> +    return var;
>> +}
>> +
>> +static bool
>> +pmd_rebalance_dry_run(struct dp_netdev *dp)
>> +    OVS_REQUIRES(dp->port_mutex)
>> +{
>> +    struct sched_numa_list numa_list_cur;
>> +    struct sched_numa_list numa_list_est;
>> +    bool thresh_met = false;
>> +    uint64_t current_var, estimate_var;
>> +    uint64_t improvement = 0;
>> +
>> +    VLOG_DBG("PMD auto load balance performing dry run.");
>> +
>> +    /* Populate current assignments. */
>> +    sched_numa_list_populate(&numa_list_cur, dp);
>> +    sched_numa_list_assignments(&numa_list_cur, dp);
>> +
>> +    /* Populate estimated assignments. */
>> +    sched_numa_list_populate(&numa_list_est, dp);
>> +    sched_numa_list_schedule(&numa_list_est, dp,
>> +                             dp->pmd_rxq_assign_type, VLL_DBG);
>> +
>> +    /* Check if cross-numa polling, there is only one numa with PMDs. */
>> +    if (!sched_numa_list_cross_numa_polling(&numa_list_est) ||
>> +            sched_numa_list_count(&numa_list_est) == 1) {
>> +
>> +        /* Calculate variances. */
>> +        current_var = sched_numa_list_variance(&numa_list_cur);
>> +        estimate_var = sched_numa_list_variance(&numa_list_est);
>> +
>> +        if (estimate_var < current_var) {
>> +             improvement = ((current_var - estimate_var) * 100) / current_var;
>> +        }
>> +        VLOG_DBG("Current variance %"PRIu64" Estimated variance
>> %"PRIu64".",
>> +                 current_var, estimate_var);
>> +        VLOG_DBG("Variance improvement %"PRIu64"%%.", improvement);
>> +
>> +        if (improvement >= dp->pmd_alb.rebalance_improve_thresh) {
>> +            thresh_met = true;
>> +            VLOG_DBG("PMD load variance improvement threshold %u%% "
>> +                     "is met.", dp->pmd_alb.rebalance_improve_thresh);
>> +        } else {
>> +            VLOG_DBG("PMD load variance improvement threshold "
>> +                     "%u%% is not met.",
>> +                      dp->pmd_alb.rebalance_improve_thresh);
>> +        }
>> +    } else {
>> +        VLOG_DBG("PMD auto load balance detected cross-numa polling with "
>> +                 "multiple numa nodes. Unable to accurately estimate.");	
>> +    }
> 
> In our patch series we decided to skip the check on cross-numa polling during auto-load balancing. The rationale is as follows: 
> 
> If the estimated PMD-rxq distribution includes cross-NUMA rxq assignments, the same must apply for the current distribution, as none of the scheduling algorithms would voluntarily assign rxqs across NUMA nodes. So, current and estimated rxq assignments are comparable and it makes sense to consider rebalancing when the variance improves.
> 
> Please consider removing this check.
> 

The first thing is that this patch is not changing any behaviour, just
re-implementing to reuse the common code, so it would not be the place
to change this functionality.

About the proposed change itself, just to be clear what is allowed
currently. It will allow rebalance when there are local pmds, OR there
are no local pmds and there is one other NUMA node with pmds available
for cross-numa polling.

The rationale of not doing a rebalance when there are no local pmds but
multiple other NUMAs available for cross-NUMA polling is that the
estimate may be incorrect due a different cross-NUMA being choosen for
an Rxq than is currently used.

I thought about some things like making an Rxq sticky with a particular
cross-NUMA etc for this case but that brings a whole new set of
problems, e.g. what happens if that NUMA gets overloaded, reduced cores,
how can it ever be reset etc. so I decided not to pursue it as I think
it is probably a corner case (at least for now).

I know the case of no local pmd and one NUMA with pmds is not a corner
case as I'm aware of users doing that.

We can discuss further about the multiple non-local NUMA case and maybe
there's some improvements we can think of, or maybe I've made some wrong
assumptions but it would be a follow on from the current patchset.

> Otherwise, looks good to me,
> 

Thanks for reviewing.

Kevin.

> Jan
>