[ovs-dev] [PATCH v2] dpif-netdev: Allow PMD auto load balance with cross-numa.

Wed Mar 17 15:59:21 UTC 2021

On 3/15/21 4:43 PM, Kevin Traynor wrote:
> Previously auto load balance did not trigger a reassignment when
> there was any cross-numa polling as an rxq could be polled from a
> different numa after reassign and it could impact estimates.
> 
> In the case where there is only one numa with pmds available, the
> same numa will always poll before and after reassignment, so estimates
> are valid. Allow PMD auto load balance to trigger a reassignment in
> this case.
> 
> Signed-off-by: Kevin Traynor <ktraynor at redhat.com>
> Acked-by: Eelco Chaudron <echaudro at redhat.com>
> 
> ---
> v2:
> - Same logic as v1, combined two "ifs" as per David suggestion
> - Updated comments/logs
> - Updated the doc note that said it will not work for cross NUMA to
>   include new condition
> - Kept Eelco's Ack, as no logic changed
> ---
>  Documentation/topics/dpdk/pmd.rst |  9 ++++++---
>  lib/dpif-netdev.c                 | 16 +++++++++++++---
>  2 files changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
> index caa7d97be..1f61bddb6 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -238,7 +238,10 @@ If not set, the default variance improvement threshold is 25%.
>  .. note::
>  
> -    PMD Auto Load Balancing doesn't currently work if queues are assigned
> -    cross NUMA as actual processing load could get worse after assignment
> -    as compared to what dry run predicts.
> +    PMD Auto Load Balancing doesn't request a reassignment if queues are
> +    assigned cross NUMA and there are multiple NUMA nodes available for
> +    reassignment. This is because reassignment to a different NUMA node could
> +    lead to an unpredictable change in processing cycles required for a queue.
> +    However, if there is only one cross NUMA node available then a dry run and
> +    possible request to reassign may continue as normal.

This note looks very cryptic.  What is 'cross NUMA node'?  Request a reassignment
from who?  What is dry run (this was understandable from the old version of the
note)?

Way too complex.
I'd not expect that normal user who doesn't know internals of the code to
understand what is going on here.

Maybe we can keep the old note and only add an exceptional case? e.g.:

    PMD Auto Load Balancing doesn't currently work if queues are assigned
    cross NUMA as actual processing load could get worse after assignment
    as compared to what dry run predicts.  The only exception is when all
    PMD threads are running on cores from a single NUMA node.  In this case
    Auto Load Balancing is still possible.

>  
>  The minimum time between 2 consecutive PMD auto load balancing iterations can
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 816945375..29e74ee43 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -4888,4 +4888,10 @@ struct rr_numa {
>  };
>  
> +static size_t
> +rr_numa_list_count(struct rr_numa_list *rr)
> +{
> +    return hmap_count(&rr->numas);
> +}
> +
>  static struct rr_numa *
>  rr_numa_list_lookup(struct rr_numa_list *rr, int numa_id)
> @@ -5600,8 +5606,12 @@ get_dry_run_variance(struct dp_netdev *dp, uint32_t *core_list,
>          int numa_id = netdev_get_numa_id(rxqs[i]->port->netdev);
>          numa = rr_numa_list_lookup(&rr, numa_id);
> +        /* If there is no available pmd on the local numa but there is only one
> +         * numa for cross-numa polling, we can estimate the dry run. */
> +        if (!numa && rr_numa_list_count(&rr) == 1) {
> +            numa = rr_numa_list_next(&rr, NULL);
> +        }
>          if (!numa) {
> -            /* Abort if cross NUMA polling. */
> -            VLOG_DBG("PMD auto lb dry run."
> -                     " Aborting due to cross-numa polling.");
> +            VLOG_DBG("PMD auto lb dry run. Aborting due to "
> +                     "multiple numa nodes available for cross-numa polling.");

Same here.  This message is hard to understand.
Maybe:
            VLOG_DBG("PMD auto lb dry run: "
                     "There's no available (non-isolated) PMD thread on NUMA "
                     "node %d for port '%s' and there are PMD threads on more "
                     "than one NUMA node available for cross-NUMA polling. "
                     "Aborting.", numa_id, netdev_rxq_get_name(rxqs[i]->rx));

What do you think?

Best regards, Ilya Maximets.