[ovs-dev] [PATCH 3/5] dpif-netdev: Add group rxq scheduling assignment type.

David Marchand david.marchand at redhat.com
Thu Jun 24 14:52:37 UTC 2021


On Fri, Jun 4, 2021 at 11:19 PM Kevin Traynor <ktraynor at redhat.com> wrote:
>
> Add an rxq scheduling option that allows rxqs to be grouped
> on a pmd based purely on their load.
>
> The current default 'cycles' assignment sorts rxqs by measured
> processing load and then assigns them to a list of round robin PMDs.
> This helps to keep the rxqs that require most processing on different
> cores but as it selects the PMDs in round robin order, it equally
> distributes rxqs to PMDs.
>
> 'cycles' assignment has the advantage in that it separates the most
> loaded rxqs from being on the same core but maintains the rxqs being
> spread across a broad range of PMDs to mitigate against changes to
> traffic pattern.
>
> 'cycles' assignment has the disadvantage that in order to make the
> trade off between optimising for current traffic load and mitigating
> against future changes, it tries to assign and equal amount of rxqs
> per PMD in a round robin manner and this can lead to less than optimal
> balance of the processing load.
>
> Now that PMD auto load balance can help mitigate with future changes in
> traffic patterns, a 'group' assignment can be used to assign rxqs based
> on their measured cycles and the estimated running total of the PMDs.
>
> In this case, there is no restriction about keeping equal number of
> rxqs per PMD as it is purely load based.
>
> This means that one PMD may have a group of low load rxqs assigned to it
> while another PMD has one high load rxq assigned to it, as that is the
> best balance of their measured loads across the PMDs.
>
> Signed-off-by: Kevin Traynor <ktraynor at redhat.com>
> ---
>  Documentation/topics/dpdk/pmd.rst |  26 ++++++
>  lib/dpif-netdev.c                 | 141 +++++++++++++++++++++++++-----
>  vswitchd/vswitch.xml              |   5 +-
>  3 files changed, 148 insertions(+), 24 deletions(-)
>
> diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
> index e481e7941..d1c45cdfb 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -137,4 +137,30 @@ The Rx queues will be assigned to the cores in the following order::
>      Core 8: Q3 (60%) | Q0 (30%)
>
> +``group`` assignment is similar to ``cycles`` in that the Rxqs will be
> +ordered by their measured processing cycles before being assigned to PMDs.
> +It differs from ``cycles`` in that it uses a running estimate of the cycles
> +that will be on each PMD to select the PMD with the lowest load for each Rxq.
> +
> +This means that there can be a group of low traffic Rxqs on one PMD, while a
> +high traffic Rxq may have a PMD to itself. Where ``cycles`` kept as close to
> +the same number of Rxqs per PMD as possible, with ``group`` this restriction is
> +removed for a better balance of the workload across PMDs.
> +
> +For example, where there are five Rx queues and three cores - 3, 7, and 8 -
> +available and the measured usage of core cycles per Rx queue over the last
> +interval is seen to be:
> +
> +- Queue #0: 10%
> +- Queue #1: 80%
> +- Queue #3: 50%
> +- Queue #4: 70%
> +- Queue #5: 10%
> +
> +The Rx queues will be assigned to the cores in the following order::
> +
> +    Core 3: Q1 (80%) |
> +    Core 7: Q4 (70%) |
> +    Core 8: Q3 (50%) | Q0 (10%) | Q5 (10%)
> +
>  Alternatively, ``roundrobin`` assignment can be used, where the Rxqs are
>  assigned to PMDs in a round-robined fashion. This algorithm was used by
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index eaa4e9733..61e0a516f 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -306,4 +306,11 @@ struct pmd_auto_lb {
>  };
>
> +enum sched_assignment_type {
> +    SCHED_ROUNDROBIN,
> +    SCHED_CYCLES, /* Default.*/
> +    SCHED_GROUP,
> +    SCHED_MAX
> +};
> +
>  /* Datapath based on the network device interface from netdev.h.
>   *
> @@ -367,5 +374,5 @@ struct dp_netdev {
>      struct ovs_mutex tx_qid_pool_mutex;
>      /* Use measured cycles for rxq to pmd assignment. */
> -    bool pmd_rxq_assign_cyc;
> +    enum sched_assignment_type pmd_rxq_assign_cyc;
>
>      /* Protects the access of the 'struct dp_netdev_pmd_thread'
> @@ -1799,5 +1806,5 @@ create_dp_netdev(const char *name, const struct dpif_class *class,
>
>      cmap_init(&dp->poll_threads);
> -    dp->pmd_rxq_assign_cyc = true;
> +    dp->pmd_rxq_assign_cyc = SCHED_CYCLES;
>
>      ovs_mutex_init(&dp->tx_qid_pool_mutex);
> @@ -4223,5 +4230,5 @@ set_pmd_auto_lb(struct dp_netdev *dp, bool always_log)
>      bool enable_alb = false;
>      bool multi_rxq = false;
> -    bool pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc;
> +    enum sched_assignment_type pmd_rxq_assign_cyc = dp->pmd_rxq_assign_cyc;
>
>      /* Ensure that there is at least 2 non-isolated PMDs and
> @@ -4242,6 +4249,6 @@ set_pmd_auto_lb(struct dp_netdev *dp, bool always_log)
>      }
>
> -    /* Enable auto LB if it is requested and cycle based assignment is true. */
> -    enable_alb = enable_alb && pmd_rxq_assign_cyc &&
> +    /* Enable auto LB if requested and not using roundrobin assignment. */
> +    enable_alb = enable_alb && pmd_rxq_assign_cyc != SCHED_ROUNDROBIN &&
>                      pmd_alb->auto_lb_requested;
>
> @@ -4284,4 +4291,5 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>      uint8_t rebalance_improve;
>      bool log_autolb = false;
> +    enum sched_assignment_type pmd_rxq_assign_cyc;
>
>      tx_flush_interval = smap_get_int(other_config, "tx-flush-interval",
> @@ -4342,9 +4350,15 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
>      }
>
> -    bool pmd_rxq_assign_cyc = !strcmp(pmd_rxq_assign, "cycles");
> -    if (!pmd_rxq_assign_cyc && strcmp(pmd_rxq_assign, "roundrobin")) {
> -        VLOG_WARN("Unsupported Rxq to PMD assignment mode in pmd-rxq-assign. "
> -                      "Defaulting to 'cycles'.");
> -        pmd_rxq_assign_cyc = true;
> +    if (!strcmp(pmd_rxq_assign, "roundrobin")) {
> +        pmd_rxq_assign_cyc = SCHED_ROUNDROBIN;
> +    } else if (!strcmp(pmd_rxq_assign, "cycles")) {
> +        pmd_rxq_assign_cyc = SCHED_CYCLES;
> +    } else if (!strcmp(pmd_rxq_assign, "group")) {
> +        pmd_rxq_assign_cyc = SCHED_GROUP;
> +    } else {
> +        /* default */
> +        VLOG_WARN("Unsupported rx queue to PMD assignment mode in "
> +                  "pmd-rxq-assign. Defaulting to 'cycles'.");
> +        pmd_rxq_assign_cyc = SCHED_CYCLES;
>          pmd_rxq_assign = "cycles";
>      }
> @@ -5171,4 +5185,61 @@ compare_rxq_cycles(const void *a, const void *b)
>  }
>
> +static struct sched_pmd *
> +get_lowest_num_rxq_pmd(struct sched_numa *numa)
> +{
> +    struct sched_pmd *lowest_rxqs_sched_pmd = NULL;
> +    unsigned lowest_rxqs = UINT_MAX;
> +
> +    /* find the pmd with lowest number of rxqs */
> +    for (unsigned i = 0; i < numa->n_pmds; i++) {
> +        struct sched_pmd *sched_pmd;
> +        unsigned num_rxqs;
> +
> +        sched_pmd = &numa->pmds[i];
> +        num_rxqs = sched_pmd->n_rxq;
> +        if (sched_pmd->isolated) {
> +            continue;
> +        }
> +
> +        /* If this current load is higher we can go to the next one */
> +        if (num_rxqs > lowest_rxqs) {

Testing for >= avoids the (wrongly indented) test below.


> +            continue;
> +        }
> +       if (num_rxqs < lowest_rxqs) {
> +           lowest_rxqs = num_rxqs;
> +           lowest_rxqs_sched_pmd = sched_pmd;
> +        }
> +    }
> +    return lowest_rxqs_sched_pmd;
> +}
> +
> +static struct sched_pmd *
> +get_lowest_proc_pmd(struct sched_numa *numa)
> +{
> +    struct sched_pmd *lowest_loaded_sched_pmd = NULL;
> +    uint64_t lowest_load = UINT64_MAX;
> +
> +    /* find the pmd with the lowest load */
> +    for (unsigned i = 0; i < numa->n_pmds; i++) {
> +        struct sched_pmd *sched_pmd;
> +        uint64_t pmd_load;
> +
> +        sched_pmd = &numa->pmds[i];
> +        if (sched_pmd->isolated) {
> +            continue;
> +        }
> +        pmd_load = sched_pmd->pmd_proc_cycles;
> +        /* If this current load is higher we can go to the next one */
> +        if (pmd_load > lowest_load) {
> +            continue;
> +        }
> +       if (pmd_load < lowest_load) {
> +           lowest_load = pmd_load;
> +           lowest_loaded_sched_pmd = sched_pmd;
> +        }
> +    }
> +    return lowest_loaded_sched_pmd;
> +}
> +
>  /*
>   * Returns the next pmd from the numa node.
> @@ -5229,16 +5300,40 @@ get_available_rr_pmd(struct sched_numa *numa, bool updown)
>
>  static struct sched_pmd *
> -get_next_pmd(struct sched_numa *numa, bool algo)
> +get_next_pmd(struct sched_numa *numa, enum sched_assignment_type algo,
> +             bool has_proc)
>  {
> -    return get_available_rr_pmd(numa, algo);
> +    if (algo == SCHED_GROUP) {
> +        struct sched_pmd *sched_pmd = NULL;
> +
> +        /* Check if the rxq has associated cycles. This is handled differently
> +         * as adding an zero cycles rxq to a PMD will mean that the lowest
> +         * core would not change on a subsequent call and all zero rxqs would
> +         * be assigned to the same PMD. */
> +        if (has_proc) {
> +            sched_pmd = get_lowest_proc_pmd(numa);
> +        } else {
> +            sched_pmd = get_lowest_num_rxq_pmd(numa);
> +        }
> +        /* If there is a pmd selected, return it now. */
> +        if (sched_pmd) {
> +            return sched_pmd;
> +        }

The only case where sched_pmd == NULL is when n_pmds == 0 in which
case the rr stuff would also lead to no pmd available.
And it seems unintuitive that in non rr mode we would still call the
rr pmd selector.

I would simply return sched_pmd here.



> +    }
> +
> +    /* By default or as a last resort, just RR the PMDs. */
> +    return get_available_rr_pmd(numa, algo == SCHED_CYCLES ? true : false);

This ternary is unneeded.


>  }
>

[snip]


-- 
David Marchand



More information about the dev mailing list