[ovs-dev] [PATCH] netdev-dpdk: Add port/queue based rxq to pmd assignment.

Eelco Chaudron echaudro at redhat.com
Wed Aug 22 07:32:04 UTC 2018



On 21 Aug 2018, at 19:15, Kevin Traynor wrote:

> Prior to OVS 2.9 automatic assignment of Rxqs to PMDs
> (i.e. CPUs) was done by assigning Rxqs in an ascending
> port/queue order, round robined across the available
> PMDs.
>
> That was changed in OVS 2.9 to order the Rxqs by the
> measured processing cycles for each, in order to try
> and keep the busiest Rxqs on different PMDs.
>
> For the most part the new scheme should be better, but
> there could be situations where a user prefers a
> port/queue scheme because Rxqs from a single port are
> more likely to be spread across multiple cores, and/or
> traffic is very bursty/unpredictable.
>
> Add the option to have a port/queue based assignment.

Some small comments below… Rest looks good.

//Eelco

> Signed-off-by: Kevin Traynor <ktraynor at redhat.com>
> ---
>  Documentation/topics/dpdk/pmd.rst |  34 +++++++++--
>  NEWS                              |   2 +
>  lib/dpif-netdev.c                 | 115 
> +++++++++++++++++++++++++++-----------
>  tests/pmd.at                      |  15 ++++-
>  vswitchd/vswitch.xml              |  23 ++++++++
>  5 files changed, 147 insertions(+), 42 deletions(-)
>
> diff --git a/Documentation/topics/dpdk/pmd.rst 
> b/Documentation/topics/dpdk/pmd.rst
> index 5f0671e..e8628cc 100644
> --- a/Documentation/topics/dpdk/pmd.rst
> +++ b/Documentation/topics/dpdk/pmd.rst
> @@ -113,10 +113,15 @@ means that this thread will only poll the 
> *pinned* Rx queues.
>
>  If ``pmd-rxq-affinity`` is not set for Rx queues, they will be 
> assigned to PMDs
> -(cores) automatically. Where known, the processing cycles that have 
> been stored
> -for each Rx queue will be used to assign Rx queue to PMDs based on a 
> round
> -robin of the sorted Rx queues. For example, take the following 
> example, where
> -there are five Rx queues and three cores - 3, 7, and 8 - available 
> and the
> -measured usage of core cycles per Rx queue over the last interval is 
> seen to
> -be:
> +(cores) automatically.
> +
> +The algorithm used to automatically assign Rxqs to PMDs can be set 
> by::
> +
> +    $ ovs-vsctl set Open_vSwitch . 
> other_config:pmd-rxq-assign=<assignment>
> +
> +By default, ``cycles`` assignment is used where the Rxqs will be 
> ordered by
> +their measured processing cycles, and then be assigned in descending 
> order to
> +PMDs based on a round robin up/down walk of the PMDs. For example, 
> where there
> +are five Rx queues and three cores - 3, 7, and 8 - available and the 
> measured
> +usage of core cycles per Rx queue over the last interval is seen to 
> be:
>
>  - Queue #0: 30%
> @@ -132,4 +137,21 @@ The Rx queues will be assigned to the cores in 
> the following order::
>      Core 8: Q3 (60%) | Q0 (30%)
>
> +Alternatively, ``portqueue`` assignment can be used, where the Rxqs 
> are
> +ordered by their port/queue number, and then assigned in ascending 
> order to
> +PMDs based on a round robin of the PMDs. This algorithm was used by 
> default
> +prior to OVS 2.9. For example, given the following ports and queues:
> +
> +- Port #0 Queue #0 (P0Q0)
> +- Port #0 Queue #1 (P0Q1)
> +- Port #1 Queue #0 (P1Q0)
> +- Port #1 Queue #1 (P1Q1)
> +- Port #1 Queue #2 (P1Q2)
> +
> +The Rx queues will be assigned to the cores in the following order::
> +
> +    Core 3: P0Q0 | P1Q1
> +    Core 7: P0Q1 | P1Q2
> +    Core 8: P1Q0 |
> +
>  To see the current measured usage history of PMD core cycles for each 
> Rx
>  queue::
> diff --git a/NEWS b/NEWS
> index 8987f9a..9e809d1 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -5,4 +5,6 @@ Post-v2.10.0
>     - The environment variable OVS_CTL_TIMEOUT, if set, is now used
>       as the default timeout for control utilities.
> +   - DPDK:
> +     * Add option for port/queue based Rxq to PMD assignment.
>
>
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 7f836bb..507906c 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -342,4 +342,7 @@ struct dp_netdev {
>      struct id_pool *tx_qid_pool;
>      struct ovs_mutex tx_qid_pool_mutex;
> +    /* Stores whether an rxq's used cycles should be
> +     * used to influence assignment to pmds. */
> +    atomic_bool pmd_rxq_assign_cyc;
>
>      /* Protects the access of the 'struct dp_netdev_pmd_thread'
> @@ -1491,4 +1494,5 @@ create_dp_netdev(const char *name, const struct 
> dpif_class *class,
>      atomic_init(&dp->emc_insert_min, DEFAULT_EM_FLOW_INSERT_MIN);
>      atomic_init(&dp->tx_flush_interval, DEFAULT_TX_FLUSH_INTERVAL);
> +    atomic_init(&dp->pmd_rxq_assign_cyc, true);
>
>      cmap_init(&dp->poll_threads);
> @@ -3717,4 +3721,6 @@ dpif_netdev_set_config(struct dpif *dpif, const 
> struct smap *other_config)
>      struct dp_netdev *dp = get_dp_netdev(dpif);
>      const char *cmask = smap_get(other_config, "pmd-cpu-mask");
> +    const char *pmd_rxq_assign = smap_get_def(other_config, 
> "pmd-rxq-assign",
> +                                             "cycles");
>      unsigned long long insert_prob =
>          smap_get_ullong(other_config, "emc-insert-inv-prob",
> @@ -3779,4 +3785,20 @@ dpif_netdev_set_config(struct dpif *dpif, const 
> struct smap *other_config)
>          }
>      }
> +
> +    bool pmd_rxq_assign_cyc = !strcmp(pmd_rxq_assign, "cycles");
> +    if (pmd_rxq_assign_cyc || !strcmp(pmd_rxq_assign, "portqueue")) {
> +        bool cur_pmd_rxq_assign_cyc;
> +        atomic_read_relaxed(&dp->pmd_rxq_assign_cyc, 
> &cur_pmd_rxq_assign_cyc);
> +        if (pmd_rxq_assign_cyc != cur_pmd_rxq_assign_cyc) {
> +            atomic_store_relaxed(&dp->pmd_rxq_assign_cyc, 
> pmd_rxq_assign_cyc);
> +            if (pmd_rxq_assign_cyc) {
> +                VLOG_INFO("Rxq to PMD assignment mode: "
> +                          "Sort queues by processing cycles.");
> +            } else {
> +                VLOG_INFO("Rxq to PMD assignment mode: "
> +                          "Sort queues by port/queue number.");
> +            }
> +        }
> +    }
>      return 0;
>  }
> @@ -4249,27 +4271,40 @@ rr_numa_list_populate(struct dp_netdev *dp, 
> struct rr_numa_list *rr)
>  }
>
> -/* Returns the next pmd from the numa node in
> - * incrementing or decrementing order. */
> +/* Returns the next pmd from the numa node.
> + *
> + * If updown is 'true' it will alternate between
> + * selecting the next pmd in either an up or down
> + * walk, switching between up/down each time the
> + * min or max core is reached. e.g. 1,2,3,3,2,1,1,2...
> + *
> + * If updown is 'false' it will select the next pmd
> + * in ascending order, wrapping around when max core
> + * is reached. e.g. 1,2,3,1,2,3,1,2... */
>  static struct dp_netdev_pmd_thread *
> -rr_numa_get_pmd(struct rr_numa *numa)
> +rr_numa_get_pmd(struct rr_numa *numa, bool updown)
>  {
> -    int numa_idx = numa->cur_index;
> +    int numa_idx;
>
> -    if (numa->idx_inc == true) {
> -        /* Incrementing through list of pmds. */
> -        if (numa->cur_index == numa->n_pmds-1) {
> -            /* Reached the last pmd. */
> -            numa->idx_inc = false;
> +    if (updown) {
> +        numa_idx = numa->cur_index;
> +        if (numa->idx_inc == true) {
> +            /* Incrementing through list of pmds. */
> +            if (numa->cur_index == numa->n_pmds-1) {
> +                /* Reached the last pmd. */
> +                numa->idx_inc = false;
> +            } else {
> +                numa->cur_index++;
> +            }
>          } else {
> -            numa->cur_index++;
> +            /* Decrementing through list of pmds. */
> +            if (numa->cur_index == 0) {
> +                /* Reached the first pmd. */
> +                numa->idx_inc = true;
> +            } else {
> +                numa->cur_index--;
> +            }
>          }
>      } else {
> -        /* Decrementing through list of pmds. */
> -        if (numa->cur_index == 0) {
> -            /* Reached the first pmd. */
> -            numa->idx_inc = true;
> -        } else {
> -            numa->cur_index--;
> -        }
> +        numa_idx = numa->cur_index++ % numa->n_pmds;
>      }

This looks like a big change, how about the following:

@@ -4259,7 +4259,11 @@ rr_numa_get_pmd(struct rr_numa *numa)
          /* Incrementing through list of pmds. */
          if (numa->cur_index == numa->n_pmds-1) {
              /* Reached the last pmd. */
-            numa->idx_inc = false;
+            if (updown) {
+                numa->idx_inc = false;
+            } else {
+                numa->cur_index = 0;
+            }
          } else {


>      return numa->pmds[numa_idx];
> @@ -4323,7 +4358,4 @@ compare_rxq_cycles(const void *a, const void *b)
>   * pmds to unpinned queues.
>   *
> - * If 'pinned' is false queues will be sorted by processing cycles 
> they are
> - * consuming and then assigned to pmds in round robin order.
> - *
>   * The function doesn't touch the pmd threads, it just stores the 
> assignment
>   * in the 'pmd' member of each rxq. */
> @@ -4338,5 +4370,7 @@ rxq_scheduling(struct dp_netdev *dp, bool 
> pinned) OVS_REQUIRES(dp->port_mutex)
>      struct rr_numa *numa = NULL;
>      int numa_id;
> +    bool assign_cyc;
>
> +    atomic_read_relaxed(&dp->pmd_rxq_assign_cyc, &assign_cyc);
>      HMAP_FOR_EACH (port, node, &dp->ports) {
>          if (!netdev_is_pmd(port->netdev)) {
> @@ -4368,10 +4402,13 @@ rxq_scheduling(struct dp_netdev *dp, bool 
> pinned) OVS_REQUIRES(dp->port_mutex)
>                      rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 
> 1));
>                  }
> -                /* Sum the queue intervals and store the cycle 
> history. */
> -                for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
> -                    cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, 
> i);
> -                }
> -                dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST, 
> cycle_hist);
>
> +                if (assign_cyc) {
> +                    /* Sum the queue intervals and store the cycle 
> history. */
> +                    for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; 
> i++) {
> +                        cycle_hist += 
> dp_netdev_rxq_get_intrvl_cycles(q, i);
> +                    }
> +                    dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST,
> +                                             cycle_hist);
> +                }
>                  /* Store the queue. */
>                  rxqs[n_rxqs++] = q;
> @@ -4380,5 +4417,5 @@ rxq_scheduling(struct dp_netdev *dp, bool 
> pinned) OVS_REQUIRES(dp->port_mutex)
>      }
>
> -    if (n_rxqs > 1) {
> +    if (n_rxqs > 1 && assign_cyc) {
>          /* Sort the queues in order of the processing cycles
>           * they consumed during their last pmd interval. */
> @@ -4404,5 +4441,5 @@ rxq_scheduling(struct dp_netdev *dp, bool 
> pinned) OVS_REQUIRES(dp->port_mutex)
>                  continue;
>              }
> -            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa);
> +            rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa, 
> assign_cyc);
>              VLOG_WARN("There's no available (non-isolated) pmd thread 
> "
>                        "on numa node %d. Queue %d on port \'%s\' will 
> "
> @@ -4413,11 +4450,21 @@ rxq_scheduling(struct dp_netdev *dp, bool 
> pinned) OVS_REQUIRES(dp->port_mutex)
>                        rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id);
>          } else {
> -        rxqs[i]->pmd = rr_numa_get_pmd(numa);
> -        VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
> -                  "rx queue %d (measured processing cycles 
> %"PRIu64").",
> -                  rxqs[i]->pmd->core_id, numa_id,
> -                  netdev_rxq_get_name(rxqs[i]->rx),
> -                  netdev_rxq_get_queue_id(rxqs[i]->rx),
> -                  dp_netdev_rxq_get_cycles(rxqs[i], 
> RXQ_CYCLES_PROC_HIST));
> +            rxqs[i]->pmd = rr_numa_get_pmd(numa, assign_cyc);
> +            if (assign_cyc) {
> +                VLOG_INFO("Core %d on numa node %d assigned port 
> \'%s\' "
> +                          "rx queue %d "
> +                          "(measured processing cycles %"PRIu64").",
> +                          rxqs[i]->pmd->core_id, numa_id,
> +                          netdev_rxq_get_name(rxqs[i]->rx),
> +                          netdev_rxq_get_queue_id(rxqs[i]->rx),
> +                          dp_netdev_rxq_get_cycles(rxqs[i],
> +                                                   
> RXQ_CYCLES_PROC_HIST));
> +            } else {
> +                VLOG_INFO("Core %d on numa node %d assigned port 
> \'%s\' "
> +                          "rx queue %d.", rxqs[i]->pmd->core_id, 
> numa_id,
> +                          netdev_rxq_get_name(rxqs[i]->rx),
> +                          netdev_rxq_get_queue_id(rxqs[i]->rx));
> +            }
> +
>          }
>      }
> diff --git a/tests/pmd.at b/tests/pmd.at
> index 4cae6c8..7cbaf41 100644
> --- a/tests/pmd.at
> +++ b/tests/pmd.at
> @@ -62,5 +62,6 @@ m4_define([CHECK_PMD_THREADS_CREATED], [
>
>  m4_define([SED_NUMA_CORE_PATTERN], ["s/\(numa_id \)[[0-9]]*\( core_id 
> \)[[0-9]]*:/\1<cleared>\2<cleared>:/"])
> -m4_define([SED_NUMA_CORE_QUEUE_PATTERN], ["s/1 2 5 6/<group>/;s/0 3 4 
> 7/<group>/"])
> +m4_define([SED_NUMA_CORE_QUEUE_CYC_PATTERN], ["s/1 2 5 6/<group>/;s/0 
> 3 4 7/<group>/"])
> +m4_define([SED_NUMA_CORE_QUEUE_PQ_PATTERN], ["s/1 3 5 7/<group>/;s/0 
> 2 4 6/<group>/"])
>  m4_define([DUMMY_NUMA], [--dummy-numa="0,0,0,0"])
>
> @@ -146,9 +147,19 @@ pmd thread numa_id <cleared> core_id <cleared>:
>  ])
>
> +AT_CHECK([ovs-vsctl set Open_vSwitch . 
> other_config:pmd-rxq-assign=cycles])

Should you not do a rebalance here, you are “theoretically” changing 
the config here.

>  TMP=$(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])
>  AT_CHECK([ovs-vsctl set Open_vSwitch . 
> other_config:pmd-cpu-mask=0x3])
>  CHECK_PMD_THREADS_CREATED([2], [], [+$TMP])
>
> -AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed 
> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed 
> SED_NUMA_CORE_QUEUE_PATTERN], [0], [dnl
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed 
> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed 
> SED_NUMA_CORE_QUEUE_CYC_PATTERN], [0], [dnl
> +port: p0 queue-id: <group>
> +port: p0 queue-id: <group>
> +])
> +
> +AT_CHECK([ovs-vsctl set Open_vSwitch . 
> other_config:pmd-rxq-assign=portqueue])
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-rebalance], [0], [dnl
> +pmd rxq rebalance requested.
> +])
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed 
> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed 
> SED_NUMA_CORE_QUEUE_PQ_PATTERN], [0], [dnl
>  port: p0 queue-id: <group>
>  port: p0 queue-id: <group>
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 0cd8520..b5789a5 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -2810,4 +2810,27 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 
> type=patch options:peer=p1 \
>        </column>
>
> +      <column name="other_config" key="pmd-rxq-assign"
> +              type='{"type": "string"}'>
> +        <p>
> +            Specifies how RX queues are automatically assigned to CPU 
> cores.
> +            Options:
> +                <code>cycles</code> - Sort queues in descending order 
> of
> +                measured processing cycles before assigning round 
> robin
> +                to CPUs.
> +                <code>portqueue</code>  - Sort queues in ascending 
> order
> +                of port/queue number before assign round robin to 
> CPUs.
> +        </p>
> +        <p>
> +          The default value is <code>cycles</code>.
> +        </p>
> +        <p>
> +          Note this this will not automatically perform a reschedule.
> +          When the mode is set to <code>portqueue</code> the
> +          <code>ovs-appctl dpif-netdev/pmd-rxq-rebalance</code> 
> command will
> +          not change the Rxq to PMD assignment, as the 
> Port/Queue/PMDs config
> +          will be unchanged from the last reconfiguration.

The last part of the rebalance in portqueue mode is a bit confusing. I 
thought initially you meant when setting it to portqueue mode, but its 
the case its already set to portqueue mode and run rebalance again.

> +        </p>
> +      </column>
> +
>        <column name="options" key="vhost-server-path"
>                type='{"type": "string"}'>
> -- 
> 1.8.3.1


More information about the dev mailing list