[ovs-dev] [PATCH v4 5/7] dpif-netdev: Add group rxq scheduling assignment type.

Kevin Traynor ktraynor at redhat.com
Thu Jul 8 13:53:42 UTC 2021


Add an rxq scheduling option that allows rxqs to be grouped
on a pmd based purely on their load.

The current default 'cycles' assignment sorts rxqs by measured
processing load and then assigns them to a list of round robin PMDs.
This helps to keep the rxqs that require most processing on different
cores but as it selects the PMDs in round robin order, it equally
distributes rxqs to PMDs.

'cycles' assignment has the advantage in that it separates the most
loaded rxqs from being on the same core but maintains the rxqs being
spread across a broad range of PMDs to mitigate against changes to
traffic pattern.

'cycles' assignment has the disadvantage that in order to make the
trade off between optimising for current traffic load and mitigating
against future changes, it tries to assign and equal amount of rxqs
per PMD in a round robin manner and this can lead to a less than optimal
balance of the processing load.

Now that PMD auto load balance can help mitigate with future changes in
traffic patterns, a 'group' assignment can be used to assign rxqs based
on their measured cycles and the estimated running total of the PMDs.

In this case, there is no restriction about keeping equal number of
rxqs per PMD as it is purely load based.

This means that one PMD may have a group of low load rxqs assigned to it
while another PMD has one high load rxq assigned to it, as that is the
best balance of their measured loads across the PMDs.

Signed-off-by: Kevin Traynor <ktraynor at redhat.com>
---
 Documentation/topics/dpdk/pmd.rst | 26 +++++++++++++++++++
 NEWS                              |  2 ++
 lib/dpif-netdev.c                 | 42 +++++++++++++++++++++++++++++--
 tests/pmd.at                      | 19 ++++++++++++--
 vswitchd/vswitch.xml              |  5 +++-
 5 files changed, 89 insertions(+), 5 deletions(-)

diff --git a/Documentation/topics/dpdk/pmd.rst b/Documentation/topics/dpdk/pmd.rst
index 065bd16ef..29ba53954 100644
--- a/Documentation/topics/dpdk/pmd.rst
+++ b/Documentation/topics/dpdk/pmd.rst
@@ -137,4 +137,30 @@ The Rx queues will be assigned to the cores in the following order::
     Core 8: Q3 (60%) | Q0 (30%)
 
+``group`` assignment is similar to ``cycles`` in that the Rxqs will be
+ordered by their measured processing cycles before being assigned to PMDs.
+It differs from ``cycles`` in that it uses a running estimate of the cycles
+that will be on each PMD to select the PMD with the lowest load for each Rxq.
+
+This means that there can be a group of low traffic Rxqs on one PMD, while a
+high traffic Rxq may have a PMD to itself. Where ``cycles`` kept as close to
+the same number of Rxqs per PMD as possible, with ``group`` this restriction is
+removed for a better balance of the workload across PMDs.
+
+For example, where there are five Rx queues and three cores - 3, 7, and 8 -
+available and the measured usage of core cycles per Rx queue over the last
+interval is seen to be:
+
+- Queue #0: 10%
+- Queue #1: 80%
+- Queue #3: 50%
+- Queue #4: 70%
+- Queue #5: 10%
+
+The Rx queues will be assigned to the cores in the following order::
+
+    Core 3: Q1 (80%) |
+    Core 7: Q4 (70%) |
+    Core 8: Q3 (50%) | Q0 (10%) | Q5 (10%)
+
 Alternatively, ``roundrobin`` assignment can be used, where the Rxqs are
 assigned to PMDs in a round-robined fashion. This algorithm was used by
diff --git a/NEWS b/NEWS
index dddd57fc2..b1e186d49 100644
--- a/NEWS
+++ b/NEWS
@@ -18,4 +18,6 @@ Post-v2.15.0
      * Userspace datapath now supports up to 2^18 meters.
      * Added support for systems with non-contiguous NUMA nodes and core ids.
+     * Added new 'group' option to pmd-rxq-assign. This will assign rxq to pmds
+       purely based on rxq and pmd load.
    - ovs-ctl:
      * New option '--no-record-hostname' to disable hostname configuration
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 0f22fdb0a..c372aa48c 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -312,4 +312,5 @@ enum sched_assignment_type {
     SCHED_ROUNDROBIN,
     SCHED_CYCLES, /* Default.*/
+    SCHED_GROUP,
     SCHED_MAX
 };
@@ -4373,4 +4374,6 @@ dpif_netdev_set_config(struct dpif *dpif, const struct smap *other_config)
     } else if (!strcmp(pmd_rxq_assign, "cycles")) {
         pmd_rxq_assign_type = SCHED_CYCLES;
+    } else if (!strcmp(pmd_rxq_assign, "group")) {
+        pmd_rxq_assign_type = SCHED_GROUP;
     } else {
         /* Default. */
@@ -5211,4 +5214,32 @@ compare_rxq_cycles(const void *a, const void *b)
 }
 
+static struct sched_pmd *
+sched_pmd_get_lowest(struct sched_numa *numa, bool has_cyc)
+{
+    struct sched_pmd *lowest_sched_pmd = NULL;
+    uint64_t lowest_num = UINT64_MAX;
+
+    for (unsigned i = 0; i < numa->n_pmds; i++) {
+        struct sched_pmd *sched_pmd;
+        uint64_t pmd_num;
+
+        sched_pmd = &numa->pmds[i];
+        if (sched_pmd->isolated) {
+            continue;
+        }
+        if (has_cyc) {
+            pmd_num = sched_pmd->pmd_proc_cycles;
+        } else {
+            pmd_num = sched_pmd->n_rxq;
+        }
+
+        if (pmd_num < lowest_num) {
+            lowest_num = pmd_num;
+            lowest_sched_pmd = sched_pmd;
+        }
+    }
+    return lowest_sched_pmd;
+}
+
 /*
  * Returns the next pmd from the numa node.
@@ -5269,6 +5300,12 @@ sched_pmd_next_noniso_rr(struct sched_numa *numa, bool updown)
 
 static struct sched_pmd *
-sched_pmd_next(struct sched_numa *numa, enum sched_assignment_type algo)
+sched_pmd_next(struct sched_numa *numa, enum sched_assignment_type algo,
+               bool has_proc)
 {
+    if (algo == SCHED_GROUP) {
+        return sched_pmd_get_lowest(numa, has_proc);
+    }
+
+    /* By default RR the PMDs. */
     return sched_pmd_next_noniso_rr(numa, algo == SCHED_CYCLES ? true : false);
 }
@@ -5280,4 +5317,5 @@ get_assignment_type_string(enum sched_assignment_type algo)
     case SCHED_ROUNDROBIN: return "roundrobin";
     case SCHED_CYCLES: return "cycles";
+    case SCHED_GROUP: return "group";
     case SCHED_MAX:
     default: return "Unknown";
@@ -5442,5 +5480,5 @@ sched_numa_list_schedule(struct sched_numa_list *numa_list,
 
             /* Select the PMD that should be used for this rxq. */
-            sched_pmd = sched_pmd_next(numa, algo);
+            sched_pmd = sched_pmd_next(numa, algo, proc_cycles ? true : false);
             if (sched_pmd) {
                 VLOG(level, "Core %2u on numa node %d assigned port \'%s\' "
diff --git a/tests/pmd.at b/tests/pmd.at
index 650aa5300..677620777 100644
--- a/tests/pmd.at
+++ b/tests/pmd.at
@@ -145,9 +145,21 @@ pmd thread numa_id <cleared> core_id <cleared>:
 ])
 
-AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=cycles])
+TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
+AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=roundrobin])
+OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "Performing pmd to rx queue assignment using roundrobin algorithm"])
+
 TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
 AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3])
 CHECK_PMD_THREADS_CREATED([2], [], [+$TMP])
 
+AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/AVAIL$/ { printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl
+port: p0 queue-id: 0 2 4 6
+port: p0 queue-id: 1 3 5 7
+])
+
+TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
+AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=cycles])
+OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "Performing pmd to rx queue assignment using cycles algorithm"])
+
 AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/AVAIL$/ { printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl
 port: p0 queue-id: 0 3 4 7
@@ -155,5 +167,8 @@ port: p0 queue-id: 1 2 5 6
 ])
 
-AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=roundrobin])
+TMP=$(($(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])+1))
+AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-rxq-assign=group])
+OVS_WAIT_UNTIL([tail -n +$TMP ovs-vswitchd.log | grep "Performing pmd to rx queue assignment using group algorithm"])
+
 AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | awk '/AVAIL$/ { printf("%s\t", $0); next } 1' | parse_pmd_rxq_show_group | sort], [0], [dnl
 port: p0 queue-id: 0 2 4 6
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 3522b2497..a2dc74a09 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -520,5 +520,5 @@
       <column name="other_config" key="pmd-rxq-assign"
               type='{"type": "string",
-                     "enum": ["set", ["cycles", "roundrobin"]]}'>
+                     "enum": ["set", ["cycles", "roundrobin", "group"]]}'>
         <p>
           Specifies how RX queues will be automatically assigned to CPU cores.
@@ -530,4 +530,7 @@
             <dt><code>roundrobin</code></dt>
             <dd>Rxqs will be round-robined across CPU cores.</dd>
+            <dt><code>group</code></dt>
+            <dd>Rxqs will be sorted by order of measured processing cycles
+            before being assigned to CPU cores with lowest estimated load.</dd>
           </dl>
         </p>
-- 
2.31.1



More information about the dev mailing list