[ovs-dev] [PATCH] netdev-dpdk: Add port/queue based rxq to pmd assignment.
Kevin Traynor
ktraynor at redhat.com
Thu Aug 23 16:18:17 UTC 2018
On 08/22/2018 08:32 AM, Eelco Chaudron wrote:
>
>
> On 21 Aug 2018, at 19:15, Kevin Traynor wrote:
>
>> Prior to OVS 2.9 automatic assignment of Rxqs to PMDs
>> (i.e. CPUs) was done by assigning Rxqs in an ascending
>> port/queue order, round robined across the available
>> PMDs.
>>
>> That was changed in OVS 2.9 to order the Rxqs by the
>> measured processing cycles for each, in order to try
>> and keep the busiest Rxqs on different PMDs.
>>
>> For the most part the new scheme should be better, but
>> there could be situations where a user prefers a
>> port/queue scheme because Rxqs from a single port are
>> more likely to be spread across multiple cores, and/or
>> traffic is very bursty/unpredictable.
>>
>> Add the option to have a port/queue based assignment.
>
> Some small comments below… Rest looks good.
>
Thanks for reviewing
> //Eelco
>
>> Signed-off-by: Kevin Traynor <ktraynor at redhat.com>
>> ---
>> Documentation/topics/dpdk/pmd.rst | 34 +++++++++--
>> NEWS | 2 +
>> lib/dpif-netdev.c | 115
>> +++++++++++++++++++++++++++-----------
>> tests/pmd.at | 15 ++++-
>> vswitchd/vswitch.xml | 23 ++++++++
>> 5 files changed, 147 insertions(+), 42 deletions(-)
>>
>> diff --git a/Documentation/topics/dpdk/pmd.rst
>> b/Documentation/topics/dpdk/pmd.rst
>> index 5f0671e..e8628cc 100644
>> --- a/Documentation/topics/dpdk/pmd.rst
>> +++ b/Documentation/topics/dpdk/pmd.rst
>> @@ -113,10 +113,15 @@ means that this thread will only poll the
>> *pinned* Rx queues.
>>
>> If ``pmd-rxq-affinity`` is not set for Rx queues, they will be
>> assigned to PMDs
>> -(cores) automatically. Where known, the processing cycles that have
>> been stored
>> -for each Rx queue will be used to assign Rx queue to PMDs based on a
>> round
>> -robin of the sorted Rx queues. For example, take the following
>> example, where
>> -there are five Rx queues and three cores - 3, 7, and 8 - available
>> and the
>> -measured usage of core cycles per Rx queue over the last interval is
>> seen to
>> -be:
>> +(cores) automatically.
>> +
>> +The algorithm used to automatically assign Rxqs to PMDs can be set by::
>> +
>> + $ ovs-vsctl set Open_vSwitch .
>> other_config:pmd-rxq-assign=<assignment>
>> +
>> +By default, ``cycles`` assignment is used where the Rxqs will be
>> ordered by
>> +their measured processing cycles, and then be assigned in descending
>> order to
>> +PMDs based on a round robin up/down walk of the PMDs. For example,
>> where there
>> +are five Rx queues and three cores - 3, 7, and 8 - available and the
>> measured
>> +usage of core cycles per Rx queue over the last interval is seen to be:
>>
>> - Queue #0: 30%
>> @@ -132,4 +137,21 @@ The Rx queues will be assigned to the cores in
>> the following order::
>> Core 8: Q3 (60%) | Q0 (30%)
>>
>> +Alternatively, ``portqueue`` assignment can be used, where the Rxqs are
>> +ordered by their port/queue number, and then assigned in ascending
>> order to
>> +PMDs based on a round robin of the PMDs. This algorithm was used by
>> default
>> +prior to OVS 2.9. For example, given the following ports and queues:
>> +
>> +- Port #0 Queue #0 (P0Q0)
>> +- Port #0 Queue #1 (P0Q1)
>> +- Port #1 Queue #0 (P1Q0)
>> +- Port #1 Queue #1 (P1Q1)
>> +- Port #1 Queue #2 (P1Q2)
>> +
>> +The Rx queues will be assigned to the cores in the following order::
>> +
>> + Core 3: P0Q0 | P1Q1
>> + Core 7: P0Q1 | P1Q2
>> + Core 8: P1Q0 |
>> +
>> To see the current measured usage history of PMD core cycles for each Rx
>> queue::
>> diff --git a/NEWS b/NEWS
>> index 8987f9a..9e809d1 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -5,4 +5,6 @@ Post-v2.10.0
>> - The environment variable OVS_CTL_TIMEOUT, if set, is now used
>> as the default timeout for control utilities.
>> + - DPDK:
>> + * Add option for port/queue based Rxq to PMD assignment.
>>
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 7f836bb..507906c 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -342,4 +342,7 @@ struct dp_netdev {
>> struct id_pool *tx_qid_pool;
>> struct ovs_mutex tx_qid_pool_mutex;
>> + /* Stores whether an rxq's used cycles should be
>> + * used to influence assignment to pmds. */
>> + atomic_bool pmd_rxq_assign_cyc;
>>
>> /* Protects the access of the 'struct dp_netdev_pmd_thread'
>> @@ -1491,4 +1494,5 @@ create_dp_netdev(const char *name, const struct
>> dpif_class *class,
>> atomic_init(&dp->emc_insert_min, DEFAULT_EM_FLOW_INSERT_MIN);
>> atomic_init(&dp->tx_flush_interval, DEFAULT_TX_FLUSH_INTERVAL);
>> + atomic_init(&dp->pmd_rxq_assign_cyc, true);
>>
>> cmap_init(&dp->poll_threads);
>> @@ -3717,4 +3721,6 @@ dpif_netdev_set_config(struct dpif *dpif, const
>> struct smap *other_config)
>> struct dp_netdev *dp = get_dp_netdev(dpif);
>> const char *cmask = smap_get(other_config, "pmd-cpu-mask");
>> + const char *pmd_rxq_assign = smap_get_def(other_config,
>> "pmd-rxq-assign",
>> + "cycles");
>> unsigned long long insert_prob =
>> smap_get_ullong(other_config, "emc-insert-inv-prob",
>> @@ -3779,4 +3785,20 @@ dpif_netdev_set_config(struct dpif *dpif, const
>> struct smap *other_config)
>> }
>> }
>> +
>> + bool pmd_rxq_assign_cyc = !strcmp(pmd_rxq_assign, "cycles");
>> + if (pmd_rxq_assign_cyc || !strcmp(pmd_rxq_assign, "portqueue")) {
>> + bool cur_pmd_rxq_assign_cyc;
>> + atomic_read_relaxed(&dp->pmd_rxq_assign_cyc,
>> &cur_pmd_rxq_assign_cyc);
>> + if (pmd_rxq_assign_cyc != cur_pmd_rxq_assign_cyc) {
>> + atomic_store_relaxed(&dp->pmd_rxq_assign_cyc,
>> pmd_rxq_assign_cyc);
>> + if (pmd_rxq_assign_cyc) {
>> + VLOG_INFO("Rxq to PMD assignment mode: "
>> + "Sort queues by processing cycles.");
>> + } else {
>> + VLOG_INFO("Rxq to PMD assignment mode: "
>> + "Sort queues by port/queue number.");
>> + }
>> + }
>> + }
>> return 0;
>> }
>> @@ -4249,27 +4271,40 @@ rr_numa_list_populate(struct dp_netdev *dp,
>> struct rr_numa_list *rr)
>> }
>>
>> -/* Returns the next pmd from the numa node in
>> - * incrementing or decrementing order. */
>> +/* Returns the next pmd from the numa node.
>> + *
>> + * If updown is 'true' it will alternate between
>> + * selecting the next pmd in either an up or down
>> + * walk, switching between up/down each time the
>> + * min or max core is reached. e.g. 1,2,3,3,2,1,1,2...
>> + *
>> + * If updown is 'false' it will select the next pmd
>> + * in ascending order, wrapping around when max core
>> + * is reached. e.g. 1,2,3,1,2,3,1,2... */
>> static struct dp_netdev_pmd_thread *
>> -rr_numa_get_pmd(struct rr_numa *numa)
>> +rr_numa_get_pmd(struct rr_numa *numa, bool updown)
>> {
>> - int numa_idx = numa->cur_index;
>> + int numa_idx;
>>
>> - if (numa->idx_inc == true) {
>> - /* Incrementing through list of pmds. */
>> - if (numa->cur_index == numa->n_pmds-1) {
>> - /* Reached the last pmd. */
>> - numa->idx_inc = false;
>> + if (updown) {
>> + numa_idx = numa->cur_index;
>> + if (numa->idx_inc == true) {
>> + /* Incrementing through list of pmds. */
>> + if (numa->cur_index == numa->n_pmds-1) {
>> + /* Reached the last pmd. */
>> + numa->idx_inc = false;
>> + } else {
>> + numa->cur_index++;
>> + }
>> } else {
>> - numa->cur_index++;
>> + /* Decrementing through list of pmds. */
>> + if (numa->cur_index == 0) {
>> + /* Reached the first pmd. */
>> + numa->idx_inc = true;
>> + } else {
>> + numa->cur_index--;
>> + }
>> }
>> } else {
>> - /* Decrementing through list of pmds. */
>> - if (numa->cur_index == 0) {
>> - /* Reached the first pmd. */
>> - numa->idx_inc = true;
>> - } else {
>> - numa->cur_index--;
>> - }
>> + numa_idx = numa->cur_index++ % numa->n_pmds;
>> }
>
> This looks like a big change, how about the following:
>
> @@ -4259,7 +4259,11 @@ rr_numa_get_pmd(struct rr_numa *numa)
> /* Incrementing through list of pmds. */
> if (numa->cur_index == numa->n_pmds-1) {
> /* Reached the last pmd. */
> - numa->idx_inc = false;
> + if (updown) {
> + numa->idx_inc = false;
> + } else {
> + numa->cur_index = 0;
> + }
> } else {
>
I had thought it could be simpler by using the previously used LOC for
that case but perhaps it is better to do the suggested and not have a
mix of whether the mask is required or not. Changed to suggested.
>
>> return numa->pmds[numa_idx];
>> @@ -4323,7 +4358,4 @@ compare_rxq_cycles(const void *a, const void *b)
>> * pmds to unpinned queues.
>> *
>> - * If 'pinned' is false queues will be sorted by processing cycles
>> they are
>> - * consuming and then assigned to pmds in round robin order.
>> - *
>> * The function doesn't touch the pmd threads, it just stores the
>> assignment
>> * in the 'pmd' member of each rxq. */
>> @@ -4338,5 +4370,7 @@ rxq_scheduling(struct dp_netdev *dp, bool
>> pinned) OVS_REQUIRES(dp->port_mutex)
>> struct rr_numa *numa = NULL;
>> int numa_id;
>> + bool assign_cyc;
>>
>> + atomic_read_relaxed(&dp->pmd_rxq_assign_cyc, &assign_cyc);
>> HMAP_FOR_EACH (port, node, &dp->ports) {
>> if (!netdev_is_pmd(port->netdev)) {
>> @@ -4368,10 +4402,13 @@ rxq_scheduling(struct dp_netdev *dp, bool
>> pinned) OVS_REQUIRES(dp->port_mutex)
>> rxqs = xrealloc(rxqs, sizeof *rxqs * (n_rxqs + 1));
>> }
>> - /* Sum the queue intervals and store the cycle
>> history. */
>> - for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX; i++) {
>> - cycle_hist += dp_netdev_rxq_get_intrvl_cycles(q, i);
>> - }
>> - dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST,
>> cycle_hist);
>>
>> + if (assign_cyc) {
>> + /* Sum the queue intervals and store the cycle
>> history. */
>> + for (unsigned i = 0; i < PMD_RXQ_INTERVAL_MAX;
>> i++) {
>> + cycle_hist +=
>> dp_netdev_rxq_get_intrvl_cycles(q, i);
>> + }
>> + dp_netdev_rxq_set_cycles(q, RXQ_CYCLES_PROC_HIST,
>> + cycle_hist);
>> + }
>> /* Store the queue. */
>> rxqs[n_rxqs++] = q;
>> @@ -4380,5 +4417,5 @@ rxq_scheduling(struct dp_netdev *dp, bool
>> pinned) OVS_REQUIRES(dp->port_mutex)
>> }
>>
>> - if (n_rxqs > 1) {
>> + if (n_rxqs > 1 && assign_cyc) {
>> /* Sort the queues in order of the processing cycles
>> * they consumed during their last pmd interval. */
>> @@ -4404,5 +4441,5 @@ rxq_scheduling(struct dp_netdev *dp, bool
>> pinned) OVS_REQUIRES(dp->port_mutex)
>> continue;
>> }
>> - rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa);
>> + rxqs[i]->pmd = rr_numa_get_pmd(non_local_numa, assign_cyc);
>> VLOG_WARN("There's no available (non-isolated) pmd thread "
>> "on numa node %d. Queue %d on port \'%s\' will "
>> @@ -4413,11 +4450,21 @@ rxq_scheduling(struct dp_netdev *dp, bool
>> pinned) OVS_REQUIRES(dp->port_mutex)
>> rxqs[i]->pmd->core_id, rxqs[i]->pmd->numa_id);
>> } else {
>> - rxqs[i]->pmd = rr_numa_get_pmd(numa);
>> - VLOG_INFO("Core %d on numa node %d assigned port \'%s\' "
>> - "rx queue %d (measured processing cycles %"PRIu64").",
>> - rxqs[i]->pmd->core_id, numa_id,
>> - netdev_rxq_get_name(rxqs[i]->rx),
>> - netdev_rxq_get_queue_id(rxqs[i]->rx),
>> - dp_netdev_rxq_get_cycles(rxqs[i],
>> RXQ_CYCLES_PROC_HIST));
>> + rxqs[i]->pmd = rr_numa_get_pmd(numa, assign_cyc);
>> + if (assign_cyc) {
>> + VLOG_INFO("Core %d on numa node %d assigned port
>> \'%s\' "
>> + "rx queue %d "
>> + "(measured processing cycles %"PRIu64").",
>> + rxqs[i]->pmd->core_id, numa_id,
>> + netdev_rxq_get_name(rxqs[i]->rx),
>> + netdev_rxq_get_queue_id(rxqs[i]->rx),
>> + dp_netdev_rxq_get_cycles(rxqs[i],
>> +
>> RXQ_CYCLES_PROC_HIST));
>> + } else {
>> + VLOG_INFO("Core %d on numa node %d assigned port
>> \'%s\' "
>> + "rx queue %d.", rxqs[i]->pmd->core_id,
>> numa_id,
>> + netdev_rxq_get_name(rxqs[i]->rx),
>> + netdev_rxq_get_queue_id(rxqs[i]->rx));
>> + }
>> +
>> }
>> }
>> diff --git a/tests/pmd.at b/tests/pmd.at
>> index 4cae6c8..7cbaf41 100644
>> --- a/tests/pmd.at
>> +++ b/tests/pmd.at
>> @@ -62,5 +62,6 @@ m4_define([CHECK_PMD_THREADS_CREATED], [
>>
>> m4_define([SED_NUMA_CORE_PATTERN], ["s/\(numa_id \)[[0-9]]*\( core_id
>> \)[[0-9]]*:/\1<cleared>\2<cleared>:/"])
>> -m4_define([SED_NUMA_CORE_QUEUE_PATTERN], ["s/1 2 5 6/<group>/;s/0 3 4
>> 7/<group>/"])
>> +m4_define([SED_NUMA_CORE_QUEUE_CYC_PATTERN], ["s/1 2 5 6/<group>/;s/0
>> 3 4 7/<group>/"])
>> +m4_define([SED_NUMA_CORE_QUEUE_PQ_PATTERN], ["s/1 3 5 7/<group>/;s/0
>> 2 4 6/<group>/"])
>> m4_define([DUMMY_NUMA], [--dummy-numa="0,0,0,0"])
>>
>> @@ -146,9 +147,19 @@ pmd thread numa_id <cleared> core_id <cleared>:
>> ])
>>
>> +AT_CHECK([ovs-vsctl set Open_vSwitch .
>> other_config:pmd-rxq-assign=cycles])
>
> Should you not do a rebalance here, you are “theoretically” changing the
> config here.
>
It won't be needed here or below now that re-assignment will
automatically happen when the assignment mode changes.
>> TMP=$(cat ovs-vswitchd.log | wc -l | tr -d [[:blank:]])
>> AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x3])
>> CHECK_PMD_THREADS_CREATED([2], [], [+$TMP])
>>
>> -AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed
>> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed
>> SED_NUMA_CORE_QUEUE_PATTERN], [0], [dnl
>> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed
>> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed
>> SED_NUMA_CORE_QUEUE_CYC_PATTERN], [0], [dnl
>> +port: p0 queue-id: <group>
>> +port: p0 queue-id: <group>
>> +])
>> +
>> +AT_CHECK([ovs-vsctl set Open_vSwitch .
>> other_config:pmd-rxq-assign=portqueue])
>> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-rebalance], [0], [dnl
>> +pmd rxq rebalance requested.
>> +])
>> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | sed
>> ':a;/AVAIL$/{N;s/\n//;ba;}' | parse_pmd_rxq_show_group | sed
>> SED_NUMA_CORE_QUEUE_PQ_PATTERN], [0], [dnl
>> port: p0 queue-id: <group>
>> port: p0 queue-id: <group>
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index 0cd8520..b5789a5 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -2810,4 +2810,27 @@ ovs-vsctl add-port br0 p0 -- set Interface p0
>> type=patch options:peer=p1 \
>> </column>
>>
>> + <column name="other_config" key="pmd-rxq-assign"
>> + type='{"type": "string"}'>
>> + <p>
>> + Specifies how RX queues are automatically assigned to CPU
>> cores.
>> + Options:
>> + <code>cycles</code> - Sort queues in descending order of
>> + measured processing cycles before assigning round robin
>> + to CPUs.
>> + <code>portqueue</code> - Sort queues in ascending order
>> + of port/queue number before assign round robin to CPUs.
>> + </p>
>> + <p>
>> + The default value is <code>cycles</code>.
>> + </p>
>> + <p>
>> + Note this this will not automatically perform a reschedule.
>> + When the mode is set to <code>portqueue</code> the
>> + <code>ovs-appctl dpif-netdev/pmd-rxq-rebalance</code>
>> command will
>> + not change the Rxq to PMD assignment, as the
>> Port/Queue/PMDs config
>> + will be unchanged from the last reconfiguration.
>
> The last part of the rebalance in portqueue mode is a bit confusing. I
> thought initially you meant when setting it to portqueue mode, but its
> the case its already set to portqueue mode and run rebalance again.
>
You're right and the last sentence was not really needed. Anyway, the
operation has changed in v2 and I've replaced this paragraph.
thanks,
Kevin.
>> + </p>
>> + </column>
>> +
>> <column name="options" key="vhost-server-path"
>> type='{"type": "string"}'>
>> --
>> 1.8.3.1
More information about the dev
mailing list