[ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on non-local numa node.

Mon Jul 10 08:11:54 UTC 2017

On 08.07.2017 22:09, Stokes, Ian wrote:
>> Previously if there is no available (non-isolated) pmd on the numa node
>> for a port then the port is not polled at all. This can result in a non-
>> operational system until such time as nics are physically repositioned. It
>> is preferable to operate with a pmd on the 'wrong' numa node albeit with
>> lower performance. Local pmds are still chosen when available.
>>
>> Signed-off-by: Billy O'Mahony <billy.o.mahony at intel.com>
>> Signed-off-by: Ilya Maximets <i.maximets at samsung.com>
>> Co-authored-by: Ilya Maximets <i.maximets at samsung.com>
>> ---
>> v9: v8 missed some comments on v7
>> v8: Some coding style issues; doc tweak
>> v7: Incorporate review comments on docs and implementation
>> v6: Change 'port' to 'queue' in a warning msg
>> v5: Fix warning msg; Update same in docs
>> v4: Fix a checkpatch error
>> v3: Fix warning messages not appearing when using multiqueue
>> v2: Add details of warning messages into docs
>>
>>  Documentation/intro/install/dpdk.rst | 21 +++++++++++++++---
>>  lib/dpif-netdev.c                    | 41
>> +++++++++++++++++++++++++++++++++---
>>  2 files changed, 56 insertions(+), 6 deletions(-)
>>
>> diff --git a/Documentation/intro/install/dpdk.rst
>> b/Documentation/intro/install/dpdk.rst
>> index e83f852..89775d6 100644
>> --- a/Documentation/intro/install/dpdk.rst
>> +++ b/Documentation/intro/install/dpdk.rst
>> @@ -449,7 +449,7 @@ affinitized accordingly.
>>
>>    A poll mode driver (pmd) thread handles the I/O of all DPDK interfaces
>>    assigned to it. A pmd thread shall poll the ports for incoming packets,
>> -  switch the packets and send to tx port.  pmd thread is CPU bound, and
>> needs
>> +  switch the packets and send to tx port.  A pmd thread is CPU bound,
>> + and needs
>>    to be affinitized to isolated cores for optimum performance.
>>
>>    By setting a bit in the mask, a pmd thread is created and pinned to the
>> @@ -458,8 +458,23 @@ affinitized accordingly.
>>        $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4
>>
>>    .. note::
>> -    pmd thread on a NUMA node is only created if there is at least one
>> DPDK
>> -    interface from that NUMA node added to OVS.
>> +    A pmd thread on a NUMA node is only created if there is at least one
>> DPDK
>> +    interface from that NUMA node added to OVS.  A pmd thread is created
>> by
>> +    default on a core of a NUMA node or when a specified pmd-cpu-mask has
>> +    indicated so.  Even though a PMD thread may exist, the thread only
>> starts
>> +    consuming CPU cycles if there is least one receive queue assigned to
>> +    the pmd.
>> +
>> +  .. note::
>> +    On NUMA systems PCI devices are also local to a NUMA node.  Unbound
>> rx
>> +    queues for a PCI device will assigned to a pmd on it's local NUMA
> 
> Minor point but should read 'will be assigned'
>> node if a
>> +    non-isolated PMD exists on that NUMA node.  If not, the queue will be
>> +    assigned to a non-isolated pmd on a remote NUMA node.  This will
>> result in
>> +    reduced maximum throughput on that device and possibly on other
>> devices
>> +    assigned to that pmd thread. In the case such, a queue assignment is
>> made a
>> +    warning message will be logged: "There's no available (non-isolated)
>> pmd
> 
> Above should read 'In the case where such a queue assignment is made, a warning message will be logged'
>> +    thread on numa node N. Queue Q on port P will be assigned to the pmd
>> on
>> +    core C (numa node N'). Expect reduced performance."
>>
>>  - QEMU vCPU thread Affinity
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 4e29085..7557f32
>> 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list *rr, int
>> numa_id)
>>      return NULL;
>>  }
>>
>> +/* Returns next NUMA from rr list in round-robin fashion. Returns the
>> +first
>> + * NUMA node if 'NULL' or the last node passed, and 'NULL' if list is
>> +empty. */ static struct rr_numa * rr_numa_list_next(struct rr_numa_list
>> +*rr, const struct rr_numa *numa) {
> 
> The comment above can be tidied up a little to better clarify the behavior of this function.
> I ended up reading the comments for hmap_next() and hmap_first() before it made sense, and even then it's a bit ambiguous, it ends up being the code that explains the comments.
> 
> You could clarify the following 2 statements:
> 
> (1) "Returns the first NUMA node if 'NULL'" - If what is NULL? I assume you mean the function parameter 'const struct rr_numa *numa' but it's not clear on first reading.
> 
> (2) " or the last node passed" - again this makes sense only when you look into the behavior of the call 'hmap_next(&rr->numas, &numa->node)'.
> 
> You could say something like:
> 
> "Attempt to return the next NUMA from a numa list in a round robin fashion. Return the first NUMA node if the struct rr_numa *numa argument passed to the function is NULL or if the numa node argument passed to hmap_next is already the last node. Return NULL if the numa list is empty."

I'm not sure that references to implementation is a good way to
write comments (I mean 'passed to hmap_next' part).

How about this:
"""
/* Returns the next node in numa list following 'numa' in round-robin fashion.
 * Returns first node if 'numa' is a null pointer or the last node in 'rr'. */
"""

or

"""
/* The next node in numa list following 'numa' in round-robin fashion.
 * Returns:
 *     - 'NULL' if 'rr' is an empty numa list.
 *     - First node in 'rr' if 'numa' is a null pointer.
 *     - First node in 'rr' if 'numa' is the last node in 'rr'.
 *     - Otherwise, the next node in numa list following 'numa'. */
"""

?

> 
>> +    struct hmap_node *node = NULL;
>> +
>> +    if (numa) {
>> +        node = hmap_next(&rr->numas, &numa->node);
>> +    }
>> +    if (!node) {
>> +        node = hmap_first(&rr->numas);
>> +    }
>> +
>> +    return (node) ? CONTAINER_OF(node, struct rr_numa, node) : NULL; }
>> +
>>  static void
>>  rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr)  {
>> @@ -3249,6 +3266,7 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned)
>> OVS_REQUIRES(dp->port_mutex)  {
>>      struct dp_netdev_port *port;
>>      struct rr_numa_list rr;
>> +    struct rr_numa *non_local_numa = NULL;
>>
>>      rr_numa_list_populate(dp, &rr);
>>
>> @@ -3281,11 +3299,28 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned)
>> OVS_REQUIRES(dp->port_mutex)
>>                  }
>>              } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
>>                  if (!numa) {
>> -                    VLOG_WARN("There's no available (non isolated) pmd
>> thread "
>> +                    /* There are no pmds on the queue's local NUMA node.
>> +                       Round-robin on the NUMA nodes that do have pmds.
>> */
>> +                    non_local_numa = rr_numa_list_next(&rr,
>> non_local_numa);
>> +                    if (!non_local_numa) {
>> +                        VLOG_ERR("There is no available (non-isolated)
>> pmd "
>> +                                 "thread for port \'%s\' queue %d. This
>> queue "
>> +                                 "will not be polled. Is pmd-cpu-mask set
>> to "
>> +                                 "zero? Or are all PMDs isolated to other
>> "
>> +                                 "queues?", netdev_get_name(port-
>>> netdev),
>> +                                 qid);
>> +                        continue;
>> +                    }
>> +                    q->pmd = rr_numa_get_pmd(non_local_numa);
>> +                    VLOG_WARN("There's no available (non-isolated) pmd
>> thread "
>>                                "on numa node %d. Queue %d on port \'%s\'
>> will "
>> -                              "not be polled.",
>> -                              numa_id, qid, netdev_get_name(port-
>>> netdev));
>> +                              "be assigned to the pmd on core %d "
>> +                              "(numa node %d). Expect reduced
>> performance.",
>> +                              numa_id, qid, netdev_get_name(port-
>>> netdev),
>> +                              q->pmd->core_id, q->pmd->numa_id);
>>                  } else {
>> +                    /* Assign queue to the next (round-robin) PMD on it's
>> local
>> +                       NUMA node. */
>>                      q->pmd = rr_numa_get_pmd(numa);
>>                  }
>>              }
>> --
>> 2.7.4
> This tested fine for me, tested with multiple rxqs distributed and isolated over pmds on 2 different numa nodes with varying pmd masks. Also passed sanity checks (clang, sparse compilation etc.).
> 
> You can add the tested by tag for me but I'd like to see the changes for the documentation and function comments above before acking.
> 
> Tested-by: Ian Stokes <ian.stokes at intel.com>
>>
>> _______________________________________________
>> dev mailing list
>> dev at openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> 
> 
>