[ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on non-local numa node.

Ilya Maximets i.maximets at samsung.com
Mon Jul 10 10:53:00 UTC 2017


On 10.07.2017 13:42, O Mahony, Billy wrote:
> 
> 
>> -----Original Message-----
>> From: Stokes, Ian
>> Sent: Monday, July 10, 2017 10:41 AM
>> To: Ilya Maximets <i.maximets at samsung.com>; O Mahony, Billy
>> <billy.o.mahony at intel.com>; dev at openvswitch.org
>> Cc: dball at vmare.com
>> Subject: RE: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on non-
>> local numa node.
>>
>>> On 08.07.2017 22:09, Stokes, Ian wrote:
>>>>> Previously if there is no available (non-isolated) pmd on the numa
>>>>> node for a port then the port is not polled at all. This can result
>>>>> in a non- operational system until such time as nics are physically
>>>>> repositioned. It is preferable to operate with a pmd on the 'wrong'
>>>>> numa node albeit with lower performance. Local pmds are still
>>>>> chosen
>>> when available.
>>>>>
>>>>> Signed-off-by: Billy O'Mahony <billy.o.mahony at intel.com>
>>>>> Signed-off-by: Ilya Maximets <i.maximets at samsung.com>
>>>>> Co-authored-by: Ilya Maximets <i.maximets at samsung.com>
>>>>> ---
>>>>> v9: v8 missed some comments on v7
>>>>> v8: Some coding style issues; doc tweak
>>>>> v7: Incorporate review comments on docs and implementation
>>>>> v6: Change 'port' to 'queue' in a warning msg
>>>>> v5: Fix warning msg; Update same in docs
>>>>> v4: Fix a checkpatch error
>>>>> v3: Fix warning messages not appearing when using multiqueue
>>>>> v2: Add details of warning messages into docs
>>>>>
>>>>>  Documentation/intro/install/dpdk.rst | 21 +++++++++++++++---
>>>>>  lib/dpif-netdev.c                    | 41
>>>>> +++++++++++++++++++++++++++++++++---
>>>>>  2 files changed, 56 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/intro/install/dpdk.rst
>>>>> b/Documentation/intro/install/dpdk.rst
>>>>> index e83f852..89775d6 100644
>>>>> --- a/Documentation/intro/install/dpdk.rst
>>>>> +++ b/Documentation/intro/install/dpdk.rst
>>>>> @@ -449,7 +449,7 @@ affinitized accordingly.
>>>>>
>>>>>    A poll mode driver (pmd) thread handles the I/O of all DPDK
>>> interfaces
>>>>>    assigned to it. A pmd thread shall poll the ports for incoming
>>>>> packets,
>>>>> -  switch the packets and send to tx port.  pmd thread is CPU
>>>>> bound, and needs
>>>>> +  switch the packets and send to tx port.  A pmd thread is CPU
>>>>> + bound, and needs
>>>>>    to be affinitized to isolated cores for optimum performance.
>>>>>
>>>>>    By setting a bit in the mask, a pmd thread is created and pinned
>>>>> to the @@ -458,8 +458,23 @@ affinitized accordingly.
>>>>>        $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4
>>>>>
>>>>>    .. note::
>>>>> -    pmd thread on a NUMA node is only created if there is at least one
>>>>> DPDK
>>>>> -    interface from that NUMA node added to OVS.
>>>>> +    A pmd thread on a NUMA node is only created if there is at
>>>>> + least one
>>>>> DPDK
>>>>> +    interface from that NUMA node added to OVS.  A pmd thread is
>>>>> + created
>>>>> by
>>>>> +    default on a core of a NUMA node or when a specified
>>>>> + pmd-cpu-mask
>>> has
>>>>> +    indicated so.  Even though a PMD thread may exist, the thread
>>>>> + only
>>>>> starts
>>>>> +    consuming CPU cycles if there is least one receive queue
>>>>> + assigned
>>> to
>>>>> +    the pmd.
>>>>> +
>>>>> +  .. note::
>>>>> +    On NUMA systems PCI devices are also local to a NUMA node.
>>>>> + Unbound
>>>>> rx
>>>>> +    queues for a PCI device will assigned to a pmd on it's local
>>>>> + NUMA
>>>>
>>>> Minor point but should read 'will be assigned'
> 
> [[BO'M]] 
> +1
> 
>>>>> node if a
>>>>> +    non-isolated PMD exists on that NUMA node.  If not, the queue
>>>>> + will
>>> be
>>>>> +    assigned to a non-isolated pmd on a remote NUMA node.  This
>>>>> + will
>>>>> result in
>>>>> +    reduced maximum throughput on that device and possibly on
>>>>> + other
>>>>> devices
>>>>> +    assigned to that pmd thread. In the case such, a queue
>>>>> + assignment is
>>>>> made a
>>>>> +    warning message will be logged: "There's no available
>>>>> + (non-isolated)
>>>>> pmd
>>>>
>>>> Above should read 'In the case where such a queue assignment is
>>>> made, a
>>> warning message will be logged'
> 
> [[BO'M]] 
> Suggesting a simpler:
> 'If such a queue assignment is made a warning message ..."
> 
>>>>> +    thread on numa node N. Queue Q on port P will be assigned to
>>>>> + the pmd
>>>>> on
>>>>> +    core C (numa node N'). Expect reduced performance."
>>>>>
>>>>>  - QEMU vCPU thread Affinity
>>>>>
>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>>>> 4e29085..7557f32
>>>>> 100644
>>>>> --- a/lib/dpif-netdev.c
>>>>> +++ b/lib/dpif-netdev.c
>>>>> @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list *rr,
>>>>> int
>>>>> numa_id)
>>>>>      return NULL;
>>>>>  }
>>>>>
>>>>> +/* Returns next NUMA from rr list in round-robin fashion. Returns
>>>>> +the first
>>>>> + * NUMA node if 'NULL' or the last node passed, and 'NULL' if list
>>>>> +is empty. */ static struct rr_numa * rr_numa_list_next(struct
>>>>> +rr_numa_list *rr, const struct rr_numa *numa) {
>>>>
>>>> The comment above can be tidied up a little to better clarify the
>>> behavior of this function.
>>>> I ended up reading the comments for hmap_next() and hmap_first()
>>>> before
>>> it made sense, and even then it's a bit ambiguous, it ends up being
>>> the code that explains the comments.
>>>>
>>>> You could clarify the following 2 statements:
>>>>
>>>> (1) "Returns the first NUMA node if 'NULL'" - If what is NULL? I
>>>> assume
>>> you mean the function parameter 'const struct rr_numa *numa' but it's
>>> not clear on first reading.
>>>>
>>>> (2) " or the last node passed" - again this makes sense only when
>>>> you
>>> look into the behavior of the call 'hmap_next(&rr->numas, &numa-
>>> node)'.
>>>>
>>>> You could say something like:
>>>>
>>>> "Attempt to return the next NUMA from a numa list in a round robin
>>> fashion. Return the first NUMA node if the struct rr_numa *numa
>>> argument passed to the function is NULL or if the numa node argument
>>> passed to hmap_next is already the last node. Return NULL if the numa
>>> list is empty."
>>>
>>> I'm not sure that references to implementation is a good way to write
>>> comments (I mean 'passed to hmap_next' part).
>>>
>>> How about this:
>>> """
>>> /* Returns the next node in numa list following 'numa' in round-robin
>>> fashion.
>>>  * Returns first node if 'numa' is a null pointer or the last node in
>>> 'rr'. */ """
>>>
>>> or
>>>
>>> """
>>> /* The next node in numa list following 'numa' in round-robin fashion.
>>>  * Returns:
>>>  *     - 'NULL' if 'rr' is an empty numa list.
>>>  *     - First node in 'rr' if 'numa' is a null pointer.
>>>  *     - First node in 'rr' if 'numa' is the last node in 'rr'.
>>>  *     - Otherwise, the next node in numa list following 'numa'. */
>>> """
>>>
>>> ?
>>
>> I'm happy with the first option you provided above, could you append
>> returning NULL if the list is empty then I think we're good.
>>
>> /* Returns the next node in numa list following 'numa' in round-robin
>> fashion.
>>  * Returns first node if 'numa' is a null pointer or the last node in 'rr'.
>>  * Returns NULL if 'rr' numa list is empty. */
> 
> [[BO'M]] 
> Sounds good to me. Anyone object to this wording? 

I don't like three 'Returns' in a row but LGTM in general.

>>
>> Thanks
>> Ian
>>>
>>>>
>>>>> +    struct hmap_node *node = NULL;
>>>>> +
>>>>> +    if (numa) {
>>>>> +        node = hmap_next(&rr->numas, &numa->node);
>>>>> +    }
>>>>> +    if (!node) {
>>>>> +        node = hmap_first(&rr->numas);
>>>>> +    }
>>>>> +
>>>>> +    return (node) ? CONTAINER_OF(node, struct rr_numa, node) :
>>>>> + NULL; }
>>>>> +
>>>>>  static void
>>>>>  rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list
>>>>> *rr) { @@ -3249,6 +3266,7 @@ rxq_scheduling(struct dp_netdev *dp,
>>>>> bool
>>>>> pinned)
>>>>> OVS_REQUIRES(dp->port_mutex)  {
>>>>>      struct dp_netdev_port *port;
>>>>>      struct rr_numa_list rr;
>>>>> +    struct rr_numa *non_local_numa = NULL;
>>>>>
>>>>>      rr_numa_list_populate(dp, &rr);
>>>>>
>>>>> @@ -3281,11 +3299,28 @@ rxq_scheduling(struct dp_netdev *dp, bool
>>>>> pinned)
>>>>> OVS_REQUIRES(dp->port_mutex)
>>>>>                  }
>>>>>              } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
>>>>>                  if (!numa) {
>>>>> -                    VLOG_WARN("There's no available (non isolated) pmd
>>>>> thread "
>>>>> +                    /* There are no pmds on the queue's local NUMA
>>> node.
>>>>> +                       Round-robin on the NUMA nodes that do have
>>> pmds.
>>>>> */
>>>>> +                    non_local_numa = rr_numa_list_next(&rr,
>>>>> non_local_numa);
>>>>> +                    if (!non_local_numa) {
>>>>> +                        VLOG_ERR("There is no available
>>>>> + (non-isolated)
>>>>> pmd "
>>>>> +                                 "thread for port \'%s\' queue %d.
>>>>> + This
>>>>> queue "
>>>>> +                                 "will not be polled. Is
>>>>> + pmd-cpu-mask set
>>>>> to "
>>>>> +                                 "zero? Or are all PMDs isolated
>>>>> + to other
>>>>> "
>>>>> +                                 "queues?", netdev_get_name(port-
>>>>>> netdev),
>>>>> +                                 qid);
>>>>> +                        continue;
>>>>> +                    }
>>>>> +                    q->pmd = rr_numa_get_pmd(non_local_numa);
>>>>> +                    VLOG_WARN("There's no available (non-isolated)
>>>>> + pmd
>>>>> thread "
>>>>>                                "on numa node %d. Queue %d on port
>>> \'%s\'
>>>>> will "
>>>>> -                              "not be polled.",
>>>>> -                              numa_id, qid, netdev_get_name(port-
>>>>>> netdev));
>>>>> +                              "be assigned to the pmd on core %d "
>>>>> +                              "(numa node %d). Expect reduced
>>>>> performance.",
>>>>> +                              numa_id, qid, netdev_get_name(port-
>>>>>> netdev),
>>>>> +                              q->pmd->core_id, q->pmd->numa_id);
>>>>>                  } else {
>>>>> +                    /* Assign queue to the next (round-robin) PMD
>>>>> + on it's
>>>>> local
>>>>> +                       NUMA node. */
>>>>>                      q->pmd = rr_numa_get_pmd(numa);
>>>>>                  }
>>>>>              }
>>>>> --
>>>>> 2.7.4
>>>> This tested fine for me, tested with multiple rxqs distributed and
>>> isolated over pmds on 2 different numa nodes with varying pmd masks.
>>> Also passed sanity checks (clang, sparse compilation etc.).
>>>>
>>>> You can add the tested by tag for me but I'd like to see the changes
>>>> for
>>> the documentation and function comments above before acking.
>>>>
>>>> Tested-by: Ian Stokes <ian.stokes at intel.com>
>>>>>
>>>>> _______________________________________________
>>>>> dev mailing list
>>>>> dev at openvswitch.org
>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>
>>>>
>>>>


More information about the dev mailing list