[ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on non-local numa node.

O Mahony, Billy billy.o.mahony at intel.com
Mon Jul 10 11:04:13 UTC 2017


Thanks everybody. Will post v10 after lunch.

/Billy

> -----Original Message-----
> From: Stokes, Ian
> Sent: Monday, July 10, 2017 12:02 PM
> To: Ilya Maximets <i.maximets at samsung.com>; O Mahony, Billy
> <billy.o.mahony at intel.com>; dev at openvswitch.org
> Cc: dball at vmare.com
> Subject: RE: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on non-
> local numa node.
> 
> > On 10.07.2017 13:42, O Mahony, Billy wrote:
> > >
> > >
> > >> -----Original Message-----
> > >> From: Stokes, Ian
> > >> Sent: Monday, July 10, 2017 10:41 AM
> > >> To: Ilya Maximets <i.maximets at samsung.com>; O Mahony, Billy
> > >> <billy.o.mahony at intel.com>; dev at openvswitch.org
> > >> Cc: dball at vmare.com
> > >> Subject: RE: [ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds
> > >> on non- local numa node.
> > >>
> > >>> On 08.07.2017 22:09, Stokes, Ian wrote:
> > >>>>> Previously if there is no available (non-isolated) pmd on the
> > >>>>> numa node for a port then the port is not polled at all. This
> > >>>>> can result in a non- operational system until such time as nics
> > >>>>> are physically repositioned. It is preferable to operate with a
> > >>>>> pmd on
> > the 'wrong'
> > >>>>> numa node albeit with lower performance. Local pmds are still
> > >>>>> chosen
> > >>> when available.
> > >>>>>
> > >>>>> Signed-off-by: Billy O'Mahony <billy.o.mahony at intel.com>
> > >>>>> Signed-off-by: Ilya Maximets <i.maximets at samsung.com>
> > >>>>> Co-authored-by: Ilya Maximets <i.maximets at samsung.com>
> > >>>>> ---
> > >>>>> v9: v8 missed some comments on v7
> > >>>>> v8: Some coding style issues; doc tweak
> > >>>>> v7: Incorporate review comments on docs and implementation
> > >>>>> v6: Change 'port' to 'queue' in a warning msg
> > >>>>> v5: Fix warning msg; Update same in docs
> > >>>>> v4: Fix a checkpatch error
> > >>>>> v3: Fix warning messages not appearing when using multiqueue
> > >>>>> v2: Add details of warning messages into docs
> > >>>>>
> > >>>>>  Documentation/intro/install/dpdk.rst | 21 +++++++++++++++---
> > >>>>>  lib/dpif-netdev.c                    | 41
> > >>>>> +++++++++++++++++++++++++++++++++---
> > >>>>>  2 files changed, 56 insertions(+), 6 deletions(-)
> > >>>>>
> > >>>>> diff --git a/Documentation/intro/install/dpdk.rst
> > >>>>> b/Documentation/intro/install/dpdk.rst
> > >>>>> index e83f852..89775d6 100644
> > >>>>> --- a/Documentation/intro/install/dpdk.rst
> > >>>>> +++ b/Documentation/intro/install/dpdk.rst
> > >>>>> @@ -449,7 +449,7 @@ affinitized accordingly.
> > >>>>>
> > >>>>>    A poll mode driver (pmd) thread handles the I/O of all DPDK
> > >>> interfaces
> > >>>>>    assigned to it. A pmd thread shall poll the ports for
> > >>>>> incoming packets,
> > >>>>> -  switch the packets and send to tx port.  pmd thread is CPU
> > >>>>> bound, and needs
> > >>>>> +  switch the packets and send to tx port.  A pmd thread is CPU
> > >>>>> + bound, and needs
> > >>>>>    to be affinitized to isolated cores for optimum performance.
> > >>>>>
> > >>>>>    By setting a bit in the mask, a pmd thread is created and
> > >>>>> pinned to the @@ -458,8 +458,23 @@ affinitized accordingly.
> > >>>>>        $ ovs-vsctl set Open_vSwitch .
> > >>>>> other_config:pmd-cpu-mask=0x4
> > >>>>>
> > >>>>>    .. note::
> > >>>>> -    pmd thread on a NUMA node is only created if there is at least
> > one
> > >>>>> DPDK
> > >>>>> -    interface from that NUMA node added to OVS.
> > >>>>> +    A pmd thread on a NUMA node is only created if there is at
> > >>>>> + least one
> > >>>>> DPDK
> > >>>>> +    interface from that NUMA node added to OVS.  A pmd thread
> > >>>>> + is created
> > >>>>> by
> > >>>>> +    default on a core of a NUMA node or when a specified
> > >>>>> + pmd-cpu-mask
> > >>> has
> > >>>>> +    indicated so.  Even though a PMD thread may exist, the
> > >>>>> + thread only
> > >>>>> starts
> > >>>>> +    consuming CPU cycles if there is least one receive queue
> > >>>>> + assigned
> > >>> to
> > >>>>> +    the pmd.
> > >>>>> +
> > >>>>> +  .. note::
> > >>>>> +    On NUMA systems PCI devices are also local to a NUMA node.
> > >>>>> + Unbound
> > >>>>> rx
> > >>>>> +    queues for a PCI device will assigned to a pmd on it's
> > >>>>> + local NUMA
> > >>>>
> > >>>> Minor point but should read 'will be assigned'
> > >
> > > [[BO'M]]
> > > +1
> > >
> > >>>>> node if a
> > >>>>> +    non-isolated PMD exists on that NUMA node.  If not, the
> > >>>>> + queue will
> > >>> be
> > >>>>> +    assigned to a non-isolated pmd on a remote NUMA node.  This
> > >>>>> + will
> > >>>>> result in
> > >>>>> +    reduced maximum throughput on that device and possibly on
> > >>>>> + other
> > >>>>> devices
> > >>>>> +    assigned to that pmd thread. In the case such, a queue
> > >>>>> + assignment is
> > >>>>> made a
> > >>>>> +    warning message will be logged: "There's no available
> > >>>>> + (non-isolated)
> > >>>>> pmd
> > >>>>
> > >>>> Above should read 'In the case where such a queue assignment is
> > >>>> made, a
> > >>> warning message will be logged'
> > >
> > > [[BO'M]]
> > > Suggesting a simpler:
> > > 'If such a queue assignment is made a warning message ..."
> > >
> > >>>>> +    thread on numa node N. Queue Q on port P will be assigned
> > >>>>> + to the pmd
> > >>>>> on
> > >>>>> +    core C (numa node N'). Expect reduced performance."
> > >>>>>
> > >>>>>  - QEMU vCPU thread Affinity
> > >>>>>
> > >>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > >>>>> 4e29085..7557f32
> > >>>>> 100644
> > >>>>> --- a/lib/dpif-netdev.c
> > >>>>> +++ b/lib/dpif-netdev.c
> > >>>>> @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list
> > >>>>> *rr, int
> > >>>>> numa_id)
> > >>>>>      return NULL;
> > >>>>>  }
> > >>>>>
> > >>>>> +/* Returns next NUMA from rr list in round-robin fashion.
> > >>>>> +Returns the first
> > >>>>> + * NUMA node if 'NULL' or the last node passed, and 'NULL' if
> > >>>>> +list is empty. */ static struct rr_numa *
> > >>>>> +rr_numa_list_next(struct rr_numa_list *rr, const struct rr_numa
> > >>>>> +*numa) {
> > >>>>
> > >>>> The comment above can be tidied up a little to better clarify the
> > >>> behavior of this function.
> > >>>> I ended up reading the comments for hmap_next() and hmap_first()
> > >>>> before
> > >>> it made sense, and even then it's a bit ambiguous, it ends up
> > >>> being the code that explains the comments.
> > >>>>
> > >>>> You could clarify the following 2 statements:
> > >>>>
> > >>>> (1) "Returns the first NUMA node if 'NULL'" - If what is NULL? I
> > >>>> assume
> > >>> you mean the function parameter 'const struct rr_numa *numa' but
> > >>> it's not clear on first reading.
> > >>>>
> > >>>> (2) " or the last node passed" - again this makes sense only when
> > >>>> you
> > >>> look into the behavior of the call 'hmap_next(&rr->numas, &numa-
> > >>> node)'.
> > >>>>
> > >>>> You could say something like:
> > >>>>
> > >>>> "Attempt to return the next NUMA from a numa list in a round
> > >>>> robin
> > >>> fashion. Return the first NUMA node if the struct rr_numa *numa
> > >>> argument passed to the function is NULL or if the numa node
> > >>> argument passed to hmap_next is already the last node. Return NULL
> > >>> if the numa list is empty."
> > >>>
> > >>> I'm not sure that references to implementation is a good way to
> > >>> write comments (I mean 'passed to hmap_next' part).
> > >>>
> > >>> How about this:
> > >>> """
> > >>> /* Returns the next node in numa list following 'numa' in
> > >>> round-robin fashion.
> > >>>  * Returns first node if 'numa' is a null pointer or the last node
> > >>> in 'rr'. */ """
> > >>>
> > >>> or
> > >>>
> > >>> """
> > >>> /* The next node in numa list following 'numa' in round-robin fashion.
> > >>>  * Returns:
> > >>>  *     - 'NULL' if 'rr' is an empty numa list.
> > >>>  *     - First node in 'rr' if 'numa' is a null pointer.
> > >>>  *     - First node in 'rr' if 'numa' is the last node in 'rr'.
> > >>>  *     - Otherwise, the next node in numa list following 'numa'. */
> > >>> """
> > >>>
> > >>> ?
> > >>
> > >> I'm happy with the first option you provided above, could you
> > >> append returning NULL if the list is empty then I think we're good.
> > >>
> > >> /* Returns the next node in numa list following 'numa' in
> > >> round-robin fashion.
> > >>  * Returns first node if 'numa' is a null pointer or the last node
> > >> in
> > 'rr'.
> > >>  * Returns NULL if 'rr' numa list is empty. */
> > >
> > > [[BO'M]]
> > > Sounds good to me. Anyone object to this wording?
> >
> > I don't like three 'Returns' in a row but LGTM in general.
> >
> 
> Once these changes are made for the next patch version feel free to add
> Tested by and acked tags for myself.
> 
> Thanks
> Ian
> > >>
> > >> Thanks
> > >> Ian
> > >>>
> > >>>>
> > >>>>> +    struct hmap_node *node = NULL;
> > >>>>> +
> > >>>>> +    if (numa) {
> > >>>>> +        node = hmap_next(&rr->numas, &numa->node);
> > >>>>> +    }
> > >>>>> +    if (!node) {
> > >>>>> +        node = hmap_first(&rr->numas);
> > >>>>> +    }
> > >>>>> +
> > >>>>> +    return (node) ? CONTAINER_OF(node, struct rr_numa, node) :
> > >>>>> + NULL; }
> > >>>>> +
> > >>>>>  static void
> > >>>>>  rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list
> > >>>>> *rr) { @@ -3249,6 +3266,7 @@ rxq_scheduling(struct dp_netdev
> > >>>>> *dp, bool
> > >>>>> pinned)
> > >>>>> OVS_REQUIRES(dp->port_mutex)  {
> > >>>>>      struct dp_netdev_port *port;
> > >>>>>      struct rr_numa_list rr;
> > >>>>> +    struct rr_numa *non_local_numa = NULL;
> > >>>>>
> > >>>>>      rr_numa_list_populate(dp, &rr);
> > >>>>>
> > >>>>> @@ -3281,11 +3299,28 @@ rxq_scheduling(struct dp_netdev *dp,
> > >>>>> bool
> > >>>>> pinned)
> > >>>>> OVS_REQUIRES(dp->port_mutex)
> > >>>>>                  }
> > >>>>>              } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
> > >>>>>                  if (!numa) {
> > >>>>> -                    VLOG_WARN("There's no available (non isolated)
> > pmd
> > >>>>> thread "
> > >>>>> +                    /* There are no pmds on the queue's local
> > >>>>> + NUMA
> > >>> node.
> > >>>>> +                       Round-robin on the NUMA nodes that do
> > >>>>> + have
> > >>> pmds.
> > >>>>> */
> > >>>>> +                    non_local_numa = rr_numa_list_next(&rr,
> > >>>>> non_local_numa);
> > >>>>> +                    if (!non_local_numa) {
> > >>>>> +                        VLOG_ERR("There is no available
> > >>>>> + (non-isolated)
> > >>>>> pmd "
> > >>>>> +                                 "thread for port \'%s\' queue %d.
> > >>>>> + This
> > >>>>> queue "
> > >>>>> +                                 "will not be polled. Is
> > >>>>> + pmd-cpu-mask set
> > >>>>> to "
> > >>>>> +                                 "zero? Or are all PMDs
> > >>>>> + isolated to other
> > >>>>> "
> > >>>>> +                                 "queues?",
> > >>>>> + netdev_get_name(port-
> > >>>>>> netdev),
> > >>>>> +                                 qid);
> > >>>>> +                        continue;
> > >>>>> +                    }
> > >>>>> +                    q->pmd = rr_numa_get_pmd(non_local_numa);
> > >>>>> +                    VLOG_WARN("There's no available
> > >>>>> + (non-isolated) pmd
> > >>>>> thread "
> > >>>>>                                "on numa node %d. Queue %d on
> > >>>>> port
> > >>> \'%s\'
> > >>>>> will "
> > >>>>> -                              "not be polled.",
> > >>>>> -                              numa_id, qid, netdev_get_name(port-
> > >>>>>> netdev));
> > >>>>> +                              "be assigned to the pmd on core %d "
> > >>>>> +                              "(numa node %d). Expect reduced
> > >>>>> performance.",
> > >>>>> +                              numa_id, qid,
> > >>>>> + netdev_get_name(port-
> > >>>>>> netdev),
> > >>>>> +                              q->pmd->core_id,
> > >>>>> + q->pmd->numa_id);
> > >>>>>                  } else {
> > >>>>> +                    /* Assign queue to the next (round-robin)
> > >>>>> + PMD on it's
> > >>>>> local
> > >>>>> +                       NUMA node. */
> > >>>>>                      q->pmd = rr_numa_get_pmd(numa);
> > >>>>>                  }
> > >>>>>              }
> > >>>>> --
> > >>>>> 2.7.4
> > >>>> This tested fine for me, tested with multiple rxqs distributed
> > >>>> and
> > >>> isolated over pmds on 2 different numa nodes with varying pmd
> masks.
> > >>> Also passed sanity checks (clang, sparse compilation etc.).
> > >>>>
> > >>>> You can add the tested by tag for me but I'd like to see the
> > >>>> changes for
> > >>> the documentation and function comments above before acking.
> > >>>>
> > >>>> Tested-by: Ian Stokes <ian.stokes at intel.com>
> > >>>>>
> > >>>>> _______________________________________________
> > >>>>> dev mailing list
> > >>>>> dev at openvswitch.org
> > >>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > >>>>
> > >>>>
> > >>>>


More information about the dev mailing list