[ovs-dev] [PATCH v9] dpif-netdev: Assign ports to pmds on non-local numa node.

Darrell Ball dball at vmware.com
Sat Jul 8 20:43:25 UTC 2017


This is one of the patches I have looked into today or will be looking into.
After comment improvements Ian has requested, I’ll fold it into an ovs-dpdk merge repo.

On 7/8/17, 12:09 PM, "ovs-dev-bounces at openvswitch.org on behalf of Stokes, Ian" <ovs-dev-bounces at openvswitch.org on behalf of ian.stokes at intel.com> wrote:

    > Previously if there is no available (non-isolated) pmd on the numa node
    > for a port then the port is not polled at all. This can result in a non-
    > operational system until such time as nics are physically repositioned. It
    > is preferable to operate with a pmd on the 'wrong' numa node albeit with
    > lower performance. Local pmds are still chosen when available.
    > 
    > Signed-off-by: Billy O'Mahony <billy.o.mahony at intel.com>
    > Signed-off-by: Ilya Maximets <i.maximets at samsung.com>
    > Co-authored-by: Ilya Maximets <i.maximets at samsung.com>
    > ---
    > v9: v8 missed some comments on v7
    > v8: Some coding style issues; doc tweak
    > v7: Incorporate review comments on docs and implementation
    > v6: Change 'port' to 'queue' in a warning msg
    > v5: Fix warning msg; Update same in docs
    > v4: Fix a checkpatch error
    > v3: Fix warning messages not appearing when using multiqueue
    > v2: Add details of warning messages into docs
    > 
    >  Documentation/intro/install/dpdk.rst | 21 +++++++++++++++---
    >  lib/dpif-netdev.c                    | 41
    > +++++++++++++++++++++++++++++++++---
    >  2 files changed, 56 insertions(+), 6 deletions(-)
    > 
    > diff --git a/Documentation/intro/install/dpdk.rst
    > b/Documentation/intro/install/dpdk.rst
    > index e83f852..89775d6 100644
    > --- a/Documentation/intro/install/dpdk.rst
    > +++ b/Documentation/intro/install/dpdk.rst
    > @@ -449,7 +449,7 @@ affinitized accordingly.
    > 
    >    A poll mode driver (pmd) thread handles the I/O of all DPDK interfaces
    >    assigned to it. A pmd thread shall poll the ports for incoming packets,
    > -  switch the packets and send to tx port.  pmd thread is CPU bound, and
    > needs
    > +  switch the packets and send to tx port.  A pmd thread is CPU bound,
    > + and needs
    >    to be affinitized to isolated cores for optimum performance.
    > 
    >    By setting a bit in the mask, a pmd thread is created and pinned to the
    > @@ -458,8 +458,23 @@ affinitized accordingly.
    >        $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4
    > 
    >    .. note::
    > -    pmd thread on a NUMA node is only created if there is at least one
    > DPDK
    > -    interface from that NUMA node added to OVS.
    > +    A pmd thread on a NUMA node is only created if there is at least one
    > DPDK
    > +    interface from that NUMA node added to OVS.  A pmd thread is created
    > by
    > +    default on a core of a NUMA node or when a specified pmd-cpu-mask has
    > +    indicated so.  Even though a PMD thread may exist, the thread only
    > starts
    > +    consuming CPU cycles if there is least one receive queue assigned to
    > +    the pmd.
    > +
    > +  .. note::
    > +    On NUMA systems PCI devices are also local to a NUMA node.  Unbound
    > rx
    > +    queues for a PCI device will assigned to a pmd on it's local NUMA
    
    Minor point but should read 'will be assigned'
    > node if a
    > +    non-isolated PMD exists on that NUMA node.  If not, the queue will be
    > +    assigned to a non-isolated pmd on a remote NUMA node.  This will
    > result in
    > +    reduced maximum throughput on that device and possibly on other
    > devices
    > +    assigned to that pmd thread. In the case such, a queue assignment is
    > made a
    > +    warning message will be logged: "There's no available (non-isolated)
    > pmd
    
    Above should read 'In the case where such a queue assignment is made, a warning message will be logged'

There was a similar suggestion that got incorporated into v7 and then accidently reverted in v9.


    > +    thread on numa node N. Queue Q on port P will be assigned to the pmd
    > on
    > +    core C (numa node N'). Expect reduced performance."
    > 
    >  - QEMU vCPU thread Affinity
    > 
    > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 4e29085..7557f32
    > 100644
    > --- a/lib/dpif-netdev.c
    > +++ b/lib/dpif-netdev.c
    > @@ -3195,6 +3195,23 @@ rr_numa_list_lookup(struct rr_numa_list *rr, int
    > numa_id)
    >      return NULL;
    >  }
    > 
    > +/* Returns next NUMA from rr list in round-robin fashion. Returns the
    > +first
    > + * NUMA node if 'NULL' or the last node passed, and 'NULL' if list is
    > +empty. */ static struct rr_numa * rr_numa_list_next(struct rr_numa_list
    > +*rr, const struct rr_numa *numa) {
    
    The comment above can be tidied up a little to better clarify the behavior of this function.
    I ended up reading the comments for hmap_next() and hmap_first() before it made sense, and even then it's a bit ambiguous, it ends up being the code that explains the comments.
    
    You could clarify the following 2 statements:
    
    (1) "Returns the first NUMA node if 'NULL'" - If what is NULL? I assume you mean the function parameter 'const struct rr_numa *numa' but it's not clear on first reading.
    
    (2) " or the last node passed" - again this makes sense only when you look into the behavior of the call 'hmap_next(&rr->numas, &numa->node)'.
    
    You could say something like:
    
    "Attempt to return the next NUMA from a numa list in a round robin fashion. Return the first NUMA node if the struct rr_numa *numa argument passed to the function is NULL or if the numa node argument passed to hmap_next is already the last node. Return NULL if the numa list is empty."
    
    > +    struct hmap_node *node = NULL;
    > +
    > +    if (numa) {
    > +        node = hmap_next(&rr->numas, &numa->node);
    > +    }
    > +    if (!node) {
    > +        node = hmap_first(&rr->numas);
    > +    }
    > +
    > +    return (node) ? CONTAINER_OF(node, struct rr_numa, node) : NULL; }
    > +
    >  static void
    >  rr_numa_list_populate(struct dp_netdev *dp, struct rr_numa_list *rr)  {
    > @@ -3249,6 +3266,7 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned)
    > OVS_REQUIRES(dp->port_mutex)  {
    >      struct dp_netdev_port *port;
    >      struct rr_numa_list rr;
    > +    struct rr_numa *non_local_numa = NULL;
    > 
    >      rr_numa_list_populate(dp, &rr);
    > 
    > @@ -3281,11 +3299,28 @@ rxq_scheduling(struct dp_netdev *dp, bool pinned)
    > OVS_REQUIRES(dp->port_mutex)
    >                  }
    >              } else if (!pinned && q->core_id == OVS_CORE_UNSPEC) {
    >                  if (!numa) {
    > -                    VLOG_WARN("There's no available (non isolated) pmd
    > thread "
    > +                    /* There are no pmds on the queue's local NUMA node.
    > +                       Round-robin on the NUMA nodes that do have pmds.
    > */
    > +                    non_local_numa = rr_numa_list_next(&rr,
    > non_local_numa);
    > +                    if (!non_local_numa) {
    > +                        VLOG_ERR("There is no available (non-isolated)
    > pmd "
    > +                                 "thread for port \'%s\' queue %d. This
    > queue "
    > +                                 "will not be polled. Is pmd-cpu-mask set
    > to "
    > +                                 "zero? Or are all PMDs isolated to other
    > "
    > +                                 "queues?", netdev_get_name(port-
    > >netdev),
    > +                                 qid);
    > +                        continue;
    > +                    }
    > +                    q->pmd = rr_numa_get_pmd(non_local_numa);
    > +                    VLOG_WARN("There's no available (non-isolated) pmd
    > thread "
    >                                "on numa node %d. Queue %d on port \'%s\'
    > will "
    > -                              "not be polled.",
    > -                              numa_id, qid, netdev_get_name(port-
    > >netdev));
    > +                              "be assigned to the pmd on core %d "
    > +                              "(numa node %d). Expect reduced
    > performance.",
    > +                              numa_id, qid, netdev_get_name(port-
    > >netdev),
    > +                              q->pmd->core_id, q->pmd->numa_id);
    >                  } else {
    > +                    /* Assign queue to the next (round-robin) PMD on it's
    > local
    > +                       NUMA node. */
    >                      q->pmd = rr_numa_get_pmd(numa);
    >                  }
    >              }
    > --
    > 2.7.4
    This tested fine for me, tested with multiple rxqs distributed and isolated over pmds on 2 different numa nodes with varying pmd masks. Also passed sanity checks (clang, sparse compilation etc.).
    
    You can add the tested by tag for me but I'd like to see the changes for the documentation and function comments above before acking.
    
    Tested-by: Ian Stokes <ian.stokes at intel.com>
    > 
    > _______________________________________________
    > dev mailing list
    > dev at openvswitch.org
    > https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=p_za6xWBiGasOHJeLuHr5VrfE701rCPbFaS87JPIrNs&s=tsSKfYpne63_JC6ML3S1NQYPtRAu4TTsp8ZY6WFXSdE&e= 
    _______________________________________________
    dev mailing list
    dev at openvswitch.org
    https://urldefense.proofpoint.com/v2/url?u=https-3A__mail.openvswitch.org_mailman_listinfo_ovs-2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-uZnsw&m=p_za6xWBiGasOHJeLuHr5VrfE701rCPbFaS87JPIrNs&s=tsSKfYpne63_JC6ML3S1NQYPtRAu4TTsp8ZY6WFXSdE&e= 
    



More information about the dev mailing list