[ovs-dev] [PATCH v5 1/2] ovs-numa: Support non-contiguous numa nodes and offline CPU cores

Kevin Traynor ktraynor at redhat.com
Fri Jun 4 09:31:43 UTC 2021


On 25/05/2021 20:27, David Wilder wrote:
> This change removes the assumption that numa nodes and cores are numbered
> contiguously in linux.  This change is required to support some Power
> systems.
> 
> An additional check has been added to verify that cores are online,
> offline cores result in non-contiguously numbered cores.
> 
> I manually verified that all the exported functions defined in
> ovs-numa.h function correctly with these changes on Power and amd64
> systems.
> 
> Tests for dpif-netdev and pmd that define numa systems using the
> --dummy-numa option have been modified to use both contiguous and
> non-contiguous numbered nodes.
> 
> Signed-off-by: David Wilder <dwilder at us.ibm.com>
> Reviewed-by: David Christensen <drc at linux.vnet.ibm.com>
> Tested-by: Emma Finn <emma.finn at intel.com>
> ---
>  lib/ovs-numa.c       | 50 ++++++++++++++++++++++++++++++++++----------------
>  tests/dpif-netdev.at | 12 ++++++------
>  tests/pmd.at         | 23 ++++++++++++++++++++---
>  3 files changed, 60 insertions(+), 25 deletions(-)
> 
> diff --git a/lib/ovs-numa.c b/lib/ovs-numa.c
> index 6d0a685..8c7d25b 100644
> --- a/lib/ovs-numa.c
> +++ b/lib/ovs-numa.c
> @@ -42,16 +42,18 @@ VLOG_DEFINE_THIS_MODULE(ovs_numa);
>   * This module stores the affinity information of numa nodes and cpu cores.
>   * It also provides functions to bookkeep the pin of threads on cpu cores.
>   *
> - * It is assumed that the numa node ids and cpu core ids all start from 0 and
> - * range continuously.  So, for example, if 'ovs_numa_get_n_cores()' returns N,
> - * user can assume core ids from 0 to N-1 are all valid and there is a
> - * 'struct cpu_core' for each id.
> + * It is assumed that the numa node ids and cpu core ids all start from 0.
> + * There is no guarantee that node and cpu ids are numbered consecutively
> + * (this is a change from earlier version of the code). So, for example,
> + * if two nodes exist with ids 0 and 8, 'ovs_numa_get_n_nodes()' will
> + * return 2, no assumption of node numbering should be made.
>   *
>   * NOTE, this module should only be used by the main thread.
>   *
> - * NOTE, the assumption above will fail when cpu hotplug is used.  In that
> - * case ovs-numa will not function correctly.  For now, add a TODO entry
> - * for addressing it in the future.
> + * NOTE, if cpu hotplug is used 'all_numa_nodes' and 'all_cpu_cores' must be
> + * invalidated when ever the system topology changes. Support for detecting
> + * topology changes has not been included. For now, add a TODO entry for
> + * addressing it in the future.
>   *
>   * TODO: Fix ovs-numa when cpu hotplug is used.
>   */
> @@ -130,15 +132,14 @@ insert_new_cpu_core(struct numa_node *n, unsigned core_id)
>   * - "0,0,0,0": four cores on numa socket 0.
>   * - "0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1": 16 cores on two numa sockets.
>   * - "0,0,0,0,1,1,1,1": 8 cores on two numa sockets.
> - *
> - * The different numa ids must be consecutives or the function will abort. */
> + * - "0,0,0,0,8,8,8,8": 8 cores on two numa sockets, non-contiguous.
> + */
>  static void
>  discover_numa_and_core_dummy(void)
>  {
>      char *conf = xstrdup(dummy_config);
>      char *id, *saveptr = NULL;
>      unsigned i = 0;
> -    long max_numa_id = 0;
>  
>      for (id = strtok_r(conf, ",", &saveptr); id;
>           id = strtok_r(NULL, ",", &saveptr)) {
> @@ -152,8 +153,6 @@ discover_numa_and_core_dummy(void)
>              continue;
>          }
>  
> -        max_numa_id = MAX(max_numa_id, numa_id);
> -
>          hnode = hmap_first_with_hash(&all_numa_nodes, hash_int(numa_id, 0));
>  
>          if (hnode) {
> @@ -169,10 +168,27 @@ discover_numa_and_core_dummy(void)
>  
>      free(conf);
>  
> -    if (max_numa_id + 1 != hmap_count(&all_numa_nodes)) {
> -        ovs_fatal(0, "dummy numa contains non consecutive numa ids");
> +}
> +
> +#ifdef __linux__
> +/* Check if a cpu is detected and online */
> +static int
> +cpu_detected(unsigned int core_id)
> +{
> +    char path[PATH_MAX];
> +    int len = snprintf(path, sizeof(path),
> +                       "/sys/devices/system/cpu/cpu%d/topology/core_id",
> +                       core_id);
> +    if (len <= 0 || (unsigned) len >= sizeof(path)) {
> +        return 0;
> +    }
> +    if (access(path, F_OK) != 0) {
> +        return 0;
>      }
> +
> +    return 1;
>  }
> +#endif /* __linux__ */
>  
>  /* Discovers all numa nodes and the corresponding cpu cores.
>   * Constructs the 'struct numa_node' and 'struct cpu_core'. */
> @@ -219,7 +235,9 @@ discover_numa_and_core(void)
>                      unsigned core_id;
>  
>                      core_id = strtoul(subdir->d_name + 3, NULL, 10);
> -                    insert_new_cpu_core(n, core_id);
> +                    if (cpu_detected(core_id)) {
> +                        insert_new_cpu_core(n, core_id);
> +                    }
>                  }
>              }
>              closedir(dir);
> @@ -229,7 +247,7 @@ discover_numa_and_core(void)
>          }
>  
>          free(path);
> -        if (!dir || !numa_supported) {
> +        if (!numa_supported) {
>              break;
>          }
>      }
> diff --git a/tests/dpif-netdev.at b/tests/dpif-netdev.at
> index 57cae38..8d88626 100644
> --- a/tests/dpif-netdev.at
> +++ b/tests/dpif-netdev.at
> @@ -98,7 +98,7 @@ m4_define([DPIF_NETDEV_DUMMY_IFACE],
>                       fail-mode=secure -- \
>        add-port br1 p2 -- set interface p2 type=$1 options:stream=unix:$OVS_RUNDIR/p0.sock ofport_request=2 -- \
>        add-port br1 p8 -- set interface p8 ofport_request=8 type=$1 --], [], [],
> -      [m4_if([$1], [dummy-pmd], [--dummy-numa="0,0,0,0,1,1,1,1"], [])])
> +      [m4_if([$1], [dummy-pmd], [--dummy-numa="0,0,0,0,8,8,8,8"], [])])
>     AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg])
>  
>     AT_CHECK([ovs-ofctl add-flow br0 action=normal])
> @@ -130,7 +130,7 @@ m4_define([DPIF_NETDEV_MISS_FLOW_INSTALL],
>        -- set interface p1 type=$1 options:pstream=punix:$OVS_RUNDIR/p0.sock \
>        -- set bridge br0 datapath-type=dummy \
>                          other-config:datapath-id=1234 fail-mode=secure], [], [],
> -      [m4_if([$1], [dummy-pmd], [--dummy-numa="0,0,0,0,1,1,1,1"], [])])
> +      [m4_if([$1], [dummy-pmd], [--dummy-numa="0,0,0,0,8,8,8,8"], [])])
>     AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg])
>  
>     AT_CHECK([ovs-ofctl add-flow br0 action=normal])
> @@ -172,7 +172,7 @@ m4_define([DPIF_NETDEV_FLOW_PUT_MODIFY],
>        add-port br0 p2 -- set interface p2 type=$1 ofport_request=2 options:pstream=punix:$OVS_RUNDIR/p2.sock -- \
>        set bridge br0 datapath-type=dummy \
>                       other-config:datapath-id=1234 fail-mode=secure], [], [],
> -      [m4_if([$1], [dummy-pmd], [--dummy-numa="0,0,0,0,1,1,1,1"], [])])
> +      [m4_if([$1], [dummy-pmd], [--dummy-numa="0,0,0,0,8,8,8,8"], [])])
>     AT_CHECK([ovs-appctl vlog/set dpif:file:dbg dpif_netdev:file:dbg])
>  
>     # Add a flow that directs some packets received on p1 to p2 and the
> @@ -226,7 +226,7 @@ m4_define([DPIF_NETDEV_MISS_FLOW_DUMP],
>        -- set interface p1 type=$1 options:pstream=punix:$OVS_RUNDIR/p0.sock \
>        -- set bridge br0 datapath-type=dummy \
>                          other-config:datapath-id=1234 fail-mode=secure], [], [],
> -      [m4_if([$1], [dummy-pmd], [--dummy-numa="0,0,0,0,1,1,1,1"], [])])
> +      [m4_if([$1], [dummy-pmd], [--dummy-numa="0,0,0,0,8,8,8,8"], [])])
>     AT_CHECK([ovs-appctl upcall/disable-ufid], [0], [Datapath dumping tersely using UFID disabled
>  ], [])
>     AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg])
> @@ -384,7 +384,7 @@ m4_define([DPIF_NETDEV_FLOW_HW_OFFLOAD],
>        set interface p1 type=$1 ofport_request=1 options:pstream=punix:$OVS_RUNDIR/p1.sock options:ifindex=1100 -- \
>        set bridge br0 datapath-type=dummy \
>                       other-config:datapath-id=1234 fail-mode=secure], [], [],
> -      [m4_if([$1], [dummy-pmd], [--dummy-numa="0,0,0,0,1,1,1,1"], [])])
> +      [m4_if([$1], [dummy-pmd], [--dummy-numa="0,0,0,0,8,8,8,8"], [])])
>     AT_CHECK([ovs-appctl vlog/set dpif:file:dbg dpif_netdev:file:dbg netdev_dummy:file:dbg])
>  
>     AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:hw-offload=true])
> @@ -447,7 +447,7 @@ m4_define([DPIF_NETDEV_FLOW_HW_OFFLOAD_OFFSETS],
>        set interface p1 type=$1 ofport_request=1 options:pcap=p1.pcap options:ifindex=1101 -- \
>        set bridge br0 datapath-type=dummy \
>                       other-config:datapath-id=1234 fail-mode=secure], [], [],
> -      [m4_if([$1], [dummy-pmd], [--dummy-numa="0,0,0,0,1,1,1,1"], [])])
> +      [m4_if([$1], [dummy-pmd], [--dummy-numa="0,0,0,0,8,8,8,8"], [])])
>     AT_CHECK([ovs-appctl vlog/set dpif:file:dbg dpif_netdev:file:dbg netdev_dummy:file:dbg])
>  
>     AT_CHECK([ovs-vsctl set Open_vSwitch . other_config:hw-offload=true])

Probably one test changed above would be enough as they don't seem to do
anything other than the initial discovery. That said, I'm not sure it
really matters if the majority of the tests are contig or non-contig
numa as they are not the focus of the test.

> diff --git a/tests/pmd.at b/tests/pmd.at
> index cc5371d..3e6d6bb 100644
> --- a/tests/pmd.at
> +++ b/tests/pmd.at
> @@ -584,7 +584,7 @@ AT_CLEANUP
>  
>  AT_SETUP([PMD - rxq affinity - NUMA])
>  OVS_VSWITCHD_START(
> -  [], [], [], [--dummy-numa 0,0,0,1,1])
> +  [], [], [], [--dummy-numa 0,0,0,1,1,8,8])
>  AT_CHECK([ovs-appctl vlog/set dpif:dbg dpif_netdev:dbg])
>  
>  AT_CHECK([ovs-ofctl add-flow br0 actions=controller])
> @@ -601,21 +601,38 @@ p1 1 0 2
>  
>  AT_CHECK([ovs-vsctl set Interface p1 other_config:pmd-rxq-affinity="0:3,1:4"])
>  
> -dnl We moved the queues to different numa node. Expecting threads on
> +dnl We moved the queues to different contiguous numa node. Expecting threads on
>  dnl NUMA node 1 to be created.
>  AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show], [0], [dnl
>  p1 0 1 3
>  p1 1 1 4
>  ])
>  
> +AT_CHECK([ovs-vsctl set Interface p1 other_config:pmd-rxq-affinity="0:5,1:6"])
> +
> +dnl We moved the queues to different non-contiguous numa node. Expecting threads on
> +dnl NUMA node 8 to be created.
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show], [0], [dnl
> +p1 0 8 5
> +p1 1 8 6
> +])
> +
>  AT_CHECK([ovs-vsctl set Interface p1 other_config:pmd-rxq-affinity="0:3,1:1"])
>  
> -dnl Queues splitted between NUMA nodes.
> +dnl Queues splitted between contiguous NUMA nodes.
>  AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show], [0], [dnl
>  p1 0 1 3
>  p1 1 0 1
>  ])
>  
> +AT_CHECK([ovs-vsctl set Interface p1 other_config:pmd-rxq-affinity="0:5,1:1"])
> +
> +dnl Queues splitted between non-contiguous NUMA nodes.
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-rxq-show | parse_pmd_rxq_show], [0], [dnl
> +p1 0 8 5
> +p1 1 0 1
> +])
> +
>  AT_CHECK([ovs-vsctl remove Interface p1 other_config pmd-rxq-affinity])
>  
>  dnl We removed the rxq-affinity request.  dpif-netdev should assign queues
> 

Affinity tests look good. This is where the core is pinned by the user.
The default method of assigning an rxq to a pmd on a numa node is done
automatically in OVS when the user does not pin.

In that case, OVS has to generate a list of numas and tries to assign a
pmd from the same numa node as the port, so it's good to add a test to
make sure there is no contig numa dependency here. I modified a test to
do that, the lines are a bit long to paste, I'll send it in reply.



More information about the dev mailing list