[ovs-dev] [PATCH RFC 1/1] netdev-dpdk: NUMA Aware vHost

Daniele Di Proietto diproiettod at vmware.com
Thu Mar 10 23:24:36 UTC 2016


ovs-numa reads the CPUs and NUMA nodes configuration from sysfs.
It is only used to decide how to set the affinity of pmd threads.

This new code just needs get_mempolicy() from libnuma: it is used
to get the NUMA node of a virtual address and it is simply a wrapper
to a system call.  Want do you think? Should we implement that in OVS?

If we implement the system call wrapper in ovs-numa we could avoid
depending on libnuma-dev for the build, but (I guess) we would still
need to link with -lnuma, as it is a dependency of the statically linked
DPDK library

Thanks

On 10/03/2016 14:38, "Ben Pfaff" <blp at ovn.org> wrote:

>How does this numa library relate to lib/ovs-numa.[ch]?
>
>On Thu, Mar 10, 2016 at 01:22:42AM +0000, Daniele Di Proietto wrote:
>> Thanks for the patch, I'll put this in the use case list for
>> my series if I need to resend it!
>> 
>> It would be nice to get the numa socket information without
>> linking OVS with libnuma, maybe using some DPDK api. From
>> a quick look I didn't find any way, but maybe you know a
>> better way.
>> 
>> Some preliminary comments inline
>> 
>> On 04/03/2016 02:08, "dev on behalf of Ciara Loftus"
>> <dev-bounces at openvswitch.org on behalf of ciara.loftus at intel.com> wrote:
>> 
>> >This commit allows for vHost memory from QEMU, DPDK and OVS, as well
>> >as the servicing PMD, to all come from the same socket.
>> >
>> >DPDK v2.2 introduces a new configuration option:
>> >CONFIG_RTE_LIBRTE_VHOST_NUMA. If enabled, DPDK detects the socket
>> >from which a vhost device's memory has been allocated by QEMU, and
>> >accordingly reallocates device memory managed by DPDK to that same
>> >socket.
>> >
>> >OVS by default sets the socket id of a vhost port to that of the
>> >master lcore. This commit introduces the ability to update the
>> >socket id of the port if it is detected (during VM boot) that the
>> >port memory is not on the default NUMA node. If this is the case, the
>> >mempool of the port is also changed to the new node, and a PMD
>> >thread currently servicing the port will no longer, in favour of a
>> >thread from the new node (if enabled in the CPU mask).
>> >
>> >Signed-off-by: Ciara Loftus <ciara.loftus at intel.com>
>> >---
>> > INSTALL.DPDK.md   |  6 +++++-
>> > acinclude.m4      |  2 +-
>> > lib/netdev-dpdk.c | 25 +++++++++++++++++++++++--
>> > 3 files changed, 29 insertions(+), 4 deletions(-)
>> >
>> >diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>> >index dca79bd..82e6908 100644
>> >--- a/INSTALL.DPDK.md
>> >+++ b/INSTALL.DPDK.md
>> >@@ -33,6 +33,10 @@ on Debian/Ubuntu)
>> > 
>> >      `CONFIG_RTE_BUILD_COMBINE_LIBS=y`
>> > 
>> >+     Enable NUMA-aware vHost by modifying the following in the same
>>file:
>> >+
>> >+     `CONFIG_RTE_LIBRTE_VHOST_NUMA=y`
>> >+
>> 
>> I guess we should also update install_dpdk() in ./travis/build.sh to do
>> this if it's required
>> 
>> >      Then run `make install` to build and install the library.
>> >      For default install without IVSHMEM:
>> > 
>> >@@ -383,7 +387,7 @@ Performance Tuning:
>> > 
>> > 	It is good practice to ensure that threads that are in the datapath
>>are
>> > 	pinned to cores in the same NUMA area. e.g. pmd threads and QEMU
>>vCPUs
>> >-	responsible for forwarding.
>> >+	responsible for forwarding. This is now default behavior for vHost
>> >ports.
>> > 
>> >   9. Rx Mergeable buffers
>> > 
>> >diff --git a/acinclude.m4 b/acinclude.m4
>> >index 11c7787..432bdbd 100644
>> >--- a/acinclude.m4
>> >+++ b/acinclude.m4
>> >@@ -199,7 +199,7 @@ AC_DEFUN([OVS_CHECK_DPDK], [
>> >     found=false
>> >     save_LIBS=$LIBS
>> >     for extras in "" "-ldl"; do
>> >-        LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB"
>> >+        LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB -lnuma"
>> 
>> I guess we should also list libnuma-dev in .travis.yml and something
>> similar
>> in rhel/openvswitch-fedora.spec
>> 
>> >         AC_LINK_IFELSE(
>> >            [AC_LANG_PROGRAM([#include <rte_config.h>
>> >                              #include <rte_eal.h>],
>> >diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> >index 17b8d51..4e1ce53 100644
>> >--- a/lib/netdev-dpdk.c
>> >+++ b/lib/netdev-dpdk.c
>> >@@ -29,6 +29,7 @@
>> > #include <stdio.h>
>> > #include <sys/types.h>
>> > #include <sys/stat.h>
>> >+#include <numaif.h>
>> > 
>> > #include "dirs.h"
>> > #include "dp-packet.h"
>> >@@ -1878,6 +1879,8 @@ new_device(struct virtio_net *dev)
>> > {
>> >     struct netdev_dpdk *netdev;
>> >     bool exists = false;
>> >+    int newnode = 0;
>> >+    long err = 0;
>> > 
>> >     ovs_mutex_lock(&dpdk_mutex);
>> >     /* Add device to the vhost port with the same name as that passed
>> >down. */
>> >@@ -1891,6 +1894,24 @@ new_device(struct virtio_net *dev)
>> >             }
>> >             ovsrcu_set(&netdev->virtio_dev, dev);
>> >             exists = true;
>> >+
>> >+            /* Get NUMA information */
>> >+            err = get_mempolicy(&newnode, NULL, 0, dev, MPOL_F_NODE |
>> >MPOL_F_ADDR);
>> >+            if (err) {
>> >+                VLOG_INFO("Error getting NUMA info for vHost Device
>> >'%s'",
>> >+                        dev->ifname);
>> >+                newnode = netdev->socket_id;
>> >+            } else if (newnode != netdev->socket_id) {
>> >+                netdev->socket_id = newnode;
>> >+                /* Change mempool to new NUMA Node */
>> >+                dpdk_mp_put(netdev->dpdk_mp);
>> >+                netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
>> >netdev->mtu);
>> >+                /* Request netdev reconfiguration. The port may now be
>> >+                 * serviced by a PMD on the new node if enabled in the
>> >cpu
>> >+                 * mask */
>> >+                netdev_request_reconfigure(&netdev->up);
>> 
>> I think here I would prefer:
>> 
>> 1) Remembering the configuration change request
>>     netdev->requested_socket_id = newnode
>> 2) Calling netdev_request_reconfigure()
>> 3) In the netdev_dpdk_vhost_user_reconfigure() method:
>>     netdev->socket_id = netdev_requested_socket_id
>> 
>> Otherwise the datapath might be confused, because it assumes that the
>> socket_id doesn't change while the device is polled by the pmd threads.
>> 
>> It's safe to change almost everything inside the
>> netdev_dpdk_vhost_user_reconfigure(), because the device will not be
>> polled by the datapath when the function is called.
>> 
>> The same applies for the mempool: the actual change should be done
>> in netdev_dpdk_vhost_user_reconfigure(), because there might still be
>> threads using it.
>> 
>> >+            }
>> >+
>> >             dev->flags |= VIRTIO_DEV_RUNNING;
>> >             /* Disable notifications. */
>> >             set_irq_status(dev);
>> >@@ -1907,8 +1928,8 @@ new_device(struct virtio_net *dev)
>> >         return -1;
>> >     }
>> > 
>> >-    VLOG_INFO("vHost Device '%s' %"PRIu64" has been added",
>>dev->ifname,
>> >-              dev->device_fh);
>> >+    VLOG_INFO("vHost Device '%s' %"PRIu64" has been added on socket
>>%i",
>> >+              dev->ifname, dev->device_fh, newnode);
>> >     return 0;
>> > }
>> > 
>> >-- 
>> >2.4.3
>> >
>> >_______________________________________________
>> >dev mailing list
>> >dev at openvswitch.org
>> >http://openvswitch.org/mailman/listinfo/dev
>> 
>> _______________________________________________
>> dev mailing list
>> dev at openvswitch.org
>> http://openvswitch.org/mailman/listinfo/dev




More information about the dev mailing list