[ovs-dev] [PATCH RFC 1/1] netdev-dpdk: NUMA Aware vHost
Daniele Di Proietto
diproiettod at vmware.com
Thu Mar 10 23:24:36 UTC 2016
ovs-numa reads the CPUs and NUMA nodes configuration from sysfs.
It is only used to decide how to set the affinity of pmd threads.
This new code just needs get_mempolicy() from libnuma: it is used
to get the NUMA node of a virtual address and it is simply a wrapper
to a system call. Want do you think? Should we implement that in OVS?
If we implement the system call wrapper in ovs-numa we could avoid
depending on libnuma-dev for the build, but (I guess) we would still
need to link with -lnuma, as it is a dependency of the statically linked
DPDK library
Thanks
On 10/03/2016 14:38, "Ben Pfaff" <blp at ovn.org> wrote:
>How does this numa library relate to lib/ovs-numa.[ch]?
>
>On Thu, Mar 10, 2016 at 01:22:42AM +0000, Daniele Di Proietto wrote:
>> Thanks for the patch, I'll put this in the use case list for
>> my series if I need to resend it!
>>
>> It would be nice to get the numa socket information without
>> linking OVS with libnuma, maybe using some DPDK api. From
>> a quick look I didn't find any way, but maybe you know a
>> better way.
>>
>> Some preliminary comments inline
>>
>> On 04/03/2016 02:08, "dev on behalf of Ciara Loftus"
>> <dev-bounces at openvswitch.org on behalf of ciara.loftus at intel.com> wrote:
>>
>> >This commit allows for vHost memory from QEMU, DPDK and OVS, as well
>> >as the servicing PMD, to all come from the same socket.
>> >
>> >DPDK v2.2 introduces a new configuration option:
>> >CONFIG_RTE_LIBRTE_VHOST_NUMA. If enabled, DPDK detects the socket
>> >from which a vhost device's memory has been allocated by QEMU, and
>> >accordingly reallocates device memory managed by DPDK to that same
>> >socket.
>> >
>> >OVS by default sets the socket id of a vhost port to that of the
>> >master lcore. This commit introduces the ability to update the
>> >socket id of the port if it is detected (during VM boot) that the
>> >port memory is not on the default NUMA node. If this is the case, the
>> >mempool of the port is also changed to the new node, and a PMD
>> >thread currently servicing the port will no longer, in favour of a
>> >thread from the new node (if enabled in the CPU mask).
>> >
>> >Signed-off-by: Ciara Loftus <ciara.loftus at intel.com>
>> >---
>> > INSTALL.DPDK.md | 6 +++++-
>> > acinclude.m4 | 2 +-
>> > lib/netdev-dpdk.c | 25 +++++++++++++++++++++++--
>> > 3 files changed, 29 insertions(+), 4 deletions(-)
>> >
>> >diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>> >index dca79bd..82e6908 100644
>> >--- a/INSTALL.DPDK.md
>> >+++ b/INSTALL.DPDK.md
>> >@@ -33,6 +33,10 @@ on Debian/Ubuntu)
>> >
>> > `CONFIG_RTE_BUILD_COMBINE_LIBS=y`
>> >
>> >+ Enable NUMA-aware vHost by modifying the following in the same
>>file:
>> >+
>> >+ `CONFIG_RTE_LIBRTE_VHOST_NUMA=y`
>> >+
>>
>> I guess we should also update install_dpdk() in ./travis/build.sh to do
>> this if it's required
>>
>> > Then run `make install` to build and install the library.
>> > For default install without IVSHMEM:
>> >
>> >@@ -383,7 +387,7 @@ Performance Tuning:
>> >
>> > It is good practice to ensure that threads that are in the datapath
>>are
>> > pinned to cores in the same NUMA area. e.g. pmd threads and QEMU
>>vCPUs
>> >- responsible for forwarding.
>> >+ responsible for forwarding. This is now default behavior for vHost
>> >ports.
>> >
>> > 9. Rx Mergeable buffers
>> >
>> >diff --git a/acinclude.m4 b/acinclude.m4
>> >index 11c7787..432bdbd 100644
>> >--- a/acinclude.m4
>> >+++ b/acinclude.m4
>> >@@ -199,7 +199,7 @@ AC_DEFUN([OVS_CHECK_DPDK], [
>> > found=false
>> > save_LIBS=$LIBS
>> > for extras in "" "-ldl"; do
>> >- LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB"
>> >+ LIBS="$DPDK_LIB $extras $save_LIBS $DPDK_EXTRA_LIB -lnuma"
>>
>> I guess we should also list libnuma-dev in .travis.yml and something
>> similar
>> in rhel/openvswitch-fedora.spec
>>
>> > AC_LINK_IFELSE(
>> > [AC_LANG_PROGRAM([#include <rte_config.h>
>> > #include <rte_eal.h>],
>> >diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> >index 17b8d51..4e1ce53 100644
>> >--- a/lib/netdev-dpdk.c
>> >+++ b/lib/netdev-dpdk.c
>> >@@ -29,6 +29,7 @@
>> > #include <stdio.h>
>> > #include <sys/types.h>
>> > #include <sys/stat.h>
>> >+#include <numaif.h>
>> >
>> > #include "dirs.h"
>> > #include "dp-packet.h"
>> >@@ -1878,6 +1879,8 @@ new_device(struct virtio_net *dev)
>> > {
>> > struct netdev_dpdk *netdev;
>> > bool exists = false;
>> >+ int newnode = 0;
>> >+ long err = 0;
>> >
>> > ovs_mutex_lock(&dpdk_mutex);
>> > /* Add device to the vhost port with the same name as that passed
>> >down. */
>> >@@ -1891,6 +1894,24 @@ new_device(struct virtio_net *dev)
>> > }
>> > ovsrcu_set(&netdev->virtio_dev, dev);
>> > exists = true;
>> >+
>> >+ /* Get NUMA information */
>> >+ err = get_mempolicy(&newnode, NULL, 0, dev, MPOL_F_NODE |
>> >MPOL_F_ADDR);
>> >+ if (err) {
>> >+ VLOG_INFO("Error getting NUMA info for vHost Device
>> >'%s'",
>> >+ dev->ifname);
>> >+ newnode = netdev->socket_id;
>> >+ } else if (newnode != netdev->socket_id) {
>> >+ netdev->socket_id = newnode;
>> >+ /* Change mempool to new NUMA Node */
>> >+ dpdk_mp_put(netdev->dpdk_mp);
>> >+ netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
>> >netdev->mtu);
>> >+ /* Request netdev reconfiguration. The port may now be
>> >+ * serviced by a PMD on the new node if enabled in the
>> >cpu
>> >+ * mask */
>> >+ netdev_request_reconfigure(&netdev->up);
>>
>> I think here I would prefer:
>>
>> 1) Remembering the configuration change request
>> netdev->requested_socket_id = newnode
>> 2) Calling netdev_request_reconfigure()
>> 3) In the netdev_dpdk_vhost_user_reconfigure() method:
>> netdev->socket_id = netdev_requested_socket_id
>>
>> Otherwise the datapath might be confused, because it assumes that the
>> socket_id doesn't change while the device is polled by the pmd threads.
>>
>> It's safe to change almost everything inside the
>> netdev_dpdk_vhost_user_reconfigure(), because the device will not be
>> polled by the datapath when the function is called.
>>
>> The same applies for the mempool: the actual change should be done
>> in netdev_dpdk_vhost_user_reconfigure(), because there might still be
>> threads using it.
>>
>> >+ }
>> >+
>> > dev->flags |= VIRTIO_DEV_RUNNING;
>> > /* Disable notifications. */
>> > set_irq_status(dev);
>> >@@ -1907,8 +1928,8 @@ new_device(struct virtio_net *dev)
>> > return -1;
>> > }
>> >
>> >- VLOG_INFO("vHost Device '%s' %"PRIu64" has been added",
>>dev->ifname,
>> >- dev->device_fh);
>> >+ VLOG_INFO("vHost Device '%s' %"PRIu64" has been added on socket
>>%i",
>> >+ dev->ifname, dev->device_fh, newnode);
>> > return 0;
>> > }
>> >
>> >--
>> >2.4.3
>> >
>> >_______________________________________________
>> >dev mailing list
>> >dev at openvswitch.org
>> >http://openvswitch.org/mailman/listinfo/dev
>>
>> _______________________________________________
>> dev mailing list
>> dev at openvswitch.org
>> http://openvswitch.org/mailman/listinfo/dev
More information about the dev
mailing list