[ovs-dev] [PATCH v7] netdev-dpdk: Add support for vHost dequeue zero copy (experimental)
Ilya Maximets
i.maximets at samsung.com
Tue Dec 26 11:05:34 UTC 2017
On 19.12.2017 19:44, Ciara Loftus wrote:
> Enabled per port like so:
> ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
>
> If the device is already active at the point of enabling/disabling the
> feature via ovs-vsctl, one must reload the guest driver or reboot the VM
> in order for the feature to be successfully enabled/disabled.
>
> When packets from a vHost device with zero copy enabled are destined for
> a single 'dpdk' port, the number of tx descriptors on that 'dpdk' port
> must be set to a smaller value. 128 is recommended. This can be achieved
> like so:
>
> ovs-vsctl set Interface dpdkport options:n_txq_desc=128
>
> Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send
> to should not exceed 128. Due to this requirement, the feature is
> considered 'experimental'.
>
> Signed-off-by: Ciara Loftus <ciara.loftus at intel.com>
> ---
> v7:
> * Remove support for zc for dpdkvhostuser ports & patch 1 in the series.
> * Remove entry from howto Documentation as the document in its current
> state is only relevant to dpdkvhostuser ports which do not support the
> feature.
> * Use ref: link to queue size documentation.
> * Elaborate on details of NIC descriptors limitation eg. cases where
> packets are destined to >1 NICs.
> * Fix comment about padding in netdev-dpdk struct.
>
> Documentation/intro/install/dpdk.rst | 2 +
> Documentation/topics/dpdk/vhost-user.rst | 71 +++++++++++++++++++++++
> NEWS | 1 +
> lib/netdev-dpdk.c | 97 ++++++++++++++++++++++++++++----
> vswitchd/vswitch.xml | 13 +++++
> 5 files changed, 174 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/intro/install/dpdk.rst b/Documentation/intro/install/dpdk.rst
> index 3fecb5c..087eb88 100644
> --- a/Documentation/intro/install/dpdk.rst
> +++ b/Documentation/intro/install/dpdk.rst
> @@ -518,6 +518,8 @@ The above command sets the number of rx queues for DPDK physical interface.
> The rx queues are assigned to pmd threads on the same NUMA node in a
> round-robin fashion.
>
> +.. _dpdk-queues-sizes:
> +
> DPDK Physical Port Queue Sizes
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> diff --git a/Documentation/topics/dpdk/vhost-user.rst b/Documentation/topics/dpdk/vhost-user.rst
> index 8447e2d..d82ca40 100644
> --- a/Documentation/topics/dpdk/vhost-user.rst
> +++ b/Documentation/topics/dpdk/vhost-user.rst
> @@ -458,3 +458,74 @@ Sample XML
> </domain>
>
> .. _QEMU documentation: http://git.qemu-project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-user.txt;h=7890d7169;hb=HEAD
> +
> +vhost-user Dequeue Zero Copy (experimental)
> +-------------------------------------------
> +
> +Normally when dequeuing a packet from a vHost User device, a memcpy operation
> +must be used to copy that packet from guest address space to host address
> +space. This memcpy can be removed by enabling dequeue zero-copy like so::
> +
> + $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
> +
> +With this feature enabled, a reference (pointer) to the packet is passed to
> +the host, instead of a copy of the packet. Removing this memcpy can give a
> +performance improvement for some use cases, for example switching large packets
> +between different VMs. However additional packet loss may be observed.
> +
> +Note that the feature is disabled by default and must be explicitly enabled
> +by using the command above.
> +
> +The feature cannot be enabled when the device is active. If the device is
> +active when enabling the feature via ovs-vsctl, a driver reload or reboot
> +is required for the changes to take effect.
> +
> +The same logic applies for disabling the feature - it must be disabled when
> +the device is inactive. To disable the feature::
> +
> + $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=false
> +
> +The feature is only available to dpdkvhostuserclient port types.
> +
> +A limitation exists whereby if packets from a vHost port with dq-zero-copy=true
> +are destined for a 'dpdk' type port, the number of tx descriptors (n_txq_desc)
> +for that port must be reduced to a smaller number, 128 being the recommended
> +value. This can be achieved by issuing the following command::
> +
> + $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> +
> +Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send to
> +should not exceed 128. For example, in case of a bond over two physical ports
> +in balance-tcp mode, one must divide 128 by the number of links in the bond.
> +
> +Refer to :ref:`<dpdk-queue-sizes>` for more information.
> +
> +The reason for this limitation is due to how the zero copy functionality is
> +implemented. The vHost device's 'tx used vring', a virtio structure used for
> +tracking used ie. sent descriptors, will only be updated when the NIC frees
> +the corresponding mbuf. If we don't free the mbufs frequently enough, that
> +vring will be starved and packets will no longer be processed. One way to
> +ensure we don't encounter this scenario, is to configure n_txq_desc to a small
> +enough number such that the 'mbuf free threshold' for the NIC will be hit more
> +often and thus free mbufs more frequently. The value of 128 is suggested, but
> +values of 64 and 256 have been tested and verified to work too, with differing
> +performance characteristics. A value of 512 can be used too, if the virtio
> +queue size in the guest is increased to 1024 (available to configure in QEMU
> +versions v2.10 and greater). This value can be set like so::
> +
> + $ qemu-system-x86_64 ... -chardev socket,id=char1,path=<sockpath>,server
> + -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
> + -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
> + tx_queue_size=1024
> +
> +Because of this limitation, this feature is condsidered 'experimental'.
> +
> +The feature currently does not fully work with QEMU >= v2.7 due to a bug in
> +DPDK which will be addressed in an upcoming release. The patch to fix this
> +issue can be found on
> +`Patchwork
> +<http://dpdk.org/dev/patchwork/patch/32198/>`__
> +
> +Further information can be found in the
> +`DPDK documentation
> +<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__
> diff --git a/NEWS b/NEWS
> index 752e98f..1eaa32c 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -24,6 +24,7 @@ Post-v2.8.0
> - DPDK:
> * Add support for DPDK v17.11
> * Add support for vHost IOMMU
> + * Add support for vHost dequeue zero copy (experimental)
>
> v2.8.0 - 31 Aug 2017
> --------------------
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 8f22264..4b82a06 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -378,7 +378,10 @@ struct netdev_dpdk {
>
> /* True if vHost device is 'up' and has been reconfigured at least once */
> bool vhost_reconfigured;
> - /* 3 pad bytes here. */
> +
> + /* True if dq-zero-copy feature has successfully been enabled */
> + bool dq_zc_enabled;
> + /* 2 pad bytes here. */
> );
>
> PADDED_MEMBERS(CACHE_LINE_SIZE,
> @@ -889,6 +892,7 @@ common_construct(struct netdev *netdev, dpdk_port_t port_no,
> dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> ovsrcu_index_init(&dev->vid, -1);
> dev->vhost_reconfigured = false;
> + dev->dq_zc_enabled = false;
> dev->attached = false;
>
> ovsrcu_init(&dev->qos_conf, NULL);
> @@ -1371,6 +1375,27 @@ netdev_dpdk_ring_set_config(struct netdev *netdev, const struct smap *args,
> return 0;
> }
>
> +static bool
> +dpdk_vhost_check_zero_copy_config(struct netdev_dpdk *dev,
> + const struct smap *args)
> +{
> + bool needs_reconfigure = false;
> + bool zc_requested = smap_get_bool(args, "dq-zero-copy", false);
> +
> + if (zc_requested &&
> + !(dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
> + dev->vhost_driver_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> + needs_reconfigure = true;
> + } else if (!zc_requested &&
> + (dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
> + dev->vhost_driver_flags &= ~RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> + needs_reconfigure = true;
> + }
> +
> + /* Only try to change ZC mode when device is down */
> + return needs_reconfigure && (netdev_dpdk_get_vid(dev) == -1);
> +}
> +
> static int
> netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
> const struct smap *args,
> @@ -1378,15 +1403,23 @@ netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
> {
> struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> const char *path;
> + bool needs_reconfigure = false;
>
> ovs_mutex_lock(&dev->mutex);
> if (!(dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT)) {
> path = smap_get(args, "vhost-server-path");
> if (path && strcmp(path, dev->vhost_id)) {
> strcpy(dev->vhost_id, path);
> - netdev_request_reconfigure(netdev);
> + needs_reconfigure = true;
> }
> }
> +
> + needs_reconfigure |= dpdk_vhost_check_zero_copy_config(dev, args);
> +
> + if (needs_reconfigure) {
> + netdev_request_reconfigure(netdev);
> + }
> +
> ovs_mutex_unlock(&dev->mutex);
>
> return 0;
> @@ -2719,6 +2752,22 @@ netdev_dpdk_txq_map_clear(struct netdev_dpdk *dev)
> }
> }
>
> +static bool
> +vhost_zero_copy_status_changed(struct netdev_dpdk *dev, bool *enable_zc)
> +{
> + if ((dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)
> + && !dev->dq_zc_enabled) {
> + *enable_zc = true;
> + return true;
> + } else if (!(dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)
> + && dev->dq_zc_enabled) {
> + *enable_zc = false;
> + return true;
> + }
> +
> + return false;
> +}
> +
> /*
> * Remove a virtio-net device from the specific vhost port. Use dev->remove
> * flag to stop any more packets from being sent or received to/from a VM and
> @@ -2764,6 +2813,7 @@ destroy_device(int vid)
> */
> ovsrcu_quiesce_start();
> VLOG_INFO("vHost Device '%s' has been removed", ifname);
> + netdev_request_reconfigure(&dev->up);
> } else {
> VLOG_INFO("vHost Device '%s' not found", ifname);
> }
> @@ -3278,24 +3328,39 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
> {
> struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> int err;
> - uint64_t vhost_flags = 0;
> + uint64_t vhost_flags = dev->vhost_driver_flags;
> + bool enable_zc = false;
> + bool zc_status_changed = false;
>
> ovs_mutex_lock(&dev->mutex);
>
> - /* Configure vHost client mode if requested and if the following criteria
> - * are met:
> - * 1. Device hasn't been registered yet.
> - * 2. A path has been specified.
> + zc_status_changed = vhost_zero_copy_status_changed(dev, &enable_zc);
> +
> + /* Reconfigure the device if either:
> + * 1. Device hasn't been registered and a path has been specified.
> + * 2. Requested zero copy mode has changed from current configuration.
> */
> - if (!(dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT)
> - && strlen(dev->vhost_id)) {
> - /* Register client-mode device. */
> + if ((!(dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT)
> + && strlen(dev->vhost_id))
> + || zc_status_changed) {
> +
> vhost_flags |= RTE_VHOST_USER_CLIENT;
>
> /* Enable IOMMU support, if explicitly requested. */
> if (dpdk_vhost_iommu_enabled()) {
> vhost_flags |= RTE_VHOST_USER_IOMMU_SUPPORT;
> }
> +
> + /* Device has already been registered. Need to unregister */
> + if (dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT) {
> + err = rte_vhost_driver_unregister(dev->vhost_id);
This causes deadlock if VM will be started between setting "dq-zero-copy=true" and
actual reconfiguration of the netdev but before the "new_device" callback.
Can be easily reproduced by inserting of some valuable sleep before taking the mutex
in 'netdev_dpdk_vhost_client_reconfigure' (for example, sleep(10)).
Running the following command with this change will cause abort on double locking of
the 'dev->mutex':
ovs-vsctl --no-wait set interface vhost1 options:dq-zero-copy=true && \
sleep 2 && ./run_vm.sh
2017-12-26T07:36:43Z|00169|netdev_dpdk|INFO|vHost Device './vhost1' has been added on numa node 0
2017-12-26T07:36:43Z|00170|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
2017-12-26T07:36:43Z|00171|dpdk|INFO|VHOST_CONFIG: vring call idx:7 file:89
<...>
2017-12-26T07:36:47Z|00221|dpdk|INFO|VHOST_CONFIG: free connfd = 67 for device './vhost1'
ovs-vswitchd: lib/netdev-dpdk.c:2790: pthread_mutex_lock failed (Resource deadlock avoided)
Program received signal SIGABRT, Aborted.
0x0000007fb7a79340 in raise () from /lib64/libc.so.6
(gdb) bt
#0 raise () from /lib64/libc.so.6
#1 abort () from /lib64/libc.so.6
#2 ovs_abort_valist (err_no=0, format=0xbe81 , args=...) at lib/util.c:347
#3 ovs_abort (err_no=0, format=0xbe81) at lib/util.c:339
#4 ovs_mutex_lock_at (l_=<optimized out>, where=0x91154a "lib/netdev-dpdk.c:2790") at lib/ovs-thread.c:76
#5 destroy_device (vid=<optimized out>) at lib/netdev-dpdk.c:2790
#6 vhost_destroy_device (vid=0) at lib/librte_vhost/vhost.c:335
#7 rte_vhost_driver_unregister (path=0x7f4a41cf80 "./vhost1") at lib/librte_vhost/socket.c:771
#8 netdev_dpdk_vhost_client_reconfigure (netdev=<optimized out>) at lib/netdev-dpdk.c:3362
#9 port_reconfigure (port=<optimized out>) at lib/dpif-netdev.c:3276
#10 reconfigure_datapath (dp=0xd6e420) at lib/dpif-netdev.c:3757
#11 dpif_netdev_run (dpif=<optimized out>) at lib/dpif-netdev.c:3895
#12 type_run (type=<optimized out>) at ofproto/ofproto-dpif.c:335
#13 ofproto_type_run (datapath_type=0xd37580 "netdev") at ofproto/ofproto.c:1702
#14 bridge_run__ () at vswitchd/bridge.c:2913
#15 bridge_reconfigure (ovs_cfg=<optimized out>) at vswitchd/bridge.c:719
#16 bridge_run () at vswitchd/bridge.c:2998
#17 main (argc=<optimized out>, argv=<optimized out>) at vswitchd/ovs-vswitchd.c:119
There are few more complex deadlocks possible due to race condition between VM start
and unregistering of vhost device.
Right now I see no way to safely call 'rte_vhost_driver_unregister' under the lock.
See commit 3f891bbea61d ("netdev-dpdk: Fix deadlock in destroy_device().") for similar
issue with 'netdev_dpdk_vhost_destruct()'.
Best regards, Ilya Maximets.
> +
> + if (err) {
> + VLOG_ERR("Error unregistering vHost socket %s", dev->vhost_id);
> + goto unlock;
> + }
> + }
> +
> err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
> if (err) {
> VLOG_ERR("vhost-user device setup failure for device %s\n",
> @@ -3333,6 +3398,18 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
> "client port: %s\n", dev->up.name);
> goto unlock;
> }
> +
> + if (zc_status_changed) {
> + if (enable_zc) {
> + dev->dq_zc_enabled = true;
> + VLOG_INFO("Zero copy enabled for vHost port %s",
> + dev->up.name);
> + } else {
> + dev->dq_zc_enabled = false;
> + VLOG_INFO("Zero copy disabled for vHost port %s",
> + dev->up.name);
> + }
> + }
> }
>
> err = dpdk_vhost_reconfigure_helper(dev);
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 018d644..e51795c 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -2669,6 +2669,19 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
> </p>
> </column>
>
> + <column name="options" key="dq-zero-copy"
> + type='{"type": "boolean"}'>
> + <p>
> + The value specifies whether or not to enable dequeue zero copy on
> + the given interface.
> + The port must be in an inactive state in order to enable or disable
> + this feature. One must reload the guest driver or reboot the VM if
> + they wish to enable/disable the feature when the device is active.
> + Only supported by dpdkvhostuserclient interfaces.
> + The feature is considered experimental.
> + </p>
> + </column>
> +
> <column name="options" key="n_rxq_desc"
> type='{"type": "integer", "minInteger": 1, "maxInteger": 4096}'>
> <p>
>
More information about the dev
mailing list