[ovs-dev] [PATCH v7] netdev-dpdk: Add support for vHost dequeue zero copy (experimental)

Ilya Maximets i.maximets at samsung.com
Tue Dec 26 11:05:34 UTC 2017


On 19.12.2017 19:44, Ciara Loftus wrote:
> Enabled per port like so:
> ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
> 
> If the device is already active at the point of enabling/disabling the
> feature via ovs-vsctl, one must reload the guest driver or reboot the VM
> in order for the feature to be successfully enabled/disabled.
> 
> When packets from a vHost device with zero copy enabled are destined for
> a single 'dpdk' port, the number of tx descriptors on that 'dpdk' port
> must be set to a smaller value. 128 is recommended. This can be achieved
> like so:
> 
> ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> 
> Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send
> to should not exceed 128. Due to this requirement, the feature is
> considered 'experimental'.
> 
> Signed-off-by: Ciara Loftus <ciara.loftus at intel.com>
> ---
> v7:
> * Remove support for zc for dpdkvhostuser ports & patch 1 in the series.
> * Remove entry from howto Documentation as the document in its current
> state is only relevant to dpdkvhostuser ports which do not support the
> feature.
> * Use ref: link to queue size documentation.
> * Elaborate on details of NIC descriptors limitation eg. cases where
> packets are destined to >1 NICs.
> * Fix comment about padding in netdev-dpdk struct.
> 
>  Documentation/intro/install/dpdk.rst     |  2 +
>  Documentation/topics/dpdk/vhost-user.rst | 71 +++++++++++++++++++++++
>  NEWS                                     |  1 +
>  lib/netdev-dpdk.c                        | 97 ++++++++++++++++++++++++++++----
>  vswitchd/vswitch.xml                     | 13 +++++
>  5 files changed, 174 insertions(+), 10 deletions(-)
> 
> diff --git a/Documentation/intro/install/dpdk.rst b/Documentation/intro/install/dpdk.rst
> index 3fecb5c..087eb88 100644
> --- a/Documentation/intro/install/dpdk.rst
> +++ b/Documentation/intro/install/dpdk.rst
> @@ -518,6 +518,8 @@ The above command sets the number of rx queues for DPDK physical interface.
>  The rx queues are assigned to pmd threads on the same NUMA node in a
>  round-robin fashion.
>  
> +.. _dpdk-queues-sizes:
> +
>  DPDK Physical Port Queue Sizes
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>  
> diff --git a/Documentation/topics/dpdk/vhost-user.rst b/Documentation/topics/dpdk/vhost-user.rst
> index 8447e2d..d82ca40 100644
> --- a/Documentation/topics/dpdk/vhost-user.rst
> +++ b/Documentation/topics/dpdk/vhost-user.rst
> @@ -458,3 +458,74 @@ Sample XML
>      </domain>
>  
>  .. _QEMU documentation: http://git.qemu-project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-user.txt;h=7890d7169;hb=HEAD
> +
> +vhost-user Dequeue Zero Copy (experimental)
> +-------------------------------------------
> +
> +Normally when dequeuing a packet from a vHost User device, a memcpy operation
> +must be used to copy that packet from guest address space to host address
> +space. This memcpy can be removed by enabling dequeue zero-copy like so::
> +
> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=true
> +
> +With this feature enabled, a reference (pointer) to the packet is passed to
> +the host, instead of a copy of the packet. Removing this memcpy can give a
> +performance improvement for some use cases, for example switching large packets
> +between different VMs. However additional packet loss may be observed.
> +
> +Note that the feature is disabled by default and must be explicitly enabled
> +by using the command above.
> +
> +The feature cannot be enabled when the device is active. If the device is
> +active when enabling the feature via ovs-vsctl, a driver reload or reboot
> +is required for the changes to take effect.
> +
> +The same logic applies for disabling the feature - it must be disabled when
> +the device is inactive. To disable the feature::
> +
> +    $ ovs-vsctl set Interface dpdkvhostuserclient0 options:dq-zero-copy=false
> +
> +The feature is only available to dpdkvhostuserclient port types.
> +
> +A limitation exists whereby if packets from a vHost port with dq-zero-copy=true
> +are destined for a 'dpdk' type port, the number of tx descriptors (n_txq_desc)
> +for that port must be reduced to a smaller number, 128 being the recommended
> +value. This can be achieved by issuing the following command::
> +
> +    $ ovs-vsctl set Interface dpdkport options:n_txq_desc=128
> +
> +Note: The sum of the tx descriptors of all 'dpdk' ports the VM will send to
> +should not exceed 128. For example, in case of a bond over two physical ports
> +in balance-tcp mode, one must divide 128 by the number of links in the bond.
> +
> +Refer to :ref:`<dpdk-queue-sizes>` for more information.
> +
> +The reason for this limitation is due to how the zero copy functionality is
> +implemented. The vHost device's 'tx used vring', a virtio structure used for
> +tracking used ie. sent descriptors, will only be updated when the NIC frees
> +the corresponding mbuf. If we don't free the mbufs frequently enough, that
> +vring will be starved and packets will no longer be processed. One way to
> +ensure we don't encounter this scenario, is to configure n_txq_desc to a small
> +enough number such that the 'mbuf free threshold' for the NIC will be hit more
> +often and thus free mbufs more frequently. The value of 128 is suggested, but
> +values of 64 and 256 have been tested and verified to work too, with differing
> +performance characteristics. A value of 512 can be used too, if the virtio
> +queue size in the guest is increased to 1024 (available to configure in QEMU
> +versions v2.10 and greater). This value can be set like so::
> +
> +    $ qemu-system-x86_64 ... -chardev socket,id=char1,path=<sockpath>,server
> +      -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce
> +      -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,
> +      tx_queue_size=1024
> +
> +Because of this limitation, this feature is condsidered 'experimental'.
> +
> +The feature currently does not fully work with QEMU >= v2.7 due to a bug in
> +DPDK which will be addressed in an upcoming release. The patch to fix this
> +issue can be found on
> +`Patchwork
> +<http://dpdk.org/dev/patchwork/patch/32198/>`__
> +
> +Further information can be found in the
> +`DPDK documentation
> +<http://dpdk.readthedocs.io/en/v17.05/prog_guide/vhost_lib.html>`__
> diff --git a/NEWS b/NEWS
> index 752e98f..1eaa32c 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -24,6 +24,7 @@ Post-v2.8.0
>     - DPDK:
>       * Add support for DPDK v17.11
>       * Add support for vHost IOMMU
> +     * Add support for vHost dequeue zero copy (experimental)
>  
>  v2.8.0 - 31 Aug 2017
>  --------------------
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 8f22264..4b82a06 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -378,7 +378,10 @@ struct netdev_dpdk {
>  
>          /* True if vHost device is 'up' and has been reconfigured at least once */
>          bool vhost_reconfigured;
> -        /* 3 pad bytes here. */
> +
> +        /* True if dq-zero-copy feature has successfully been enabled */
> +        bool dq_zc_enabled;
> +        /* 2 pad bytes here. */
>      );
>  
>      PADDED_MEMBERS(CACHE_LINE_SIZE,
> @@ -889,6 +892,7 @@ common_construct(struct netdev *netdev, dpdk_port_t port_no,
>      dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>      ovsrcu_index_init(&dev->vid, -1);
>      dev->vhost_reconfigured = false;
> +    dev->dq_zc_enabled = false;
>      dev->attached = false;
>  
>      ovsrcu_init(&dev->qos_conf, NULL);
> @@ -1371,6 +1375,27 @@ netdev_dpdk_ring_set_config(struct netdev *netdev, const struct smap *args,
>      return 0;
>  }
>  
> +static bool
> +dpdk_vhost_check_zero_copy_config(struct netdev_dpdk *dev,
> +                                  const struct smap *args)
> +{
> +    bool needs_reconfigure = false;
> +    bool zc_requested = smap_get_bool(args, "dq-zero-copy", false);
> +
> +    if (zc_requested &&
> +            !(dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
> +        dev->vhost_driver_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> +        needs_reconfigure = true;
> +    } else if (!zc_requested &&
> +            (dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)) {
> +        dev->vhost_driver_flags &= ~RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> +        needs_reconfigure = true;
> +    }
> +
> +    /* Only try to change ZC mode when device is down */
> +    return needs_reconfigure && (netdev_dpdk_get_vid(dev) == -1);
> +}
> +
>  static int
>  netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
>                                      const struct smap *args,
> @@ -1378,15 +1403,23 @@ netdev_dpdk_vhost_client_set_config(struct netdev *netdev,
>  {
>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>      const char *path;
> +    bool needs_reconfigure = false;
>  
>      ovs_mutex_lock(&dev->mutex);
>      if (!(dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT)) {
>          path = smap_get(args, "vhost-server-path");
>          if (path && strcmp(path, dev->vhost_id)) {
>              strcpy(dev->vhost_id, path);
> -            netdev_request_reconfigure(netdev);
> +            needs_reconfigure = true;
>          }
>      }
> +
> +    needs_reconfigure |= dpdk_vhost_check_zero_copy_config(dev, args);
> +
> +    if (needs_reconfigure) {
> +        netdev_request_reconfigure(netdev);
> +    }
> +
>      ovs_mutex_unlock(&dev->mutex);
>  
>      return 0;
> @@ -2719,6 +2752,22 @@ netdev_dpdk_txq_map_clear(struct netdev_dpdk *dev)
>      }
>  }
>  
> +static bool
> +vhost_zero_copy_status_changed(struct netdev_dpdk *dev, bool *enable_zc)
> +{
> +    if ((dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)
> +            && !dev->dq_zc_enabled) {
> +        *enable_zc = true;
> +        return true;
> +    } else if (!(dev->vhost_driver_flags & RTE_VHOST_USER_DEQUEUE_ZERO_COPY)
> +            && dev->dq_zc_enabled) {
> +        *enable_zc = false;
> +        return true;
> +    }
> +
> +    return false;
> +}
> +
>  /*
>   * Remove a virtio-net device from the specific vhost port.  Use dev->remove
>   * flag to stop any more packets from being sent or received to/from a VM and
> @@ -2764,6 +2813,7 @@ destroy_device(int vid)
>           */
>          ovsrcu_quiesce_start();
>          VLOG_INFO("vHost Device '%s' has been removed", ifname);
> +        netdev_request_reconfigure(&dev->up);
>      } else {
>          VLOG_INFO("vHost Device '%s' not found", ifname);
>      }
> @@ -3278,24 +3328,39 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
>  {
>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>      int err;
> -    uint64_t vhost_flags = 0;
> +    uint64_t vhost_flags = dev->vhost_driver_flags;
> +    bool enable_zc = false;
> +    bool zc_status_changed = false;
>  
>      ovs_mutex_lock(&dev->mutex);
>  
> -    /* Configure vHost client mode if requested and if the following criteria
> -     * are met:
> -     *  1. Device hasn't been registered yet.
> -     *  2. A path has been specified.
> +    zc_status_changed = vhost_zero_copy_status_changed(dev, &enable_zc);
> +
> +    /* Reconfigure the device if either:
> +     *  1. Device hasn't been registered and a path has been specified.
> +     *  2. Requested zero copy mode has changed from current configuration.
>       */
> -    if (!(dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT)
> -            && strlen(dev->vhost_id)) {
> -        /* Register client-mode device. */
> +    if ((!(dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT)
> +           && strlen(dev->vhost_id))
> +           || zc_status_changed) {
> +
>          vhost_flags |= RTE_VHOST_USER_CLIENT;
>  
>          /* Enable IOMMU support, if explicitly requested. */
>          if (dpdk_vhost_iommu_enabled()) {
>              vhost_flags |= RTE_VHOST_USER_IOMMU_SUPPORT;
>          }
> +
> +        /* Device has already been registered. Need to unregister */
> +        if (dev->vhost_driver_flags & RTE_VHOST_USER_CLIENT) {
> +            err = rte_vhost_driver_unregister(dev->vhost_id);

This causes deadlock if VM will be started between setting "dq-zero-copy=true" and
actual reconfiguration of the netdev but before the "new_device" callback.

Can be easily reproduced by inserting of some valuable sleep before taking the mutex
in 'netdev_dpdk_vhost_client_reconfigure' (for example, sleep(10)).
Running the following command with this change will cause abort on double locking of
the 'dev->mutex':
    ovs-vsctl --no-wait set interface vhost1 options:dq-zero-copy=true && \
        sleep 2 && ./run_vm.sh

2017-12-26T07:36:43Z|00169|netdev_dpdk|INFO|vHost Device './vhost1' has been added on numa node 0
2017-12-26T07:36:43Z|00170|dpdk|INFO|VHOST_CONFIG: read message VHOST_USER_SET_VRING_CALL
2017-12-26T07:36:43Z|00171|dpdk|INFO|VHOST_CONFIG: vring call idx:7 file:89
<...>
2017-12-26T07:36:47Z|00221|dpdk|INFO|VHOST_CONFIG: free connfd = 67 for device './vhost1'
ovs-vswitchd: lib/netdev-dpdk.c:2790: pthread_mutex_lock failed (Resource deadlock avoided)

Program received signal SIGABRT, Aborted.
0x0000007fb7a79340 in raise () from /lib64/libc.so.6
(gdb) bt
#0  raise () from /lib64/libc.so.6
#1  abort () from /lib64/libc.so.6
#2  ovs_abort_valist (err_no=0, format=0xbe81 , args=...) at lib/util.c:347
#3  ovs_abort (err_no=0, format=0xbe81) at lib/util.c:339
#4  ovs_mutex_lock_at (l_=<optimized out>, where=0x91154a "lib/netdev-dpdk.c:2790") at lib/ovs-thread.c:76
#5  destroy_device (vid=<optimized out>) at lib/netdev-dpdk.c:2790
#6  vhost_destroy_device (vid=0) at lib/librte_vhost/vhost.c:335
#7  rte_vhost_driver_unregister (path=0x7f4a41cf80 "./vhost1") at lib/librte_vhost/socket.c:771
#8  netdev_dpdk_vhost_client_reconfigure (netdev=<optimized out>) at lib/netdev-dpdk.c:3362
#9  port_reconfigure (port=<optimized out>) at lib/dpif-netdev.c:3276
#10 reconfigure_datapath (dp=0xd6e420) at lib/dpif-netdev.c:3757
#11 dpif_netdev_run (dpif=<optimized out>) at lib/dpif-netdev.c:3895
#12 type_run (type=<optimized out>) at ofproto/ofproto-dpif.c:335
#13 ofproto_type_run (datapath_type=0xd37580 "netdev") at ofproto/ofproto.c:1702
#14 bridge_run__ () at vswitchd/bridge.c:2913
#15 bridge_reconfigure (ovs_cfg=<optimized out>) at vswitchd/bridge.c:719
#16 bridge_run () at vswitchd/bridge.c:2998
#17 main (argc=<optimized out>, argv=<optimized out>) at vswitchd/ovs-vswitchd.c:119

There are few more complex deadlocks possible due to race condition between VM start
and unregistering of vhost device.
Right now I see no way to safely call 'rte_vhost_driver_unregister' under the lock.
See commit 3f891bbea61d ("netdev-dpdk: Fix deadlock in destroy_device().") for similar
issue with 'netdev_dpdk_vhost_destruct()'.


Best regards, Ilya Maximets.


> +
> +            if (err) {
> +                VLOG_ERR("Error unregistering vHost socket %s", dev->vhost_id);
> +                goto unlock;
> +            }
> +        }
> +
>          err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
>          if (err) {
>              VLOG_ERR("vhost-user device setup failure for device %s\n",
> @@ -3333,6 +3398,18 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
>                       "client port: %s\n", dev->up.name);
>              goto unlock;
>          }
> +
> +        if (zc_status_changed) {
> +            if (enable_zc) {
> +                 dev->dq_zc_enabled = true;
> +                 VLOG_INFO("Zero copy enabled for vHost port %s",
> +                           dev->up.name);
> +            } else {
> +                 dev->dq_zc_enabled = false;
> +                 VLOG_INFO("Zero copy disabled for vHost port %s",
> +                           dev->up.name);
> +            }
> +        }
>      }
>  
>      err = dpdk_vhost_reconfigure_helper(dev);
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 018d644..e51795c 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -2669,6 +2669,19 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \
>          </p>
>        </column>
>  
> +      <column name="options" key="dq-zero-copy"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          The value specifies whether or not to enable dequeue zero copy on
> +          the given interface.
> +          The port must be in an inactive state in order to enable or disable
> +          this feature. One must reload the guest driver or reboot the VM if
> +          they wish to enable/disable the feature when the device is active.
> +          Only supported by dpdkvhostuserclient interfaces.
> +          The feature is considered experimental.
> +        </p>
> +      </column>
> +
>        <column name="options" key="n_rxq_desc"
>                type='{"type": "integer", "minInteger": 1, "maxInteger": 4096}'>
>          <p>
> 


More information about the dev mailing list