[ovs-dev] [PATCH v9 11/14] netdev-dpdk: support multi-segment jumbo frames

Tiago Lam tiago.lam at intel.com
Fri Aug 24 16:24:27 UTC 2018


From: Mark Kavanagh <mark.b.kavanagh at intel.com>

Currently, jumbo frame support for OvS-DPDK is implemented by
increasing the size of mbufs within a mempool, such that each mbuf
within the pool is large enough to contain an entire jumbo frame of
a user-defined size. Typically, for each user-defined MTU,
'requested_mtu', a new mempool is created, containing mbufs of size
~requested_mtu.

With the multi-segment approach, a port uses a single mempool,
(containing standard/default-sized mbufs of ~2k bytes), irrespective
of the user-requested MTU value. To accommodate jumbo frames, mbufs
are chained together, where each mbuf in the chain stores a portion of
the jumbo frame. Each mbuf in the chain is termed a segment, hence the
name.

== Enabling multi-segment mbufs ==
Multi-segment and single-segment mbufs are mutually exclusive, and the
user must decide on which approach to adopt on init. The introduction
of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This
is a global boolean value, which determines how jumbo frames are
represented across all DPDK ports. In the absence of a user-supplied
value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
mbufs must be explicitly enabled / single-segment mbufs remain the
default.

Setting the field is identical to setting existing DPDK-specific OVSDB
fields:

    ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
    ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
    ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

Co-authored-by: Tiago Lam <tiago.lam at intel.com>

Signed-off-by: Mark Kavanagh <mark.b.kavanagh at intel.com>
Signed-off-by: Tiago Lam <tiago.lam at intel.com>
Acked-by: Eelco Chaudron <echaudro at redhat.com>
---
 Documentation/topics/dpdk/jumbo-frames.rst | 67 ++++++++++++++++++++++++++++++
 Documentation/topics/dpdk/memory.rst       | 36 ++++++++++++++++
 NEWS                                       |  1 +
 lib/dpdk.c                                 |  8 ++++
 lib/netdev-dpdk.c                          | 66 +++++++++++++++++++++++++----
 lib/netdev-dpdk.h                          |  2 +
 vswitchd/vswitch.xml                       | 22 ++++++++++
 7 files changed, 194 insertions(+), 8 deletions(-)

diff --git a/Documentation/topics/dpdk/jumbo-frames.rst b/Documentation/topics/dpdk/jumbo-frames.rst
index 00360b4..07bf3ca 100644
--- a/Documentation/topics/dpdk/jumbo-frames.rst
+++ b/Documentation/topics/dpdk/jumbo-frames.rst
@@ -71,3 +71,70 @@ Jumbo frame support has been validated against 9728B frames, which is the
 largest frame size supported by Fortville NIC using the DPDK i40e driver, but
 larger frames and other DPDK NIC drivers may be supported. These cases are
 common for use cases involving East-West traffic only.
+
+-------------------
+Multi-segment mbufs
+-------------------
+
+Instead of increasing the size of mbufs within a mempool, such that each mbuf
+within the pool is large enough to contain an entire jumbo frame of a
+user-defined size, mbufs can be chained together instead. In this approach each
+mbuf in the chain stores a portion of the jumbo frame, by default ~2K bytes,
+irrespective of the user-requested MTU value. Since each mbuf in the chain is
+termed a segment, this approach is named "multi-segment mbufs".
+
+This approach may bring more flexibility in use cases where the maximum packet
+length may be hard to guess. For example, in cases where packets originate from
+sources marked for oflload (such as TSO), each packet may be larger than the
+MTU, and as such, when forwarding it to a DPDK port a single mbuf may not be
+enough to hold all of the packet's data.
+
+Multi-segment and single-segment mbufs are mutually exclusive, and the user
+must decide on which approach to adopt on initialisation. If multi-segment
+mbufs is to be enabled, it can be done so with the following command::
+
+    $ ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
+
+Single-segment mbufs still remain the default when using OvS-DPDK, and the
+above option `dpdk-multi-seg-mbufs` must be explicitly set to `true` if
+multi-segment mbufs are to be used.
+
+~~~~~~~~~~~~~~~~~
+Performance notes
+~~~~~~~~~~~~~~~~~
+
+When using multi-segment mbufs some PMDs may not support vectorized Tx
+functions, due to its non-contiguous nature. As a result this can hit
+performance for smaller packet sizes. For example, on a setup sending 64B
+packets at line rate, a decrease of ~20% has been observed. The performance
+impact stops being noticeable for larger packet sizes, although the exact size
+will between PMDs, and depending on the architecture one's using.
+
+Tests performed with the i40e PMD driver only showed this limitation for 64B
+packets, and the same rate was observed when comparing multi-segment mbufs and
+single-segment mbuf for 128B packets. In other words, the 20% drop in
+performance was not observed for packets >= 128B during this test case.
+
+Because of this, multi-segment mbufs is not advised to be used with smaller
+packet sizes, such as 64B.
+
+Also, note that using multi-segment mbufs won't improve memory usage. For a
+packet of 9000B, for example, which would be stored on a single mbuf when using
+the single-segment approach, 5 mbufs (9000/2176) of 2176B would be needed to
+store the same data using the multi-segment mbufs approach (refer to
+:doc:`/topics/dpdk/memory` for examples).
+
+~~~~~~~~~~~
+Limitations
+~~~~~~~~~~~
+
+Because multi-segment mbufs store the data uncontiguously in memory, when used
+across DPDK and non-DPDK ports, a performance drop is expected, as the mbufs'
+content needs to be copied into a contiguous region in memory to be used by
+operations such as write(). Exchanging traffic between DPDK ports (such as
+vhost and physical ports) doesn't have this limitation, however.
+
+Other operations may have a hit in performance as well, under the current
+implementation. For example, operations that require a checksum to be performed
+on the data, such as pushing / popping a VXLAN header, will also require a copy
+of the data (if it hasn't been copied before).
diff --git a/Documentation/topics/dpdk/memory.rst b/Documentation/topics/dpdk/memory.rst
index e5fb166..d8a952a 100644
--- a/Documentation/topics/dpdk/memory.rst
+++ b/Documentation/topics/dpdk/memory.rst
@@ -82,6 +82,14 @@ Users should be aware of the following:
 Below are a number of examples of memory requirement calculations for both
 shared and per port memory models.
 
+.. note::
+
+   If multi-segment mbufs is enabled (:doc:`/topics/dpdk/jumbo-frames`), both
+   the **number of mbufs** and the **size of each mbuf** might be adjusted,
+   which might change slightly the amount of memory required for a given
+   mempool. Examples of how these calculations are performed are also provided
+   below, for the higher MTU case of each memory model.
+
 Shared Memory Calculations
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -142,6 +150,20 @@ Example 4
  Mbuf size = 10176 Bytes
  Memory required = 262144 * 10176 = 2667 MB
 
+Example 5 (multi-segment mbufs enabled)
++++++++++++++++++++++++++++++++++++++++
+::
+
+ MTU = 9000 Bytes
+ Number of mbufs = 262144
+ Mbuf size = 2176 Bytes
+ Memory required = 262144 * (2176 * 5) = 2852 MB
+
+.. note::
+
+   In order to hold 9000B of data, 5 mbufs of 2176B each will be needed, hence
+   the "5" above in 2176 * 5.
+
 Per Port Memory Calculations
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -214,3 +236,17 @@ Example 3: (2 rxq, 2 PMD, 9000 MTU)
  Number of mbufs = (2 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 26656
  Mbuf size = 10176 Bytes
  Memory required = 26656 * 10176 = 271 MB
+
+Example 4: (2 rxq, 2 PMD, 9000 MTU, multi-segment mbufs enabled)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+::
+
+ MTU = 9000
+ Number of mbufs = (2 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 26656
+ Mbuf size = 2176 Bytes
+ Memory required = 26656 * (2176 * 5) = 290 MB
+
+.. note::
+
+   In order to hold 9000B of data, 5 mbufs of 2176B each will be needed, hence
+   the "5" above in 2176 * 5.
diff --git a/NEWS b/NEWS
index 8077c9e..c0e4c45 100644
--- a/NEWS
+++ b/NEWS
@@ -53,6 +53,7 @@ v2.10.0 - xx xxx xxxx
      * Allow init to fail and record DPDK status/version in OVS database.
      * Add experimental flow hardware offload support
      * Support both shared and per port mempools for DPDK devices.
+     * Add support for multi-segment mbufs.
    - Userspace datapath:
      * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
      * Detailed PMD performance metrics available with new command
diff --git a/lib/dpdk.c b/lib/dpdk.c
index 0ee3e19..ac89fd8 100644
--- a/lib/dpdk.c
+++ b/lib/dpdk.c
@@ -497,6 +497,14 @@ dpdk_init__(const struct smap *ovs_other_config)
 
     /* Finally, register the dpdk classes */
     netdev_dpdk_register();
+
+    bool multi_seg_mbufs_enable = smap_get_bool(ovs_other_config,
+            "dpdk-multi-seg-mbufs", false);
+    if (multi_seg_mbufs_enable) {
+        VLOG_INFO("DPDK multi-segment mbufs enabled\n");
+        netdev_dpdk_multi_segment_mbufs_enable();
+    }
+
     return true;
 }
 
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 5ab1af3..d4a2b0c 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -70,6 +70,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
 
 VLOG_DEFINE_THIS_MODULE(netdev_dpdk);
 static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
+static bool dpdk_multi_segment_mbufs = false;
 
 #define DPDK_PORT_WATCHDOG_INTERVAL 5
 
@@ -521,6 +522,18 @@ is_dpdk_class(const struct netdev_class *class)
            || class->destruct == netdev_dpdk_vhost_destruct;
 }
 
+bool
+netdev_dpdk_is_multi_segment_mbufs_enabled(void)
+{
+    return dpdk_multi_segment_mbufs == true;
+}
+
+void
+netdev_dpdk_multi_segment_mbufs_enable(void)
+{
+    dpdk_multi_segment_mbufs = true;
+}
+
 /* DPDK NIC drivers allocate RX buffers at a particular granularity, typically
  * aligned at 1k or less. If a declared mbuf size is not a multiple of this
  * value, insufficient buffers are allocated to accomodate the packet in its
@@ -636,14 +649,17 @@ dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
     }
 }
 
-/* Calculating the required number of mbufs differs depending on the
- * mempool model being used. Check if per port memory is in use before
- * calculating.
- */
+/* Calculating the required number of mbufs differs depending on the mempool
+ * model (per port vs shared mempools) being used.
+ * In case multi-segment mbufs are being used, the number of mbufs is also
+ * increased, to account for the multiple mbufs needed to hold each packet's
+ * data. */
 static uint32_t
-dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
+dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, uint32_t mbuf_size,
+                     bool per_port_mp)
 {
     uint32_t n_mbufs;
+    uint16_t max_frame_len = 0;
 
     if (!per_port_mp) {
         /* Shared memory are being used.
@@ -672,6 +688,22 @@ dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
                   + MIN_NB_MBUF;
     }
 
+    /* If multi-segment mbufs are used, we also increase the number of
+     * mbufs used. This is done by calculating how many mbufs are needed to
+     * hold the data on a single packet of MTU size. For example, for a
+     * received packet of 9000B, 5 mbufs (9000 / 2048) are needed to hold
+     * the data - 4 more than with single-mbufs (as mbufs' size is extended
+     * to hold all data) */
+    max_frame_len = MTU_TO_MAX_FRAME_LEN(dev->requested_mtu);
+    if (dpdk_multi_segment_mbufs && mbuf_size < max_frame_len) {
+        uint16_t nb_segs = max_frame_len / mbuf_size;
+        if (max_frame_len % mbuf_size) {
+            nb_segs += 1;
+        }
+
+        n_mbufs *= nb_segs;
+    }
+
     return n_mbufs;
 }
 
@@ -700,8 +732,12 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
 
     /* Get the size of each mbuf, based on the MTU */
     mbuf_size = dpdk_buf_size(dev->requested_mtu);
+    /* multi-segment mbufs - use standard mbuf size */
+    if (dpdk_multi_segment_mbufs) {
+        mbuf_size = dpdk_buf_size(ETHER_MTU);
+    }
 
-    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
+    n_mbufs = dpdk_calculate_mbufs(dev, mtu, mbuf_size, per_port_mp);
 
     do {
         /* Full DPDK memory pool name must be unique and cannot be
@@ -959,6 +995,7 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
     int diag = 0;
     int i;
     struct rte_eth_conf conf = port_conf;
+    struct rte_eth_txconf txconf;
     struct rte_eth_dev_info info;
     uint16_t conf_mtu;
 
@@ -975,6 +1012,18 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
         }
     }
 
+    /* Multi-segment-mbuf-specific setup. */
+    if (dpdk_multi_segment_mbufs) {
+        /* DPDK PMDs typically attempt to use simple or vectorized
+         * transmit functions, neither of which are compatible with
+         * multi-segment mbufs. Ensure that these are disabled when
+         * multi-segment mbufs are enabled.
+         */
+        rte_eth_dev_info_get(dev->port_id, &info);
+        txconf = info.default_txconf;
+        txconf.txq_flags &= ~ETH_TXQ_FLAGS_NOMULTSEGS;
+    }
+
     conf.intr_conf.lsc = dev->lsc_interrupt_mode;
     conf.rxmode.hw_ip_checksum = (dev->hw_ol_features &
                                   NETDEV_RX_CHECKSUM_OFFLOAD) != 0;
@@ -1019,7 +1068,9 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
 
         for (i = 0; i < n_txq; i++) {
             diag = rte_eth_tx_queue_setup(dev->port_id, i, dev->txq_size,
-                                          dev->socket_id, NULL);
+                                          dev->socket_id,
+                                          dpdk_multi_segment_mbufs ? &txconf
+                                                                   : NULL);
             if (diag) {
                 VLOG_INFO("Interface %s unable to setup txq(%d): %s",
                           dev->up.name, i, rte_strerror(-diag));
@@ -4108,7 +4159,6 @@ unlock:
     return err;
 }
 
-
 /* Find rte_flow with @ufid */
 static struct rte_flow *
 ufid_to_rte_flow_find(const ovs_u128 *ufid) {
diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h
index b7d02a7..19aa5c6 100644
--- a/lib/netdev-dpdk.h
+++ b/lib/netdev-dpdk.h
@@ -25,6 +25,8 @@ struct dp_packet;
 
 #ifdef DPDK_NETDEV
 
+bool netdev_dpdk_is_multi_segment_mbufs_enabled(void);
+void netdev_dpdk_multi_segment_mbufs_enable(void);
 void netdev_dpdk_register(void);
 void free_dpdk_buf(struct dp_packet *);
 
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 0cd8520..253cfc9 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -338,6 +338,28 @@
         </p>
       </column>
 
+      <column name="other_config" key="dpdk-multi-seg-mbufs"
+              type='{"type": "boolean"}'>
+        <p>
+          Specifies if DPDK uses multi-segment mbufs for handling jumbo frames.
+        </p>
+        <p>
+          If true, DPDK allocates a single mempool per port, irrespective of
+          the ports' requested MTU sizes. The elements of this mempool are
+          'standard'-sized mbufs (typically 2k MB), which may be chained
+          together to accommodate jumbo frames. In this approach, each mbuf
+          typically stores a fragment of the overall jumbo frame.
+        </p>
+        <p>
+          If not specified, defaults to <code>false</code>, in which case, the
+          size of each mbuf within a DPDK port's mempool will be grown to
+          accommodate jumbo frames within a single mbuf.
+        </p>
+        <p>
+          Changing this value requires restarting the daemon.
+        </p>
+      </column>
+
       <column name="other_config" key="vhost-sock-dir"
               type='{"type": "string"}'>
         <p>
-- 
2.7.4



More information about the dev mailing list