[ovs-dev] [PATCH v16 08/14] netdev-dpdk: support multi-segment jumbo frames.

Michal Obrembski michalx.obrembski at intel.com
Thu Sep 12 08:56:04 UTC 2019


From: Mark Kavanagh <mark.b.kavanagh at intel.com>

Currently, jumbo frame support for OvS-DPDK is implemented by
increasing the size of mbufs within a mempool, such that each mbuf
within the pool is large enough to contain an entire jumbo frame of
a user-defined size. Typically, for each user-defined MTU,
'requested_mtu', a new mempool is created, containing mbufs of size
~requested_mtu.

With the multi-segment approach, a port uses a single mempool,
(containing standard/default-sized mbufs of ~2k bytes), irrespective
of the user-requested MTU value. To accommodate jumbo frames, mbufs
are chained together, where each mbuf in the chain stores a portion of
the jumbo frame. Each mbuf in the chain is termed a segment, hence the
name.

== Enabling multi-segment mbufs ==
Multi-segment and single-segment mbufs are mutually exclusive, and the
user must decide on which approach to adopt on init. The introduction
of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This
is a global boolean value, which determines how jumbo frames are
represented across all DPDK ports. In the absence of a user-supplied
value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
mbufs must be explicitly enabled / single-segment mbufs remain the
default.

Setting the field is identical to setting existing DPDK-specific OVSDB
fields:

    ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
    ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
    ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true

Co-authored-by: Tiago Lam <tiago.lam at intel.com>

Signed-off-by: Mark Kavanagh <mark.b.kavanagh at intel.com>
Signed-off-by: Tiago Lam <tiago.lam at intel.com>
Acked-by: Eelco Chaudron <echaudro at redhat.com>
---
 Documentation/topics/dpdk/jumbo-frames.rst | 73 ++++++++++++++++++++++++++++++
 Documentation/topics/dpdk/memory.rst       | 36 +++++++++++++++
 NEWS                                       |  1 +
 lib/dpdk.c                                 |  8 ++++
 lib/netdev-dpdk.c                          | 68 +++++++++++++++++++++++++---
 lib/netdev-dpdk.h                          |  1 +
 vswitchd/vswitch.xml                       | 22 +++++++++
 7 files changed, 202 insertions(+), 7 deletions(-)

diff --git a/Documentation/topics/dpdk/jumbo-frames.rst b/Documentation/topics/dpdk/jumbo-frames.rst
index 00360b4..9804bbb 100644
--- a/Documentation/topics/dpdk/jumbo-frames.rst
+++ b/Documentation/topics/dpdk/jumbo-frames.rst
@@ -71,3 +71,76 @@ Jumbo frame support has been validated against 9728B frames, which is the
 largest frame size supported by Fortville NIC using the DPDK i40e driver, but
 larger frames and other DPDK NIC drivers may be supported. These cases are
 common for use cases involving East-West traffic only.
+
+-------------------
+Multi-segment mbufs
+-------------------
+
+Instead of increasing the size of mbufs within a mempool, such that each mbuf
+within the pool is large enough to contain an entire jumbo frame of a
+user-defined size, mbufs can be chained together instead. In this approach each
+mbuf in the chain stores a portion of the jumbo frame, by default ~2K bytes,
+irrespective of the user-requested MTU value. Since each mbuf in the chain is
+termed a segment, this approach is named "multi-segment mbufs".
+
+This approach may bring more flexibility in use cases where the maximum packet
+length may be hard to guess. For example, in cases where packets originate from
+sources marked for offload (such as TSO), each packet may be larger than the
+MTU, and as such, when forwarding it to a DPDK port a single mbuf may not be
+enough to hold all of the packet's data.
+
+Multi-segment and single-segment mbufs are mutually exclusive, and the user
+must decide on which approach to adopt on initialisation. If multi-segment
+mbufs is to be enabled, it can be done so with the following command::
+
+    $ ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
+
+Single-segment mbufs still remain the default when using OvS-DPDK, and the
+above option `dpdk-multi-seg-mbufs` must be explicitly set to `true` if
+multi-segment mbufs are to be used.
+
+~~~~~~~~~~~~~~~~~
+Performance notes
+~~~~~~~~~~~~~~~~~
+
+When using multi-segment mbufs some PMDs may not support vectorized Tx
+functions, due to its non-contiguous nature. As a result this can hit
+performance for smaller packet sizes. For example, on a setup sending 64B
+packets at line rate, a decrease of ~20% has been observed. The performance
+impact stops being noticeable for larger packet sizes, although the exact size
+will depend on each PMD, and vary between architectures.
+
+Tests performed with the i40e PMD driver only showed this limitation for 64B
+packets, and the same rate was observed when comparing multi-segment mbufs and
+single-segment mbuf for 128B packets. In other words, the 20% drop in
+performance was not observed for packets >= 128B during this test case.
+
+Because of this, multi-segment mbufs is not advised to be used with smaller
+packet sizes, such as 64B.
+
+Also, note that using multi-segment mbufs won't improve memory usage. For a
+packet of 9000B, for example, which would be stored on a single mbuf when using
+the single-segment approach, 5 mbufs (9000/2176) of 2176B would be needed to
+store the same data using the multi-segment mbufs approach (refer to
+:doc:`/topics/dpdk/memory` for examples).
+
+~~~~~~~~~~~
+Limitations
+~~~~~~~~~~~
+
+Because multi-segment mbufs store the data uncontiguously in memory, when used
+across DPDK and non-DPDK ports, a performance drop is expected, as the mbufs'
+content needs to be copied into a contiguous region in memory to be used by
+operations such as write(). Exchanging traffic between DPDK ports (such as
+vhost and physical ports) doesn't have this limitation, however.
+
+Other operations may have a hit in performance as well, under the current
+implementation. For example, operations that require a checksum to be performed
+on the data, such as pushing / popping a VXLAN header, will also require a copy
+of the data (if it hasn't been copied before), or when using the Userspace
+connection tracker.
+
+Finally, it is assumed that, when enabling the multi-segment mbufs, a packet
+header falls within the first mbuf, which is 2K in size. This is required
+because at the moment the miniflow extraction and setting of the layer headers
+(l2_5, l3, l4) assumes contiguous access to memory.
diff --git a/Documentation/topics/dpdk/memory.rst b/Documentation/topics/dpdk/memory.rst
index 9ebfd11..7f414ef 100644
--- a/Documentation/topics/dpdk/memory.rst
+++ b/Documentation/topics/dpdk/memory.rst
@@ -82,6 +82,14 @@ Users should be aware of the following:
 Below are a number of examples of memory requirement calculations for both
 shared and per port memory models.
 
+.. note::
+
+   If multi-segment mbufs is enabled (:doc:`/topics/dpdk/jumbo-frames`), both
+   the **number of mbufs** and the **size of each mbuf** might be adjusted,
+   which might change slightly the amount of memory required for a given
+   mempool. Examples of how these calculations are performed are also provided
+   below, for the higher MTU case of each memory model.
+
 Shared Memory Calculations
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -142,6 +150,20 @@ Example 4
  Mbuf size = 10176 Bytes
  Memory required = 262144 * 10176 = 2667 MB
 
+Example 5 (multi-segment mbufs enabled)
++++++++++++++++++++++++++++++++++++++++
+::
+
+ MTU = 9000 Bytes
+ Number of mbufs = 262144
+ Mbuf size = 2048 Bytes
+ Memory required = 262144 * (2048 * 5) = 2684 MB
+
+.. note::
+
+   In order to hold 9000B of data, 5 mbufs of 2048B each will be needed, hence
+   the "5" above in 2048 * 5.
+
 Per Port Memory Calculations
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -214,3 +236,17 @@ Example 3: (2 rxq, 2 PMD, 9000 MTU)
  Number of mbufs = (2 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 26656
  Mbuf size = 10176 Bytes
  Memory required = 26656 * 10176 = 271 MB
+
+Example 4: (2 rxq, 2 PMD, 9000 MTU, multi-segment mbufs enabled)
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+::
+
+ MTU = 9000
+ Number of mbufs = (2 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 26656
+ Mbuf size = 2048 Bytes
+ Memory required = 26656 * (2048 * 5) = 273 MB
+
+.. note::
+
+   In order to hold 9000B of data, 5 mbufs of 2048B each will be needed, hence
+   the "5" above in 2048 * 5.
diff --git a/NEWS b/NEWS
index c5caa13..1278ada 100644
--- a/NEWS
+++ b/NEWS
@@ -164,6 +164,7 @@ v2.10.0 - 18 Aug 2018
      * Allow init to fail and record DPDK status/version in OVS database.
      * Add experimental flow hardware offload support
      * Support both shared and per port mempools for DPDK devices.
+     * Add support for multi-segment mbufs.
    - Userspace datapath:
      * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
      * Detailed PMD performance metrics available with new command
diff --git a/lib/dpdk.c b/lib/dpdk.c
index f31e158..2fa1630 100644
--- a/lib/dpdk.c
+++ b/lib/dpdk.c
@@ -456,6 +456,14 @@ dpdk_init__(const struct smap *ovs_other_config)
     /* Finally, register the dpdk classes */
     netdev_dpdk_register();
     netdev_register_flow_api_provider(&netdev_offload_dpdk);
+
+    bool multi_seg_mbufs_enable = smap_get_bool(ovs_other_config,
+            "dpdk-multi-seg-mbufs", false);
+    if (multi_seg_mbufs_enable) {
+        VLOG_INFO("DPDK multi-segment mbufs enabled\n");
+        netdev_dpdk_multi_segment_mbufs_enable();
+    }
+
     return true;
 }
 
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 6b66fc3..084a54e 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -71,6 +71,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
 
 VLOG_DEFINE_THIS_MODULE(netdev_dpdk);
 static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
+static bool dpdk_multi_segment_mbufs = false;
 
 #define DPDK_PORT_WATCHDOG_INTERVAL 5
 
@@ -497,6 +498,12 @@ is_dpdk_class(const struct netdev_class *class)
            || class->destruct == netdev_dpdk_vhost_destruct;
 }
 
+void
+netdev_dpdk_multi_segment_mbufs_enable(void)
+{
+    dpdk_multi_segment_mbufs = true;
+}
+
 /* DPDK NIC drivers allocate RX buffers at a particular granularity, typically
  * aligned at 1k or less. If a declared mbuf size is not a multiple of this
  * value, insufficient buffers are allocated to accomodate the packet in its
@@ -610,14 +617,17 @@ dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
     }
 }
 
-/* Calculating the required number of mbufs differs depending on the
- * mempool model being used. Check if per port memory is in use before
- * calculating.
- */
+/* Calculating the required number of mbufs differs depending on the mempool
+ * model (per port vs shared mempools) being used.
+ * In case multi-segment mbufs are being used, the number of mbufs is also
+ * increased, to account for the multiple mbufs needed to hold each packet's
+ * data. */
 static uint32_t
-dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
+dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, uint32_t mbuf_size,
+                     bool per_port_mp)
 {
     uint32_t n_mbufs;
+    uint16_t max_frame_len = 0;
 
     if (!per_port_mp) {
         /* Shared memory are being used.
@@ -646,6 +656,22 @@ dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
                   + MIN_NB_MBUF;
     }
 
+    /* If multi-segment mbufs are used, we also increase the number of
+     * mbufs used. This is done by calculating how many mbufs are needed to
+     * hold the data on a single packet of MTU size. For example, for a
+     * received packet of 9000B, 5 mbufs (9000 / 2048) are needed to hold
+     * the data - 4 more than with single-mbufs (as mbufs' size is extended
+     * to hold all data) */
+    max_frame_len = MTU_TO_MAX_FRAME_LEN(dev->requested_mtu);
+    if (dpdk_multi_segment_mbufs && mbuf_size < max_frame_len) {
+        uint16_t nb_segs = max_frame_len / mbuf_size;
+        if (max_frame_len % mbuf_size) {
+            nb_segs += 1;
+        }
+
+        n_mbufs *= nb_segs;
+    }
+
     return n_mbufs;
 }
 
@@ -674,8 +700,12 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
 
     /* Get the size of each mbuf, based on the MTU */
     mbuf_size = MTU_TO_FRAME_LEN(mtu);
+    /* multi-segment mbufs - use standard mbuf size */
+    if (dpdk_multi_segment_mbufs) {
+        mbuf_size = dpdk_buf_size(ETHER_MTU);
+    }
 
-    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
+    n_mbufs = dpdk_calculate_mbufs(dev, mtu, mbuf_size, per_port_mp);
 
     do {
         /* Full DPDK memory pool name must be unique and cannot be
@@ -933,6 +963,7 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
     int diag = 0;
     int i;
     struct rte_eth_conf conf = port_conf;
+    struct rte_eth_txconf txconf;
     struct rte_eth_dev_info info;
     uint16_t conf_mtu;
 
@@ -948,6 +979,27 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
         }
     }
 
+    /* Multi-segment-mbuf-specific setup. */
+    if (dpdk_multi_segment_mbufs) {
+        if (info.tx_offload_capa & DEV_TX_OFFLOAD_MULTI_SEGS) {
+            /* DPDK PMDs typically attempt to use simple or vectorized
+             * transmit functions, neither of which are compatible with
+             * multi-segment mbufs. Ensure that these are disabled when
+             * multi-segment mbufs are enabled. */
+            conf.txmode.offloads |= DEV_TX_OFFLOAD_MULTI_SEGS;
+        } else {
+            VLOG_ERR("Interface %s doesn't support multi-segment mbufs",
+                     dev->up.name);
+            conf.txmode.offloads &= ~DEV_TX_OFFLOAD_MULTI_SEGS;
+
+            /* Fail interface init if OFFLOAD_MULTI_SEGS is not supported. */
+            return -ENOTSUP;
+        }
+
+        txconf = info.default_txconf;
+        txconf.offloads = conf.txmode.offloads;
+    }
+
     conf.intr_conf.lsc = dev->lsc_interrupt_mode;
 
     if (dev->hw_ol_features & NETDEV_RX_CHECKSUM_OFFLOAD) {
@@ -999,7 +1051,9 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
 
         for (i = 0; i < n_txq; i++) {
             diag = rte_eth_tx_queue_setup(dev->port_id, i, dev->txq_size,
-                                          dev->socket_id, NULL);
+                                          dev->socket_id,
+                                          dpdk_multi_segment_mbufs ? &txconf
+                                                                   : NULL);
             if (diag) {
                 VLOG_INFO("Interface %s unable to setup txq(%d): %s",
                           dev->up.name, i, rte_strerror(-diag));
diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h
index 60631c4..2c2ca09 100644
--- a/lib/netdev-dpdk.h
+++ b/lib/netdev-dpdk.h
@@ -32,6 +32,7 @@ struct rte_flow_attr;
 struct rte_flow_item;
 struct rte_flow_action;
 
+void netdev_dpdk_multi_segment_mbufs_enable(void);
 void netdev_dpdk_register(void);
 void free_dpdk_buf(struct dp_packet *);
 
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index 9a743c0..c360d2c 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -408,6 +408,28 @@
         </p>
       </column>
 
+      <column name="other_config" key="dpdk-multi-seg-mbufs"
+              type='{"type": "boolean"}'>
+        <p>
+          Specifies if DPDK uses multi-segment mbufs for handling jumbo frames.
+        </p>
+        <p>
+          If true, DPDK allocates a single mempool per port, irrespective of
+          the ports' requested MTU sizes. The elements of this mempool are
+          'standard'-sized mbufs (typically 2k MB), which may be chained
+          together to accommodate jumbo frames. In this approach, each mbuf
+          typically stores a fragment of the overall jumbo frame.
+        </p>
+        <p>
+          If not specified, defaults to <code>false</code>, in which case, the
+          size of each mbuf within a DPDK port's mempool will be grown to
+          accommodate jumbo frames within a single mbuf.
+        </p>
+        <p>
+          Changing this value requires restarting the daemon.
+        </p>
+      </column>
+
       <column name="other_config" key="vhost-sock-dir"
               type='{"type": "string"}'>
         <p>
-- 
2.7.4



More information about the dev mailing list