[ovs-dev] [PATCH] netdev-dpdk: Add Jumbo Frame Support.

Mark Kavanagh mark.b.kavanagh at intel.com
Wed Nov 11 15:06:02 UTC 2015


Add support for Jumbo Frames to DPDK-enabled port types,
using single-segment-mbufs.

Using this approach, the amount of memory allocated for each mbuf
to store frame data is increased to a value greater than 1518B
(typical Ethernet maximum frame length). The increased space
available in the mbuf means that an entire Jumbo Frame can be carried
in a single mbuf, as opposed to partitioning it across multiple mbuf
segments.

The amount of space allocated to each mbuf to hold frame data is
defined by the user at compile time; if this frame length is not a
multiple of the DPDK NIC driver's minimum Rx buffer length, the frame
length is rounded up to the closest value that is.

Signed-off-by: Mark Kavanagh <mark.b.kavanagh at intel.com>
---
 INSTALL.DPDK.md   |   67 ++++++++++++++++++++-
 lib/netdev-dpdk.c |  176 ++++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 207 insertions(+), 36 deletions(-)

diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
index 96b686c..9a30f88 100644
--- a/INSTALL.DPDK.md
+++ b/INSTALL.DPDK.md
@@ -859,10 +859,70 @@ by adding the following string:
 to <interface> sections of all network devices used by DPDK. Parameter 'N'
 determines how many queues can be used by the guest.
 
+
+Jumbo Frames
+------------
+
+Support for Jumbo Frames may be enabled at compile-time for DPDK-type ports.
+Note that if enabled, the mbuf segment size for all DPDK ports is increased, in
+order to accommodate a full Jumbo Frame inside a single mbuf segment. This value
+is also immutable. Note that if non-datapath ports are added to a bridge, the
+value of their MTU will not affect that of the DPDK ports; this is in-keeping
+with the current functionality of DPDK-enabled ports.
+
+To avail of Jumbo Frame support, some source code modifications are
+required, specifically to `lib/netdev-dpdk.c`:
+
+  1. Uncomment the following line to enable JF support:
+
+     ```
+     #define NETDEV_DPDK_JUMBO
+     ```
+
+  2. Adjust the value of `NETDEV_DPDK_MAX_FRAME_LEN` to the required Jumbo
+     Frame size. Consult the datasheet for the NIC in use to determine the max
+     frame size supported by your hardware. Also take into consideration that
+     the DPDK NIC driver allocates RX buffers at a particular granularity
+     (currently 1024B, i.e. NETDEV_DPDK_DEFAULT_RX_BUFSIZE, for the `igb_uio` &
+     `i40e` drivers, respectively). Consequently, the value assigned to
+     NETDEV_DPDK_MAX_FRAME_LEN at compile time should be a multiple of the
+     driver's buffer size. If not, the value used to configure the 'dpdk' ports
+     is rounded up to the next compatible value. Jumbo frame support has been
+     validated against 13312B frames, using the DPDK `igb_uio` driver, but
+     larger frames and other DPDK NIC drivers may theoretically be supported.
+
+NOTE: The use of Jumbo Frames may affect throughput of lower-sized packets; if
+throughput for small-packet workloads is critical, then do not enable this
+feature.
+
+vHost Ports and Jumbo Frames
+----------------------------
+vHost ports require additional configuration to enable Jumbo Frame support.
+
+  1. `mergeable buffers` must be enabled for all vHost port types,
+      as demonstrated in the QEMU command line snippet, below:
+
+      ```
+      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
+      '-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
+      ```
+
+  2. Guests utilizing vHost ports with `virtio-net` backend (as opposed to
+     `virtio-pmd`) must also increase the MTU of their network interfaces,
+     to avoid segmentation of Jumbo Frames in the guest. Note that 'MTU' refers
+     to the length of the IP packet only, and not that of the entire frame. To
+     calculate the exact MTU, subtract the L2 header and trailer lengths
+     (i.e. 18B) from the max supported frame size.
+     e.g. set the MTU for a 13312B Jumbo Frame:
+
+      ```
+      ifconfig eth1 mtu 13294
+      ```
+
+
 Restrictions:
 -------------
 
-  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.
   - Currently DPDK port does not make use any offload functionality.
   - DPDK-vHost support works with 1G huge pages.
 
@@ -903,6 +963,11 @@ Restrictions:
     the next release of DPDK (which includes the above patch) is available and
     integrated into OVS.
 
+  Jumbo Frames:
+  - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The source of
+  this issue is currently being investigated.
+
+
 Bug Reporting:
 --------------
 
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index 4658416..c835303 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -62,20 +62,30 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
 #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
 #define OVS_VPORT_DPDK "ovs_dpdk"
 
+/* Uncomment to enable Jumbo Frame support */
+/* #define NETDEV_DPDK_JUMBO */
+
+#define NETDEV_DPDK_JUMBO_DISABLE      0
+#define NETDEV_DPDK_JUMBO_ENABLE       1
+#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE 1024
+
 /*
  * need to reserve tons of extra space in the mbufs so we can align the
  * DMA addresses to 4KB.
  * The minimum mbuf size is limited to avoid scatter behaviour and drop in
  * performance for standard Ethernet MTU.
  */
-#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
-#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
-                              + sizeof(struct dp_packet) \
-                              + RTE_PKTMBUF_HEADROOM)
-#define MBUF_SIZE_DRIVER     (2048                       \
-                              + sizeof (struct rte_mbuf) \
-                              + RTE_PKTMBUF_HEADROOM)
-#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
+#define MTU_TO_FRAME_LEN(mtu)       ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
+#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN - ETHER_CRC_LEN)
+#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
+                                    + sizeof(struct dp_packet)   \
+                                    + RTE_PKTMBUF_HEADROOM)
+/* This value should be specified as a multiple of the DPDK NIC driver's
+ * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio'). If the value
+ * specified is not such a multiple, the value used to configure the netdev
+ * will be rounded up to the next compatible value, via the
+ * 'dpdk_frame_len' function; in that case, this value will be ignored  */
+#define NETDEV_DPDK_MAX_FRAME_LEN    13312
 
 /* Max and min number of packets in the mempool.  OVS tries to allocate a
  * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
@@ -114,7 +124,13 @@ static const struct rte_eth_conf port_conf = {
         .header_split   = 0, /* Header Split disabled */
         .hw_ip_checksum = 0, /* IP checksum offload disabled */
         .hw_vlan_filter = 0, /* VLAN filtering disabled */
-        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
+#ifdef NETDEV_DPDK_JUMBO
+        .jumbo_frame    = NETDEV_DPDK_JUMBO_ENABLE, /* Jumbo Frame Support enabled */
+        .max_rx_pkt_len = UINT32_MAX, /* Set value in a copy of \
+                                this struct later, based on netdev's MTU */
+#else
+        .jumbo_frame    = NETDEV_DPDK_JUMBO_DISABLE, /* Jumbo Frame Support disabled */
+#endif
         .hw_strip_crc   = 0,
     },
     .rx_adv_conf = {
@@ -254,6 +270,43 @@ is_dpdk_class(const struct netdev_class *class)
     return class->construct == netdev_dpdk_construct;
 }
 
+/* DPDK NIC drivers allocate RX buffers at a particular granularity
+ * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for igb_uio).
+ * If 'frame_len' is not a multiple of this value, insufficient
+ * buffers will be allocated to accomodate the packet in its entirety.
+ * Return the value closest to 'frame_len' that is a multiple of the
+ * driver's 'min_rx_bufsize' which enables the driver to receive the
+ * entire packet.
+ */
+static uint32_t
+dpdk_frame_len(struct netdev_dpdk *netdev, int frame_len)
+{
+    struct rte_eth_dev_info info;
+    uint32_t buf_size;
+    int len = 0;
+
+    /* All VHost ports currently use '-1' as their port_id */
+    if(netdev->type != DPDK_DEV_VHOST) {
+        rte_eth_dev_info_get(netdev->port_id, &info);
+        buf_size = info.min_rx_bufsize;
+    } else {
+        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
+    }
+
+    if(frame_len % buf_size != 0) {
+        len = buf_size * ((frame_len/buf_size) + 1);
+#ifdef NETDEV_DPDK_JUMBO
+        VLOG_WARN("User-specified frame length %d is not compatible with "
+                  "minimum DPDK RX buffer length, and will be increased to"
+                  "%d\n", frame_len, len);
+#endif
+    } else {
+        len = frame_len;
+    }
+
+    return len;
+}
+
 /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used
  * for all other segments data, bss and text. */
 
@@ -280,31 +333,70 @@ free_dpdk_buf(struct dp_packet *p)
 }
 
 static void
-__rte_pktmbuf_init(struct rte_mempool *mp,
-                   void *opaque_arg OVS_UNUSED,
-                   void *_m,
-                   unsigned i OVS_UNUSED)
+ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
 {
-    struct rte_mbuf *m = _m;
-    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
+    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
+    struct rte_pktmbuf_pool_private default_mbp_priv;
+    uint16_t roomsz;
 
     RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
 
-    memset(m, 0, mp->elt_size);
+    /* if no structure is provided, assume no mbuf private area */
 
-    /* start of buffer is just after mbuf structure */
-    m->buf_addr = (char *)m + sizeof(struct dp_packet);
-    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
-                    sizeof(struct dp_packet);
-    m->buf_len = (uint16_t)buf_len;
+    user_mbp_priv = opaque_arg;
+    if (user_mbp_priv == NULL) {
+        default_mbp_priv.mbuf_priv_size = 0;
+        if (mp->elt_size > sizeof(struct dp_packet)) {
+            roomsz = mp->elt_size - sizeof(struct dp_packet);
+        } else {
+            roomsz = 0;
+        }
+        default_mbp_priv.mbuf_data_room_size = roomsz;
+        user_mbp_priv = &default_mbp_priv;
+    }
 
-    /* keep some headroom between start of buffer and data */
-    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
+    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
+        user_mbp_priv->mbuf_data_room_size +
+        user_mbp_priv->mbuf_priv_size);
 
-    /* init some constant fields */
-    m->pool = mp;
-    m->nb_segs = 1;
-    m->port = 0xff;
+    mbp_priv = rte_mempool_get_priv(mp);
+    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
+}
+
+/* Initialise some fields in the mbuf structure that are not modified by the
+ * user once created (origin pool, buffer start address, etc.*/
+static void
+__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
+                       void *opaque_arg OVS_UNUSED,
+                       void *_m,
+                       unsigned i OVS_UNUSED)
+{
+	struct rte_mbuf *m = _m;
+	uint32_t buf_size, buf_len, priv_size;
+
+	priv_size = rte_pktmbuf_priv_size(mp);
+	buf_size = sizeof(struct dp_packet) + priv_size;
+	buf_len = rte_pktmbuf_data_room_size(mp);
+
+	RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) == priv_size);
+	RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
+	RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
+
+	memset(m, 0, mp->elt_size);
+
+	/* start of buffer is after dp_packet structure and priv data */
+	m->priv_size = priv_size;
+	m->buf_addr = (char *)m + buf_size;
+	m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
+	m->buf_len = (uint16_t)buf_len;
+
+	/* keep some headroom between start of buffer and data */
+	m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
+
+	/* init some constant fields */
+	m->pool = mp;
+	m->nb_segs = 1;
+	m->port = 0xff;
 }
 
 static void
@@ -315,7 +407,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
 {
     struct rte_mbuf *m = _m;
 
-    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
+    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
 
     dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
 }
@@ -326,6 +418,7 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
     struct dpdk_mp *dmp = NULL;
     char mp_name[RTE_MEMPOOL_NAMESIZE];
     unsigned mp_size;
+    struct rte_pktmbuf_pool_private mbp_priv;
 
     LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
         if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
@@ -338,6 +431,8 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
     dmp->socket_id = socket_id;
     dmp->mtu = mtu;
     dmp->refcount = 1;
+    mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) + RTE_PKTMBUF_HEADROOM;
+    mbp_priv.mbuf_priv_size = 0;
 
     mp_size = MAX_NB_MBUF;
     do {
@@ -346,10 +441,10 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
             return NULL;
         }
 
-        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
+        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SEGMENT_SIZE(mtu),
                                      MP_CACHE_SZ,
                                      sizeof(struct rte_pktmbuf_pool_private),
-                                     rte_pktmbuf_pool_init, NULL,
+                                     ovs_rte_pktmbuf_pool_init, &mbp_priv,
                                      ovs_rte_pktmbuf_init, NULL,
                                      socket_id, 0);
     } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >= MIN_NB_MBUF);
@@ -433,6 +528,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)
 {
     int diag = 0;
     int i;
+    struct rte_eth_conf conf = port_conf;
 
     /* A device may report more queues than it makes available (this has
      * been observed for Intel xl710, which reserves some of them for
@@ -444,7 +540,11 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)
             VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
         }
 
-        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &port_conf);
+#ifdef NETDEV_DPDK_JUMBO
+        conf.rxmode.max_rx_pkt_len = dpdk_frame_len(dev,
+                                                    NETDEV_DPDK_MAX_FRAME_LEN);
+#endif
+        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
         if (diag) {
             break;
         }
@@ -586,6 +686,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,
     struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
     int sid;
     int err = 0;
+    uint32_t max_frame_len;
 
     ovs_mutex_init(&netdev->mutex);
     ovs_mutex_lock(&netdev->mutex);
@@ -605,8 +706,13 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,
     netdev->port_id = port_no;
     netdev->type = type;
     netdev->flags = 0;
-    netdev->mtu = ETHER_MTU;
-    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
+#ifdef NETDEV_DPDK_JUMBO
+    max_frame_len = dpdk_frame_len(netdev, NETDEV_DPDK_MAX_FRAME_LEN);
+#else
+    max_frame_len = dpdk_frame_len(netdev, ETHER_MAX_LEN);
+#endif
+    netdev->mtu = FRAME_LEN_TO_MTU(max_frame_len);
+    netdev->max_packet_len = max_frame_len;
 
     netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
     if (!netdev->dpdk_mp) {
@@ -1386,14 +1492,14 @@ netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
     old_mp = dev->dpdk_mp;
     dev->dpdk_mp = mp;
     dev->mtu = mtu;
-    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
+    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
 
     err = dpdk_eth_dev_init(dev);
     if (err) {
         dpdk_mp_put(mp);
         dev->mtu = old_mtu;
         dev->dpdk_mp = old_mp;
-        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
+        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
         dpdk_eth_dev_init(dev);
         goto out;
     }
-- 
1.7.4.1




More information about the dev mailing list