[ovs-dev] [PATCH V5 7/7] netdev-dpdk: add support for Jumbo Frames

Kavanagh, Mark B mark.b.kavanagh at intel.com
Tue Aug 9 14:52:50 UTC 2016


>Few minor comments inline.

Thanks Ilya - responses inline.

>
>On 09.08.2016 15:03, Mark Kavanagh wrote:
>> Add support for Jumbo Frames to DPDK-enabled port types,
>> using single-segment-mbufs.
>>
>> Using this approach, the amount of memory allocated to each mbuf
>> to store frame data is increased to a value greater than 1518B
>> (typical Ethernet maximum frame length). The increased space
>> available in the mbuf means that an entire Jumbo Frame of a specific
>> size can be carried in a single mbuf, as opposed to partitioning
>> it across multiple mbuf segments.
>>
>> The amount of space allocated to each mbuf to hold frame data is
>> defined dynamically by the user with ovs-vsctl, via the 'mtu_request'
>> parameter.
>>
>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh at intel.com>
>> [diproiettod at vmware.com rebased]
>> Signed-off-by: Daniele Di Proietto <diproiettod at vmware.com>
>> ---
>>
>> v5:
>>     - rename dpdk_mp_configure to netdev_dpdk_mempool_configure
>>     - consolidate socket_id and mtu changes within
>>       netdev_dpdk_mempool_configure
>>     - add lower bounds check for user-supplied MTU
>>     - add socket_id and mtu fields to mempool configure error report
>>     - minor cosmetic changes
>>
>> v4:
>>     - restore error reporting in *_reconfigure functions (for
>>       non-mtu-configuration based errors)
>>     - remove 'goto' in the event of dpdk_mp_configure failure
>>     - remove superfluous error variables
>>
>>  v3:
>>     - replace netdev_dpdk.last_mtu with local variable
>>     - add comment for dpdk_mp_configure
>>
>>  v2:
>>      - rebase to HEAD of master
>>      - fall back to previous 'good' MTU if reconfigure fails
>>      - introduce new field 'last_mtu' in struct netdev_dpdk to facilitate
>>        fall-back
>>      - rename 'mtu_request' to 'requested_mtu' in struct netdev_dpdk
>>      - remove rebasing artifact in INSTALL.DPDK-Advanced.md
>>      - remove superflous variable in dpdk_mp_configure
>>      - fix minor coding style infraction
>>
>>  INSTALL.DPDK-ADVANCED.md |  58 ++++++++++++++++++-
>>  INSTALL.DPDK.md          |   1 -
>>  NEWS                     |   1 +
>>  lib/netdev-dpdk.c        | 145 +++++++++++++++++++++++++++++++++++++++--------
>>  4 files changed, 176 insertions(+), 29 deletions(-)
>>
>> diff --git a/INSTALL.DPDK-ADVANCED.md b/INSTALL.DPDK-ADVANCED.md
>> index 0ab43d4..5e758ce 100755
>> --- a/INSTALL.DPDK-ADVANCED.md
>> +++ b/INSTALL.DPDK-ADVANCED.md
>> @@ -1,5 +1,5 @@
>>  OVS DPDK ADVANCED INSTALL GUIDE
>> -=================================
>> +===============================
>>
>>  ## Contents
>>
>> @@ -12,7 +12,8 @@ OVS DPDK ADVANCED INSTALL GUIDE
>>  7. [QOS](#qos)
>>  8. [Rate Limiting](#rl)
>>  9. [Flow Control](#fc)
>> -10. [Vsperf](#vsperf)
>> +10. [Jumbo Frames](#jumbo)
>> +11. [Vsperf](#vsperf)
>>
>>  ## <a name="overview"></a> 1. Overview
>>
>> @@ -862,7 +863,58 @@ respective parameter. To disable the flow control at tx side,
>>
>>  `ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=false`
>>
>> -## <a name="vsperf"></a> 10. Vsperf
>> +## <a name="jumbo"></a> 10. Jumbo Frames
>> +
>> +By default, DPDK ports are configured with standard Ethernet MTU (1500B). To
>> +enable Jumbo Frames support for a DPDK port, change the Interface's `mtu_request`
>> +attribute to a sufficiently large value.
>> +
>> +e.g. Add a DPDK Phy port with MTU of 9000:
>> +
>> +`ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk -- set Interface dpdk0
>mtu_request=9000`
>> +
>> +e.g. Change the MTU of an existing port to 6200:
>> +
>> +`ovs-vsctl set Interface dpdk0 mtu_request=6200`
>> +
>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are
>> +increased, such that a full Jumbo Frame of a specific size may be accommodated
>> +within a single mbuf segment.
>> +
>> +Jumbo frame support has been validated against 9728B frames (largest frame size
>> +supported by Fortville NIC), using the DPDK `i40e` driver, but larger frames
>> +(particularly in use cases involving East-West traffic only), and other DPDK NIC
>> +drivers may be supported.
>> +
>> +### 9.1 vHost Ports and Jumbo Frames
>> +
>> +Some additional configuration is needed to take advantage of jumbo frames with
>> +vhost ports:
>> +
>> +    1. `mergeable buffers` must be enabled for vHost ports, as demonstrated in
>> +        the QEMU command line snippet below:
>> +
>> +        ```
>> +        '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
>> +        '-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
>> +        ```
>> +
>> +    2. Where virtio devices are bound to the Linux kernel driver in a guest
>> +       environment (i.e. interfaces are not bound to an in-guest DPDK driver),
>> +       the MTU of those logical network interfaces must also be increased to a
>> +       sufficiently large value. This avoids segmentation of Jumbo Frames
>> +       received in the guest. Note that 'MTU' refers to the length of the IP
>> +       packet only, and not that of the entire frame.
>> +
>> +       To calculate the exact MTU of a standard IPv4 frame, subtract the L2
>> +       header and CRC lengths (i.e. 18B) from the max supported frame size.
>> +       So, to set the MTU for a 9018B Jumbo Frame:
>> +
>> +       ```
>> +       ifconfig eth1 mtu 9000
>> +       ```
>> +
>> +## <a name="vsperf"></a> 11. Vsperf
>>
>>  Vsperf project goal is to develop vSwitch test framework that can be used to
>>  validate the suitability of different vSwitch implementations in a Telco deployment
>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>> index 253d022..a810ac8 100644
>> --- a/INSTALL.DPDK.md
>> +++ b/INSTALL.DPDK.md
>> @@ -590,7 +590,6 @@ can be found in [Vhost Walkthrough].
>>
>>  ## <a name="ovslimits"></a> 6. Limitations
>>
>> -  - Supports MTU size 1500, MTU setting for DPDK netdevs will be in future OVS release.
>>    - Currently DPDK ports does not use HW offload functionality.
>>    - Network Interface Firmware requirements:
>>      Each release of DPDK is validated against a specific firmware version for
>> diff --git a/NEWS b/NEWS
>> index ce10982..53c816b 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -69,6 +69,7 @@ Post-v2.5.0
>>       * Basic connection tracking for the userspace datapath (no ALG,
>>         fragmentation or NAT support yet)
>>       * Support for DPDK 16.07
>> +     * Jumbo frame support
>>     - Increase number of registers to 16.
>>     - ovs-benchmark: This utility has been removed due to lack of use and
>>       bitrot.
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index 60db568..c1d4cdf 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -81,6 +81,7 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>>                                      + sizeof(struct dp_packet)    \
>>                                      + RTE_PKTMBUF_HEADROOM)
>>  #define NETDEV_DPDK_MBUF_ALIGN      1024
>> +#define NETDEV_DPDK_MAX_PKT_LEN     9728
>>
>>  /* Max and min number of packets in the mempool.  OVS tries to allocate a
>>   * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
>> @@ -368,6 +369,7 @@ struct netdev_dpdk {
>>      /* The following properties cannot be changed when a device is running,
>>       * so we remember the request and update them next time
>>       * netdev_dpdk*_reconfigure() is called */
>> +    int requested_mtu;
>>      int requested_n_txq;
>>      int requested_n_rxq;
>>
>> @@ -477,10 +479,19 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
>>      dmp->mtu = mtu;
>>      dmp->refcount = 1;
>>      mbp_priv.mbuf_data_room_size = MBUF_SIZE(mtu) - sizeof(struct dp_packet);
>> -    mbp_priv.mbuf_priv_size = sizeof (struct dp_packet) -
>> -                              sizeof (struct rte_mbuf);
>> +    mbp_priv.mbuf_priv_size = sizeof (struct dp_packet)
>> +                              - sizeof (struct rte_mbuf);
>> +    /* XXX: this is a really rough method of provisioning memory.
>> +     * It's impossible to determine what the exact memory requirements are when
>> +     * the number of ports and rxqs that utilize a particular mempool can change
>> +     * dynamically at runtime. For the moment, use this rough heurisitic.
>> +     */
>> +    if (mtu >= ETHER_MTU) {
>> +        mp_size = MAX_NB_MBUF;
>> +    } else {
>> +        mp_size = MIN_NB_MBUF;
>> +    }
>>
>> -    mp_size = MAX_NB_MBUF;
>>      do {
>>          if (snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, "ovs_mp_%d_%d_%u",
>>                       dmp->mtu, dmp->socket_id, mp_size) < 0) {
>> @@ -520,6 +531,35 @@ dpdk_mp_put(struct dpdk_mp *dmp) OVS_REQUIRES(dpdk_mutex)
>>      }
>>  }
>>
>> +/* Tries to allocate new mempool on requested_socket_id with
>> + * mbuf size corresponding to requested_mtu.
>> + * On success new configuration will be applied.
>> + * On error, device will be left unchanged. */
>> +static int
>> +netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>> +    OVS_REQUIRES(dpdk_mutex)
>> +    OVS_REQUIRES(dev->mutex)
>> +{
>> +    uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
>> +    struct dpdk_mp *mp;
>> +
>> +    mp = dpdk_mp_get(dev->requested_socket_id, FRAME_LEN_TO_MTU(buf_size));
>> +    if (!mp) {
>> +        VLOG_ERR("Insufficient memory to create memory pool for "
>> +                 "netdev %s, with MTU %d on socket %d\n",
>
>Please, don't split the string that way. This breaks attempts to grep for it.
>Splitting is safe near to qualifiers. Like this:
>
>        VLOG_ERR("Insufficient memory to create memory pool for netdev "
>                 "%s, with MTU %d on socket %d\n",

Fair point - I'll refactor accordingly in V6.

>
>> +                 dev->up.name, dev->requested_mtu, dev->requested_socket_id);
>> +        return ENOMEM;
>> +    } else {
>> +        dpdk_mp_put(dev->dpdk_mp);
>> +        dev->dpdk_mp = mp;
>> +        dev->mtu = dev->requested_mtu;
>> +        dev->socket_id = dev->requested_socket_id;
>> +        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>  static void
>>  check_link_status(struct netdev_dpdk *dev)
>>  {
>> @@ -571,7 +611,15 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int
>n_txq)
>>  {
>>      int diag = 0;
>>      int i;
>> +    struct rte_eth_conf conf = port_conf;
>>
>> +    if (dev->mtu > ETHER_MTU) {
>> +        conf.rxmode.jumbo_frame = 1;
>> +        conf.rxmode.max_rx_pkt_len = dev->max_packet_len;
>> +    } else {
>> +        conf.rxmode.jumbo_frame = 0;
>> +        conf.rxmode.max_rx_pkt_len = 0;
>> +    }
>>      /* A device may report more queues than it makes available (this has
>>       * been observed for Intel xl710, which reserves some of them for
>>       * SRIOV):  rte_eth_*_queue_setup will fail if a queue is not
>> @@ -582,8 +630,10 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int
>n_txq)
>>              VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
>>          }
>>
>> -        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &port_conf);
>> +        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
>>          if (diag) {
>> +            VLOG_WARN("Interface %s eth_dev setup error %s\n",
>> +                      dev->up.name, rte_strerror(-diag));
>>              break;
>>          }
>>
>> @@ -736,7 +786,6 @@ netdev_dpdk_init(struct netdev *netdev, unsigned int port_no,
>>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>      int sid;
>>      int err = 0;
>> -    uint32_t buf_size;
>>
>>      ovs_mutex_init(&dev->mutex);
>>      ovs_mutex_lock(&dev->mutex);
>> @@ -757,15 +806,13 @@ netdev_dpdk_init(struct netdev *netdev, unsigned int port_no,
>>      dev->port_id = port_no;
>>      dev->type = type;
>>      dev->flags = 0;
>> -    dev->mtu = ETHER_MTU;
>> +    dev->requested_mtu = dev->mtu = ETHER_MTU;
>>      dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>      ovsrcu_index_init(&dev->vid, -1);
>>      dev->vhost_reconfigured = false;
>>
>> -    buf_size = dpdk_buf_size(dev->mtu);
>> -    dev->dpdk_mp = dpdk_mp_get(dev->socket_id, FRAME_LEN_TO_MTU(buf_size));
>> -    if (!dev->dpdk_mp) {
>> -        err = ENOMEM;
>> +    err = netdev_dpdk_mempool_configure(dev);
>> +    if (err) {
>>          goto unlock;
>>      }
>>
>> @@ -991,6 +1038,7 @@ netdev_dpdk_get_config(const struct netdev *netdev, struct smap *args)
>>      smap_add_format(args, "configured_rx_queues", "%d", netdev->n_rxq);
>>      smap_add_format(args, "requested_tx_queues", "%d", dev->requested_n_txq);
>>      smap_add_format(args, "configured_tx_queues", "%d", netdev->n_txq);
>> +    smap_add_format(args, "mtu", "%d", dev->mtu);
>>      ovs_mutex_unlock(&dev->mutex);
>>
>>      return 0;
>> @@ -1366,6 +1414,7 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
>>      struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
>>      unsigned int total_pkts = cnt;
>>      unsigned int qos_pkts = cnt;
>> +    unsigned int mtu_dropped = 0;
>>      int retries = 0;
>>
>>      qid = dev->tx_q[qid % netdev->n_txq].map;
>> @@ -1387,25 +1436,41 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
>>      do {
>>          int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
>>          unsigned int tx_pkts;
>> +        unsigned int try_tx_pkts = cnt;
>>
>> +        for (unsigned int i = 0; i < cnt; i++) {
>> +            if (cur_pkts[i]->pkt_len > dev->max_packet_len) {
>> +                try_tx_pkts = i;
>> +                break;
>> +            }
>> +        }
>> +        if (!try_tx_pkts) {
>> +            cur_pkts++;
>> +            mtu_dropped++;
>> +            cnt--;
>> +            continue;
>> +        }
>>          tx_pkts = rte_vhost_enqueue_burst(netdev_dpdk_get_vid(dev),
>> -                                          vhost_qid, cur_pkts, cnt);
>> +                                          vhost_qid, cur_pkts, try_tx_pkts);
>>          if (OVS_LIKELY(tx_pkts)) {
>>              /* Packets have been sent.*/
>>              cnt -= tx_pkts;
>>              /* Prepare for possible retry.*/
>>              cur_pkts = &cur_pkts[tx_pkts];
>> +            if (tx_pkts != try_tx_pkts) {
>> +                retries++;
>> +            }
>>          } else {
>>              /* No packets sent - do not retry.*/
>>              break;
>>          }
>> -    } while (cnt && (retries++ < VHOST_ENQ_RETRY_NUM));
>> +    } while (cnt && (retries <= VHOST_ENQ_RETRY_NUM));
>>
>>      rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
>>
>>      rte_spinlock_lock(&dev->stats_lock);
>> -    cnt += qos_pkts;
>> -    netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts, cnt);
>> +    netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts,
>> +                                         cnt + mtu_dropped + qos_pkts);
>>      rte_spinlock_unlock(&dev->stats_lock);
>>
>>  out:
>> @@ -1639,6 +1704,27 @@ netdev_dpdk_get_mtu(const struct netdev *netdev, int *mtup)
>>  }
>>
>>  static int
>> +netdev_dpdk_set_mtu(struct netdev *netdev, int mtu)
>> +{
>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +
>> +    if (MTU_TO_FRAME_LEN(mtu) > NETDEV_DPDK_MAX_PKT_LEN ||
>> +            mtu < ETHER_MIN_MTU) {
>
>Should be:
>
>  +    if (MTU_TO_FRAME_LEN(mtu) > NETDEV_DPDK_MAX_PKT_LEN
>  +        || mtu < ETHER_MIN_MTU) {

You're right -  I'll change it to this way in the next version.
Incidentally, checkpatch.py didn't detect this - the addition of such a check would ensure consistency across all future commits.

>
>> +        VLOG_WARN("Unsupported MTU (%d)\n", mtu);
>
>Did you missed my suggestion about device name in this message or
>you just don't like it?

I actually missed it - I'll add it in the next version.

>
>> +        return EINVAL;
>> +    }
>> +
>> +    ovs_mutex_lock(&dev->mutex);
>> +    if (dev->requested_mtu != mtu) {
>> +        dev->requested_mtu = mtu;
>> +        netdev_request_reconfigure(netdev);
>> +    }
>> +    ovs_mutex_unlock(&dev->mutex);
>> +
>> +    return 0;
>> +}
>> +
>> +static int
>>  netdev_dpdk_get_carrier(const struct netdev *netdev, bool *carrier);
>>
>>  static int
>> @@ -2803,7 +2889,8 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>>      ovs_mutex_lock(&dev->mutex);
>>
>>      if (netdev->n_txq == dev->requested_n_txq
>> -        && netdev->n_rxq == dev->requested_n_rxq) {
>> +        && netdev->n_rxq == dev->requested_n_rxq
>> +        && dev->mtu == dev->requested_mtu) {
>>          /* Reconfiguration is unnecessary */
>>
>>          goto out;
>> @@ -2811,6 +2898,10 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>>
>>      rte_eth_dev_stop(dev->port_id);
>>
>> +    if (dev->mtu != dev->requested_mtu) {
>> +        netdev_dpdk_mempool_configure(dev);
>> +    }
>> +
>>      netdev->n_txq = dev->requested_n_txq;
>>      netdev->n_rxq = dev->requested_n_rxq;
>>
>> @@ -2818,6 +2909,8 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>>      err = dpdk_eth_dev_init(dev);
>>      netdev_dpdk_alloc_txq(dev, netdev->n_txq);
>>
>> +    netdev_change_seq_changed(netdev);
>> +
>>  out:
>>
>>      ovs_mutex_unlock(&dev->mutex);
>> @@ -2830,7 +2923,6 @@ static int
>>  netdev_dpdk_vhost_user_reconfigure(struct netdev *netdev)
>>  {
>>      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> -    int err = 0;
>>
>>      ovs_mutex_lock(&dpdk_mutex);
>>      ovs_mutex_lock(&dev->mutex);
>> @@ -2845,13 +2937,10 @@ netdev_dpdk_vhost_user_reconfigure(struct netdev *netdev)
>>
>>      netdev_dpdk_remap_txqs(dev);
>>
>> -    if (dev->requested_socket_id != dev->socket_id) {
>> -        dev->socket_id = dev->requested_socket_id;
>> -        /* Change mempool to new NUMA Node */
>> -        dpdk_mp_put(dev->dpdk_mp);
>> -        dev->dpdk_mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>> -        if (!dev->dpdk_mp) {
>> -            err = ENOMEM;
>> +    if (dev->requested_socket_id != dev->socket_id
>> +        || dev->requested_mtu != dev->mtu) {
>> +        if (!netdev_dpdk_mempool_configure(dev)) {
>> +            netdev_change_seq_changed(netdev);
>>          }
>>      }
>>
>> @@ -2862,7 +2951,7 @@ netdev_dpdk_vhost_user_reconfigure(struct netdev *netdev)
>>      ovs_mutex_unlock(&dev->mutex);
>>      ovs_mutex_unlock(&dpdk_mutex);
>>
>> -    return err;
>> +    return 0;
>>  }
>>
>>  static int
>> @@ -2876,6 +2965,12 @@ netdev_dpdk_vhost_cuse_reconfigure(struct netdev *netdev)
>>      netdev->n_txq = dev->requested_n_txq;
>>      netdev->n_rxq = 1;
>>
>> +    if (dev->requested_mtu != dev->mtu) {
>> +        if (!netdev_dpdk_mempool_configure(dev)) {
>> +            netdev_change_seq_changed(netdev);
>> +        }
>> +    }
>> +
>>      ovs_mutex_unlock(&dev->mutex);
>>      ovs_mutex_unlock(&dpdk_mutex);
>>
>> @@ -2913,7 +3008,7 @@ netdev_dpdk_vhost_cuse_reconfigure(struct netdev *netdev)
>>      netdev_dpdk_set_etheraddr,                                \
>>      netdev_dpdk_get_etheraddr,                                \
>>      netdev_dpdk_get_mtu,                                      \
>> -    NULL,                       /* set_mtu */                 \
>> +    netdev_dpdk_set_mtu,                                      \
>>      netdev_dpdk_get_ifindex,                                  \
>>      GET_CARRIER,                                              \
>>      netdev_dpdk_get_carrier_resets,                           \
>>



More information about the dev mailing list