[ovs-dev] [ovs-dev, 7/7] netdev-dpdk: add support for Jumbo Frames
Ilya Maximets
i.maximets at samsung.com
Mon Aug 1 13:14:54 UTC 2016
Hi Daniele. Thanks for posting this.
I have almost same patch in my local branch.
I didn't test this with physical DPDK NICs yet, but I have few
high level comments:
1. Do you thought about renaming of 'mtu_request' inside netdev-dpdk
to 'requested_mtu'? I think, this would be more clear and
consistent with other configurable parameters (n_rxq, n_txq, ...).
2. I'd prefer not to fail reconfiguration if there is no enough memory
for new mempool. I think, it'll be common situation when we are
requesting more memory than we have. Failure leads to destruction
of the port and inability to reconnect to vhost-user port after
re-creation if vhost is in server mode. We can just keep old
mempool and inform user via VLOG_ERR.
3. Minor issues inline.
What do you think?
Best regards, Ilya Maximets.
On 30.07.2016 04:22, Daniele Di Proietto wrote:
> From: Mark Kavanagh <mark.b.kavanagh at intel.com>
>
> Add support for Jumbo Frames to DPDK-enabled port types,
> using single-segment-mbufs.
>
> Using this approach, the amount of memory allocated to each mbuf
> to store frame data is increased to a value greater than 1518B
> (typical Ethernet maximum frame length). The increased space
> available in the mbuf means that an entire Jumbo Frame of a specific
> size can be carried in a single mbuf, as opposed to partitioning
> it across multiple mbuf segments.
>
> The amount of space allocated to each mbuf to hold frame data is
> defined dynamically by the user with ovs-vsctl, via the 'mtu_request'
> parameter.
>
> Signed-off-by: Mark Kavanagh <mark.b.kavanagh at intel.com>
> [diproiettod at vmware.com rebased]
> Signed-off-by: Daniele Di Proietto <diproiettod at vmware.com>
> ---
> INSTALL.DPDK-ADVANCED.md | 59 +++++++++++++++++-
> INSTALL.DPDK.md | 1 -
> NEWS | 1 +
> lib/netdev-dpdk.c | 151 +++++++++++++++++++++++++++++++++++++++--------
> 4 files changed, 185 insertions(+), 27 deletions(-)
>
> diff --git a/INSTALL.DPDK-ADVANCED.md b/INSTALL.DPDK-ADVANCED.md
> index 191e69e..5cd64bf 100755
> --- a/INSTALL.DPDK-ADVANCED.md
> +++ b/INSTALL.DPDK-ADVANCED.md
> @@ -1,5 +1,5 @@
> OVS DPDK ADVANCED INSTALL GUIDE
> -=================================
> +===============================
>
> ## Contents
>
> @@ -12,7 +12,8 @@ OVS DPDK ADVANCED INSTALL GUIDE
> 7. [QOS](#qos)
> 8. [Rate Limiting](#rl)
> 9. [Flow Control](#fc)
> -10. [Vsperf](#vsperf)
> +10. [Jumbo Frames](#jumbo)
> +11. [Vsperf](#vsperf)
>
> ## <a name="overview"></a> 1. Overview
>
> @@ -862,7 +863,59 @@ respective parameter. To disable the flow control at tx side,
>
> `ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=false`
>
> -## <a name="vsperf"></a> 10. Vsperf
> +## <a name="jumbo"></a> 10. Jumbo Frames
> +
> +By default, DPDK ports are configured with standard Ethernet MTU (1500B). To
> +enable Jumbo Frames support for a DPDK port, change the Interface's `mtu_request`
> +attribute to a sufficiently large value.
> +
> +e.g. Add a DPDK Phy port with MTU of 9000:
> +
> +`ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk -- set Interface dpdk0 mtu_request=9000`
> +
> +e.g. Change the MTU of an existing port to 6200:
> +
> +`ovs-vsctl set Interface dpdk0 mtu_request=6200`
> +
> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are
> +increased, such that a full Jumbo Frame of a specific size may be accommodated
> +within a single mbuf segment.
> +
> +Jumbo frame support has been validated against 9728B frames (largest frame size
> +supported by Fortville NIC), using the DPDK `i40e` driver, but larger frames
> +(particularly in use cases involving East-West traffic only), and other DPDK NIC
> +drivers may be supported.
> +
> +### 9.1 vHost Ports and Jumbo Frames
> +
> +Some additional configuration is needed to take advantage of jumbo frames with
> +vhost ports:
> +
> + 1. `mergeable buffers` must be enabled for vHost ports, as demonstrated in
> + the QEMU command line snippet below:
> +
> + ```
> + '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
> + '-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
> + ```
> +
> + 2. Where virtio devices are bound to the Linux kernel driver in a guest
> + environment (i.e. interfaces are not bound to an in-guest DPDK driver),
> + the MTU of those logical network interfaces must also be increased to a
> + sufficiently large value. This avoids segmentation of Jumbo Frames
> + received in the guest. Note that 'MTU' refers to the length of the IP
> + packet only, and not that of the entire frame.
> +
> + To calculate the exact MTU of a standard IPv4 frame, subtract the L2
> + header and CRC lengths (i.e. 18B) from the max supported frame size.
> + So, to set the MTU for a 9018B Jumbo Frame:
> +
> + ```
> + ifconfig eth1 mtu 9000
> + ```
> +>>>>>>> 5ec921d... netdev-dpdk: add support for Jumbo Frames
Looks like rebasing artefact.
> +
> +## <a name="vsperf"></a> 11. Vsperf
>
> Vsperf project goal is to develop vSwitch test framework that can be used to
> validate the suitability of different vSwitch implementations in a Telco deployment
> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
> index 7609aa7..25c79de 100644
> --- a/INSTALL.DPDK.md
> +++ b/INSTALL.DPDK.md
> @@ -590,7 +590,6 @@ can be found in [Vhost Walkthrough].
>
> ## <a name="ovslimits"></a> 6. Limitations
>
> - - Supports MTU size 1500, MTU setting for DPDK netdevs will be in future OVS release.
> - Currently DPDK ports does not use HW offload functionality.
> - Network Interface Firmware requirements:
> Each release of DPDK is validated against a specific firmware version for
> diff --git a/NEWS b/NEWS
> index 0ff5616..c004e5f 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -68,6 +68,7 @@ Post-v2.5.0
> is enabled in DPDK.
> * Basic connection tracking for the userspace datapath (no ALG,
> fragmentation or NAT support yet)
> + * Jumbo frame support
> - Increase number of registers to 16.
> - ovs-benchmark: This utility has been removed due to lack of use and
> bitrot.
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 0b6e410..68639ae 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -82,6 +82,7 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> + sizeof(struct dp_packet) \
> + RTE_PKTMBUF_HEADROOM)
> #define NETDEV_DPDK_MBUF_ALIGN 1024
> +#define NETDEV_DPDK_MAX_PKT_LEN 9728
>
> /* Max and min number of packets in the mempool. OVS tries to allocate a
> * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
> @@ -336,6 +337,7 @@ struct netdev_dpdk {
> struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
>
> struct dpdk_mp *dpdk_mp;
> + int mtu_request;
> int mtu;
> int socket_id;
> int buf_size;
> @@ -474,10 +476,19 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
> dmp->mtu = mtu;
> dmp->refcount = 1;
> mbp_priv.mbuf_data_room_size = MBUF_SIZE(mtu) - sizeof(struct dp_packet);
> - mbp_priv.mbuf_priv_size = sizeof (struct dp_packet) -
> - sizeof (struct rte_mbuf);
> + mbp_priv.mbuf_priv_size = sizeof (struct dp_packet)
> + - sizeof (struct rte_mbuf);
> + /* XXX: this is a really rough method of provisioning memory.
> + * It's impossible to determine what the exact memory requirements are when
> + * the number of ports and rxqs that utilize a particular mempool can change
> + * dynamically at runtime. For the moment, use this rough heurisitic.
> + */
> + if (mtu >= ETHER_MTU) {
> + mp_size = MAX_NB_MBUF;
> + } else {
> + mp_size = MIN_NB_MBUF;
> + }
>
> - mp_size = MAX_NB_MBUF;
> do {
> if (snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, "ovs_mp_%d_%d_%u",
> dmp->mtu, dmp->socket_id, mp_size) < 0) {
> @@ -522,6 +533,32 @@ dpdk_mp_put(struct dpdk_mp *dmp)
> #endif
> }
>
> +static int
> +dpdk_mp_configure(struct netdev_dpdk *dev)
> + OVS_REQUIRES(dpdk_mutex)
> + OVS_REQUIRES(dev->mutex)
> +{
> + uint32_t buf_size = dpdk_buf_size(dev->mtu);
> + struct dpdk_mp *mp_old, *mp;
> +
> + mp_old = dev->dpdk_mp;
Do we need this variable? We can just put old mempool before
assigning the new one.
> +
> + mp = dpdk_mp_get(dev->socket_id, FRAME_LEN_TO_MTU(buf_size));
> + if (!mp) {
> + VLOG_ERR("Insufficient memory to create memory pool for netdev %s\n",
> + dev->up.name);
+1 space character.
> + return ENOMEM;
> + }
> +
> + dev->dpdk_mp = mp;
> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> +
> + dpdk_mp_put(mp_old);
> +
> + return 0;
> +}
> +
> +
> static void
> check_link_status(struct netdev_dpdk *dev)
> {
> @@ -573,7 +610,15 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)
> {
> int diag = 0;
> int i;
> + struct rte_eth_conf conf = port_conf;
>
> + if (dev->mtu > ETHER_MTU) {
> + conf.rxmode.jumbo_frame = 1;
> + conf.rxmode.max_rx_pkt_len = dev->max_packet_len;
> + } else {
> + conf.rxmode.jumbo_frame = 0;
> + conf.rxmode.max_rx_pkt_len = 0;
I know, it was implemented this way in the original patch, but I'm
not sure that all DPDK drivers will handle zero value of
'max_rx_pkt_len' in a right way.
> + }
> /* A device may report more queues than it makes available (this has
> * been observed for Intel xl710, which reserves some of them for
> * SRIOV): rte_eth_*_queue_setup will fail if a queue is not
> @@ -584,8 +629,10 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)
> VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
> }
>
> - diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &port_conf);
> + diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
> if (diag) {
> + VLOG_WARN("Interface %s eth_dev setup error %s\n",
> + dev->up.name, rte_strerror(-diag));
> break;
> }
>
> @@ -738,7 +785,6 @@ netdev_dpdk_init(struct netdev *netdev, unsigned int port_no,
> struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> int sid;
> int err = 0;
> - uint32_t buf_size;
>
> ovs_mutex_init(&dev->mutex);
> ovs_mutex_lock(&dev->mutex);
> @@ -759,13 +805,11 @@ netdev_dpdk_init(struct netdev *netdev, unsigned int port_no,
> dev->port_id = port_no;
> dev->type = type;
> dev->flags = 0;
> - dev->mtu = ETHER_MTU;
> + dev->mtu_request = dev->mtu = ETHER_MTU;
> dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>
> - buf_size = dpdk_buf_size(dev->mtu);
> - dev->dpdk_mp = dpdk_mp_get(dev->socket_id, FRAME_LEN_TO_MTU(buf_size));
> - if (!dev->dpdk_mp) {
> - err = ENOMEM;
> + err = dpdk_mp_configure(dev);
> + if (err) {
> goto unlock;
> }
>
> @@ -986,6 +1030,7 @@ netdev_dpdk_get_config(const struct netdev *netdev, struct smap *args)
> smap_add_format(args, "configured_rx_queues", "%d", netdev->n_rxq);
> smap_add_format(args, "requested_tx_queues", "%d", dev->requested_n_txq);
> smap_add_format(args, "configured_tx_queues", "%d", netdev->n_txq);
> + smap_add_format(args, "mtu", "%d", dev->mtu);
> ovs_mutex_unlock(&dev->mutex);
>
> return 0;
> @@ -1362,6 +1407,7 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
> struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
> unsigned int total_pkts = cnt;
> unsigned int qos_pkts = cnt;
> + unsigned int mtu_dropped = 0;
> int retries = 0;
>
> qid = dev->tx_q[qid % netdev->n_txq].map;
> @@ -1383,25 +1429,41 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
> do {
> int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ;
> unsigned int tx_pkts;
> + unsigned int try_tx_pkts = cnt;
>
> + for (unsigned int i = 0; i < cnt; i++) {
> + if (cur_pkts[i]->pkt_len > dev->max_packet_len) {
> + try_tx_pkts = i;
> + break;
> + }
> + }
> + if (!try_tx_pkts) {
> + cur_pkts++;
> + mtu_dropped++;
> + cnt--;
> + continue;
> + }
> tx_pkts = rte_vhost_enqueue_burst(virtio_dev, vhost_qid,
> - cur_pkts, cnt);
> + cur_pkts, try_tx_pkts);
> if (OVS_LIKELY(tx_pkts)) {
> /* Packets have been sent.*/
> cnt -= tx_pkts;
> /* Prepare for possible retry.*/
> cur_pkts = &cur_pkts[tx_pkts];
> + if (tx_pkts != try_tx_pkts) {
> + retries++;
> + }
> } else {
> /* No packets sent - do not retry.*/
> break;
> }
> - } while (cnt && (retries++ < VHOST_ENQ_RETRY_NUM));
> + } while (cnt && (retries <= VHOST_ENQ_RETRY_NUM));
>
> rte_spinlock_unlock(&dev->tx_q[qid].tx_lock);
>
> rte_spinlock_lock(&dev->stats_lock);
> - cnt += qos_pkts;
> - netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts, cnt);
> + netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts,
> + cnt + mtu_dropped + qos_pkts);
> rte_spinlock_unlock(&dev->stats_lock);
>
> out:
> @@ -1635,6 +1697,26 @@ netdev_dpdk_get_mtu(const struct netdev *netdev, int *mtup)
> }
>
> static int
> +netdev_dpdk_set_mtu(struct netdev *netdev, int mtu)
> +{
> + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +
> + if (MTU_TO_FRAME_LEN(mtu) > NETDEV_DPDK_MAX_PKT_LEN) {
> + VLOG_WARN("Unsupported MTU (%d)\n", mtu);
> + return EINVAL;
> + }
> +
> + ovs_mutex_lock(&dev->mutex);
> + if (dev->mtu_request != mtu) {
> + dev->mtu_request = mtu;
> + netdev_request_reconfigure(netdev);
> + }
> + ovs_mutex_unlock(&dev->mutex);
> +
> + return 0;
> +}
> +
> +static int
> netdev_dpdk_get_carrier(const struct netdev *netdev, bool *carrier);
>
> static int
> @@ -2787,7 +2869,8 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
> ovs_mutex_lock(&dev->mutex);
>
> if (netdev->n_txq == dev->requested_n_txq
> - && netdev->n_rxq == dev->requested_n_rxq) {
> + && netdev->n_rxq == dev->requested_n_rxq
> + && dev->mtu == dev->mtu_request) {
> /* Reconfiguration is unnecessary */
>
> goto out;
> @@ -2795,6 +2878,14 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>
> rte_eth_dev_stop(dev->port_id);
>
> + if (dev->mtu != dev->mtu_request) {
> + dev->mtu = dev->mtu_request;
> + err = dpdk_mp_configure(dev);
> + if (err) {
> + goto out;
> + }
> + }
> +
> netdev->n_txq = dev->requested_n_txq;
> netdev->n_rxq = dev->requested_n_rxq;
>
> @@ -2802,6 +2893,8 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
> err = dpdk_eth_dev_init(dev);
> netdev_dpdk_alloc_txq(dev, netdev->n_txq);
>
> + netdev_change_seq_changed(netdev);
> +
> out:
>
> ovs_mutex_unlock(&dev->mutex);
> @@ -2830,20 +2923,23 @@ netdev_dpdk_vhost_user_reconfigure(struct netdev *netdev)
>
> netdev_dpdk_remap_txqs(dev);
>
> - if (dev->requested_socket_id != dev->socket_id) {
> + if (dev->requested_socket_id != dev->socket_id
> + || dev->mtu_request != dev->mtu) {
> dev->socket_id = dev->requested_socket_id;
> - /* Change mempool to new NUMA Node */
> - dpdk_mp_put(dev->dpdk_mp);
> - dev->dpdk_mp = dpdk_mp_get(dev->socket_id, dev->mtu);
> - if (!dev->dpdk_mp) {
> - err = ENOMEM;
> + dev->mtu = dev->mtu_request;
> + /* Change mempool to new NUMA Node and to new MTU. */
> + err = dpdk_mp_configure(dev);
> + if (err) {
> + goto out;
> }
> + netdev_change_seq_changed(netdev);
> }
>
> if (virtio_dev) {
> virtio_dev->flags |= VIRTIO_DEV_RUNNING;
> }
>
> +out:
> ovs_mutex_unlock(&dev->mutex);
> ovs_mutex_unlock(&dpdk_mutex);
>
> @@ -2854,6 +2950,7 @@ static int
> netdev_dpdk_vhost_cuse_reconfigure(struct netdev *netdev)
> {
> struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> + int err = 0;
>
> ovs_mutex_lock(&dpdk_mutex);
> ovs_mutex_lock(&dev->mutex);
> @@ -2861,10 +2958,18 @@ netdev_dpdk_vhost_cuse_reconfigure(struct netdev *netdev)
> netdev->n_txq = dev->requested_n_txq;
> netdev->n_rxq = 1;
>
> + if (dev->mtu_request != dev->mtu) {
> + /* Change mempool to new MTU. */
> + err = dpdk_mp_configure(dev);
> + if (!err) {
> + netdev_change_seq_changed(netdev);
> + }
> + }
> +
> ovs_mutex_unlock(&dev->mutex);
> ovs_mutex_unlock(&dpdk_mutex);
>
> - return 0;
> + return err;
> }
>
> #define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, \
> @@ -2898,7 +3003,7 @@ netdev_dpdk_vhost_cuse_reconfigure(struct netdev *netdev)
> netdev_dpdk_set_etheraddr, \
> netdev_dpdk_get_etheraddr, \
> netdev_dpdk_get_mtu, \
> - NULL, /* set_mtu */ \
> + netdev_dpdk_set_mtu, \
> netdev_dpdk_get_ifindex, \
> GET_CARRIER, \
> netdev_dpdk_get_carrier_resets, \
>
More information about the dev
mailing list