[ovs-dev] [PATCH v4 12/15] netdev-dpdk: support multi-segment jumbo frames

Ilya Maximets i.maximets at samsung.com
Tue Jul 10 12:40:37 UTC 2018


On 10.07.2018 15:30, Eelco Chaudron wrote:
> 
> 
> On 10 Jul 2018, at 13:06, Tiago Lam wrote:
> 
>> From: Mark Kavanagh <mark.b.kavanagh at intel.com>
>>
>> Currently, jumbo frame support for OvS-DPDK is implemented by
>> increasing the size of mbufs within a mempool, such that each mbuf
>> within the pool is large enough to contain an entire jumbo frame of
>> a user-defined size. Typically, for each user-defined MTU,
>> 'requested_mtu', a new mempool is created, containing mbufs of size
>> ~requested_mtu.
>>
>> With the multi-segment approach, a port uses a single mempool,
>> (containing standard/default-sized mbufs of ~2k bytes), irrespective
>> of the user-requested MTU value. To accommodate jumbo frames, mbufs
>> are chained together, where each mbuf in the chain stores a portion of
>> the jumbo frame. Each mbuf in the chain is termed a segment, hence the
>> name.
>>
>> == Enabling multi-segment mbufs ==
>> Multi-segment and single-segment mbufs are mutually exclusive, and the
>> user must decide on which approach to adopt on init. The introduction
>> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This
>> is a global boolean value, which determines how jumbo frames are
>> represented across all DPDK ports. In the absence of a user-supplied
>> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
>> mbufs must be explicitly enabled / single-segment mbufs remain the
>> default.
>>
>> Setting the field is identical to setting existing DPDK-specific OVSDB
>> fields:
>>
>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
>> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>>
>> Co-authored-by: Tiago Lam <tiago.lam at intel.com>
>>
>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh at intel.com>
>> Signed-off-by: Tiago Lam <tiago.lam at intel.com>
>> Acked-by: Eelco Chaudron <echaudro at redhat.com>
>> ---
>>  Documentation/topics/dpdk/jumbo-frames.rst | 51 +++++++++++++++++
>>  NEWS                                       |  2 +
>>  lib/dpdk.c                                 |  8 +++
>>  lib/netdev-dpdk.c                          | 92 +++++++++++++++++++++++++-----
>>  lib/netdev-dpdk.h                          |  2 +
>>  vswitchd/vswitch.xml                       | 23 ++++++++
>>  6 files changed, 164 insertions(+), 14 deletions(-)
>>
>> diff --git a/Documentation/topics/dpdk/jumbo-frames.rst b/Documentation/topics/dpdk/jumbo-frames.rst
>> index 00360b4..d1fb111 100644
>> --- a/Documentation/topics/dpdk/jumbo-frames.rst
>> +++ b/Documentation/topics/dpdk/jumbo-frames.rst
>> @@ -71,3 +71,54 @@ Jumbo frame support has been validated against 9728B frames, which is the
>>  largest frame size supported by Fortville NIC using the DPDK i40e driver, but
>>  larger frames and other DPDK NIC drivers may be supported. These cases are
>>  common for use cases involving East-West traffic only.
>> +
>> +-------------------
>> +Multi-segment mbufs
>> +-------------------
>> +
>> +Instead of increasing the size of mbufs within a mempool, such that each mbuf
>> +within the pool is large enough to contain an entire jumbo frame of a
>> +user-defined size, mbufs can be chained together instead. In this approach each
>> +mbuf in the chain stores a portion of the jumbo frame, by default ~2K bytes,
>> +irrespective of the user-requested MTU value. Since each mbuf in the chain is
>> +termed a segment, this approach is named "multi-segment mbufs".
>> +
>> +This approach may bring more flexibility in use cases where the maximum packet
>> +length may be hard to guess. For example, in cases where packets originate from
>> +sources marked for oflload (such as TSO), each packet may be larger than the
>> +MTU, and as such, when forwarding it to a DPDK port a single mbuf may not be
>> +enough to hold all of the packet's data.
>> +
>> +Multi-segment and single-segment mbufs are mutually exclusive, and the user
>> +must decide on which approach to adopt on initialisation. If multi-segment
>> +mbufs is to be enabled, it can be done so with the following command::
>> +
>> +    $ ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>> +
>> +Single-segment mbufs still remain the default when using OvS-DPDK, and the
>> +above option `dpdk-multi-seg-mbufs` must be explicitly set to `true` if
>> +multi-segment mbufs are to be used.
>> +
>> +~~~~~~~~~~~~~~~~~
>> +Performance notes
>> +~~~~~~~~~~~~~~~~~
>> +
>> +When using multi-segment mbufs some PMDs may not support vectorized Tx
>> +functions, due to its non-contiguous nature. As a result this can hit
>> +performance for smaller packet sizes. For example, on a setup sending 64B
>> +packets at line rate, a decrease of ~20% has been observed. The performance
>> +impact stops being noticeable for larger packet sizes, although the exact size
>> +will between PMDs, and depending on the architecture one's using.
>> +
>> +Tests performed with the i40e PMD driver only showed this limitation for 64B
>> +packets, and the same rate was observed when comparing multi-segment mbufs and
>> +single-segment mbuf for 128B packets. In other words, the 20% drop in
>> +performance was not observed for packets >= 128B during this test case.
>> +
>> +Because of this, multi-segment mbufs is not advised to be used with smaller
>> +packet sizes, such as 64B.
>> +
>> +Also, note that using multi-segment mbufs won't improve memory usage. For a
>> +packet of 9000B, for example, which would be stored on a single mbuf when using
>> +the single-segment approach, 5 mbufs (9000/2048) of 2048B would be needed to
>> +store the same data using the multi-segment mbufs approach.
>> diff --git a/NEWS b/NEWS
>> index 92e9b92..c7facac 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -38,6 +38,7 @@ Post-v2.9.0
>>       * Allow init to fail and record DPDK status/version in OVS database.
>>       * Add experimental flow hardware offload support
>>       * Support both shared and per port mempools for DPDK devices.
>> +     * Add support for multi-segment mbufs.
>>     - Userspace datapath:
>>       * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
>>       * Detailed PMD performance metrics available with new command
>> @@ -115,6 +116,7 @@ v2.9.0 - 19 Feb 2018
>>       * New appctl command 'dpif-netdev/pmd-rxq-rebalance' to rebalance rxq to
>>         pmd assignments.
>>       * Add rxq utilization of pmd to appctl 'dpif-netdev/pmd-rxq-show'.
>> +     * Add support for vHost dequeue zero copy (experimental)
> 
> Guess this was added on error?

I guess, Ian accidentally removed this line by commit
c3c722d2c7ee ("Documentation: document ovs-dpdk flow offload").

But, sure, this change should not be in this patch.

> 
>>     - Userspace datapath:
>>       * Output packet batching support.
>>     - vswitchd:
>> diff --git a/lib/dpdk.c b/lib/dpdk.c
>> index 0ee3e19..ac89fd8 100644
>> --- a/lib/dpdk.c
>> +++ b/lib/dpdk.c
>> @@ -497,6 +497,14 @@ dpdk_init__(const struct smap *ovs_other_config)
>>
>>      /* Finally, register the dpdk classes */
>>      netdev_dpdk_register();
>> +
>> +    bool multi_seg_mbufs_enable = smap_get_bool(ovs_other_config,
>> +            "dpdk-multi-seg-mbufs", false);
>> +    if (multi_seg_mbufs_enable) {
>> +        VLOG_INFO("DPDK multi-segment mbufs enabled\n");
>> +        netdev_dpdk_multi_segment_mbufs_enable();
>> +    }
>> +
>>      return true;
>>  }
>>
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index 81dd1ed..6fc95d6 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -70,6 +70,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
>>
>>  VLOG_DEFINE_THIS_MODULE(netdev_dpdk);
>>  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>> +static bool dpdk_multi_segment_mbufs = false;
>>
>>  #define DPDK_PORT_WATCHDOG_INTERVAL 5
>>
>> @@ -521,6 +522,18 @@ is_dpdk_class(const struct netdev_class *class)
>>             || class->destruct == netdev_dpdk_vhost_destruct;
>>  }
>>
>> +bool
>> +netdev_dpdk_is_multi_segment_mbufs_enabled(void)
>> +{
>> +    return dpdk_multi_segment_mbufs == true;
>> +}
>> +
>> +void
>> +netdev_dpdk_multi_segment_mbufs_enable(void)
>> +{
>> +    dpdk_multi_segment_mbufs = true;
>> +}
>> +
>>  /* DPDK NIC drivers allocate RX buffers at a particular granularity, typically
>>   * aligned at 1k or less. If a declared mbuf size is not a multiple of this
>>   * value, insufficient buffers are allocated to accomodate the packet in its
>> @@ -628,14 +641,19 @@ dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
>>      }
>>  }
>>
>> -/* Calculating the required number of mbufs differs depending on the
>> - * mempool model being used. Check if per port memory is in use before
>> - * calculating.
>> - */
>> +/* Calculating the required number of mbufs differs depending on the mempool
>> + * model (per port vs shared mempools) being used.
>> + * In case multi-segment mbufs are being used, the number of mbufs is also
>> + * increased when using the per port model, to account for the multiple mbufs
>> + * needed to hold each packet's data. This doesn't happen for the shared port
>> + * model since the number of mbufs assigned is should already be sufficient
>> + * (MAX_NB_MBUF) */
>>  static uint32_t
>> -dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, uint32_t mbuf_size,
>> +                     bool per_port_mp)
>>  {
>>      uint32_t n_mbufs;
>> +    uint16_t max_frame_len = 0;
>>
>>      if (!per_port_mp) {
>>          /* Shared memory are being used.
>> @@ -662,6 +680,22 @@ dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>                    + dev->requested_n_txq * dev->requested_txq_size
>>                    + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
>>                    + MIN_NB_MBUF;
>> +
>> +        /* If multi-segment mbufs are used, we also increase the number of
>> +         * mbufs used. This is done by calculating how many mbufs are needed to
>> +         * hold the data on a single packet of MTU size. For example, for a
>> +         * received packet of 9000B, 5 mbufs (9000 / 2048) are needed to hold
>> +         * the data - 4 more than with single-mbufs (as mbufs' size is extended
>> +         * to hold all data) */
>> +        max_frame_len = MTU_TO_MAX_FRAME_LEN(mtu);
>> +        if (dpdk_multi_segment_mbufs && mbuf_size < max_frame_len) {
>> +            uint16_t nb_segs = max_frame_len / mbuf_size;
>> +            if (max_frame_len % mbuf_size) {
>> +                nb_segs += 1;
>> +            }
>> +
>> +            n_mbufs *= nb_segs;
>> +        }
>>      }
>>
>>      return n_mbufs;
>> @@ -688,7 +722,7 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, uint32_t mbuf_size,
>>      dmp->mtu = mtu;
>>      dmp->refcount = 1;
>>
>> -    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>> +    n_mbufs = dpdk_calculate_mbufs(dev, mtu, mbuf_size, per_port_mp);
>>
>>      do {
>>          /* Full DPDK memory pool name must be unique and cannot be
>> @@ -843,16 +877,22 @@ dpdk_mp_put(struct dpdk_mp *dmp)
>>      ovs_mutex_unlock(&dpdk_mp_mutex);
>>  }
>>
>> -/* Depending on the memory model being used this function tries to
>> - * identify and reuse an existing mempool or tries to allocate a new
>> - * mempool on requested_socket_id with mbuf size corresponding to the
>> - * requested_mtu. On success, a new configuration will be applied.
>> - * On error, device will be left unchanged. */
>> +/* Depending on the memory model (per port vs shared mempools) being used this
>> + * function tries to identify and reuse an existing mempool or tries to
>> + * allocate a new mempool on requested_socket_id, with mbuf size set to the
>> + * following, depending if 'dpdk_multi_segment_mbufs' is enabled or not:
>> + * - if 'true', then the mempool contains standard-sized mbufs that are chained
>> + *   together to accommodate packets of size 'requested_mtu'.
>> + * - if 'false', then the members of the allocated mempool are
>> + *   non-standard-sized mbufs. Each mbuf in the mempool is large enough to
>> + *   fully accomdate packets of size 'requested_mtu'.
>> + * On success, a new configuration will be applied. On error, device will be
>> + * left unchanged. */
>>  static int
>>  netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>>      OVS_REQUIRES(dev->mutex)
>>  {
>> -    uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
>> +    uint16_t buf_size = 0;
>>      struct dpdk_mp *dmp;
>>      int ret = 0;
>>      bool per_port_mp = dpdk_per_port_memory();
>> @@ -865,6 +905,16 @@ netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>>          return ret;
>>      }
>>
>> +    dpdk_mp_sweep();
> 
> Is this sweep needed, as its part of dpdk_mp_get()
> 
>> +
>> +    /* Contiguous mbufs in use - permit oversized mbufs */
>> +    if (!dpdk_multi_segment_mbufs) {
>> +        buf_size = dpdk_buf_size(dev->requested_mtu);
>> +    } else {
>> +        /* multi-segment mbufs - use standard mbuf size */
>> +        buf_size = dpdk_buf_size(ETHER_MTU);
>> +    }
>> +
>>      dmp = dpdk_mp_get(dev, dev->requested_mtu, buf_size, per_port_mp);
>>      if (!dmp) {
>>          VLOG_ERR("Failed to create memory pool for netdev "
>> @@ -949,6 +999,7 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>>      int diag = 0;
>>      int i;
>>      struct rte_eth_conf conf = port_conf;
>> +    struct rte_eth_txconf txconf;
>>      struct rte_eth_dev_info info;
>>      uint16_t conf_mtu;
>>
>> @@ -965,6 +1016,18 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>>          }
>>      }
>>
>> +    /* Multi-segment-mbuf-specific setup. */
>> +    if (dpdk_multi_segment_mbufs) {
>> +        /* DPDK PMDs typically attempt to use simple or vectorized
>> +         * transmit functions, neither of which are compatible with
>> +         * multi-segment mbufs. Ensure that these are disabled when
>> +         * multi-segment mbufs are enabled.
>> +         */
>> +        rte_eth_dev_info_get(dev->port_id, &info);
>> +        txconf = info.default_txconf;
>> +        txconf.txq_flags &= ~ETH_TXQ_FLAGS_NOMULTSEGS;
>> +    }
>> +
>>      conf.intr_conf.lsc = dev->lsc_interrupt_mode;
>>      conf.rxmode.hw_ip_checksum = (dev->hw_ol_features &
>>                                    NETDEV_RX_CHECKSUM_OFFLOAD) != 0;
>> @@ -1009,7 +1072,9 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>>
>>          for (i = 0; i < n_txq; i++) {
>>              diag = rte_eth_tx_queue_setup(dev->port_id, i, dev->txq_size,
>> -                                          dev->socket_id, NULL);
>> +                                          dev->socket_id,
>> +                                          dpdk_multi_segment_mbufs ? &txconf
>> +                                                                   : NULL);
>>              if (diag) {
>>                  VLOG_INFO("Interface %s unable to setup txq(%d): %s",
>>                            dev->up.name, i, rte_strerror(-diag));
>> @@ -4100,7 +4165,6 @@ unlock:
>>      return err;
>>  }
>>
>> -
>>  /* Find rte_flow with @ufid */
>>  static struct rte_flow *
>>  ufid_to_rte_flow_find(const ovs_u128 *ufid) {
>> diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h
>> index b7d02a7..19aa5c6 100644
>> --- a/lib/netdev-dpdk.h
>> +++ b/lib/netdev-dpdk.h
>> @@ -25,6 +25,8 @@ struct dp_packet;
>>
>>  #ifdef DPDK_NETDEV
>>
>> +bool netdev_dpdk_is_multi_segment_mbufs_enabled(void);
>> +void netdev_dpdk_multi_segment_mbufs_enable(void);
>>  void netdev_dpdk_register(void);
>>  void free_dpdk_buf(struct dp_packet *);
>>
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index 63a3a2e..b77849e 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -332,6 +332,29 @@
>>          </p>
>>        </column>
>>
>> +      <column name="other_config" key="dpdk-multi-seg-mbufs"
>> +              type='{"type": "boolean"}'>
>> +        <p>
>> +          Specifies if DPDK uses multi-segment mbufs for handling jumbo frames.
>> +        </p>
>> +        <p>
>> +          If true, DPDK allocates a single mempool per port, irrespective of
>> +          the ports' requested MTU sizes. The elements of this mempool are
>> +          'standard'-sized mbufs (typically 2k MB), which may be chained
>> +          together to accommodate jumbo frames. In this approach, each mbuf
>> +          typically stores a fragment of the overall jumbo frame.
>> +        </p>
>> +        <p>
>> +          If not specified, defaults to <code>false</code>, in which case, the
>> +          size of each mbuf within a DPDK port's mempool will be grown to
>> +          accommodate jumbo frames within a single mbuf.
>> +        </p>
>> +        <p>
>> +          Changing this value requires restarting the daemon.
>> +        </p>
>> +      </column>
>> +
>> +
>>        <column name="other_config" key="vhost-sock-dir"
>>                type='{"type": "string"}'>
>>          <p>
>> -- 
>> 2.7.4
> 
> 
> 


More information about the dev mailing list