[ovs-dev] [PATCH v4 12/15] netdev-dpdk: support multi-segment jumbo frames

Lam, Tiago tiago.lam at intel.com
Tue Jul 10 12:48:38 UTC 2018


On 10/07/2018 13:40, Ilya Maximets wrote:
> On 10.07.2018 15:30, Eelco Chaudron wrote:
>>
>>
>> On 10 Jul 2018, at 13:06, Tiago Lam wrote:
>>
>>> From: Mark Kavanagh <mark.b.kavanagh at intel.com>
>>>
>>> Currently, jumbo frame support for OvS-DPDK is implemented by
>>> increasing the size of mbufs within a mempool, such that each mbuf
>>> within the pool is large enough to contain an entire jumbo frame of
>>> a user-defined size. Typically, for each user-defined MTU,
>>> 'requested_mtu', a new mempool is created, containing mbufs of size
>>> ~requested_mtu.
>>>
>>> With the multi-segment approach, a port uses a single mempool,
>>> (containing standard/default-sized mbufs of ~2k bytes), irrespective
>>> of the user-requested MTU value. To accommodate jumbo frames, mbufs
>>> are chained together, where each mbuf in the chain stores a portion of
>>> the jumbo frame. Each mbuf in the chain is termed a segment, hence the
>>> name.
>>>
>>> == Enabling multi-segment mbufs ==
>>> Multi-segment and single-segment mbufs are mutually exclusive, and the
>>> user must decide on which approach to adopt on init. The introduction
>>> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This
>>> is a global boolean value, which determines how jumbo frames are
>>> represented across all DPDK ports. In the absence of a user-supplied
>>> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
>>> mbufs must be explicitly enabled / single-segment mbufs remain the
>>> default.
>>>
>>> Setting the field is identical to setting existing DPDK-specific OVSDB
>>> fields:
>>>
>>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
>>> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>>>
>>> Co-authored-by: Tiago Lam <tiago.lam at intel.com>
>>>
>>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh at intel.com>
>>> Signed-off-by: Tiago Lam <tiago.lam at intel.com>
>>> Acked-by: Eelco Chaudron <echaudro at redhat.com>
>>> ---
>>>  Documentation/topics/dpdk/jumbo-frames.rst | 51 +++++++++++++++++
>>>  NEWS                                       |  2 +
>>>  lib/dpdk.c                                 |  8 +++
>>>  lib/netdev-dpdk.c                          | 92 +++++++++++++++++++++++++-----
>>>  lib/netdev-dpdk.h                          |  2 +
>>>  vswitchd/vswitch.xml                       | 23 ++++++++
>>>  6 files changed, 164 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/Documentation/topics/dpdk/jumbo-frames.rst b/Documentation/topics/dpdk/jumbo-frames.rst
>>> index 00360b4..d1fb111 100644
>>> --- a/Documentation/topics/dpdk/jumbo-frames.rst
>>> +++ b/Documentation/topics/dpdk/jumbo-frames.rst
>>> @@ -71,3 +71,54 @@ Jumbo frame support has been validated against 9728B frames, which is the
>>>  largest frame size supported by Fortville NIC using the DPDK i40e driver, but
>>>  larger frames and other DPDK NIC drivers may be supported. These cases are
>>>  common for use cases involving East-West traffic only.
>>> +
>>> +-------------------
>>> +Multi-segment mbufs
>>> +-------------------
>>> +
>>> +Instead of increasing the size of mbufs within a mempool, such that each mbuf
>>> +within the pool is large enough to contain an entire jumbo frame of a
>>> +user-defined size, mbufs can be chained together instead. In this approach each
>>> +mbuf in the chain stores a portion of the jumbo frame, by default ~2K bytes,
>>> +irrespective of the user-requested MTU value. Since each mbuf in the chain is
>>> +termed a segment, this approach is named "multi-segment mbufs".
>>> +
>>> +This approach may bring more flexibility in use cases where the maximum packet
>>> +length may be hard to guess. For example, in cases where packets originate from
>>> +sources marked for oflload (such as TSO), each packet may be larger than the
>>> +MTU, and as such, when forwarding it to a DPDK port a single mbuf may not be
>>> +enough to hold all of the packet's data.
>>> +
>>> +Multi-segment and single-segment mbufs are mutually exclusive, and the user
>>> +must decide on which approach to adopt on initialisation. If multi-segment
>>> +mbufs is to be enabled, it can be done so with the following command::
>>> +
>>> +    $ ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>>> +
>>> +Single-segment mbufs still remain the default when using OvS-DPDK, and the
>>> +above option `dpdk-multi-seg-mbufs` must be explicitly set to `true` if
>>> +multi-segment mbufs are to be used.
>>> +
>>> +~~~~~~~~~~~~~~~~~
>>> +Performance notes
>>> +~~~~~~~~~~~~~~~~~
>>> +
>>> +When using multi-segment mbufs some PMDs may not support vectorized Tx
>>> +functions, due to its non-contiguous nature. As a result this can hit
>>> +performance for smaller packet sizes. For example, on a setup sending 64B
>>> +packets at line rate, a decrease of ~20% has been observed. The performance
>>> +impact stops being noticeable for larger packet sizes, although the exact size
>>> +will between PMDs, and depending on the architecture one's using.
>>> +
>>> +Tests performed with the i40e PMD driver only showed this limitation for 64B
>>> +packets, and the same rate was observed when comparing multi-segment mbufs and
>>> +single-segment mbuf for 128B packets. In other words, the 20% drop in
>>> +performance was not observed for packets >= 128B during this test case.
>>> +
>>> +Because of this, multi-segment mbufs is not advised to be used with smaller
>>> +packet sizes, such as 64B.
>>> +
>>> +Also, note that using multi-segment mbufs won't improve memory usage. For a
>>> +packet of 9000B, for example, which would be stored on a single mbuf when using
>>> +the single-segment approach, 5 mbufs (9000/2048) of 2048B would be needed to
>>> +store the same data using the multi-segment mbufs approach.
>>> diff --git a/NEWS b/NEWS
>>> index 92e9b92..c7facac 100644
>>> --- a/NEWS
>>> +++ b/NEWS
>>> @@ -38,6 +38,7 @@ Post-v2.9.0
>>>       * Allow init to fail and record DPDK status/version in OVS database.
>>>       * Add experimental flow hardware offload support
>>>       * Support both shared and per port mempools for DPDK devices.
>>> +     * Add support for multi-segment mbufs.
>>>     - Userspace datapath:
>>>       * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
>>>       * Detailed PMD performance metrics available with new command
>>> @@ -115,6 +116,7 @@ v2.9.0 - 19 Feb 2018
>>>       * New appctl command 'dpif-netdev/pmd-rxq-rebalance' to rebalance rxq to
>>>         pmd assignments.
>>>       * Add rxq utilization of pmd to appctl 'dpif-netdev/pmd-rxq-show'.
>>> +     * Add support for vHost dequeue zero copy (experimental)
>>
>> Guess this was added on error?
> 
> I guess, Ian accidentally removed this line by commit
> c3c722d2c7ee ("Documentation: document ovs-dpdk flow offload").
> 
> But, sure, this change should not be in this patch.
> 

Thanks for tracking it, Ilya (and for finding it, Eelco). I was wrapping
my head around it myself as this was still present in v3.

>>
>>>     - Userspace datapath:
>>>       * Output packet batching support.
>>>     - vswitchd:
>>> diff --git a/lib/dpdk.c b/lib/dpdk.c
>>> index 0ee3e19..ac89fd8 100644
>>> --- a/lib/dpdk.c
>>> +++ b/lib/dpdk.c
>>> @@ -497,6 +497,14 @@ dpdk_init__(const struct smap *ovs_other_config)
>>>
>>>      /* Finally, register the dpdk classes */
>>>      netdev_dpdk_register();
>>> +
>>> +    bool multi_seg_mbufs_enable = smap_get_bool(ovs_other_config,
>>> +            "dpdk-multi-seg-mbufs", false);
>>> +    if (multi_seg_mbufs_enable) {
>>> +        VLOG_INFO("DPDK multi-segment mbufs enabled\n");
>>> +        netdev_dpdk_multi_segment_mbufs_enable();
>>> +    }
>>> +
>>>      return true;
>>>  }
>>>
>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>> index 81dd1ed..6fc95d6 100644
>>> --- a/lib/netdev-dpdk.c
>>> +++ b/lib/netdev-dpdk.c
>>> @@ -70,6 +70,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
>>>
>>>  VLOG_DEFINE_THIS_MODULE(netdev_dpdk);
>>>  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>>> +static bool dpdk_multi_segment_mbufs = false;
>>>
>>>  #define DPDK_PORT_WATCHDOG_INTERVAL 5
>>>
>>> @@ -521,6 +522,18 @@ is_dpdk_class(const struct netdev_class *class)
>>>             || class->destruct == netdev_dpdk_vhost_destruct;
>>>  }
>>>
>>> +bool
>>> +netdev_dpdk_is_multi_segment_mbufs_enabled(void)
>>> +{
>>> +    return dpdk_multi_segment_mbufs == true;
>>> +}
>>> +
>>> +void
>>> +netdev_dpdk_multi_segment_mbufs_enable(void)
>>> +{
>>> +    dpdk_multi_segment_mbufs = true;
>>> +}
>>> +
>>>  /* DPDK NIC drivers allocate RX buffers at a particular granularity, typically
>>>   * aligned at 1k or less. If a declared mbuf size is not a multiple of this
>>>   * value, insufficient buffers are allocated to accomodate the packet in its
>>> @@ -628,14 +641,19 @@ dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
>>>      }
>>>  }
>>>
>>> -/* Calculating the required number of mbufs differs depending on the
>>> - * mempool model being used. Check if per port memory is in use before
>>> - * calculating.
>>> - */
>>> +/* Calculating the required number of mbufs differs depending on the mempool
>>> + * model (per port vs shared mempools) being used.
>>> + * In case multi-segment mbufs are being used, the number of mbufs is also
>>> + * increased when using the per port model, to account for the multiple mbufs
>>> + * needed to hold each packet's data. This doesn't happen for the shared port
>>> + * model since the number of mbufs assigned is should already be sufficient
>>> + * (MAX_NB_MBUF) */
>>>  static uint32_t
>>> -dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, uint32_t mbuf_size,
>>> +                     bool per_port_mp)
>>>  {
>>>      uint32_t n_mbufs;
>>> +    uint16_t max_frame_len = 0;
>>>
>>>      if (!per_port_mp) {
>>>          /* Shared memory are being used.
>>> @@ -662,6 +680,22 @@ dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>>                    + dev->requested_n_txq * dev->requested_txq_size
>>>                    + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
>>>                    + MIN_NB_MBUF;
>>> +
>>> +        /* If multi-segment mbufs are used, we also increase the number of
>>> +         * mbufs used. This is done by calculating how many mbufs are needed to
>>> +         * hold the data on a single packet of MTU size. For example, for a
>>> +         * received packet of 9000B, 5 mbufs (9000 / 2048) are needed to hold
>>> +         * the data - 4 more than with single-mbufs (as mbufs' size is extended
>>> +         * to hold all data) */
>>> +        max_frame_len = MTU_TO_MAX_FRAME_LEN(mtu);
>>> +        if (dpdk_multi_segment_mbufs && mbuf_size < max_frame_len) {
>>> +            uint16_t nb_segs = max_frame_len / mbuf_size;
>>> +            if (max_frame_len % mbuf_size) {
>>> +                nb_segs += 1;
>>> +            }
>>> +
>>> +            n_mbufs *= nb_segs;
>>> +        }
>>>      }
>>>
>>>      return n_mbufs;
>>> @@ -688,7 +722,7 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, uint32_t mbuf_size,
>>>      dmp->mtu = mtu;
>>>      dmp->refcount = 1;
>>>
>>> -    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>>> +    n_mbufs = dpdk_calculate_mbufs(dev, mtu, mbuf_size, per_port_mp);
>>>
>>>      do {
>>>          /* Full DPDK memory pool name must be unique and cannot be
>>> @@ -843,16 +877,22 @@ dpdk_mp_put(struct dpdk_mp *dmp)
>>>      ovs_mutex_unlock(&dpdk_mp_mutex);
>>>  }
>>>
>>> -/* Depending on the memory model being used this function tries to
>>> - * identify and reuse an existing mempool or tries to allocate a new
>>> - * mempool on requested_socket_id with mbuf size corresponding to the
>>> - * requested_mtu. On success, a new configuration will be applied.
>>> - * On error, device will be left unchanged. */
>>> +/* Depending on the memory model (per port vs shared mempools) being used this
>>> + * function tries to identify and reuse an existing mempool or tries to
>>> + * allocate a new mempool on requested_socket_id, with mbuf size set to the
>>> + * following, depending if 'dpdk_multi_segment_mbufs' is enabled or not:
>>> + * - if 'true', then the mempool contains standard-sized mbufs that are chained
>>> + *   together to accommodate packets of size 'requested_mtu'.
>>> + * - if 'false', then the members of the allocated mempool are
>>> + *   non-standard-sized mbufs. Each mbuf in the mempool is large enough to
>>> + *   fully accomdate packets of size 'requested_mtu'.
>>> + * On success, a new configuration will be applied. On error, device will be
>>> + * left unchanged. */
>>>  static int
>>>  netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>>>      OVS_REQUIRES(dev->mutex)
>>>  {
>>> -    uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
>>> +    uint16_t buf_size = 0;
>>>      struct dpdk_mp *dmp;
>>>      int ret = 0;
>>>      bool per_port_mp = dpdk_per_port_memory();
>>> @@ -865,6 +905,16 @@ netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>>>          return ret;
>>>      }
>>>
>>> +    dpdk_mp_sweep();
>>
>> Is this sweep needed, as its part of dpdk_mp_get()
>>

A blatant leftover from the rebase. Thanks, Eelco.

I'll wait for Ian's reply in patch 01/15 and take this into account for v5.

Tiago.


More information about the dev mailing list