[ovs-dev] [PATCH v4 12/15] netdev-dpdk: support multi-segment jumbo frames

Eelco Chaudron echaudro at redhat.com
Tue Jul 10 12:53:46 UTC 2018



On 10 Jul 2018, at 14:48, Lam, Tiago wrote:

> On 10/07/2018 13:40, Ilya Maximets wrote:
>> On 10.07.2018 15:30, Eelco Chaudron wrote:
>>>
>>>
>>> On 10 Jul 2018, at 13:06, Tiago Lam wrote:
>>>
>>>> From: Mark Kavanagh <mark.b.kavanagh at intel.com>
>>>>
>>>> Currently, jumbo frame support for OvS-DPDK is implemented by
>>>> increasing the size of mbufs within a mempool, such that each mbuf
>>>> within the pool is large enough to contain an entire jumbo frame of
>>>> a user-defined size. Typically, for each user-defined MTU,
>>>> 'requested_mtu', a new mempool is created, containing mbufs of size
>>>> ~requested_mtu.
>>>>
>>>> With the multi-segment approach, a port uses a single mempool,
>>>> (containing standard/default-sized mbufs of ~2k bytes), 
>>>> irrespective
>>>> of the user-requested MTU value. To accommodate jumbo frames, mbufs
>>>> are chained together, where each mbuf in the chain stores a portion 
>>>> of
>>>> the jumbo frame. Each mbuf in the chain is termed a segment, hence 
>>>> the
>>>> name.
>>>>
>>>> == Enabling multi-segment mbufs ==
>>>> Multi-segment and single-segment mbufs are mutually exclusive, and 
>>>> the
>>>> user must decide on which approach to adopt on init. The 
>>>> introduction
>>>> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. 
>>>> This
>>>> is a global boolean value, which determines how jumbo frames are
>>>> represented across all DPDK ports. In the absence of a 
>>>> user-supplied
>>>> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
>>>> mbufs must be explicitly enabled / single-segment mbufs remain the
>>>> default.
>>>>
>>>> Setting the field is identical to setting existing DPDK-specific 
>>>> OVSDB
>>>> fields:
>>>>
>>>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>>>>     ovs-vsctl set Open_vSwitch . 
>>>> other_config:dpdk-lcore-mask=0x10
>>>>     ovs-vsctl set Open_vSwitch . 
>>>> other_config:dpdk-socket-mem=4096,0
>>>> ==> ovs-vsctl set Open_vSwitch . 
>>>> other_config:dpdk-multi-seg-mbufs=true
>>>>
>>>> Co-authored-by: Tiago Lam <tiago.lam at intel.com>
>>>>
>>>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh at intel.com>
>>>> Signed-off-by: Tiago Lam <tiago.lam at intel.com>
>>>> Acked-by: Eelco Chaudron <echaudro at redhat.com>
>>>> ---
>>>>  Documentation/topics/dpdk/jumbo-frames.rst | 51 +++++++++++++++++
>>>>  NEWS                                       
>>>> |  2 +
>>>>  lib/dpdk.c                                 
>>>> |  8 +++
>>>>  lib/netdev-dpdk.c                          
>>>> | 92 +++++++++++++++++++++++++-----
>>>>  lib/netdev-dpdk.h                          
>>>> |  2 +
>>>>  vswitchd/vswitch.xml                       
>>>> | 23 ++++++++
>>>>  6 files changed, 164 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/Documentation/topics/dpdk/jumbo-frames.rst 
>>>> b/Documentation/topics/dpdk/jumbo-frames.rst
>>>> index 00360b4..d1fb111 100644
>>>> --- a/Documentation/topics/dpdk/jumbo-frames.rst
>>>> +++ b/Documentation/topics/dpdk/jumbo-frames.rst
>>>> @@ -71,3 +71,54 @@ Jumbo frame support has been validated against 
>>>> 9728B frames, which is the
>>>>  largest frame size supported by Fortville NIC using the DPDK i40e 
>>>> driver, but
>>>>  larger frames and other DPDK NIC drivers may be supported. These 
>>>> cases are
>>>>  common for use cases involving East-West traffic only.
>>>> +
>>>> +-------------------
>>>> +Multi-segment mbufs
>>>> +-------------------
>>>> +
>>>> +Instead of increasing the size of mbufs within a mempool, such 
>>>> that each mbuf
>>>> +within the pool is large enough to contain an entire jumbo frame 
>>>> of a
>>>> +user-defined size, mbufs can be chained together instead. In this 
>>>> approach each
>>>> +mbuf in the chain stores a portion of the jumbo frame, by default 
>>>> ~2K bytes,
>>>> +irrespective of the user-requested MTU value. Since each mbuf in 
>>>> the chain is
>>>> +termed a segment, this approach is named "multi-segment mbufs".
>>>> +
>>>> +This approach may bring more flexibility in use cases where the 
>>>> maximum packet
>>>> +length may be hard to guess. For example, in cases where packets 
>>>> originate from
>>>> +sources marked for oflload (such as TSO), each packet may be 
>>>> larger than the
>>>> +MTU, and as such, when forwarding it to a DPDK port a single mbuf 
>>>> may not be
>>>> +enough to hold all of the packet's data.
>>>> +
>>>> +Multi-segment and single-segment mbufs are mutually exclusive, and 
>>>> the user
>>>> +must decide on which approach to adopt on initialisation. If 
>>>> multi-segment
>>>> +mbufs is to be enabled, it can be done so with the following 
>>>> command::
>>>> +
>>>> +    $ ovs-vsctl set Open_vSwitch . 
>>>> other_config:dpdk-multi-seg-mbufs=true
>>>> +
>>>> +Single-segment mbufs still remain the default when using OvS-DPDK, 
>>>> and the
>>>> +above option `dpdk-multi-seg-mbufs` must be explicitly set to 
>>>> `true` if
>>>> +multi-segment mbufs are to be used.
>>>> +
>>>> +~~~~~~~~~~~~~~~~~
>>>> +Performance notes
>>>> +~~~~~~~~~~~~~~~~~
>>>> +
>>>> +When using multi-segment mbufs some PMDs may not support 
>>>> vectorized Tx
>>>> +functions, due to its non-contiguous nature. As a result this can 
>>>> hit
>>>> +performance for smaller packet sizes. For example, on a setup 
>>>> sending 64B
>>>> +packets at line rate, a decrease of ~20% has been observed. The 
>>>> performance
>>>> +impact stops being noticeable for larger packet sizes, although 
>>>> the exact size
>>>> +will between PMDs, and depending on the architecture one's using.
>>>> +
>>>> +Tests performed with the i40e PMD driver only showed this 
>>>> limitation for 64B
>>>> +packets, and the same rate was observed when comparing 
>>>> multi-segment mbufs and
>>>> +single-segment mbuf for 128B packets. In other words, the 20% drop 
>>>> in
>>>> +performance was not observed for packets >= 128B during this test 
>>>> case.
>>>> +
>>>> +Because of this, multi-segment mbufs is not advised to be used 
>>>> with smaller
>>>> +packet sizes, such as 64B.
>>>> +
>>>> +Also, note that using multi-segment mbufs won't improve memory 
>>>> usage. For a
>>>> +packet of 9000B, for example, which would be stored on a single 
>>>> mbuf when using
>>>> +the single-segment approach, 5 mbufs (9000/2048) of 2048B would be 
>>>> needed to
>>>> +store the same data using the multi-segment mbufs approach.
>>>> diff --git a/NEWS b/NEWS
>>>> index 92e9b92..c7facac 100644
>>>> --- a/NEWS
>>>> +++ b/NEWS
>>>> @@ -38,6 +38,7 @@ Post-v2.9.0
>>>>       * Allow init to fail and record DPDK status/version in 
>>>> OVS database.
>>>>       * Add experimental flow hardware offload support
>>>>       * Support both shared and per port mempools for DPDK 
>>>> devices.
>>>> +     * Add support for multi-segment mbufs.
>>>>     - Userspace datapath:
>>>>       * Commands ovs-appctl dpif-netdev/pmd-*-show can now 
>>>> work on a single PMD
>>>>       * Detailed PMD performance metrics available with new 
>>>> command
>>>> @@ -115,6 +116,7 @@ v2.9.0 - 19 Feb 2018
>>>>       * New appctl command 'dpif-netdev/pmd-rxq-rebalance' to 
>>>> rebalance rxq to
>>>>         pmd assignments.
>>>>       * Add rxq utilization of pmd to appctl 
>>>> 'dpif-netdev/pmd-rxq-show'.
>>>> +     * Add support for vHost dequeue zero copy (experimental)
>>>
>>> Guess this was added on error?
>>
>> I guess, Ian accidentally removed this line by commit
>> c3c722d2c7ee ("Documentation: document ovs-dpdk flow offload").
>>
>> But, sure, this change should not be in this patch.
>>
>
> Thanks for tracking it, Ilya (and for finding it, Eelco). I was 
> wrapping
> my head around it myself as this was still present in v3.
>
>>>
>>>>     - Userspace datapath:
>>>>       * Output packet batching support.
>>>>     - vswitchd:
>>>> diff --git a/lib/dpdk.c b/lib/dpdk.c
>>>> index 0ee3e19..ac89fd8 100644
>>>> --- a/lib/dpdk.c
>>>> +++ b/lib/dpdk.c
>>>> @@ -497,6 +497,14 @@ dpdk_init__(const struct smap 
>>>> *ovs_other_config)
>>>>
>>>>      /* Finally, register the dpdk classes */
>>>>      netdev_dpdk_register();
>>>> +
>>>> +    bool multi_seg_mbufs_enable = 
>>>> smap_get_bool(ovs_other_config,
>>>> +            "dpdk-multi-seg-mbufs", false);
>>>> +    if (multi_seg_mbufs_enable) {
>>>> +        VLOG_INFO("DPDK multi-segment mbufs enabled\n");
>>>> +        netdev_dpdk_multi_segment_mbufs_enable();
>>>> +    }
>>>> +
>>>>      return true;
>>>>  }
>>>>
>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>> index 81dd1ed..6fc95d6 100644
>>>> --- a/lib/netdev-dpdk.c
>>>> +++ b/lib/netdev-dpdk.c
>>>> @@ -70,6 +70,7 @@ enum {VIRTIO_RXQ, VIRTIO_TXQ, VIRTIO_QNUM};
>>>>
>>>>  VLOG_DEFINE_THIS_MODULE(netdev_dpdk);
>>>>  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>>>> +static bool dpdk_multi_segment_mbufs = false;
>>>>
>>>>  #define DPDK_PORT_WATCHDOG_INTERVAL 5
>>>>
>>>> @@ -521,6 +522,18 @@ is_dpdk_class(const struct netdev_class 
>>>> *class)
>>>>             || class->destruct == 
>>>> netdev_dpdk_vhost_destruct;
>>>>  }
>>>>
>>>> +bool
>>>> +netdev_dpdk_is_multi_segment_mbufs_enabled(void)
>>>> +{
>>>> +    return dpdk_multi_segment_mbufs == true;
>>>> +}
>>>> +
>>>> +void
>>>> +netdev_dpdk_multi_segment_mbufs_enable(void)
>>>> +{
>>>> +    dpdk_multi_segment_mbufs = true;
>>>> +}
>>>> +
>>>>  /* DPDK NIC drivers allocate RX buffers at a particular 
>>>> granularity, typically
>>>>   * aligned at 1k or less. If a declared mbuf size is not a 
>>>> multiple of this
>>>>   * value, insufficient buffers are allocated to accomodate the 
>>>> packet in its
>>>> @@ -628,14 +641,19 @@ dpdk_mp_sweep(void) 
>>>> OVS_REQUIRES(dpdk_mp_mutex)
>>>>      }
>>>>  }
>>>>
>>>> -/* Calculating the required number of mbufs differs depending on 
>>>> the
>>>> - * mempool model being used. Check if per port memory is in use 
>>>> before
>>>> - * calculating.
>>>> - */
>>>> +/* Calculating the required number of mbufs differs depending on 
>>>> the mempool
>>>> + * model (per port vs shared mempools) being used.
>>>> + * In case multi-segment mbufs are being used, the number of mbufs 
>>>> is also
>>>> + * increased when using the per port model, to account for the 
>>>> multiple mbufs
>>>> + * needed to hold each packet's data. This doesn't happen for the 
>>>> shared port
>>>> + * model since the number of mbufs assigned is should already be 
>>>> sufficient
>>>> + * (MAX_NB_MBUF) */
>>>>  static uint32_t
>>>> -dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool 
>>>> per_port_mp)
>>>> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, uint32_t 
>>>> mbuf_size,
>>>> +                     bool per_port_mp)
>>>>  {
>>>>      uint32_t n_mbufs;
>>>> +    uint16_t max_frame_len = 0;
>>>>
>>>>      if (!per_port_mp) {
>>>>          /* Shared memory are being used.
>>>> @@ -662,6 +680,22 @@ dpdk_calculate_mbufs(struct netdev_dpdk *dev, 
>>>> int mtu, bool per_port_mp)
>>>>                    + dev->requested_n_txq * 
>>>> dev->requested_txq_size
>>>>                    + MIN(RTE_MAX_LCORE, 
>>>> dev->requested_n_rxq) * NETDEV_MAX_BURST
>>>>                    + MIN_NB_MBUF;
>>>> +
>>>> +        /* If multi-segment mbufs are used, we also 
>>>> increase the number of
>>>> +         * mbufs used. This is done by calculating how 
>>>> many mbufs are needed to
>>>> +         * hold the data on a single packet of MTU size. 
>>>> For example, for a
>>>> +         * received packet of 9000B, 5 mbufs (9000 / 2048) 
>>>> are needed to hold
>>>> +         * the data - 4 more than with single-mbufs (as 
>>>> mbufs' size is extended
>>>> +         * to hold all data) */
>>>> +        max_frame_len = MTU_TO_MAX_FRAME_LEN(mtu);
>>>> +        if (dpdk_multi_segment_mbufs && mbuf_size < 
>>>> max_frame_len) {
>>>> +            uint16_t nb_segs = max_frame_len / 
>>>> mbuf_size;
>>>> +            if (max_frame_len % mbuf_size) {
>>>> +                nb_segs += 1;
>>>> +            }
>>>> +
>>>> +            n_mbufs *= nb_segs;
>>>> +        }
>>>>      }
>>>>
>>>>      return n_mbufs;
>>>> @@ -688,7 +722,7 @@ dpdk_mp_create(struct netdev_dpdk *dev, int 
>>>> mtu, uint32_t mbuf_size,
>>>>      dmp->mtu = mtu;
>>>>      dmp->refcount = 1;
>>>>
>>>> -    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>>>> +    n_mbufs = dpdk_calculate_mbufs(dev, mtu, mbuf_size, 
>>>> per_port_mp);
>>>>
>>>>      do {
>>>>          /* Full DPDK memory pool name must be unique and 
>>>> cannot be
>>>> @@ -843,16 +877,22 @@ dpdk_mp_put(struct dpdk_mp *dmp)
>>>>      ovs_mutex_unlock(&dpdk_mp_mutex);
>>>>  }
>>>>
>>>> -/* Depending on the memory model being used this function tries to
>>>> - * identify and reuse an existing mempool or tries to allocate a 
>>>> new
>>>> - * mempool on requested_socket_id with mbuf size corresponding to 
>>>> the
>>>> - * requested_mtu. On success, a new configuration will be applied.
>>>> - * On error, device will be left unchanged. */
>>>> +/* Depending on the memory model (per port vs shared mempools) 
>>>> being used this
>>>> + * function tries to identify and reuse an existing mempool or 
>>>> tries to
>>>> + * allocate a new mempool on requested_socket_id, with mbuf size 
>>>> set to the
>>>> + * following, depending if 'dpdk_multi_segment_mbufs' is enabled 
>>>> or not:
>>>> + * - if 'true', then the mempool contains standard-sized mbufs 
>>>> that are chained
>>>> + *   together to accommodate packets of size 'requested_mtu'.
>>>> + * - if 'false', then the members of the allocated mempool are
>>>> + *   non-standard-sized mbufs. Each mbuf in the mempool is large 
>>>> enough to
>>>> + *   fully accomdate packets of size 'requested_mtu'.
>>>> + * On success, a new configuration will be applied. On error, 
>>>> device will be
>>>> + * left unchanged. */
>>>>  static int
>>>>  netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>>>>      OVS_REQUIRES(dev->mutex)
>>>>  {
>>>> -    uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
>>>> +    uint16_t buf_size = 0;
>>>>      struct dpdk_mp *dmp;
>>>>      int ret = 0;
>>>>      bool per_port_mp = dpdk_per_port_memory();
>>>> @@ -865,6 +905,16 @@ netdev_dpdk_mempool_configure(struct 
>>>> netdev_dpdk *dev)
>>>>          return ret;
>>>>      }
>>>>
>>>> +    dpdk_mp_sweep();
>>>
>>> Is this sweep needed, as its part of dpdk_mp_get()
>>>
>
> A blatant leftover from the rebase. Thanks, Eelco.
>
> I'll wait for Ian's reply in patch 01/15 and take this into account 
> for v5.
>
> Tiago.

I’m on PTO starting tomorrow evening for 3 weeks, so might not have a 
change to review your v5.

So if you are just fixing this, keep my ack :)



More information about the dev mailing list