[ovs-dev] [PATCH v10 01/14] netdev-dpdk: fix mbuf sizing

Lam, Tiago tiago.lam at intel.com
Fri Oct 5 14:50:36 UTC 2018


On 03/10/2018 19:25, Flavio Leitner wrote:
> 
> Hi Tiago,
> 
> Thanks for working on this. This is my first pass trying to wrap my head
> around the whole thing, so I may have missed something.
> 
> See below.
> 
> On Fri, Sep 28, 2018 at 05:15:02PM +0100, Tiago Lam wrote:
>> From: Mark Kavanagh <mark.b.kavanagh at intel.com>
>>
>> There are numerous factors that must be considered when calculating
>> the size of an mbuf:
>> - the data portion of the mbuf must be sized in accordance With Rx
>>   buffer alignment (typically 1024B). So, for example, in order to
>>   successfully receive and capture a 1500B packet, mbufs with a
>>   data portion of size 2048B must be used.
>> - in OvS, the elements that comprise an mbuf are:
>>   * the dp packet, which includes a struct rte mbuf (704B)
>>   * RTE_PKTMBUF_HEADROOM (128B)
>>   * packet data (aligned to 1k, as previously described)
>>   * RTE_PKTMBUF_TAILROOM (typically 0)
>>
>> Some PMDs require that the total mbuf size (i.e. the total sum of all
>> of the above-listed components' lengths) is cache-aligned. To satisfy
>> this requirement, it may be necessary to round up the total mbuf size
>> with respect to cacheline size. In doing so, it's possible that the
>> dp_packet's data portion is inadvertently increased in size, such that
>> it no longer adheres to Rx buffer alignment. Consequently, the
>> following property of the mbuf no longer holds true:
>>
>>     mbuf.data_len == mbuf.buf_len - mbuf.data_off
>>
>> This creates a problem in the case of multi-segment mbufs, where that
>> assumption is assumed to be true for all but the final segment in an
>> mbuf chain. Resolve this issue by adjusting the size of the mbuf's
>> private data portion, as opposed to the packet data portion when
>> aligning mbuf size to cachelines.
> 
> Do you have a pointer to the DPDK documentation about this issue?
> Because this looks like an internal DPDK requirement which shouldn't
> be exposed to OVS. Maybe we can take it now, but yeah, doesn't make
> sense to have every DPDK application out there worrying about mbuf
> internal constrains/alignments.

Unfortunately, no (if someone does, please link me to it). This seems to
happen mostly ad-hoc for each PMD. I can see that this alignment fix has
been initially added for the vNIC thunderx PMD (commit 31b88c9), looking
at the history, which seems to fail initialization unless the mbuf_size
is a multiple of cache_line. This patch is just rationalizing this odd
logic for setting the buf_size and priv_size.

DPDK already checks if the passed priv_size is aligned to
RTE_MBUF_PRIV_ALIGN when creating the pool, but doesn't seem to do any
check about the whole mbuf size. The PMD above then just fails if not
aligned. I would agree that this has room for improvement.

> 
> 
>> Fixes: 4be4d22 ("netdev-dpdk: clean up mbuf initialization")
>> Fixes: 31b88c9 ("netdev-dpdk: round up mbuf_size to cache_line_size")
>> CC: Santosh Shukla <santosh.shukla at caviumnetworks.com>
>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh at intel.com>
>> Acked-by: Santosh Shukla <santosh.shukla at caviumnetworks.com>
>> Acked-by: Eelco Chaudron <echaudro at redhat.com>
>> ---
>>  lib/netdev-dpdk.c | 56 +++++++++++++++++++++++++++++++++++++------------------
>>  1 file changed, 38 insertions(+), 18 deletions(-)
>>
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index f91aa27..1e19622 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -88,10 +88,6 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>>  #define MTU_TO_MAX_FRAME_LEN(mtu)   ((mtu) + ETHER_HDR_MAX_LEN)
>>  #define FRAME_LEN_TO_MTU(frame_len) ((frame_len)                    \
>>                                       - ETHER_HDR_LEN - ETHER_CRC_LEN)
>> -#define MBUF_SIZE(mtu)              ROUND_UP((MTU_TO_MAX_FRAME_LEN(mtu) \
>> -                                             + sizeof(struct dp_packet) \
>> -                                             + RTE_PKTMBUF_HEADROOM),   \
>> -                                             RTE_CACHE_LINE_SIZE)
>>  #define NETDEV_DPDK_MBUF_ALIGN      1024
>>  #define NETDEV_DPDK_MAX_PKT_LEN     9728
>>  
>> @@ -637,7 +633,11 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>>      const char *netdev_name = netdev_get_name(&dev->up);
>>      int socket_id = dev->requested_socket_id;
>> -    uint32_t n_mbufs;
>> +    uint32_t n_mbufs = 0;
>> +    uint32_t mbuf_size = 0;
>> +    uint32_t aligned_mbuf_size = 0;
>> +    uint32_t mbuf_priv_data_len = 0;
>> +    uint32_t pkt_size = 0;
>>      uint32_t hash = hash_string(netdev_name, 0);
>>      struct dpdk_mp *dmp = NULL;
>>      int ret;
>> @@ -650,6 +650,9 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>      dmp->mtu = mtu;
>>      dmp->refcount = 1;
>>  
>> +    /* Get the size of each mbuf, based on the MTU */
>> +    mbuf_size = dpdk_buf_size(dev->requested_mtu);
>> +
> 
> Here mbuf_size accounts for RTE_PKTMBUF_HEADROOM.
> 
>>      n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>>  
>>      do {
>> @@ -661,8 +664,8 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>           * so this is not an issue for tasks such as debugging.
>>           */
>>          ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>> -                           "ovs%08x%02d%05d%07u",
>> -                           hash, socket_id, mtu, n_mbufs);
>> +                       "ovs%08x%02d%05d%07u",
>> +                        hash, socket_id, mtu, n_mbufs);
>>          if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
>>              VLOG_DBG("snprintf returned %d. "
>>                       "Failed to generate a mempool name for \"%s\". "
>> @@ -671,17 +674,34 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>              break;
>>          }
>>  
>> -        VLOG_DBG("Port %s: Requesting a mempool of %u mbufs "
>> -                  "on socket %d for %d Rx and %d Tx queues.",
>> -                  netdev_name, n_mbufs, socket_id,
>> -                  dev->requested_n_rxq, dev->requested_n_txq);
>> -
>> -        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
>> -                                          MP_CACHE_SZ,
>> -                                          sizeof (struct dp_packet)
>> -                                          - sizeof (struct rte_mbuf),
>> -                                          MBUF_SIZE(mtu)
>> -                                          - sizeof(struct dp_packet),
>> +        VLOG_DBG("Port %s: Requesting a mempool of %u mbufs of size %u "
>> +                  "on socket %d for %d Rx and %d Tx queues, "
>> +                  "cache line size of %u",
>> +                  netdev_name, n_mbufs, mbuf_size, socket_id,
>> +                  dev->requested_n_rxq, dev->requested_n_txq,
>> +                  RTE_CACHE_LINE_SIZE);
>> +
>> +        mbuf_priv_data_len = sizeof(struct dp_packet) -
>> +                                 sizeof(struct rte_mbuf);
>> +        /* The size of the entire dp_packet. */
>> +        pkt_size = sizeof (struct dp_packet) +
>> +                                mbuf_size + RTE_PKTMBUF_HEADROOM;
>> +        /* mbuf size, rounded up to cacheline size. */
>> +        aligned_mbuf_size = ROUND_UP(pkt_size, RTE_CACHE_LINE_SIZE);
>> +        /* If there is a size discrepancy, add padding to mbuf_priv_data_len.
>> +         * This maintains mbuf size cache alignment, while also honoring RX
>> +         * buffer alignment in the data portion of the mbuf. If this adjustment
>> +         * is not made, there is a possiblity later on that for an element of
>> +         * the mempool, buf, buf->data_len < (buf->buf_len - buf->data_off).
>> +         * This is problematic in the case of multi-segment mbufs, particularly
>> +         * when an mbuf segment needs to be resized (when [push|popp]ing a VLAN
>> +         * header, for example.
>> +         */
>> +        mbuf_priv_data_len += (aligned_mbuf_size - pkt_size);
>> +
>> +        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
>> +                                          mbuf_priv_data_len,
>> +                                          mbuf_size + RTE_PKTMBUF_HEADROOM,
>>                                            socket_id);
> 
> And here it accounts the same again. I can't see why.

Yeah, there's no need really. The only reason was to make the whole
available data room 2048B, and then still have 128B of head room. But
I'll get rid of this one here in the next iteration, thanks.

Tiago.

> Thanks,
> fbl
> 
>>  
>>          if (dmp->mp) {
>> -- 
>> 2.7.4
>>
>> _______________________________________________
>> dev mailing list
>> dev at openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> 


More information about the dev mailing list