[ovs-dev] [PATCH v3] netdev-dpdk: Add mbuf HEADROOM after alignment.

Thu Nov 29 13:04:29 UTC 2018

On 28/11/2018 15:56, Stokes, Ian wrote:
>> Commit dfaf00e started using the result of dpdk_buf_size() to calculate
>> the available size on each mbuf, as opposed to using the previous
>> MBUF_SIZE macro. However, this was calculating the mbuf size by adding up
>> the MTU with RTE_PKTMBUF_HEADROOM and only then aligning to
>> NETDEV_DPDK_MBUF_ALIGN. Instead, the accounting for the
>> RTE_PKTMBUF_HEADROOM should only happen after alignment, as per below.
>>
>> Before alignment:
>> ROUNDUP(MTU(1500) + RTE_PKTMBUF_HEADROOM(128), 1024) = 2048
>>
>> After aligment:
>> ROUNDUP(MTU(1500), 1024) + 128 = 2176
>>
>> This might seem insignificant, however, it might have performance
>> implications in DPDK, where each mbuf is expected to have 2k +
>> RTE_PKTMBUF_HEADROOM of available space. This is because not only some
>> NICs have course grained alignments of 1k, they will also take
>> RTE_PKTMBUF_HEADROOM bytes from the overall available space in an mbuf
>> when setting up their Rx requirements. Thus, only the "After alignment"
>> case above would guarantee a 2k of available room, as the "Before
>> alignment" would report only 1920B.
>>
>> Some extra information can be found at:
>> https://mails.dpdk.org/archives/dev/2018-November/119219.html
>>
>> Note: This has been found by Ian Stokes while going through some af_packet
>> checks.
>>
>> Reported-by: Ian Stokes <ian.stokes at intel.com>
>> Fixes: dfaf00e ("netdev-dpdk: fix mbuf sizing")
>> Signed-off-by: Tiago Lam <tiago.lam at intel.com>
> 
> Thanks Tiago, I've applied this to master. If I'm not mistaken I believe it should be back ported also, previous releases were using the incorrect calculation for dpdk_buf_size Thoughts?

Thanks for that, Ian.

On back porting, that's a good point. I'm not too sure on this one here,
though.

>From what I can gather, on previous versions until 2.6.x (so, 2.6.x,
2.7.x, 2.8.x, 2.9.x and 2.10.x) we won't have the same problem. This is
because we are over allocating the size of each mbuf and the size being
passed to `rte_pktmbuf_pool_create()` is adding RTE_PKTMBUF_HEADROOM for
the second time, since it had already been added in dpdk_buf_size() and
will be added again by the MBUF_SIZE macro. So, here we can potentially
save some space by back porting, but I don't think we ever pass less
size than we should into DPDK.

So, I think the TL;DR here is that this incorrect behavior was
introduced in commit dfaf00e ("netdev-dpdk: fix mbuf sizing") only,
where the MBUF_SIZE header was removed. As a result, we started relying
on dpdk_buf_size() result only, which adds RTE_PKTMBUF_HEADROOM first
and aligns to MBUF_ALIGN afterwards, meaning we can end up passing only
2k when calling DPDK's rte_pktmbuf_pool_create(), which may not be enough.

Back porting could make this up a bit, though. Let me know if you think
it would still make sense.

Regards,
Tiago.

(As an aside, from  2.7.x below, OvS-DPDK is not using
`rte_pktmbuf_pool_create` and is using `rte_mempool_create()` directly
(likely because the former wasn't available in DPDK yet). And there I
think passing RTE_PKTMBUF_HEADROOM twice is correct behavior since the
mempool API doesn't add `sizeof(struct rte_mbuf)` as
`rte_pktmbuf_pool_create()` does. So, this may very well be a leftover.)