[ovs-dev] [PATCH v7 00/13] Support multi-segment mbufs

Lam, Tiago tiago.lam at intel.com
Tue Aug 14 09:45:16 UTC 2018


On 09/08/2018 09:27, Ilya Maximets wrote:
> Hmm. I found that this series modifies only dpdk related components
> and doesn't pay any attention to others.
> dp_packet API was modified to respect the segmented packets, but
> there are many places, where code uses packet data directly using
> only dp_packet_data() pointer and dp_packet_size().
> For example, at least following functions will read the wrong memory
> and probably generate segfault:
> 
>   * All the netdev-{linux, dummy, bsd} send functions, because
>     they are using code like this:
> 
>     write(netdev->tap_fd, dp_packet_data(packet), dp_packet_size(packet));
> 
>   * Tunnel extracting functions like 'udp_extract_tnl_md', because
>     they're trying to calculate packet checksums using raw calls to
>     'csum_continue'.
> 
>   * CONTROLLER_UPCALL will aslo be broken, because it uses direct xmemdup.
> 
>   * Possibly, much more functions.
> 

Hi Ilya (and All),

I went through the uses that call dp_packet_data(), and some of the
cases were covered already, like the "udp_extract_tnl_md" case you
mention - because the dp_packet API is changed to not let callers pull
headers after the first segment.

The other cases I consider all to be of the same category: they all need
the data contiguous in memory, such as the write() syscall you mention
above. Do you see any other corner cases that I may be missing?

So, to cover this category, where data is expected to be contiguous in
memory, we'd need to copy it to a someplace contigous (most likely
system's memory). The approach I have in mind is to extend the dp_packet
API in order to add a new function that "linearizes" the multi-segmented
data. This new function would be called from places such as the write()
call in netdev-linux you mention above, instead of the call to
dp_packet_data() in place now, and would convert between a DPBUF_DPDK
packet into a DPBUF_MALLOC packet. The packet would then continue to be
used on as a regular DPBUF_MALLOC packet (instead of a DPBUF_DPDK
packet) which would be properly free'ed and returned back to the DPDK
mempool's upon dp_packet_delete() / dp_packet_uninit().

An upside of this approach is that it would also enable us to follow the
ideia proposed on [1] (by yourself), where we could "easily" tranform a
DPDK packet into a system packet if no more space exists when calling
dp_packet_reuse__().

This would enable us to use multi-segment mbufs, and later on TSO,
between DPDK ports, while still enabling the use cases between DPDK
ports and non-DPDK ports. For this latter case, there'd be a one-off
(expensive) operation where system's memory needs to be allocated and
copied to, but documentation would be further improved to explain this,
and I believe warning the user with a WARN log message would also be
helpful here as well.

What do you think?

[1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-July/350009.html

> If It's true, when I'm afraid that this patch set is far from being
> ready for merge.
> 
> On 25.07.2018 17:19, Tiago Lam wrote:
>> Overview
>> ========
>> This patchset introduces support for multi-segment mbufs to OvS-DPDK.
>> Multi-segment mbufs are typically used when the size of an mbuf is
>> insufficient to contain the entirety of a packet's data. Instead, the
>> data is split across numerous mbufs, each carrying a portion, or
>> 'segment', of the packet data. Mbufs are chained via their 'next'
>> attribute (an mbuf pointer).
>>
>> The main motivation behind the support for multi-segment mbufs is to
>> later introduce TSO (use case i. below) / GRO in OvS-DPDK, which is
>> planned to be introduced after this series.
>>
>> Use Cases
>> =========
>> i.  Handling oversized (guest-originated) frames, which are marked
>>     for hardware accelration/offload (TSO, for example).
>>
>>     Packets which originate from a non-DPDK source may be marked for
>>     offload; as such, they may be larger than the permitted ingress
>>     interface's MTU, and may be stored in an oversized dp-packet. In
>>     order to transmit such packets over a DPDK port, their contents
>>     must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
>>     its current implementation, that function only copies data into
>>     a single mbuf; if the space available in the mbuf is exhausted,
>>     but not all packet data has been copied, then it is lost.
>>     Similarly, when cloning a DPDK mbuf, it must be considered
>>     whether that mbuf contains multiple segments. Both issues are
>>     resolved within this patchset.
>>
>> ii. Handling jumbo frames.
>>
>>     While OvS already supports jumbo frames, it does so by increasing
>>     mbuf size, such that the entirety of a jumbo frame may be handled
>>     in a single mbuf. This is certainly the preferred, and most
>>     performant approach (and remains the default).
>>
>> Enabling multi-segment mbufs
>> ============================
>> Multi-segment and single-segment mbufs are mutually exclusive, and the
>> user must decide on which approach to adopt on init. The introduction
>> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.
>>
>> This is a global boolean value, which determines how jumbo frames are
>> represented across all DPDK ports. In the absence of a user-supplied
>> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
>> mbufs must be explicitly enabled / single-segment mbufs remain the
>> default.
>>
>> Setting the field is identical to setting existing DPDK-specific OVSDB
>> fields:
>>
>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>>     ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
>> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>>
>> Performance notes (based on v8)
>> =================
>> In order to test for regressions in performance, tests were run on top
>> of master 88125d6 and v8 of this patchset, both with the multi-segment
>> mbufs option enabled and disabled.
>>
>> VSperf was used to run the phy2phy_cont and pvp_cont tests with varying
>> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.
>>
>> Test | Size | Master | Multi-seg disabled | Multi-seg enabled
>> -------------------------------------------------------------
>> p2p  |  64  | ~22.7  |      ~22.65        |       ~18.3
>> p2p  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>> p2p  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>> pvp  |  64  |  ~6.7  |        ~6.7        |        ~6.3
>> pvp  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
>> pvp  | 7000 | ~0.36  |       ~0.36        |       ~0.36
>>
>> Packet size is in bytes, while all packet rates are reported in mpps
>> (aggregated).
>>
>> No noticeable regression has been observed (certainly everything is
>> within the ± 5% margin of existing performance), aside from the 64B
>> packet size case when multi-segment mbuf is enabled. This is
>> expected, however, because of how Tx vectoriszed functions are
>> incompatible with multi-segment mbufs on some PMDs. The PMD under
>> use during these tests was the i40e (on a Intel X710 NIC), which
>> indeed doesn't support vectorized Tx functions with multi-segment
>> mbufs.
>>
>> ---
>> v7: - Rebase on master 024810c ("Prepare for post-2.10.0 (2.10.90).");
>>     - Add Ben's proposed fix for automake's warning;
>>     - Add a note to cover letter to explain this is preperatory work for TSO /
>>       GRO.
>>
>> v6: - Rebase on master d1b235d ("tests: Add test for ovn-nbctl's command parser
>>       error paths.");
>>     - Address Darrell's comments:
>>       - The changes in dp_packet_resize__() were trying to alleviate the call
>>         to OVS_NOT_REACHED() for DPDK packets, by trying to reuse the available
>>         tailroom space when no more headroom space is available, and vice-versa.
>>         However, this was breaking the API for the dp_packet_resize__()
>>         function (only called in dp_packet_prealloc_tailroom() and
>>         dp_packet_prealloc_headroom()), which doesn't seem to suit the purposes
>>         for DPDK packets.
>>         Instead, and because this is isolate funtionality, revert to the
>>         previous state where dp_packet_resize__() is not supported for DPDK
>>         packets. Hence, then patch 08/14 has been dropped.
>>     - Additionally, fix the tests that were relying on the removed
>>       functionality.
>>
>> v5: - Rebase on master 030958a0cc ("conntrack: Fix conn_update_state_alg use
>>       after free.");
>>     - Address Eelco's comments:
>>       - Remove dpdk_mp_sweep() call in netdev_dpdk_mempool_configure(), a
>>         leftover from rebase. Only call should be in dpdk_mp_get();
>>       - Remove NEWS line added by mistake during rebase (about adding
>>         experimental vhost zero copy support).
>>     - Address Ian's comments:
>>       - Drop patch 01 from previous series entirely;
>>       - Patch (now) 01/14 adds a new call to dpdk_buf_size() inside
>>         dpdk_mp_create() to get the correct "mbuf_size" to be used;
>>       - Patch (now) 11/14 modifies dpdk_mp_create() to check if multi-segment
>>         mbufs is enabled, in which case it calculates the new "mbuf_size" to be
>>         used;
>>       - In free_dpdk_buf() and dpdk_buf_alloc(), don't lock and unlock
>>         conditionally.
>>     - Add "per-port-memory=true" to test "Multi-segment mbufs Tx" as the current
>>       DPDK set up in system-dpdk-testsuite can't handle higher MTU sizes using
>>       the shared mempool model (runs out of memory);
>>     - Add new examples for when multi-segment mbufs are enabled in
>>       topics/dpdk/memory.rst, and a reference to topics/dpdk/jumbo-frames.rst
>>       (now patch 11/14).
>>
>> v4: - Rebase on master b22fb00 ("ovn-nbctl: Clarify error messages in qos-add
>>       command."):
>>       - A new patch (now 01/15) has been introduced to differentiate between
>>         MTU and mbuf size when creating the mempool. This is because as part of
>>         the new support for both per port and shared mempools, mempools were
>>         being reused based on the mbuf size, and for multi-segment mbufs the
>>         mbuf size can end up being the same for all mbufs;
>>       - A couple of other patches (now 02/15 and 12/15) have been modified as
>>         part of the rebase, but only to adapt to the code changes to "Support
>>         both shared and per port mempools.", no functionality should have been
>>         changed.
>>
>> v3:
>>     - Address Eelco's comment:
>>         - Fix the ovs_assert() introduced in v2 in __packet_set_data(), which
>>           wasn't correctly asserting that the passed 'v' was smaller than the
>>           first mbuf's buf_len.
>>
>> v2:
>>     - Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on
>>       error.");
>>     - Address Eelco's comments:
>>         - In free_dpdk_buf(), use mbuf's struct address in dp_packet instead of
>>           casting;
>>         - Remove unneeded variable in dp_packet_set_size(), pointing to the
>>           first mbuf in the chain;
>>         - Assert in dp_packet_set_size() to enforce that "pkt_len == v" is
>>           always true for DPBUF_DPDK packets;
>>         - Assert in __packet_set_data() to enforce that data_off never goes
>>           beyond the first mbuf in the chain.
>>
>> v1:
>>     - v8 should have been sent as v1 really, as that's usually the approach
>>       followed in OvS. That clearly didn't happen, so restarting the series
>>       now. This also helps making it clear it is no longer an RFC series;
>>     - Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags()
>>       output.");
>>     - Address Eelco's comments:
>>         - Change dp_packet_size() so that for `DPBUF_DPDK` packets their
>>           `pkt_len` and `data_len` can't be set to values bigger than the
>>           available space. Also fix assigment to `data_len` which was
>>           incorrectly being set to just`pkt_len`;
>>         - Improve `nonpmd_mp_mutex` comment with a better explanation as to
>>           why the mutex is needed;
>>         - Fix dp_packet_clear() to not call rte_pktmbuf_reset() for non
>>           `DPBUF_DPDK` packets;
>>         - Dropped `if` clause in dp_packet_l4_size(), keep just the `else`;
>>         - Change dp_packet_clone_with_headroom() to use rte_pktmbuf_read() for
>>           copying `DPBUF_DPDK` packets' data. Also, change it to return
>>           appropriate and meaningful errors, instead of just "0" or "1";
>>         - Change dpdk_prep_tx_buf() name's to dpdk_clone_dp_packet_to_mbuf(),
>>           and reuse dp_packet_mbuf_write() instead of manual copy;
>>         - Add note vswitchd/vswitch.xml to make it clear the enabling of
>>           dpdk-multi-seg-mbufs requires a restart;
>>     - Change dpdk_mp_create() to increase # mbufs used under the multi-segment
>>       mbufs approach;
>>     - Increase test coverage by adding "end-to-end" tests that verify that
>>       "dpdk-multi-seg-mbufs" is disabled by default and that a packet is
>>       successfully sent out;
>>     - Some helper funcs such as dp_packet_tail() and dp_packet_end() were moved
>>       back to be common between `DPBUF_DPDK` and non `DPBUF_DPDK` packets, to
>>       minimise changes;
>>     - Add some extra notes to "Performance notes" in jumbo-frames.rst doc,
>>       after further testing;
>>     - Structure changes:
>>       - Drop patch 07/13 which is now unneeded;
>>       - Two more patches added for extra test coverage. This is what accounts
>>         for the increase in size (+1 patch) in the series.
>>
>> v8 (non-RFC): 
>>     - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from
>>       temporary directory (also for RHEL)");
>>     - Address Ciara's and Ilya's comments:
>>         - Drop the dp_packet_mbuf_tail() function and use only the
>>           already existing dp_packet_tail();
>>         - Fix bug in dpdk_do_tx_copy() where the error return from
>>           dpdk_prep_tx_buf() was being wrongly checked;
>>         - Use dpdk_buf_alloc() and free_dpdk_buf() instead of
>>           rte_pktmbuf_alloc() and rte_pktmbuf_free();
>>         - Fix some other code style and duplication issues pointed out.
>>     - Refactored dp_packet_shift(), dp_packet_resize__() and
>>       dp_packet_put*() functions to work within the bounds of existing
>>       mbufs only;
>>     - Fix dp_packet_clear() which wasn't correctly clearing / freeing
>>       other mbufs in the chain for chains with more than a single mbuf;
>>     - dp_packet layer functions (such as dp_packet_l3()) now check if
>>       the header is within the first mbuf, when using mbufs;
>>     - Move patch 08/13 to before patch 04/13, since dp_packet_set_size()
>>       was refactored to use free_dpdk_buf();
>>     - Fix wrong rte_memcpy() when performing dp_packet_clone() which was
>>       leading to memory corruption; 
>>     - Modified the added tests to account for some of the above changes;
>>     - Run performance tests, compiling results and adding them to the
>>       cover letter;
>>     - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
>>       together with a "Performance notes" sub-section reflecting the
>>       findings mentioned above in the cover letter.
>>
>> v7:  - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
>>      - Address Ilya comments;
>>         - Fix non-DPDK build;
>>         - Serialise the access of non pmds to allocation and free of mbufs by
>>           using a newly introduced mutex.
>>      - Add a new set of tests that integrates with the recently added DPDK
>>        testsuite. These focus on allocating dp_packets, with a single or
>>        multiple mbufs, from an instantiated mempool and performing several
>>        operations on those, verifying if the data at the end matches what's
>>        expected;
>>      - Fix bugs found by the new tests:
>>         - dp_packet_shift() wasn't taking into account shift lefts;
>>         - dp_packet_resize__() was misusing and miscalculating the tailrooms
>>           and headrooms, ending up calculating the wrong number of segments
>>           that needed allocation;
>>         - An mbuf's end was being miscalculated both in dp_packet_tail,
>>           dp_packet_mbuf_tail() and dp_packet_end();
>>         - dp_packet_set_size() was not updating the number of chained segments
>>           'nb_segs';
>>      - Add support for multi-seg mbufs in dp_packet_clear().
>>
>> v6:  - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
>>        reuse.");
>>      - Further improve dp_packet_put_uninit() and dp_packet_shift() to
>>        support multi-seg mbufs;
>>      - Add support for multi-seg mbufs in dp_packet_l4_size() and
>>        improve other helper funcs, such as dp_packet_tail() and dp_
>>        packet_tailroom().
>>      - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
>>        put_zeros(), as well as dp_packet_resize__() - allocating new
>>        mbufs and linking them together;
>>      Restructured patchset:
>>      - Squash patch 5 into patch 6, since they were both related to
>>        copying data while handling multi-seg mbufs;
>>      - Split patch 4 into two separate patches - one that introduces the
>>        changes in helper functions to deal with multi-seg mbufs and
>>        two others for the shift() and put_uninit() functionality;
>>      - Move patch 4 to before patch 3, so that ihelper functions come
>>        before functionality improvement that rely on those helpers.
>>
>> v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
>>       from getting redundantly incremented");
>>     - Sugesh's comments have been addressed:
>>       - Changed dp_packet_set_data() and dp_packet_set_size() logic to
>>         make them independent of each other;
>>       - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
>>         size() are independent;
>>       - dp_packet_clone_with_headroom() now has split functions for
>>         handling DPDK sourced packets and non-DPDK packets;
>>     - Modified various functions in dp-packet.h to account for multi-seg
>>       mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
>>       and dp_packet_at();
>>     - Added support for shifting packet data in multi-seg mbufs, using
>>       dp_packet_shift();
>>     - Fixed some minor inconsistencies.
>>
>>     Note that some of the changes in v5 have been contributed by Mark
>>     Kavanagh as well.
>>
>> v4: - restructure patchset
>>     - account for 128B ARM cacheline when sizing mbufs
>>
>> Mark Kavanagh (4):
>>   netdev-dpdk: fix mbuf sizing
>>   dp-packet: Init specific mbuf fields.
>>   netdev-dpdk: copy large packet to multi-seg. mbufs
>>   netdev-dpdk: support multi-segment jumbo frames
>>
>> Michael Qiu (1):
>>   dp-packet: copy data from multi-seg. DPDK mbuf
>>
>> Tiago Lam (8):
>>   dp-packet: Fix allocated size on DPDK init.
>>   netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
>>   dp-packet: Fix data_len handling multi-seg mbufs.
>>   dp-packet: Handle multi-seg mbufs in helper funcs.
>>   dp-packet: Handle multi-seg mubfs in shift() func.
>>   dpdk-tests: Add uni-tests for multi-seg mbufs.
>>   dpdk-tests: Accept other configs in OVS_DPDK_START
>>   dpdk-tests: End-to-end tests for multi-seg mbufs.
>>
>>  Documentation/topics/dpdk/jumbo-frames.rst |  52 +++
>>  Documentation/topics/dpdk/memory.rst       |  36 ++
>>  NEWS                                       |   1 +
>>  lib/dp-packet.c                            | 173 +++++++++-
>>  lib/dp-packet.h                            | 214 ++++++++++--
>>  lib/dpdk.c                                 |   8 +
>>  lib/netdev-dpdk.c                          | 244 +++++++++++---
>>  lib/netdev-dpdk.h                          |   2 +
>>  tests/automake.mk                          |  10 +-
>>  tests/dpdk-packet-mbufs.at                 |   7 +
>>  tests/system-dpdk-macros.at                |   6 +-
>>  tests/system-dpdk-testsuite.at             |   1 +
>>  tests/system-dpdk.at                       |  65 ++++
>>  tests/test-dpdk-mbufs.c                    | 513 +++++++++++++++++++++++++++++
>>  vswitchd/vswitch.xml                       |  22 ++
>>  15 files changed, 1277 insertions(+), 77 deletions(-)
>>  create mode 100644 tests/dpdk-packet-mbufs.at
>>  create mode 100644 tests/test-dpdk-mbufs.c
>>


More information about the dev mailing list