[ovs-dev] [PATCH v5 00/14] Support multi-segment mbufs

Ian Stokes ian.stokes at intel.com
Fri Jul 13 14:47:10 UTC 2018


On 7/11/2018 7:23 PM, Tiago Lam wrote:
> Overview
> ========
> This patchset introduces support for multi-segment mbufs to OvS-DPDK.
> Multi-segment mbufs are typically used when the size of an mbuf is
> insufficient to contain the entirety of a packet's data. Instead, the
> data is split across numerous mbufs, each carrying a portion, or
> 'segment', of the packet data. mbufs are chained via their 'next'
> attribute (an mbuf pointer).
> 

Thanks to all for the work on this series. I've pushed this to 
dpdk_merge and it will be part of the pull request this week.

Ian
> Use Cases
> =========
> i.  Handling oversized (guest-originated) frames, which are marked
>      for hardware accelration/offload (TSO, for example).
> 
>      Packets which originate from a non-DPDK source may be marked for
>      offload; as such, they may be larger than the permitted ingress
>      interface's MTU, and may be stored in an oversized dp-packet. In
>      order to transmit such packets over a DPDK port, their contents
>      must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
>      its current implementation, that function only copies data into
>      a single mbuf; if the space available in the mbuf is exhausted,
>      but not all packet data has been copied, then it is lost.
>      Similarly, when cloning a DPDK mbuf, it must be considered
>      whether that mbuf contains multiple segments. Both issues are
>      resolved within this patchset.
> 
> ii. Handling jumbo frames.
> 
>      While OvS already supports jumbo frames, it does so by increasing
>      mbuf size, such that the entirety of a jumbo frame may be handled
>      in a single mbuf. This is certainly the preferred, and most
>      performant approach (and remains the default).
> 
> Enabling multi-segment mbufs
> ============================
> Multi-segment and single-segment mbufs are mutually exclusive, and the
> user must decide on which approach to adopt on init. The introduction
> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.
> 
> This is a global boolean value, which determines how jumbo frames are
> represented across all DPDK ports. In the absence of a user-supplied
> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
> mbufs must be explicitly enabled / single-segment mbufs remain the
> default.
> 
> Setting the field is identical to setting existing DPDK-specific OVSDB
> fields:
> 
>      ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>      ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>      ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
> 
> Performance notes (based on v8)
> =================
> In order to test for regressions in performance, tests were run on top
> of master 88125d6 and v8 of this patchset, both with the multi-segment
> mbufs option enabled and disabled.
> 
> VSperf was used to run the phy2phy_cont and pvp_cont tests with varying
> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.
> 
> Test | Size | Master | Multi-seg disabled | Multi-seg enabled
> -------------------------------------------------------------
> p2p  |  64  | ~22.7  |      ~22.65        |       ~18.3
> p2p  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
> p2p  | 7000 | ~0.36  |       ~0.36        |       ~0.36
> pvp  |  64  |  ~6.7  |        ~6.7        |        ~6.3
> pvp  | 1500 |  ~1.6  |        ~1.6        |        ~1.6
> pvp  | 7000 | ~0.36  |       ~0.36        |       ~0.36
> 
> Packet size is in bytes, while all packet rates are reported in mpps
> (aggregated).
> 
> No noticeable regression has been observed (certainly everything is
> within the ± 5% margin of existing performance), aside from the 64B
> packet size case when multi-segment mbuf is enabled. This is
> expected, however, because of how Tx vectoriszed functions are
> incompatible with multi-segment mbufs on some PMDs. The PMD under
> use during these tests was the i40e (on a Intel X710 NIC), which
> indeed doesn't support vectorized Tx functions with multi-segment
> mbufs.
> 
> ---
> v5: - Rebase on master 030958a0cc ("conntrack: Fix conn_update_state_alg use
>        after free.");
>      - Address Eelco's comments:
>        - Remove dpdk_mp_sweep() call in netdev_dpdk_mempool_configure(), a
>          leftover from rebase. Only call should be in dpdk_mp_get();
>        - Remove NEWS line added by mistake during rebase (about adding
>          experimental vhost zero copy support).
>      - Address Ian's comments:
>        - Drop patch 01 from previous series entirely;
>        - Patch (now) 01/14 adds a new call to dpdk_buf_size() inside
>          dpdk_mp_create() to get the correct "mbuf_size" to be used;
>        - Patch (now) 11/14 modifies dpdk_mp_create() to check if multi-segment
>          mbufs is enabled, in which case it calculates the new "mbuf_size" to be
>          used;
>        - In free_dpdk_buf() and dpdk_buf_alloc(), don't lock and unlock
>          conditionally.
>      - Add "per-port-memory=true" to test "Multi-segment mbufs Tx" as the current
>        DPDK set up in system-dpdk-testsuite can't handle higher MTU sizes using
>        the shared mempool model (runs out of memory);
>      - Add new examples for when multi-segment mbufs are enabled in
>        topics/dpdk/memory.rst, and a reference to topics/dpdk/jumbo-frames.rst
>        (now patch 11/14).
> 
> v4: - Rebase on master b22fb00 ("ovn-nbctl: Clarify error messages in qos-add
>        command."):
>        - A new patch (now 01/15) has been introduced to differentiate between
>          MTU and mbuf size when creating the mempool. This is because as part of
>          the new support for both per port and shared mempools, mempools were
>          being reused based on the mbuf size, and for multi-segment mbufs the
>          mbuf size can end up being the same for all mbufs;
>        - A couple of other patches (now 02/15 and 12/15) have been modified as
>          part of the rebase, but only to adapt to the code changes to "Support
>          both shared and per port mempools.", no functionality should have been
>          changed.
> 
> v3:
>      - Address Eelco's comment:
>          - Fix the ovs_assert() introduced in v2 in __packet_set_data(), which
>            wasn't correctly asserting that the passed 'v' was smaller than the
>            first mbuf's buf_len.
> 
> v2:
>      - Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on
>        error.");
>      - Address Eelco's comments:
>          - In free_dpdk_buf(), use mbuf's struct address in dp_packet instead of
>            casting;
>          - Remove unneeded variable in dp_packet_set_size(), pointing to the
>            first mbuf in the chain;
>          - Assert in dp_packet_set_size() to enforce that "pkt_len == v" is
>            always true for DPBUF_DPDK packets;
>          - Assert in __packet_set_data() to enforce that data_off never goes
>            beyond the first mbuf in the chain.
> 
> v1:
>      - v8 should have been sent as v1 really, as that's usually the approach
>        followed in OvS. That clearly didn't happen, so restarting the series
>        now. This also helps making it clear it is no longer an RFC series;
>      - Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags()
>        output.");
>      - Address Eelco's comments:
>          - Change dp_packet_size() so that for `DPBUF_DPDK` packets their
>            `pkt_len` and `data_len` can't be set to values bigger than the
>            available space. Also fix assigment to `data_len` which was
>            incorrectly being set to just`pkt_len`;
>          - Improve `nonpmd_mp_mutex` comment with a better explanation as to
>            why the mutex is needed;
>          - Fix dp_packet_clear() to not call rte_pktmbuf_reset() for non
>            `DPBUF_DPDK` packets;
>          - Dropped `if` clause in dp_packet_l4_size(), keep just the `else`;
>          - Change dp_packet_clone_with_headroom() to use rte_pktmbuf_read() for
>            copying `DPBUF_DPDK` packets' data. Also, change it to return
>            appropriate and meaningful errors, instead of just "0" or "1";
>          - Change dpdk_prep_tx_buf() name's to dpdk_clone_dp_packet_to_mbuf(),
>            and reuse dp_packet_mbuf_write() instead of manual copy;
>          - Add note vswitchd/vswitch.xml to make it clear the enabling of
>            dpdk-multi-seg-mbufs requires a restart;
>      - Change dpdk_mp_create() to increase # mbufs used under the multi-segment
>        mbufs approach;
>      - Increase test coverage by adding "end-to-end" tests that verify that
>        "dpdk-multi-seg-mbufs" is disabled by default and that a packet is
>        successfully sent out;
>      - Some helper funcs such as dp_packet_tail() and dp_packet_end() were moved
>        back to be common between `DPBUF_DPDK` and non `DPBUF_DPDK` packets, to
>        minimise changes;
>      - Add some extra notes to "Performance notes" in jumbo-frames.rst doc,
>        after further testing;
>      - Structure changes:
>        - Drop patch 07/13 which is now unneeded;
>        - Two more patches added for extra test coverage. This is what accounts
>          for the increase in size (+1 patch) in the series.
> 
> v8 (non-RFC):
>      - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from
>        temporary directory (also for RHEL)");
>      - Address Ciara's and Ilya's comments:
>          - Drop the dp_packet_mbuf_tail() function and use only the
>            already existing dp_packet_tail();
>          - Fix bug in dpdk_do_tx_copy() where the error return from
>            dpdk_prep_tx_buf() was being wrongly checked;
>          - Use dpdk_buf_alloc() and free_dpdk_buf() instead of
>            rte_pktmbuf_alloc() and rte_pktmbuf_free();
>          - Fix some other code style and duplication issues pointed out.
>      - Refactored dp_packet_shift(), dp_packet_resize__() and
>        dp_packet_put*() functions to work within the bounds of existing
>        mbufs only;
>      - Fix dp_packet_clear() which wasn't correctly clearing / freeing
>        other mbufs in the chain for chains with more than a single mbuf;
>      - dp_packet layer functions (such as dp_packet_l3()) now check if
>        the header is within the first mbuf, when using mbufs;
>      - Move patch 08/13 to before patch 04/13, since dp_packet_set_size()
>        was refactored to use free_dpdk_buf();
>      - Fix wrong rte_memcpy() when performing dp_packet_clone() which was
>        leading to memory corruption;
>      - Modified the added tests to account for some of the above changes;
>      - Run performance tests, compiling results and adding them to the
>        cover letter;
>      - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
>        together with a "Performance notes" sub-section reflecting the
>        findings mentioned above in the cover letter.
>   
> v7:  - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
>       - Address Ilya comments;
>          - Fix non-DPDK build;
>          - Serialise the access of non pmds to allocation and free of mbufs by
>            using a newly introduced mutex.
>       - Add a new set of tests that integrates with the recently added DPDK
>         testsuite. These focus on allocating dp_packets, with a single or
>         multiple mbufs, from an instantiated mempool and performing several
>         operations on those, verifying if the data at the end matches what's
>         expected;
>       - Fix bugs found by the new tests:
>          - dp_packet_shift() wasn't taking into account shift lefts;
>          - dp_packet_resize__() was misusing and miscalculating the tailrooms
>            and headrooms, ending up calculating the wrong number of segments
>            that needed allocation;
>          - An mbuf's end was being miscalculated both in dp_packet_tail,
>            dp_packet_mbuf_tail() and dp_packet_end();
>          - dp_packet_set_size() was not updating the number of chained segments
>            'nb_segs';
>       - Add support for multi-seg mbufs in dp_packet_clear().
> 
> v6:  - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
>         reuse.");
>       - Further improve dp_packet_put_uninit() and dp_packet_shift() to
>         support multi-seg mbufs;
>       - Add support for multi-seg mbufs in dp_packet_l4_size() and
>         improve other helper funcs, such as dp_packet_tail() and dp_
>         packet_tailroom().
>       - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
>         put_zeros(), as well as dp_packet_resize__() - allocating new
>         mbufs and linking them together;
>       Restructured patchset:
>       - Squash patch 5 into patch 6, since they were both related to
>         copying data while handling multi-seg mbufs;
>       - Split patch 4 into two separate patches - one that introduces the
>         changes in helper functions to deal with multi-seg mbufs and
>         two others for the shift() and put_uninit() functionality;
>       - Move patch 4 to before patch 3, so that ihelper functions come
>         before functionality improvement that rely on those helpers.
> 
> v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
>        from getting redundantly incremented");
>      - Sugesh's comments have been addressed:
>        - Changed dp_packet_set_data() and dp_packet_set_size() logic to
>          make them independent of each other;
>        - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
>          size() are independent;
>        - dp_packet_clone_with_headroom() now has split functions for
>          handling DPDK sourced packets and non-DPDK packets;
>      - Modified various functions in dp-packet.h to account for multi-seg
>        mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
>        and dp_packet_at();
>      - Added support for shifting packet data in multi-seg mbufs, using
>        dp_packet_shift();
>      - Fixed some minor inconsistencies.
> 
>      Note that some of the changes in v5 have been contributed by Mark
>      Kavanagh as well.
> 
> v4: - restructure patchset
>      - account for 128B ARM cacheline when sizing mbufs
> 
> Mark Kavanagh (4):
>    netdev-dpdk: fix mbuf sizing
>    dp-packet: Init specific mbuf fields.
>    netdev-dpdk: copy large packet to multi-seg. mbufs
>    netdev-dpdk: support multi-segment jumbo frames
> 
> Michael Qiu (1):
>    dp-packet: copy data from multi-seg. DPDK mbuf
> 
> Tiago Lam (9):
>    dp-packet: Fix allocated size on DPDK init.
>    netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
>    dp-packet: Fix data_len handling multi-seg mbufs.
>    dp-packet: Handle multi-seg mbufs in helper funcs.
>    dp-packet: Handle multi-seg mubfs in shift() func.
>    dp-packet: Handle multi-seg mbufs in resize__().
>    dpdk-tests: Add uni-tests for multi-seg mbufs.
>    dpdk-tests: Accept other configs in OVS_DPDK_START
>    dpdk-tests: End-to-end tests for multi-seg mbufs.
> 
>   Documentation/topics/dpdk/jumbo-frames.rst |  52 +++
>   Documentation/topics/dpdk/memory.rst       |  36 ++
>   NEWS                                       |   1 +
>   lib/dp-packet.c                            | 221 +++++++++++-
>   lib/dp-packet.h                            | 214 ++++++++++--
>   lib/dpdk.c                                 |   8 +
>   lib/netdev-dpdk.c                          | 244 +++++++++++---
>   lib/netdev-dpdk.h                          |   2 +
>   tests/automake.mk                          |  10 +-
>   tests/dpdk-packet-mbufs.at                 |   7 +
>   tests/system-dpdk-macros.at                |   6 +-
>   tests/system-dpdk-testsuite.at             |   1 +
>   tests/system-dpdk.at                       |  65 ++++
>   tests/test-dpdk-mbufs.c                    | 518 +++++++++++++++++++++++++++++
>   vswitchd/vswitch.xml                       |  22 ++
>   15 files changed, 1328 insertions(+), 79 deletions(-)
>   create mode 100644 tests/dpdk-packet-mbufs.at
>   create mode 100644 tests/test-dpdk-mbufs.c
> 



More information about the dev mailing list