[ovs-dev] [PATCH v5 00/14] Support multi-segment mbufs
Ian Stokes
ian.stokes at intel.com
Fri Jul 13 14:47:10 UTC 2018
On 7/11/2018 7:23 PM, Tiago Lam wrote:
> Overview
> ========
> This patchset introduces support for multi-segment mbufs to OvS-DPDK.
> Multi-segment mbufs are typically used when the size of an mbuf is
> insufficient to contain the entirety of a packet's data. Instead, the
> data is split across numerous mbufs, each carrying a portion, or
> 'segment', of the packet data. mbufs are chained via their 'next'
> attribute (an mbuf pointer).
>
Thanks to all for the work on this series. I've pushed this to
dpdk_merge and it will be part of the pull request this week.
Ian
> Use Cases
> =========
> i. Handling oversized (guest-originated) frames, which are marked
> for hardware accelration/offload (TSO, for example).
>
> Packets which originate from a non-DPDK source may be marked for
> offload; as such, they may be larger than the permitted ingress
> interface's MTU, and may be stored in an oversized dp-packet. In
> order to transmit such packets over a DPDK port, their contents
> must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
> its current implementation, that function only copies data into
> a single mbuf; if the space available in the mbuf is exhausted,
> but not all packet data has been copied, then it is lost.
> Similarly, when cloning a DPDK mbuf, it must be considered
> whether that mbuf contains multiple segments. Both issues are
> resolved within this patchset.
>
> ii. Handling jumbo frames.
>
> While OvS already supports jumbo frames, it does so by increasing
> mbuf size, such that the entirety of a jumbo frame may be handled
> in a single mbuf. This is certainly the preferred, and most
> performant approach (and remains the default).
>
> Enabling multi-segment mbufs
> ============================
> Multi-segment and single-segment mbufs are mutually exclusive, and the
> user must decide on which approach to adopt on init. The introduction
> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.
>
> This is a global boolean value, which determines how jumbo frames are
> represented across all DPDK ports. In the absence of a user-supplied
> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
> mbufs must be explicitly enabled / single-segment mbufs remain the
> default.
>
> Setting the field is identical to setting existing DPDK-specific OVSDB
> fields:
>
> ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
> ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
> ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>
> Performance notes (based on v8)
> =================
> In order to test for regressions in performance, tests were run on top
> of master 88125d6 and v8 of this patchset, both with the multi-segment
> mbufs option enabled and disabled.
>
> VSperf was used to run the phy2phy_cont and pvp_cont tests with varying
> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.
>
> Test | Size | Master | Multi-seg disabled | Multi-seg enabled
> -------------------------------------------------------------
> p2p | 64 | ~22.7 | ~22.65 | ~18.3
> p2p | 1500 | ~1.6 | ~1.6 | ~1.6
> p2p | 7000 | ~0.36 | ~0.36 | ~0.36
> pvp | 64 | ~6.7 | ~6.7 | ~6.3
> pvp | 1500 | ~1.6 | ~1.6 | ~1.6
> pvp | 7000 | ~0.36 | ~0.36 | ~0.36
>
> Packet size is in bytes, while all packet rates are reported in mpps
> (aggregated).
>
> No noticeable regression has been observed (certainly everything is
> within the ± 5% margin of existing performance), aside from the 64B
> packet size case when multi-segment mbuf is enabled. This is
> expected, however, because of how Tx vectoriszed functions are
> incompatible with multi-segment mbufs on some PMDs. The PMD under
> use during these tests was the i40e (on a Intel X710 NIC), which
> indeed doesn't support vectorized Tx functions with multi-segment
> mbufs.
>
> ---
> v5: - Rebase on master 030958a0cc ("conntrack: Fix conn_update_state_alg use
> after free.");
> - Address Eelco's comments:
> - Remove dpdk_mp_sweep() call in netdev_dpdk_mempool_configure(), a
> leftover from rebase. Only call should be in dpdk_mp_get();
> - Remove NEWS line added by mistake during rebase (about adding
> experimental vhost zero copy support).
> - Address Ian's comments:
> - Drop patch 01 from previous series entirely;
> - Patch (now) 01/14 adds a new call to dpdk_buf_size() inside
> dpdk_mp_create() to get the correct "mbuf_size" to be used;
> - Patch (now) 11/14 modifies dpdk_mp_create() to check if multi-segment
> mbufs is enabled, in which case it calculates the new "mbuf_size" to be
> used;
> - In free_dpdk_buf() and dpdk_buf_alloc(), don't lock and unlock
> conditionally.
> - Add "per-port-memory=true" to test "Multi-segment mbufs Tx" as the current
> DPDK set up in system-dpdk-testsuite can't handle higher MTU sizes using
> the shared mempool model (runs out of memory);
> - Add new examples for when multi-segment mbufs are enabled in
> topics/dpdk/memory.rst, and a reference to topics/dpdk/jumbo-frames.rst
> (now patch 11/14).
>
> v4: - Rebase on master b22fb00 ("ovn-nbctl: Clarify error messages in qos-add
> command."):
> - A new patch (now 01/15) has been introduced to differentiate between
> MTU and mbuf size when creating the mempool. This is because as part of
> the new support for both per port and shared mempools, mempools were
> being reused based on the mbuf size, and for multi-segment mbufs the
> mbuf size can end up being the same for all mbufs;
> - A couple of other patches (now 02/15 and 12/15) have been modified as
> part of the rebase, but only to adapt to the code changes to "Support
> both shared and per port mempools.", no functionality should have been
> changed.
>
> v3:
> - Address Eelco's comment:
> - Fix the ovs_assert() introduced in v2 in __packet_set_data(), which
> wasn't correctly asserting that the passed 'v' was smaller than the
> first mbuf's buf_len.
>
> v2:
> - Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on
> error.");
> - Address Eelco's comments:
> - In free_dpdk_buf(), use mbuf's struct address in dp_packet instead of
> casting;
> - Remove unneeded variable in dp_packet_set_size(), pointing to the
> first mbuf in the chain;
> - Assert in dp_packet_set_size() to enforce that "pkt_len == v" is
> always true for DPBUF_DPDK packets;
> - Assert in __packet_set_data() to enforce that data_off never goes
> beyond the first mbuf in the chain.
>
> v1:
> - v8 should have been sent as v1 really, as that's usually the approach
> followed in OvS. That clearly didn't happen, so restarting the series
> now. This also helps making it clear it is no longer an RFC series;
> - Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags()
> output.");
> - Address Eelco's comments:
> - Change dp_packet_size() so that for `DPBUF_DPDK` packets their
> `pkt_len` and `data_len` can't be set to values bigger than the
> available space. Also fix assigment to `data_len` which was
> incorrectly being set to just`pkt_len`;
> - Improve `nonpmd_mp_mutex` comment with a better explanation as to
> why the mutex is needed;
> - Fix dp_packet_clear() to not call rte_pktmbuf_reset() for non
> `DPBUF_DPDK` packets;
> - Dropped `if` clause in dp_packet_l4_size(), keep just the `else`;
> - Change dp_packet_clone_with_headroom() to use rte_pktmbuf_read() for
> copying `DPBUF_DPDK` packets' data. Also, change it to return
> appropriate and meaningful errors, instead of just "0" or "1";
> - Change dpdk_prep_tx_buf() name's to dpdk_clone_dp_packet_to_mbuf(),
> and reuse dp_packet_mbuf_write() instead of manual copy;
> - Add note vswitchd/vswitch.xml to make it clear the enabling of
> dpdk-multi-seg-mbufs requires a restart;
> - Change dpdk_mp_create() to increase # mbufs used under the multi-segment
> mbufs approach;
> - Increase test coverage by adding "end-to-end" tests that verify that
> "dpdk-multi-seg-mbufs" is disabled by default and that a packet is
> successfully sent out;
> - Some helper funcs such as dp_packet_tail() and dp_packet_end() were moved
> back to be common between `DPBUF_DPDK` and non `DPBUF_DPDK` packets, to
> minimise changes;
> - Add some extra notes to "Performance notes" in jumbo-frames.rst doc,
> after further testing;
> - Structure changes:
> - Drop patch 07/13 which is now unneeded;
> - Two more patches added for extra test coverage. This is what accounts
> for the increase in size (+1 patch) in the series.
>
> v8 (non-RFC):
> - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from
> temporary directory (also for RHEL)");
> - Address Ciara's and Ilya's comments:
> - Drop the dp_packet_mbuf_tail() function and use only the
> already existing dp_packet_tail();
> - Fix bug in dpdk_do_tx_copy() where the error return from
> dpdk_prep_tx_buf() was being wrongly checked;
> - Use dpdk_buf_alloc() and free_dpdk_buf() instead of
> rte_pktmbuf_alloc() and rte_pktmbuf_free();
> - Fix some other code style and duplication issues pointed out.
> - Refactored dp_packet_shift(), dp_packet_resize__() and
> dp_packet_put*() functions to work within the bounds of existing
> mbufs only;
> - Fix dp_packet_clear() which wasn't correctly clearing / freeing
> other mbufs in the chain for chains with more than a single mbuf;
> - dp_packet layer functions (such as dp_packet_l3()) now check if
> the header is within the first mbuf, when using mbufs;
> - Move patch 08/13 to before patch 04/13, since dp_packet_set_size()
> was refactored to use free_dpdk_buf();
> - Fix wrong rte_memcpy() when performing dp_packet_clone() which was
> leading to memory corruption;
> - Modified the added tests to account for some of the above changes;
> - Run performance tests, compiling results and adding them to the
> cover letter;
> - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
> together with a "Performance notes" sub-section reflecting the
> findings mentioned above in the cover letter.
>
> v7: - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
> - Address Ilya comments;
> - Fix non-DPDK build;
> - Serialise the access of non pmds to allocation and free of mbufs by
> using a newly introduced mutex.
> - Add a new set of tests that integrates with the recently added DPDK
> testsuite. These focus on allocating dp_packets, with a single or
> multiple mbufs, from an instantiated mempool and performing several
> operations on those, verifying if the data at the end matches what's
> expected;
> - Fix bugs found by the new tests:
> - dp_packet_shift() wasn't taking into account shift lefts;
> - dp_packet_resize__() was misusing and miscalculating the tailrooms
> and headrooms, ending up calculating the wrong number of segments
> that needed allocation;
> - An mbuf's end was being miscalculated both in dp_packet_tail,
> dp_packet_mbuf_tail() and dp_packet_end();
> - dp_packet_set_size() was not updating the number of chained segments
> 'nb_segs';
> - Add support for multi-seg mbufs in dp_packet_clear().
>
> v6: - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
> reuse.");
> - Further improve dp_packet_put_uninit() and dp_packet_shift() to
> support multi-seg mbufs;
> - Add support for multi-seg mbufs in dp_packet_l4_size() and
> improve other helper funcs, such as dp_packet_tail() and dp_
> packet_tailroom().
> - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
> put_zeros(), as well as dp_packet_resize__() - allocating new
> mbufs and linking them together;
> Restructured patchset:
> - Squash patch 5 into patch 6, since they were both related to
> copying data while handling multi-seg mbufs;
> - Split patch 4 into two separate patches - one that introduces the
> changes in helper functions to deal with multi-seg mbufs and
> two others for the shift() and put_uninit() functionality;
> - Move patch 4 to before patch 3, so that ihelper functions come
> before functionality improvement that rely on those helpers.
>
> v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
> from getting redundantly incremented");
> - Sugesh's comments have been addressed:
> - Changed dp_packet_set_data() and dp_packet_set_size() logic to
> make them independent of each other;
> - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
> size() are independent;
> - dp_packet_clone_with_headroom() now has split functions for
> handling DPDK sourced packets and non-DPDK packets;
> - Modified various functions in dp-packet.h to account for multi-seg
> mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
> and dp_packet_at();
> - Added support for shifting packet data in multi-seg mbufs, using
> dp_packet_shift();
> - Fixed some minor inconsistencies.
>
> Note that some of the changes in v5 have been contributed by Mark
> Kavanagh as well.
>
> v4: - restructure patchset
> - account for 128B ARM cacheline when sizing mbufs
>
> Mark Kavanagh (4):
> netdev-dpdk: fix mbuf sizing
> dp-packet: Init specific mbuf fields.
> netdev-dpdk: copy large packet to multi-seg. mbufs
> netdev-dpdk: support multi-segment jumbo frames
>
> Michael Qiu (1):
> dp-packet: copy data from multi-seg. DPDK mbuf
>
> Tiago Lam (9):
> dp-packet: Fix allocated size on DPDK init.
> netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
> dp-packet: Fix data_len handling multi-seg mbufs.
> dp-packet: Handle multi-seg mbufs in helper funcs.
> dp-packet: Handle multi-seg mubfs in shift() func.
> dp-packet: Handle multi-seg mbufs in resize__().
> dpdk-tests: Add uni-tests for multi-seg mbufs.
> dpdk-tests: Accept other configs in OVS_DPDK_START
> dpdk-tests: End-to-end tests for multi-seg mbufs.
>
> Documentation/topics/dpdk/jumbo-frames.rst | 52 +++
> Documentation/topics/dpdk/memory.rst | 36 ++
> NEWS | 1 +
> lib/dp-packet.c | 221 +++++++++++-
> lib/dp-packet.h | 214 ++++++++++--
> lib/dpdk.c | 8 +
> lib/netdev-dpdk.c | 244 +++++++++++---
> lib/netdev-dpdk.h | 2 +
> tests/automake.mk | 10 +-
> tests/dpdk-packet-mbufs.at | 7 +
> tests/system-dpdk-macros.at | 6 +-
> tests/system-dpdk-testsuite.at | 1 +
> tests/system-dpdk.at | 65 ++++
> tests/test-dpdk-mbufs.c | 518 +++++++++++++++++++++++++++++
> vswitchd/vswitch.xml | 22 ++
> 15 files changed, 1328 insertions(+), 79 deletions(-)
> create mode 100644 tests/dpdk-packet-mbufs.at
> create mode 100644 tests/test-dpdk-mbufs.c
>
More information about the dev
mailing list