[ovs-dev] [PATCH v4 3/3] netdev-dpdk: Add TCP Segmentation Offload support
Stokes, Ian
ian.stokes at intel.com
Fri Jan 17 14:23:21 UTC 2020
On 1/16/2020 5:00 PM, Flavio Leitner wrote:
> Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
> the network stack to delegate the TCP segmentation to the NIC reducing
> the per packet CPU overhead.
>
> A guest using vhostuser interface with TSO enabled can send TCP packets
> much bigger than the MTU, which saves CPU cycles normally used to break
> the packets down to MTU size and to calculate checksums.
>
> It also saves CPU cycles used to parse multiple packets/headers during
> the packet processing inside virtual switch.
>
> If the destination of the packet is another guest in the same host, then
> the same big packet can be sent through a vhostuser interface skipping
> the segmentation completely. However, if the destination is not local,
> the NIC hardware is instructed to do the TCP segmentation and checksum
> calculation.
>
> It is recommended to check if NIC hardware supports TSO before enabling
> the feature, which is off by default. For additional information please
> check the tso.rst document.
Thanks for the patch Flavio,
Ciara has vaidated the series and no issues found on our side.
@Ilya, any concerns on your side? Any reason to blocking this to merge?
Regards
Ian
>
> Signed-off-by: Flavio Leitner <fbl at sysclose.org>
> ---
> Documentation/automake.mk | 1 +
> Documentation/topics/index.rst | 1 +
> Documentation/topics/userspace-tso.rst | 98 +++++++
> NEWS | 1 +
> lib/automake.mk | 2 +
> lib/conntrack.c | 29 +-
> lib/dp-packet.h | 186 +++++++++++-
> lib/ipf.c | 32 +-
> lib/netdev-dpdk.c | 349 +++++++++++++++++++---
> lib/netdev-linux-private.h | 5 +
> lib/netdev-linux.c | 386 ++++++++++++++++++++++---
> lib/netdev-provider.h | 9 +
> lib/netdev.c | 78 ++++-
> lib/userspace-tso.c | 48 +++
> lib/userspace-tso.h | 23 ++
> vswitchd/bridge.c | 2 +
> vswitchd/vswitch.xml | 17 ++
> 17 files changed, 1143 insertions(+), 124 deletions(-)
> create mode 100644 Documentation/topics/userspace-tso.rst
> create mode 100644 lib/userspace-tso.c
> create mode 100644 lib/userspace-tso.h
>
> Changelog:
> - v4
> * rebased on top of master (recvmmsg)
> * fixed URL in doc to point to 19.11
> * renamed tso to userspace-tso
> * renamed the option to userspace-tso-enable
> * removed prototype that left over from v2
> * fixed function style declaration
> * renamed dp_packet_hwol_tx_ip_checksum to dp_packet_hwol_tx_ipv4_checksum
> * dp_packet_hwol_tx_ipv4_checksum now checks for PKT_TX_IPV4.
> * account for drops while preping the batch for TX.
> * don't prep the batch for TX if TSO is disabled.
> * simplified setsockopt error checking
> * fixed af_packet_sock error checking to not call setsockopt on
> closed sockets.
> * fixed ol_flags comment.
> * used VLOG_ERR_BUF() to pass error messages.
> * fixed packet leak at netdev_send_prepare_batch()
> * added a coverage counter to account drops while preparing a batch
> at netdev.c
> * fixed netdev_send() to not call ->send() if the batch is empty.
> * fixed packet leak at netdev_push_header and account for the drops.
> * removed DPDK requirement to enable userspace TSO support.
> * fixed parameter documentation in vswitch.xml.
> * renamed tso.rst to userspace-tso.rst and moved to topics/
> * added comments documeting the functions in dp-packet.h
> * fixed dp_packet_hwol_is_tso to check only PKT_TX_TCP_SEG
>
> - v3
> * Improved the documentation.
> * Updated copyright year to 2020.
> * TSO offloaded msg now includes the netdev's name.
> * Added period at the end of all code comments.
> * Warn and drop encapsulation of TSO packets.
> * Fixed travis issue with restricted virtio types.
> * Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf()
> which caused packet corruption.
> * Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set
> PKT_TX_IP_CKSUM only for IPv4 packets.
>
>
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> index f2ca17bad..22976a3cd 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -57,6 +57,7 @@ DOC_SOURCE = \
> Documentation/topics/ovsdb-replication.rst \
> Documentation/topics/porting.rst \
> Documentation/topics/tracing.rst \
> + Documentation/topics/userspace-tso.rst \
> Documentation/topics/windows.rst \
> Documentation/howto/index.rst \
> Documentation/howto/dpdk.rst \
> diff --git a/Documentation/topics/index.rst b/Documentation/topics/index.rst
> index 34c4b10e0..08af3a24d 100644
> --- a/Documentation/topics/index.rst
> +++ b/Documentation/topics/index.rst
> @@ -50,5 +50,6 @@ OVS
> language-bindings
> testing
> tracing
> + userspace-tso
> idl-compound-indexes
> ovs-extensions
> diff --git a/Documentation/topics/userspace-tso.rst b/Documentation/topics/userspace-tso.rst
> new file mode 100644
> index 000000000..893c64839
> --- /dev/null
> +++ b/Documentation/topics/userspace-tso.rst
> @@ -0,0 +1,98 @@
> +..
> + Copyright 2020, Red Hat, Inc.
> +
> + Licensed under the Apache License, Version 2.0 (the "License"); you may
> + not use this file except in compliance with the License. You may obtain
> + a copy of the License at
> +
> + http://www.apache.org/licenses/LICENSE-2.0
> +
> + Unless required by applicable law or agreed to in writing, software
> + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> + License for the specific language governing permissions and limitations
> + under the License.
> +
> + Convention for heading levels in Open vSwitch documentation:
> +
> + ======= Heading 0 (reserved for the title in a document)
> + ------- Heading 1
> + ~~~~~~~ Heading 2
> + +++++++ Heading 3
> + ''''''' Heading 4
> +
> + Avoid deeper levels because they do not render well.
> +
> +========================
> +Userspace Datapath - TSO
> +========================
> +
> +**Note:** This feature is considered experimental.
> +
> +TCP Segmentation Offload (TSO) enables a network stack to delegate segmentation
> +of an oversized TCP segment to the underlying physical NIC. Offload of frame
> +segmentation achieves computational savings in the core, freeing up CPU cycles
> +for more useful work.
> +
> +A common use case for TSO is when using virtualization, where traffic that's
> +coming in from a VM can offload the TCP segmentation, thus avoiding the
> +fragmentation in software. Additionally, if the traffic is headed to a VM
> +within the same host further optimization can be expected. As the traffic never
> +leaves the machine, no MTU needs to be accounted for, and thus no segmentation
> +and checksum calculations are required, which saves yet more cycles. Only when
> +the traffic actually leaves the host the segmentation needs to happen, in which
> +case it will be performed by the egress NIC. Consult your controller's
> +datasheet for compatibility. Secondly, the NIC must have an associated DPDK
> +Poll Mode Driver (PMD) which supports `TSO`. For a list of features per PMD,
> +refer to the `DPDK documentation`__.
> +
> +__ https://doc.dpdk.org/guides-19.11/nics/overview.html
> +
> +Enabling TSO
> +~~~~~~~~~~~~
> +
> +The TSO support may be enabled via a global config value
> +``userspace-tso-enable``. Setting this to ``true`` enables TSO support for
> +all ports.
> +
> + $ ovs-vsctl set Open_vSwitch . other_config:userspace-tso-enable=true
> +
> +The default value is ``false``.
> +
> +Changing ``userspace-tso-enable`` requires restarting the daemon.
> +
> +When using :doc:`vHost User ports <dpdk/vhost-user>`, TSO may be enabled
> +as follows.
> +
> +`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest
> +connection is established, `TSO` is thus advertised to the guest as an
> +available feature:
> +
> +QEMU Command Line Parameter::
> +
> + $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \
> + ...
> + -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\
> + csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\
> + ...
> +
> +2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool can be
> +used to enable same::
> +
> + $ ethtool -K eth0 sg on # scatter-gather is a prerequisite for TSO
> + $ ethtool -K eth0 tso on
> + $ ethtool -k eth0
> +
> +~~~~~~~~~~~
> +Limitations
> +~~~~~~~~~~~
> +
> +The current OvS userspace `TSO` implementation supports flat and VLAN networks
> +only (i.e. no support for `TSO` over tunneled connection [VxLAN, GRE, IPinIP,
> +etc.]).
> +
> +There is no software implementation of TSO, so all ports attached to the
> +datapath must support TSO or packets using that feature will be dropped
> +on ports without TSO support. That also means guests using vhost-user
> +in client mode will receive TSO packet regardless of TSO being enabled
> +or disabled within the guest.
> diff --git a/NEWS b/NEWS
> index e8d662a0c..586d81173 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -26,6 +26,7 @@ Post-v2.12.0
> * DPDK ring ports (dpdkr) are deprecated and will be removed in next
> releases.
> * Add support for DPDK 19.11.
> + * Add experimental support for TSO.
> - RSTP:
> * The rstp_statistics column in Port table will only be updated every
> stats-update-interval configured in Open_vSwtich table.
> diff --git a/lib/automake.mk b/lib/automake.mk
> index ebf714501..b80de9fc4 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -304,6 +304,8 @@ lib_libopenvswitch_la_SOURCES = \
> lib/tnl-neigh-cache.h \
> lib/tnl-ports.c \
> lib/tnl-ports.h \
> + lib/userspace-tso.c \
> + lib/userspace-tso.h \
> lib/netdev-native-tnl.c \
> lib/netdev-native-tnl.h \
> lib/token-bucket.c \
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index b80080e72..742d2ad4f 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -2022,7 +2022,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
> if (hwol_bad_l3_csum) {
> ok = false;
> } else {
> - bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt);
> + bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt)
> + || dp_packet_hwol_tx_ipv4_checksum(pkt);
> /* Validate the checksum only when hwol is not supported. */
> ok = extract_l3_ipv4(&ctx->key, l3, dp_packet_l3_size(pkt), NULL,
> !hwol_good_l3_csum);
> @@ -2036,7 +2037,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
> if (ok) {
> bool hwol_bad_l4_csum = dp_packet_l4_checksum_bad(pkt);
> if (!hwol_bad_l4_csum) {
> - bool hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt);
> + bool hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt)
> + || dp_packet_hwol_tx_l4_checksum(pkt);
> /* Validate the checksum only when hwol is not supported. */
> if (extract_l4(&ctx->key, l4, dp_packet_l4_size(pkt),
> &ctx->icmp_related, l3, !hwol_good_l4_csum,
> @@ -3237,8 +3239,11 @@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
> }
> if (seq_skew) {
> ip_len = ntohs(l3_hdr->ip_tot_len) + seq_skew;
> - l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
> - l3_hdr->ip_tot_len, htons(ip_len));
> + if (!dp_packet_hwol_tx_ipv4_checksum(pkt)) {
> + l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
> + l3_hdr->ip_tot_len,
> + htons(ip_len));
> + }
> l3_hdr->ip_tot_len = htons(ip_len);
> }
> }
> @@ -3256,13 +3261,15 @@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
> }
>
> th->tcp_csum = 0;
> - if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
> - th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
> - dp_packet_l4_size(pkt));
> - } else {
> - uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
> - th->tcp_csum = csum_finish(
> - csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
> + if (!dp_packet_hwol_tx_l4_checksum(pkt)) {
> + if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
> + th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
> + dp_packet_l4_size(pkt));
> + } else {
> + uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
> + th->tcp_csum = csum_finish(
> + csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
> + }
> }
>
> if (seq_skew) {
> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> index 133942155..3e995f505 100644
> --- a/lib/dp-packet.h
> +++ b/lib/dp-packet.h
> @@ -456,7 +456,7 @@ dp_packet_init_specific(struct dp_packet *p)
> {
> /* This initialization is needed for packets that do not come from DPDK
> * interfaces, when vswitchd is built with --with-dpdk. */
> - p->mbuf.tx_offload = p->mbuf.packet_type = 0;
> + p->mbuf.ol_flags = p->mbuf.tx_offload = p->mbuf.packet_type = 0;
> p->mbuf.nb_segs = 1;
> p->mbuf.next = NULL;
> }
> @@ -519,6 +519,96 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
> b->mbuf.buf_len = s;
> }
>
> +/* Return true if packet 'b' offloads TCP segmentation. */
> +static inline bool
> +dp_packet_hwol_is_tso(const struct dp_packet *b)
> +{
> + return !!(b->mbuf.ol_flags & PKT_TX_TCP_SEG);
> +}
> +
> +/* Return true if packet 'b' is IPv4. The flag is required when
> + * offload is requested. */
> +static inline bool
> +dp_packet_hwol_is_ipv4(const struct dp_packet *b)
> +{
> + return !!(b->mbuf.ol_flags & PKT_TX_IPV4);
> +}
> +
> +/* Return the L4 cksum offload bitmask. */
> +static inline uint64_t
> +dp_packet_hwol_l4_mask(const struct dp_packet *b)
> +{
> + return b->mbuf.ol_flags & PKT_TX_L4_MASK;
> +}
> +
> +/* Return true if the packet 'b' offloads TCP checksum calculation. */
> +static inline bool
> +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b)
> +{
> + return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM;
> +}
> +
> +/* Return true if the packet 'b' offloads UDP checksum calculation. */
> +static inline bool
> +dp_packet_hwol_l4_is_udp(struct dp_packet *b)
> +{
> + return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM;
> +}
> +
> +/* Return true if the packet 'b' offloads SCTP checksum calculation. */
> +static inline bool
> +dp_packet_hwol_l4_is_sctp(struct dp_packet *b)
> +{
> + return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_SCTP_CKSUM;
> +}
> +
> +/* Flag the packet 'b' as IPv4 necessary when offload is used. */
> +static inline void
> +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b)
> +{
> + b->mbuf.ol_flags |= PKT_TX_IPV4;
> +}
> +
> +/* Flag the packet 'b' as IPv6 necessary when offload is used. */
> +static inline void
> +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b)
> +{
> + b->mbuf.ol_flags |= PKT_TX_IPV6;
> +}
> +
> +/* Request TCP checksum offload for packet 'b'. It implies that
> + * either the packet 'b' is flagged as IPv4 or IPv6. */
> +static inline void
> +dp_packet_hwol_set_csum_tcp(struct dp_packet *b)
> +{
> + b->mbuf.ol_flags |= PKT_TX_TCP_CKSUM;
> +}
> +
> +/* Request UDP checksum offload for packet 'b'. It implies that
> + * either the packet 'b' is flagged as IPv4 or IPv6. */
> +static inline void
> +dp_packet_hwol_set_csum_udp(struct dp_packet *b)
> +{
> + b->mbuf.ol_flags |= PKT_TX_UDP_CKSUM;
> +}
> +
> +/* Request SCTP checksum offload for packet 'b'. It implies that
> + * either the packet 'b' is flagged as IPv4 or IPv6. */
> +static inline void
> +dp_packet_hwol_set_csum_sctp(struct dp_packet *b)
> +{
> + b->mbuf.ol_flags |= PKT_TX_SCTP_CKSUM;
> +}
> +
> +/* Request TCP segmentation offload for packet 'b'. It implies that
> + * either the packet 'b' is flagged as IPv4 or IPv6 and also implies
> + * that TCP checksum offload is flagged. */
> +static inline void
> +dp_packet_hwol_set_tcp_seg(struct dp_packet *b)
> +{
> + b->mbuf.ol_flags |= PKT_TX_TCP_SEG;
> +}
> +
> /* Returns the RSS hash of the packet 'p'. Note that the returned value is
> * correct only if 'dp_packet_rss_valid(p)' returns true */
> static inline uint32_t
> @@ -648,6 +738,84 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
> b->allocated_ = s;
> }
>
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline bool
> +dp_packet_hwol_is_tso(const struct dp_packet *b OVS_UNUSED)
> +{
> + return false;
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline bool
> +dp_packet_hwol_is_ipv4(const struct dp_packet *b OVS_UNUSED)
> +{
> + return false;
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline uint64_t
> +dp_packet_hwol_l4_mask(const struct dp_packet *b OVS_UNUSED)
> +{
> + return 0;
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline bool
> +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b OVS_UNUSED)
> +{
> + return false;
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline bool
> +dp_packet_hwol_l4_is_udp(const struct dp_packet *b OVS_UNUSED)
> +{
> + return false;
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline bool
> +dp_packet_hwol_l4_is_sctp(const struct dp_packet *b OVS_UNUSED)
> +{
> + return false;
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline void
> +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b OVS_UNUSED)
> +{
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline void
> +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b OVS_UNUSED)
> +{
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline void
> +dp_packet_hwol_set_csum_tcp(struct dp_packet *b OVS_UNUSED)
> +{
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline void
> +dp_packet_hwol_set_csum_udp(struct dp_packet *b OVS_UNUSED)
> +{
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline void
> +dp_packet_hwol_set_csum_sctp(struct dp_packet *b OVS_UNUSED)
> +{
> +}
> +
> +/* There are no implementation when not DPDK enabled datapath. */
> +static inline void
> +dp_packet_hwol_set_tcp_seg(struct dp_packet *b OVS_UNUSED)
> +{
> +}
> +
> /* Returns the RSS hash of the packet 'p'. Note that the returned value is
> * correct only if 'dp_packet_rss_valid(p)' returns true */
> static inline uint32_t
> @@ -939,6 +1107,22 @@ dp_packet_batch_reset_cutlen(struct dp_packet_batch *batch)
> }
> }
>
> +/* Return true if the packet 'b' requested IPv4 checksum offload. */
> +static inline bool
> +dp_packet_hwol_tx_ipv4_checksum(const struct dp_packet *b)
> +{
> +
> + return !!dp_packet_hwol_is_ipv4(b);
> +}
> +
> +/* Return true if the packet 'b' requested L4 checksum offload. */
> +static inline bool
> +dp_packet_hwol_tx_l4_checksum(const struct dp_packet *b)
> +{
> +
> + return !!dp_packet_hwol_l4_mask(b);
> +}
> +
> #ifdef __cplusplus
> }
> #endif
> diff --git a/lib/ipf.c b/lib/ipf.c
> index 45c489122..14df04374 100644
> --- a/lib/ipf.c
> +++ b/lib/ipf.c
> @@ -433,9 +433,11 @@ ipf_reassemble_v4_frags(struct ipf_list *ipf_list)
> len += rest_len;
> l3 = dp_packet_l3(pkt);
> ovs_be16 new_ip_frag_off = l3->ip_frag_off & ~htons(IP_MORE_FRAGMENTS);
> - l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
> - new_ip_frag_off);
> - l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
> + if (!dp_packet_hwol_tx_ipv4_checksum(pkt)) {
> + l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
> + new_ip_frag_off);
> + l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
> + }
> l3->ip_tot_len = htons(len);
> l3->ip_frag_off = new_ip_frag_off;
> dp_packet_set_l2_pad_size(pkt, 0);
> @@ -606,6 +608,7 @@ ipf_is_valid_v4_frag(struct ipf *ipf, struct dp_packet *pkt)
> }
>
> if (OVS_UNLIKELY(!dp_packet_ip_checksum_valid(pkt)
> + && !dp_packet_hwol_tx_ipv4_checksum(pkt)
> && csum(l3, ip_hdr_len) != 0)) {
> goto invalid_pkt;
> }
> @@ -1181,16 +1184,21 @@ ipf_post_execute_reass_pkts(struct ipf *ipf,
> } else {
> struct ip_header *l3_frag = dp_packet_l3(frag_0->pkt);
> struct ip_header *l3_reass = dp_packet_l3(pkt);
> - ovs_be32 reass_ip = get_16aligned_be32(&l3_reass->ip_src);
> - ovs_be32 frag_ip = get_16aligned_be32(&l3_frag->ip_src);
> - l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> - frag_ip, reass_ip);
> - l3_frag->ip_src = l3_reass->ip_src;
> + if (!dp_packet_hwol_tx_ipv4_checksum(frag_0->pkt)) {
> + ovs_be32 reass_ip =
> + get_16aligned_be32(&l3_reass->ip_src);
> + ovs_be32 frag_ip =
> + get_16aligned_be32(&l3_frag->ip_src);
> +
> + l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> + frag_ip, reass_ip);
> + reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
> + frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
> + l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> + frag_ip, reass_ip);
> + }
>
> - reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
> - frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
> - l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> - frag_ip, reass_ip);
> + l3_frag->ip_src = l3_reass->ip_src;
> l3_frag->ip_dst = l3_reass->ip_dst;
> }
>
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index d1469f6f2..48fd6c184 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -70,6 +70,7 @@
> #include "smap.h"
> #include "sset.h"
> #include "timeval.h"
> +#include "userspace-tso.h"
> #include "unaligned.h"
> #include "unixctl.h"
> #include "util.h"
> @@ -201,6 +202,8 @@ struct netdev_dpdk_sw_stats {
> uint64_t tx_qos_drops;
> /* Packet drops in ingress policer processing. */
> uint64_t rx_qos_drops;
> + /* Packet drops in HWOL processing */
> + uint64_t tx_invalid_hwol_drops;
> };
>
> enum { DPDK_RING_SIZE = 256 };
> @@ -410,7 +413,8 @@ struct ingress_policer {
> enum dpdk_hw_ol_features {
> NETDEV_RX_CHECKSUM_OFFLOAD = 1 << 0,
> NETDEV_RX_HW_CRC_STRIP = 1 << 1,
> - NETDEV_RX_HW_SCATTER = 1 << 2
> + NETDEV_RX_HW_SCATTER = 1 << 2,
> + NETDEV_TX_TSO_OFFLOAD = 1 << 3,
> };
>
> /*
> @@ -992,6 +996,12 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
> conf.rxmode.offloads |= DEV_RX_OFFLOAD_KEEP_CRC;
> }
>
> + if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> + conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_TSO;
> + conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM;
> + conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM;
> + }
> +
> /* Limit configured rss hash functions to only those supported
> * by the eth device. */
> conf.rx_adv_conf.rss_conf.rss_hf &= info.flow_type_rss_offloads;
> @@ -1093,6 +1103,9 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
> uint32_t rx_chksm_offload_capa = DEV_RX_OFFLOAD_UDP_CKSUM |
> DEV_RX_OFFLOAD_TCP_CKSUM |
> DEV_RX_OFFLOAD_IPV4_CKSUM;
> + uint32_t tx_tso_offload_capa = DEV_TX_OFFLOAD_TCP_TSO |
> + DEV_TX_OFFLOAD_TCP_CKSUM |
> + DEV_TX_OFFLOAD_IPV4_CKSUM;
>
> rte_eth_dev_info_get(dev->port_id, &info);
>
> @@ -1119,6 +1132,14 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
> dev->hw_ol_features &= ~NETDEV_RX_HW_SCATTER;
> }
>
> + if (info.tx_offload_capa & tx_tso_offload_capa) {
> + dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
> + } else {
> + dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
> + VLOG_WARN("Tx TSO offload is not supported on %s port "
> + DPDK_PORT_ID_FMT, netdev_get_name(&dev->up), dev->port_id);
> + }
> +
> n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
> n_txq = MIN(info.max_tx_queues, dev->up.n_txq);
>
> @@ -1369,14 +1390,16 @@ netdev_dpdk_vhost_construct(struct netdev *netdev)
> goto out;
> }
>
> - err = rte_vhost_driver_disable_features(dev->vhost_id,
> - 1ULL << VIRTIO_NET_F_HOST_TSO4
> - | 1ULL << VIRTIO_NET_F_HOST_TSO6
> - | 1ULL << VIRTIO_NET_F_CSUM);
> - if (err) {
> - VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> - "port: %s\n", name);
> - goto out;
> + if (!userspace_tso_enabled()) {
> + err = rte_vhost_driver_disable_features(dev->vhost_id,
> + 1ULL << VIRTIO_NET_F_HOST_TSO4
> + | 1ULL << VIRTIO_NET_F_HOST_TSO6
> + | 1ULL << VIRTIO_NET_F_CSUM);
> + if (err) {
> + VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> + "port: %s\n", name);
> + goto out;
> + }
> }
>
> err = rte_vhost_driver_start(dev->vhost_id);
> @@ -1711,6 +1734,11 @@ netdev_dpdk_get_config(const struct netdev *netdev, struct smap *args)
> } else {
> smap_add(args, "rx_csum_offload", "false");
> }
> + if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> + smap_add(args, "tx_tso_offload", "true");
> + } else {
> + smap_add(args, "tx_tso_offload", "false");
> + }
> smap_add(args, "lsc_interrupt_mode",
> dev->lsc_interrupt_mode ? "true" : "false");
> }
> @@ -2138,6 +2166,67 @@ netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq)
> rte_free(rx);
> }
>
> +/* Prepare the packet for HWOL.
> + * Return True if the packet is OK to continue. */
> +static bool
> +netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf)
> +{
> + struct dp_packet *pkt = CONTAINER_OF(mbuf, struct dp_packet, mbuf);
> +
> + if (mbuf->ol_flags & PKT_TX_L4_MASK) {
> + mbuf->l2_len = (char *)dp_packet_l3(pkt) - (char *)dp_packet_eth(pkt);
> + mbuf->l3_len = (char *)dp_packet_l4(pkt) - (char *)dp_packet_l3(pkt);
> + mbuf->outer_l2_len = 0;
> + mbuf->outer_l3_len = 0;
> + }
> +
> + if (mbuf->ol_flags & PKT_TX_TCP_SEG) {
> + struct tcp_header *th = dp_packet_l4(pkt);
> +
> + if (!th) {
> + VLOG_WARN_RL(&rl, "%s: TCP Segmentation without L4 header"
> + " pkt len: %"PRIu32"", dev->up.name, mbuf->pkt_len);
> + return false;
> + }
> +
> + mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4;
> + mbuf->ol_flags |= PKT_TX_TCP_CKSUM;
> + mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len;
> +
> + if (mbuf->ol_flags & PKT_TX_IPV4) {
> + mbuf->ol_flags |= PKT_TX_IP_CKSUM;
> + }
> + }
> + return true;
> +}
> +
> +/* Prepare a batch for HWOL.
> + * Return the number of good packets in the batch. */
> +static int
> +netdev_dpdk_prep_hwol_batch(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
> + int pkt_cnt)
> +{
> + int i = 0;
> + int cnt = 0;
> + struct rte_mbuf *pkt;
> +
> + /* Prepare and filter bad HWOL packets. */
> + for (i = 0; i < pkt_cnt; i++) {
> + pkt = pkts[i];
> + if (!netdev_dpdk_prep_hwol_packet(dev, pkt)) {
> + rte_pktmbuf_free(pkt);
> + continue;
> + }
> +
> + if (OVS_UNLIKELY(i != cnt)) {
> + pkts[cnt] = pkt;
> + }
> + cnt++;
> + }
> +
> + return cnt;
> +}
> +
> /* Tries to transmit 'pkts' to txq 'qid' of device 'dev'. Takes ownership of
> * 'pkts', even in case of failure.
> *
> @@ -2147,11 +2236,22 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
> struct rte_mbuf **pkts, int cnt)
> {
> uint32_t nb_tx = 0;
> + uint16_t nb_tx_prep = cnt;
> +
> + if (userspace_tso_enabled()) {
> + nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
> + if (nb_tx_prep != cnt) {
> + VLOG_WARN_RL(&rl, "%s: Output batch contains invalid packets. "
> + "Only %u/%u are valid: %s", dev->up.name, nb_tx_prep,
> + cnt, rte_strerror(rte_errno));
> + }
> + }
>
> - while (nb_tx != cnt) {
> + while (nb_tx != nb_tx_prep) {
> uint32_t ret;
>
> - ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - nb_tx);
> + ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx,
> + nb_tx_prep - nb_tx);
> if (!ret) {
> break;
> }
> @@ -2437,11 +2537,14 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
> int cnt = 0;
> struct rte_mbuf *pkt;
>
> + /* Filter oversized packets, unless are marked for TSO. */
> for (i = 0; i < pkt_cnt; i++) {
> pkt = pkts[i];
> - if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) {
> - VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " max_packet_len %d",
> - dev->up.name, pkt->pkt_len, dev->max_packet_len);
> + if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len)
> + && !(pkt->ol_flags & PKT_TX_TCP_SEG))) {
> + VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " "
> + "max_packet_len %d", dev->up.name, pkt->pkt_len,
> + dev->max_packet_len);
> rte_pktmbuf_free(pkt);
> continue;
> }
> @@ -2463,7 +2566,8 @@ netdev_dpdk_vhost_update_tx_counters(struct netdev_dpdk *dev,
> {
> int dropped = sw_stats_add->tx_mtu_exceeded_drops +
> sw_stats_add->tx_qos_drops +
> - sw_stats_add->tx_failure_drops;
> + sw_stats_add->tx_failure_drops +
> + sw_stats_add->tx_invalid_hwol_drops;
> struct netdev_stats *stats = &dev->stats;
> int sent = attempted - dropped;
> int i;
> @@ -2482,6 +2586,7 @@ netdev_dpdk_vhost_update_tx_counters(struct netdev_dpdk *dev,
> sw_stats->tx_failure_drops += sw_stats_add->tx_failure_drops;
> sw_stats->tx_mtu_exceeded_drops += sw_stats_add->tx_mtu_exceeded_drops;
> sw_stats->tx_qos_drops += sw_stats_add->tx_qos_drops;
> + sw_stats->tx_invalid_hwol_drops += sw_stats_add->tx_invalid_hwol_drops;
> }
> }
>
> @@ -2513,8 +2618,15 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
> rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
> }
>
> + sw_stats_add.tx_invalid_hwol_drops = cnt;
> + if (userspace_tso_enabled()) {
> + cnt = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt);
> + }
> +
> + sw_stats_add.tx_invalid_hwol_drops -= cnt;
> + sw_stats_add.tx_mtu_exceeded_drops = cnt;
> cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
> - sw_stats_add.tx_mtu_exceeded_drops = total_packets - cnt;
> + sw_stats_add.tx_mtu_exceeded_drops -= cnt;
>
> /* Check has QoS has been configured for the netdev */
> sw_stats_add.tx_qos_drops = cnt;
> @@ -2562,6 +2674,121 @@ out:
> }
> }
>
> +static void
> +netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
> +{
> + rte_free(opaque);
> +}
> +
> +static struct rte_mbuf *
> +dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len)
> +{
> + uint32_t total_len = RTE_PKTMBUF_HEADROOM + data_len;
> + struct rte_mbuf_ext_shared_info *shinfo = NULL;
> + uint16_t buf_len;
> + void *buf;
> +
> + if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo)) {
> + shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
> + } else {
> + total_len += sizeof(*shinfo) + sizeof(uintptr_t);
> + total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> + }
> +
> + if (unlikely(total_len > UINT16_MAX)) {
> + VLOG_ERR("Can't copy packet: too big %u", total_len);
> + return NULL;
> + }
> +
> + buf_len = total_len;
> + buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> + if (unlikely(buf == NULL)) {
> + VLOG_ERR("Failed to allocate memory using rte_malloc: %u", buf_len);
> + return NULL;
> + }
> +
> + /* Initialize shinfo. */
> + if (shinfo) {
> + shinfo->free_cb = netdev_dpdk_extbuf_free;
> + shinfo->fcb_opaque = buf;
> + rte_mbuf_ext_refcnt_set(shinfo, 1);
> + } else {
> + shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> + netdev_dpdk_extbuf_free,
> + buf);
> + if (unlikely(shinfo == NULL)) {
> + rte_free(buf);
> + VLOG_ERR("Failed to initialize shared info for mbuf while "
> + "attempting to attach an external buffer.");
> + return NULL;
> + }
> + }
> +
> + rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), buf_len,
> + shinfo);
> + rte_pktmbuf_reset_headroom(pkt);
> +
> + return pkt;
> +}
> +
> +static struct rte_mbuf *
> +dpdk_pktmbuf_alloc(struct rte_mempool *mp, uint32_t data_len)
> +{
> + struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> +
> + if (OVS_UNLIKELY(!pkt)) {
> + return NULL;
> + }
> +
> + dp_packet_init_specific((struct dp_packet *)pkt);
> + if (rte_pktmbuf_tailroom(pkt) >= data_len) {
> + return pkt;
> + }
> +
> + if (dpdk_pktmbuf_attach_extbuf(pkt, data_len)) {
> + return pkt;
> + }
> +
> + rte_pktmbuf_free(pkt);
> +
> + return NULL;
> +}
> +
> +static struct dp_packet *
> +dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet *pkt_orig)
> +{
> + struct rte_mbuf *mbuf_dest;
> + struct dp_packet *pkt_dest;
> + uint32_t pkt_len;
> +
> + pkt_len = dp_packet_size(pkt_orig);
> + mbuf_dest = dpdk_pktmbuf_alloc(mp, pkt_len);
> + if (OVS_UNLIKELY(mbuf_dest == NULL)) {
> + return NULL;
> + }
> +
> + pkt_dest = CONTAINER_OF(mbuf_dest, struct dp_packet, mbuf);
> + memcpy(dp_packet_data(pkt_dest), dp_packet_data(pkt_orig), pkt_len);
> + dp_packet_set_size(pkt_dest, pkt_len);
> +
> + mbuf_dest->tx_offload = pkt_orig->mbuf.tx_offload;
> + mbuf_dest->packet_type = pkt_orig->mbuf.packet_type;
> + mbuf_dest->ol_flags |= (pkt_orig->mbuf.ol_flags &
> + ~(EXT_ATTACHED_MBUF | IND_ATTACHED_MBUF));
> +
> + memcpy(&pkt_dest->l2_pad_size, &pkt_orig->l2_pad_size,
> + sizeof(struct dp_packet) - offsetof(struct dp_packet, l2_pad_size));
> +
> + if (mbuf_dest->ol_flags & PKT_TX_L4_MASK) {
> + mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest)
> + - (char *)dp_packet_eth(pkt_dest);
> + mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest)
> + - (char *) dp_packet_l3(pkt_dest);
> + }
> +
> + return pkt_dest;
> +}
> +
> /* Tx function. Transmit packets indefinitely */
> static void
> dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> @@ -2575,7 +2802,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
> #endif
> struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> - struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
> + struct dp_packet *pkts[PKT_ARRAY_SIZE];
> struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
> uint32_t cnt = batch_cnt;
> uint32_t dropped = 0;
> @@ -2596,34 +2823,30 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> struct dp_packet *packet = batch->packets[i];
> uint32_t size = dp_packet_size(packet);
>
> - if (OVS_UNLIKELY(size > dev->max_packet_len)) {
> - VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d",
> - size, dev->max_packet_len);
> -
> + if (size > dev->max_packet_len
> + && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) {
> + VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size,
> + dev->max_packet_len);
> mtu_drops++;
> continue;
> }
>
> - pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
> + pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet);
> if (OVS_UNLIKELY(!pkts[txcnt])) {
> dropped = cnt - i;
> break;
> }
>
> - /* We have to do a copy for now */
> - memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
> - dp_packet_data(packet), size);
> - dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
> -
> txcnt++;
> }
>
> if (OVS_LIKELY(txcnt)) {
> if (dev->type == DPDK_DEV_VHOST) {
> - __netdev_dpdk_vhost_send(netdev, qid, (struct dp_packet **) pkts,
> - txcnt);
> + __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt);
> } else {
> - tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, txcnt);
> + tx_failure += netdev_dpdk_eth_tx_burst(dev, qid,
> + (struct rte_mbuf **)pkts,
> + txcnt);
> }
> }
>
> @@ -2676,26 +2899,33 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
> dp_packet_delete_batch(batch, true);
> } else {
> struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
> - int tx_cnt, dropped;
> - int tx_failure, mtu_drops, qos_drops;
> + int dropped;
> + int tx_failure, mtu_drops, qos_drops, hwol_drops;
> int batch_cnt = dp_packet_batch_size(batch);
> struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
>
> - tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
> - mtu_drops = batch_cnt - tx_cnt;
> - qos_drops = tx_cnt;
> - tx_cnt = netdev_dpdk_qos_run(dev, pkts, tx_cnt, true);
> - qos_drops -= tx_cnt;
> + hwol_drops = batch_cnt;
> + if (userspace_tso_enabled()) {
> + batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, batch_cnt);
> + }
> + hwol_drops -= batch_cnt;
> + mtu_drops = batch_cnt;
> + batch_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
> + mtu_drops -= batch_cnt;
> + qos_drops = batch_cnt;
> + batch_cnt = netdev_dpdk_qos_run(dev, pkts, batch_cnt, true);
> + qos_drops -= batch_cnt;
>
> - tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, tx_cnt);
> + tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, batch_cnt);
>
> - dropped = tx_failure + mtu_drops + qos_drops;
> + dropped = tx_failure + mtu_drops + qos_drops + hwol_drops;
> if (OVS_UNLIKELY(dropped)) {
> rte_spinlock_lock(&dev->stats_lock);
> dev->stats.tx_dropped += dropped;
> sw_stats->tx_failure_drops += tx_failure;
> sw_stats->tx_mtu_exceeded_drops += mtu_drops;
> sw_stats->tx_qos_drops += qos_drops;
> + sw_stats->tx_invalid_hwol_drops += hwol_drops;
> rte_spinlock_unlock(&dev->stats_lock);
> }
> }
> @@ -3011,7 +3241,8 @@ netdev_dpdk_get_sw_custom_stats(const struct netdev *netdev,
> SW_CSTAT(tx_failure_drops) \
> SW_CSTAT(tx_mtu_exceeded_drops) \
> SW_CSTAT(tx_qos_drops) \
> - SW_CSTAT(rx_qos_drops)
> + SW_CSTAT(rx_qos_drops) \
> + SW_CSTAT(tx_invalid_hwol_drops)
>
> #define SW_CSTAT(NAME) + 1
> custom_stats->size = SW_CSTATS;
> @@ -4874,6 +5105,12 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
>
> rte_free(dev->tx_q);
> err = dpdk_eth_dev_init(dev);
> + if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> + netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> + netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> + netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> + }
> +
> dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
> if (!dev->tx_q) {
> err = ENOMEM;
> @@ -4903,6 +5140,11 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
> dev->tx_q[0].map = 0;
> }
>
> + if (userspace_tso_enabled()) {
> + dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
> + VLOG_DBG("%s: TSO enabled on vhost port", netdev_get_name(&dev->up));
> + }
> +
> netdev_dpdk_remap_txqs(dev);
>
> err = netdev_dpdk_mempool_configure(dev);
> @@ -4975,6 +5217,11 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
> vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> }
>
> + /* Enable External Buffers if TCP Segmentation Offload is enabled. */
> + if (userspace_tso_enabled()) {
> + vhost_flags |= RTE_VHOST_USER_EXTBUF_SUPPORT;
> + }
> +
> err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
> if (err) {
> VLOG_ERR("vhost-user device setup failure for device %s\n",
> @@ -4999,14 +5246,20 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
> goto unlock;
> }
>
> - err = rte_vhost_driver_disable_features(dev->vhost_id,
> - 1ULL << VIRTIO_NET_F_HOST_TSO4
> - | 1ULL << VIRTIO_NET_F_HOST_TSO6
> - | 1ULL << VIRTIO_NET_F_CSUM);
> - if (err) {
> - VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> - "client port: %s\n", dev->up.name);
> - goto unlock;
> + if (userspace_tso_enabled()) {
> + netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> + netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> + netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> + } else {
> + err = rte_vhost_driver_disable_features(dev->vhost_id,
> + 1ULL << VIRTIO_NET_F_HOST_TSO4
> + | 1ULL << VIRTIO_NET_F_HOST_TSO6
> + | 1ULL << VIRTIO_NET_F_CSUM);
> + if (err) {
> + VLOG_ERR("rte_vhost_driver_disable_features failed for "
> + "vhost user client port: %s\n", dev->up.name);
> + goto unlock;
> + }
> }
>
> err = rte_vhost_driver_start(dev->vhost_id);
> diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> index f08159aa7..9dbc67658 100644
> --- a/lib/netdev-linux-private.h
> +++ b/lib/netdev-linux-private.h
> @@ -27,6 +27,7 @@
> #include <stdint.h>
> #include <stdbool.h>
>
> +#include "dp-packet.h"
> #include "netdev-afxdp.h"
> #include "netdev-afxdp-pool.h"
> #include "netdev-provider.h"
> @@ -37,10 +38,13 @@
>
> struct netdev;
>
> +#define LINUX_RXQ_TSO_MAX_LEN 65536
> +
> struct netdev_rxq_linux {
> struct netdev_rxq up;
> bool is_tap;
> int fd;
> + char *aux_bufs[NETDEV_MAX_BURST]; /* Batch of preallocated TSO buffers. */
> };
>
> int netdev_linux_construct(struct netdev *);
> @@ -92,6 +96,7 @@ struct netdev_linux {
> int tap_fd;
> bool present; /* If the device is present in the namespace */
> uint64_t tx_dropped; /* tap device can drop if the iface is down */
> + uint64_t rx_dropped; /* Packets dropped while recv from kernel. */
>
> /* LAG information. */
> bool is_lag_master; /* True if the netdev is a LAG master. */
> diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> index 41d1e9273..c308abf54 100644
> --- a/lib/netdev-linux.c
> +++ b/lib/netdev-linux.c
> @@ -29,16 +29,18 @@
> #include <linux/filter.h>
> #include <linux/gen_stats.h>
> #include <linux/if_ether.h>
> +#include <linux/if_packet.h>
> #include <linux/if_tun.h>
> #include <linux/types.h>
> #include <linux/ethtool.h>
> #include <linux/mii.h>
> #include <linux/rtnetlink.h>
> #include <linux/sockios.h>
> +#include <linux/virtio_net.h>
> #include <sys/ioctl.h>
> #include <sys/socket.h>
> +#include <sys/uio.h>
> #include <sys/utsname.h>
> -#include <netpacket/packet.h>
> #include <net/if.h>
> #include <net/if_arp.h>
> #include <net/route.h>
> @@ -72,6 +74,7 @@
> #include "socket-util.h"
> #include "sset.h"
> #include "tc.h"
> +#include "userspace-tso.h"
> #include "timer.h"
> #include "unaligned.h"
> #include "openvswitch/vlog.h"
> @@ -237,6 +240,16 @@ enum {
> VALID_DRVINFO = 1 << 6,
> VALID_FEATURES = 1 << 7,
> };
> +
> +/* Use one for the packet buffer and another for the aux buffer to receive
> + * TSO packets. */
> +#define IOV_STD_SIZE 1
> +#define IOV_TSO_SIZE 2
> +
> +enum {
> + IOV_PACKET = 0,
> + IOV_AUXBUF = 1,
> +};
>
> struct linux_lag_slave {
> uint32_t block_id;
> @@ -501,6 +514,8 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> * changes in the device miimon status, so we can use atomic_count. */
> static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
>
> +static int netdev_linux_parse_vnet_hdr(struct dp_packet *b);
> +static void netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu);
> static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *,
> int cmd, const char *cmd_name);
> static int get_flags(const struct netdev *, unsigned int *flags);
> @@ -902,6 +917,13 @@ netdev_linux_common_construct(struct netdev *netdev_)
> /* The device could be in the same network namespace or in another one. */
> netnsid_unset(&netdev->netnsid);
> ovs_mutex_init(&netdev->mutex);
> +
> + if (userspace_tso_enabled()) {
> + netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> + netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> + netdev_->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> + }
> +
> return 0;
> }
>
> @@ -961,6 +983,10 @@ netdev_linux_construct_tap(struct netdev *netdev_)
> /* Create tap device. */
> get_flags(&netdev->up, &netdev->ifi_flags);
> ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
> + if (userspace_tso_enabled()) {
> + ifr.ifr_flags |= IFF_VNET_HDR;
> + }
> +
> ovs_strzcpy(ifr.ifr_name, name, sizeof ifr.ifr_name);
> if (ioctl(netdev->tap_fd, TUNSETIFF, &ifr) == -1) {
> VLOG_WARN("%s: creating tap device failed: %s", name,
> @@ -1024,6 +1050,15 @@ static struct netdev_rxq *
> netdev_linux_rxq_alloc(void)
> {
> struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
> + if (userspace_tso_enabled()) {
> + int i;
> +
> + /* Allocate auxiliay buffers to receive TSO packets */
> + for (i = 0; i < NETDEV_MAX_BURST; i++) {
> + rx->aux_bufs[i] = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
> + }
> + }
> +
> return &rx->up;
> }
>
> @@ -1069,6 +1104,15 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
> goto error;
> }
>
> + if (userspace_tso_enabled()
> + && setsockopt(rx->fd, SOL_PACKET, PACKET_VNET_HDR, &val,
> + sizeof val)) {
> + error = errno;
> + VLOG_ERR("%s: failed to enable vnet hdr in txq raw socket: %s",
> + netdev_get_name(netdev_), ovs_strerror(errno));
> + goto error;
> + }
> +
> /* Set non-blocking mode. */
> error = set_nonblocking(rx->fd);
> if (error) {
> @@ -1119,10 +1163,15 @@ static void
> netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
> {
> struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> + int i;
>
> if (!rx->is_tap) {
> close(rx->fd);
> }
> +
> + for (i = 0; i < NETDEV_MAX_BURST; i++) {
> + free(rx->aux_bufs[i]);
> + }
> }
>
> static void
> @@ -1159,12 +1208,14 @@ auxdata_has_vlan_tci(const struct tpacket_auxdata *aux)
> * It also used recvmmsg to reduce multiple syscalls overhead;
> */
> static int
> -netdev_linux_batch_rxq_recv_sock(int fd, int mtu,
> +netdev_linux_batch_rxq_recv_sock(struct netdev_rxq_linux *rx, int mtu,
> struct dp_packet_batch *batch)
> {
> - size_t size;
> + int iovlen;
> + size_t std_len;
> ssize_t retval;
> - struct iovec iovs[NETDEV_MAX_BURST];
> + int virtio_net_hdr_size;
> + struct iovec iovs[NETDEV_MAX_BURST][IOV_TSO_SIZE];
> struct cmsghdr *cmsg;
> union {
> struct cmsghdr cmsg;
> @@ -1174,41 +1225,87 @@ netdev_linux_batch_rxq_recv_sock(int fd, int mtu,
> struct dp_packet *buffers[NETDEV_MAX_BURST];
> int i;
>
> + if (userspace_tso_enabled()) {
> + /* Use the buffer from the allocated packet below to receive MTU
> + * sized packets and an aux_buf for extra TSO data. */
> + iovlen = IOV_TSO_SIZE;
> + virtio_net_hdr_size = sizeof(struct virtio_net_hdr);
> + } else {
> + /* Use only the buffer from the allocated packet. */
> + iovlen = IOV_STD_SIZE;
> + virtio_net_hdr_size = 0;
> + }
> +
> + std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> for (i = 0; i < NETDEV_MAX_BURST; i++) {
> - buffers[i] = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
> - DP_NETDEV_HEADROOM);
> - /* Reserve headroom for a single VLAN tag */
> - dp_packet_reserve(buffers[i], VLAN_HEADER_LEN);
> - size = dp_packet_tailroom(buffers[i]);
> - iovs[i].iov_base = dp_packet_data(buffers[i]);
> - iovs[i].iov_len = size;
> + buffers[i] = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
> + iovs[i][IOV_PACKET].iov_base = dp_packet_data(buffers[i]);
> + iovs[i][IOV_PACKET].iov_len = std_len;
> + iovs[i][IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> + iovs[i][IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> mmsgs[i].msg_hdr.msg_name = NULL;
> mmsgs[i].msg_hdr.msg_namelen = 0;
> - mmsgs[i].msg_hdr.msg_iov = &iovs[i];
> - mmsgs[i].msg_hdr.msg_iovlen = 1;
> + mmsgs[i].msg_hdr.msg_iov = iovs[i];
> + mmsgs[i].msg_hdr.msg_iovlen = iovlen;
> mmsgs[i].msg_hdr.msg_control = &cmsg_buffers[i];
> mmsgs[i].msg_hdr.msg_controllen = sizeof cmsg_buffers[i];
> mmsgs[i].msg_hdr.msg_flags = 0;
> }
>
> do {
> - retval = recvmmsg(fd, mmsgs, NETDEV_MAX_BURST, MSG_TRUNC, NULL);
> + retval = recvmmsg(rx->fd, mmsgs, NETDEV_MAX_BURST, MSG_TRUNC, NULL);
> } while (retval < 0 && errno == EINTR);
>
> if (retval < 0) {
> - /* Save -errno to retval temporarily */
> - retval = -errno;
> - i = 0;
> - goto free_buffers;
> + retval = errno;
> + for (i = 0; i < NETDEV_MAX_BURST; i++) {
> + dp_packet_delete(buffers[i]);
> + }
> +
> + return retval;
> }
>
> for (i = 0; i < retval; i++) {
> if (mmsgs[i].msg_len < ETH_HEADER_LEN) {
> - break;
> + struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> + struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> +
> + dp_packet_delete(buffers[i]);
> + netdev->rx_dropped += 1;
> + VLOG_WARN_RL(&rl, "%s: Dropped packet: less than ether hdr size",
> + netdev_get_name(netdev_));
> + continue;
> + }
> +
> + if (mmsgs[i].msg_len > std_len) {
> + /* Build a single linear TSO packet by expanding the current packet
> + * to append the data received in the aux_buf. */
> + size_t extra_len = mmsgs[i].msg_len - std_len;
> +
> + dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> + + std_len);
> + dp_packet_prealloc_tailroom(buffers[i], extra_len);
> + memcpy(dp_packet_tail(buffers[i]), rx->aux_bufs[i], extra_len);
> + dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> + + extra_len);
> + } else {
> + dp_packet_set_size(buffers[i], dp_packet_size(buffers[i])
> + + mmsgs[i].msg_len);
> }
>
> - dp_packet_set_size(buffers[i],
> - dp_packet_size(buffers[i]) + mmsgs[i].msg_len);
> + if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffers[i])) {
> + struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> + struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> +
> + /* Unexpected error situation: the virtio header is not present
> + * or corrupted. Drop the packet but continue in case next ones
> + * are correct. */
> + dp_packet_delete(buffers[i]);
> + netdev->rx_dropped += 1;
> + VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net header",
> + netdev_get_name(netdev_));
> + continue;
> + }
>
> for (cmsg = CMSG_FIRSTHDR(&mmsgs[i].msg_hdr); cmsg;
> cmsg = CMSG_NXTHDR(&mmsgs[i].msg_hdr, cmsg)) {
> @@ -1238,22 +1335,11 @@ netdev_linux_batch_rxq_recv_sock(int fd, int mtu,
> dp_packet_batch_add(batch, buffers[i]);
> }
>
> -free_buffers:
> - /* Free unused buffers, including buffers whose size is less than
> - * ETH_HEADER_LEN.
> - *
> - * Note: i has been set correctly by the above for loop, so don't
> - * try to re-initialize it.
> - */
> + /* Delete unused buffers */
> for (; i < NETDEV_MAX_BURST; i++) {
> dp_packet_delete(buffers[i]);
> }
>
> - /* netdev_linux_rxq_recv needs it to return 0 or positive errno */
> - if (retval < 0) {
> - return -retval;
> - }
> -
> return 0;
> }
>
> @@ -1263,20 +1349,40 @@ free_buffers:
> * packets are added into *batch. The return value is 0 or errno.
> */
> static int
> -netdev_linux_batch_rxq_recv_tap(int fd, int mtu, struct dp_packet_batch *batch)
> +netdev_linux_batch_rxq_recv_tap(struct netdev_rxq_linux *rx, int mtu,
> + struct dp_packet_batch *batch)
> {
> struct dp_packet *buffer;
> + int virtio_net_hdr_size;
> ssize_t retval;
> - size_t size;
> + size_t std_len;
> + int iovlen;
> int i;
>
> + if (userspace_tso_enabled()) {
> + /* Use the buffer from the allocated packet below to receive MTU
> + * sized packets and an aux_buf for extra TSO data. */
> + iovlen = IOV_TSO_SIZE;
> + virtio_net_hdr_size = sizeof(struct virtio_net_hdr);
> + } else {
> + /* Use only the buffer from the allocated packet. */
> + iovlen = IOV_STD_SIZE;
> + virtio_net_hdr_size = 0;
> + }
> +
> + std_len = VLAN_ETH_HEADER_LEN + mtu + virtio_net_hdr_size;
> for (i = 0; i < NETDEV_MAX_BURST; i++) {
> + struct iovec iov[IOV_TSO_SIZE];
> +
> /* Assume Ethernet port. No need to set packet_type. */
> - buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
> - DP_NETDEV_HEADROOM);
> - size = dp_packet_tailroom(buffer);
> + buffer = dp_packet_new_with_headroom(std_len, DP_NETDEV_HEADROOM);
> + iov[IOV_PACKET].iov_base = dp_packet_data(buffer);
> + iov[IOV_PACKET].iov_len = std_len;
> + iov[IOV_AUXBUF].iov_base = rx->aux_bufs[i];
> + iov[IOV_AUXBUF].iov_len = LINUX_RXQ_TSO_MAX_LEN;
> +
> do {
> - retval = read(fd, dp_packet_data(buffer), size);
> + retval = readv(rx->fd, iov, iovlen);
> } while (retval < 0 && errno == EINTR);
>
> if (retval < 0) {
> @@ -1284,7 +1390,33 @@ netdev_linux_batch_rxq_recv_tap(int fd, int mtu, struct dp_packet_batch *batch)
> break;
> }
>
> - dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> + if (retval > std_len) {
> + /* Build a single linear TSO packet by expanding the current packet
> + * to append the data received in the aux_buf. */
> + size_t extra_len = retval - std_len;
> +
> + dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> + dp_packet_prealloc_tailroom(buffer, extra_len);
> + memcpy(dp_packet_tail(buffer), rx->aux_bufs[i], extra_len);
> + dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> + } else {
> + dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> + }
> +
> + if (virtio_net_hdr_size && netdev_linux_parse_vnet_hdr(buffer)) {
> + struct netdev *netdev_ = netdev_rxq_get_netdev(&rx->up);
> + struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> +
> + /* Unexpected error situation: the virtio header is not present
> + * or corrupted. Drop the packet but continue in case next ones
> + * are correct. */
> + dp_packet_delete(buffer);
> + netdev->rx_dropped += 1;
> + VLOG_WARN_RL(&rl, "%s: Dropped packet: Invalid virtio net header",
> + netdev_get_name(netdev_));
> + continue;
> + }
> +
> dp_packet_batch_add(batch, buffer);
> }
>
> @@ -1310,8 +1442,8 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
>
> dp_packet_batch_init(batch);
> retval = (rx->is_tap
> - ? netdev_linux_batch_rxq_recv_tap(rx->fd, mtu, batch)
> - : netdev_linux_batch_rxq_recv_sock(rx->fd, mtu, batch));
> + ? netdev_linux_batch_rxq_recv_tap(rx, mtu, batch)
> + : netdev_linux_batch_rxq_recv_sock(rx, mtu, batch));
>
> if (retval) {
> if (retval != EAGAIN && retval != EMSGSIZE) {
> @@ -1353,7 +1485,7 @@ netdev_linux_rxq_drain(struct netdev_rxq *rxq_)
> }
>
> static int
> -netdev_linux_sock_batch_send(int sock, int ifindex,
> +netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu,
> struct dp_packet_batch *batch)
> {
> const size_t size = dp_packet_batch_size(batch);
> @@ -1367,6 +1499,10 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
>
> struct dp_packet *packet;
> DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> + if (tso) {
> + netdev_linux_prepend_vnet_hdr(packet, mtu);
> + }
> +
> iov[i].iov_base = dp_packet_data(packet);
> iov[i].iov_len = dp_packet_size(packet);
> mmsg[i].msg_hdr = (struct msghdr) { .msg_name = &sll,
> @@ -1399,7 +1535,7 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
> * on other interface types because we attach a socket filter to the rx
> * socket. */
> static int
> -netdev_linux_tap_batch_send(struct netdev *netdev_,
> +netdev_linux_tap_batch_send(struct netdev *netdev_, bool tso, int mtu,
> struct dp_packet_batch *batch)
> {
> struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> @@ -1416,10 +1552,15 @@ netdev_linux_tap_batch_send(struct netdev *netdev_,
> }
>
> DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> - size_t size = dp_packet_size(packet);
> + size_t size;
> ssize_t retval;
> int error;
>
> + if (tso) {
> + netdev_linux_prepend_vnet_hdr(packet, mtu);
> + }
> +
> + size = dp_packet_size(packet);
> do {
> retval = write(netdev->tap_fd, dp_packet_data(packet), size);
> error = retval < 0 ? errno : 0;
> @@ -1454,9 +1595,15 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
> struct dp_packet_batch *batch,
> bool concurrent_txq OVS_UNUSED)
> {
> + bool tso = userspace_tso_enabled();
> + int mtu = ETH_PAYLOAD_MAX;
> int error = 0;
> int sock = 0;
>
> + if (tso) {
> + netdev_linux_get_mtu__(netdev_linux_cast(netdev_), &mtu);
> + }
> +
> if (!is_tap_netdev(netdev_)) {
> if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
> error = EOPNOTSUPP;
> @@ -1475,9 +1622,9 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
> goto free_batch;
> }
>
> - error = netdev_linux_sock_batch_send(sock, ifindex, batch);
> + error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu, batch);
> } else {
> - error = netdev_linux_tap_batch_send(netdev_, batch);
> + error = netdev_linux_tap_batch_send(netdev_, tso, mtu, batch);
> }
> if (error) {
> if (error == ENOBUFS) {
> @@ -2045,6 +2192,7 @@ netdev_tap_get_stats(const struct netdev *netdev_, struct netdev_stats *stats)
> stats->collisions += dev_stats.collisions;
> }
> stats->tx_dropped += netdev->tx_dropped;
> + stats->rx_dropped += netdev->rx_dropped;
> ovs_mutex_unlock(&netdev->mutex);
>
> return error;
> @@ -6223,6 +6371,17 @@ af_packet_sock(void)
> if (error) {
> close(sock);
> sock = -error;
> + } else if (userspace_tso_enabled()) {
> + int val = 1;
> + error = setsockopt(sock, SOL_PACKET, PACKET_VNET_HDR, &val,
> + sizeof val);
> + if (error) {
> + error = errno;
> + VLOG_ERR("failed to enable vnet hdr in raw socket: %s",
> + ovs_strerror(errno));
> + close(sock);
> + sock = -error;
> + }
> }
> } else {
> sock = -errno;
> @@ -6234,3 +6393,136 @@ af_packet_sock(void)
>
> return sock;
> }
> +
> +static int
> +netdev_linux_parse_l2(struct dp_packet *b, uint16_t *l4proto)
> +{
> + struct eth_header *eth_hdr;
> + ovs_be16 eth_type;
> + int l2_len;
> +
> + eth_hdr = dp_packet_at(b, 0, ETH_HEADER_LEN);
> + if (!eth_hdr) {
> + return -EINVAL;
> + }
> +
> + l2_len = ETH_HEADER_LEN;
> + eth_type = eth_hdr->eth_type;
> + if (eth_type_vlan(eth_type)) {
> + struct vlan_header *vlan = dp_packet_at(b, l2_len, VLAN_HEADER_LEN);
> +
> + if (!vlan) {
> + return -EINVAL;
> + }
> +
> + eth_type = vlan->vlan_next_type;
> + l2_len += VLAN_HEADER_LEN;
> + }
> +
> + if (eth_type == htons(ETH_TYPE_IP)) {
> + struct ip_header *ip_hdr = dp_packet_at(b, l2_len, IP_HEADER_LEN);
> +
> + if (!ip_hdr) {
> + return -EINVAL;
> + }
> +
> + *l4proto = ip_hdr->ip_proto;
> + dp_packet_hwol_set_tx_ipv4(b);
> + } else if (eth_type == htons(ETH_TYPE_IPV6)) {
> + struct ovs_16aligned_ip6_hdr *nh6;
> +
> + nh6 = dp_packet_at(b, l2_len, IPV6_HEADER_LEN);
> + if (!nh6) {
> + return -EINVAL;
> + }
> +
> + *l4proto = nh6->ip6_ctlun.ip6_un1.ip6_un1_nxt;
> + dp_packet_hwol_set_tx_ipv6(b);
> + }
> +
> + return 0;
> +}
> +
> +static int
> +netdev_linux_parse_vnet_hdr(struct dp_packet *b)
> +{
> + struct virtio_net_hdr *vnet = dp_packet_pull(b, sizeof *vnet);
> + uint16_t l4proto = 0;
> +
> + if (OVS_UNLIKELY(!vnet)) {
> + return -EINVAL;
> + }
> +
> + if (vnet->flags == 0 && vnet->gso_type == VIRTIO_NET_HDR_GSO_NONE) {
> + return 0;
> + }
> +
> + if (netdev_linux_parse_l2(b, &l4proto)) {
> + return -EINVAL;
> + }
> +
> + if (vnet->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> + if (l4proto == IPPROTO_TCP) {
> + dp_packet_hwol_set_csum_tcp(b);
> + } else if (l4proto == IPPROTO_UDP) {
> + dp_packet_hwol_set_csum_udp(b);
> + } else if (l4proto == IPPROTO_SCTP) {
> + dp_packet_hwol_set_csum_sctp(b);
> + }
> + }
> +
> + if (l4proto && vnet->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> + uint8_t allowed_mask = VIRTIO_NET_HDR_GSO_TCPV4
> + | VIRTIO_NET_HDR_GSO_TCPV6
> + | VIRTIO_NET_HDR_GSO_UDP;
> + uint8_t type = vnet->gso_type & allowed_mask;
> +
> + if (type == VIRTIO_NET_HDR_GSO_TCPV4
> + || type == VIRTIO_NET_HDR_GSO_TCPV6) {
> + dp_packet_hwol_set_tcp_seg(b);
> + }
> + }
> +
> + return 0;
> +}
> +
> +static void
> +netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu)
> +{
> + struct virtio_net_hdr *vnet = dp_packet_push_zeros(b, sizeof *vnet);
> +
> + if (dp_packet_hwol_is_tso(b)) {
> + uint16_t hdr_len = ((char *)dp_packet_l4(b) - (char *)dp_packet_eth(b))
> + + TCP_HEADER_LEN;
> +
> + vnet->hdr_len = (OVS_FORCE __virtio16)hdr_len;
> + vnet->gso_size = (OVS_FORCE __virtio16)(mtu - hdr_len);
> + if (dp_packet_hwol_is_ipv4(b)) {
> + vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> + } else {
> + vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> + }
> +
> + } else {
> + vnet->flags = VIRTIO_NET_HDR_GSO_NONE;
> + }
> +
> + if (dp_packet_hwol_l4_mask(b)) {
> + vnet->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
> + vnet->csum_start = (OVS_FORCE __virtio16)((char *)dp_packet_l4(b)
> + - (char *)dp_packet_eth(b));
> +
> + if (dp_packet_hwol_l4_is_tcp(b)) {
> + vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
> + struct tcp_header, tcp_csum);
> + } else if (dp_packet_hwol_l4_is_udp(b)) {
> + vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
> + struct udp_header, udp_csum);
> + } else if (dp_packet_hwol_l4_is_sctp(b)) {
> + vnet->csum_offset = (OVS_FORCE __virtio16) __builtin_offsetof(
> + struct sctp_header, sctp_csum);
> + } else {
> + VLOG_WARN_RL(&rl, "Unsupported L4 protocol");
> + }
> + }
> +}
> diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> index f109c4e66..22f4cde33 100644
> --- a/lib/netdev-provider.h
> +++ b/lib/netdev-provider.h
> @@ -37,6 +37,12 @@ extern "C" {
> struct netdev_tnl_build_header_params;
> #define NETDEV_NUMA_UNSPEC OVS_NUMA_UNSPEC
>
> +enum netdev_ol_flags {
> + NETDEV_TX_OFFLOAD_IPV4_CKSUM = 1 << 0,
> + NETDEV_TX_OFFLOAD_TCP_CKSUM = 1 << 1,
> + NETDEV_TX_OFFLOAD_TCP_TSO = 1 << 2,
> +};
> +
> /* A network device (e.g. an Ethernet device).
> *
> * Network device implementations may read these members but should not modify
> @@ -51,6 +57,9 @@ struct netdev {
> * opening this device, and therefore got assigned to the "system" class */
> bool auto_classified;
>
> + /* This bitmask of the offloading features enabled by the netdev. */
> + uint64_t ol_flags;
> +
> /* If this is 'true', the user explicitly specified an MTU for this
> * netdev. Otherwise, Open vSwitch is allowed to override it. */
> bool mtu_user_config;
> diff --git a/lib/netdev.c b/lib/netdev.c
> index 405c98c68..f95b19af4 100644
> --- a/lib/netdev.c
> +++ b/lib/netdev.c
> @@ -66,6 +66,8 @@ COVERAGE_DEFINE(netdev_received);
> COVERAGE_DEFINE(netdev_sent);
> COVERAGE_DEFINE(netdev_add_router);
> COVERAGE_DEFINE(netdev_get_stats);
> +COVERAGE_DEFINE(netdev_send_prepare_drops);
> +COVERAGE_DEFINE(netdev_push_header_drops);
>
> struct netdev_saved_flags {
> struct netdev *netdev;
> @@ -782,6 +784,54 @@ netdev_get_pt_mode(const struct netdev *netdev)
> : NETDEV_PT_LEGACY_L2);
> }
>
> +/* Check if a 'packet' is compatible with 'netdev_flags'.
> + * If a packet is incompatible, return 'false' with the 'errormsg'
> + * pointing to a reason. */
> +static bool
> +netdev_send_prepare_packet(const uint64_t netdev_flags,
> + struct dp_packet *packet, char **errormsg)
> +{
> + if (dp_packet_hwol_is_tso(packet)
> + && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_TSO)) {
> + /* Fall back to GSO in software. */
> + VLOG_ERR_BUF(errormsg, "No TSO support");
> + return false;
> + }
> +
> + if (dp_packet_hwol_l4_mask(packet)
> + && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_CKSUM)) {
> + /* Fall back to L4 csum in software. */
> + VLOG_ERR_BUF(errormsg, "No L4 checksum support");
> + return false;
> + }
> +
> + return true;
> +}
> +
> +/* Check if each packet in 'batch' is compatible with 'netdev' features,
> + * otherwise either fall back to software implementation or drop it. */
> +static void
> +netdev_send_prepare_batch(const struct netdev *netdev,
> + struct dp_packet_batch *batch)
> +{
> + struct dp_packet *packet;
> + size_t i, size = dp_packet_batch_size(batch);
> +
> + DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
> + char *errormsg = NULL;
> +
> + if (netdev_send_prepare_packet(netdev->ol_flags, packet, &errormsg)) {
> + dp_packet_batch_refill(batch, packet, i);
> + } else {
> + dp_packet_delete(packet);
> + COVERAGE_INC(netdev_send_prepare_drops);
> + VLOG_WARN_RL(&rl, "%s: Packet dropped: %s",
> + netdev_get_name(netdev), errormsg);
> + free(errormsg);
> + }
> + }
> +}
> +
> /* Sends 'batch' on 'netdev'. Returns 0 if successful (for every packet),
> * otherwise a positive errno value. Returns EAGAIN without blocking if
> * at least one the packets cannot be queued immediately. Returns EMSGSIZE
> @@ -811,8 +861,14 @@ int
> netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch *batch,
> bool concurrent_txq)
> {
> - int error = netdev->netdev_class->send(netdev, qid, batch,
> - concurrent_txq);
> + int error;
> +
> + netdev_send_prepare_batch(netdev, batch);
> + if (OVS_UNLIKELY(dp_packet_batch_is_empty(batch))) {
> + return 0;
> + }
> +
> + error = netdev->netdev_class->send(netdev, qid, batch, concurrent_txq);
> if (!error) {
> COVERAGE_INC(netdev_sent);
> }
> @@ -878,9 +934,21 @@ netdev_push_header(const struct netdev *netdev,
> const struct ovs_action_push_tnl *data)
> {
> struct dp_packet *packet;
> - DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> - netdev->netdev_class->push_header(netdev, packet, data);
> - pkt_metadata_init(&packet->md, data->out_port);
> + size_t i, size = dp_packet_batch_size(batch);
> +
> + DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
> + if (OVS_UNLIKELY(dp_packet_hwol_is_tso(packet)
> + || dp_packet_hwol_l4_mask(packet))) {
> + COVERAGE_INC(netdev_push_header_drops);
> + dp_packet_delete(packet);
> + VLOG_WARN_RL(&rl, "%s: Tunneling packets with HW offload flags is "
> + "not supported: packet dropped",
> + netdev_get_name(netdev));
> + } else {
> + netdev->netdev_class->push_header(netdev, packet, data);
> + pkt_metadata_init(&packet->md, data->out_port);
> + dp_packet_batch_refill(batch, packet, i);
> + }
> }
>
> return 0;
> diff --git a/lib/userspace-tso.c b/lib/userspace-tso.c
> new file mode 100644
> index 000000000..f843c2a76
> --- /dev/null
> +++ b/lib/userspace-tso.c
> @@ -0,0 +1,48 @@
> +/*
> + * Copyright (c) 2020 Red Hat, Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#include <config.h>
> +
> +#include "smap.h"
> +#include "ovs-thread.h"
> +#include "openvswitch/vlog.h"
> +#include "dpdk.h"
> +#include "userspace-tso.h"
> +#include "vswitch-idl.h"
> +
> +VLOG_DEFINE_THIS_MODULE(userspace_tso);
> +
> +static bool userspace_tso = false;
> +
> +void
> +userspace_tso_init(const struct smap *ovs_other_config)
> +{
> + if (smap_get_bool(ovs_other_config, "userspace-tso-enable", false)) {
> + static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> +
> + if (ovsthread_once_start(&once)) {
> + VLOG_INFO("Userspace TCP Segmentation Offloading support enabled");
> + userspace_tso = true;
> + ovsthread_once_done(&once);
> + }
> + }
> +}
> +
> +bool
> +userspace_tso_enabled(void)
> +{
> + return userspace_tso;
> +}
> diff --git a/lib/userspace-tso.h b/lib/userspace-tso.h
> new file mode 100644
> index 000000000..0758274c0
> --- /dev/null
> +++ b/lib/userspace-tso.h
> @@ -0,0 +1,23 @@
> +/*
> + * Copyright (c) 2020 Red Hat Inc.
> + *
> + * Licensed under the Apache License, Version 2.0 (the "License");
> + * you may not use this file except in compliance with the License.
> + * You may obtain a copy of the License at:
> + *
> + * http://www.apache.org/licenses/LICENSE-2.0
> + *
> + * Unless required by applicable law or agreed to in writing, software
> + * distributed under the License is distributed on an "AS IS" BASIS,
> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> + * See the License for the specific language governing permissions and
> + * limitations under the License.
> + */
> +
> +#ifndef USERSPACE_TSO_H
> +#define USERSPACE_TSO_H 1
> +
> +void userspace_tso_init(const struct smap *ovs_other_config);
> +bool userspace_tso_enabled(void);
> +
> +#endif /* userspace-tso.h */
> diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
> index 86c7b10a9..e591c26a6 100644
> --- a/vswitchd/bridge.c
> +++ b/vswitchd/bridge.c
> @@ -65,6 +65,7 @@
> #include "system-stats.h"
> #include "timeval.h"
> #include "tnl-ports.h"
> +#include "userspace-tso.h"
> #include "util.h"
> #include "unixctl.h"
> #include "lib/vswitch-idl.h"
> @@ -3285,6 +3286,7 @@ bridge_run(void)
> if (cfg) {
> netdev_set_flow_api_enabled(&cfg->other_config);
> dpdk_init(&cfg->other_config);
> + userspace_tso_init(&cfg->other_config);
> }
>
> /* Initialize the ofproto library. This only needs to run once, but
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index c43cb1aa4..a9efe71a5 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -690,6 +690,23 @@
> once in few hours or a day or a week.
> </p>
> </column>
> + <column name="other_config" key="userspace-tso-enable"
> + type='{"type": "boolean"}'>
> + <p>
> + Set this value to <code>true</code> to enable userspace support for
> + TCP Segmentation Offloading (TSO). When it is enabled, the interfaces
> + can provide an oversized TCP segment to the datapath and the datapath
> + will offload the TCP segmentation and checksum calculation to the
> + interfaces when necessary.
> + </p>
> + <p>
> + The default value is <code>false</code>. Changing this value requires
> + restarting the daemon.
> + </p>
> + <p>
> + The feature is considered experimental.
> + </p>
> + </column>
> </group>
> <group title="Status">
> <column name="next_cfg">
>
More information about the dev
mailing list