[ovs-dev] [PATCH dpdk-latest v2 4/4] netdev-dpdk: Add TCP Segmentation Offload support

Flavio Leitner fbl at sysclose.org
Tue Jan 7 18:09:02 UTC 2020


Hi Ian,

Thanks for the reviews. I agree with your comments for the other
patches. This one I will answer them inline.


On Mon, Jan 06, 2020 at 08:24:48PM +0000, Stokes, Ian wrote:
> 
> 
> On 12/31/2019 8:14 PM, Flavio Leitner wrote:
> > Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
> > the network stack to delegate the TCP segmentation to the NIC reducing
> > the per packet CPU overhead.
> > 
> > A guest using vhostuser interface with TSO enabled can send TCP packets
> > much bigger than the MTU, which saves CPU cycles normally used to break
> > the packets down to MTU size and to calculate checksums.
> > 
> > It also saves CPU cycles used to parse multiple packets/headers during
> > the packet processing inside virtual switch.
> > 
> > If the destination of the packet is another guest in the same host, then
> > the same big packet can be sent through a vhostuser interface skipping
> > the segmentation completely. However, if the destination is not local,
> > the NIC hardware is instructed to do the TCP segmentation and checksum
> > calculation.
> > 
> > It is recommended to check if NIC hardware supports TSO before enabling
> > the feature, which is off by default.
> 
> Might be useful to add a link to the DPDK Networking Device feature support
> as it lists whether TSO is supported for a given device currently
> 
> https://doc.dpdk.org/guides/nics/overview.html

Yeah, will add to the documentation.


> > 
> > Signed-off-by: Flavio Leitner <fbl at sysclose.org>
> > ---
> >   Documentation/automake.mk           |   1 +
> >   Documentation/topics/dpdk/index.rst |   1 +
> >   Documentation/topics/dpdk/tso.rst   |  89 ++++++++
> >   NEWS                                |   1 +
> >   lib/automake.mk                     |   2 +
> >   lib/conntrack.c                     |  29 ++-
> >   lib/dp-packet.h                     | 152 +++++++++++++-
> >   lib/ipf.c                           |  32 +--
> >   lib/netdev-dpdk.c                   | 311 ++++++++++++++++++++++++----
> >   lib/netdev-linux-private.h          |   4 +
> >   lib/netdev-linux.c                  | 295 +++++++++++++++++++++++---
> >   lib/netdev-provider.h               |  10 +
> >   lib/netdev.c                        |  52 ++++-
> >   lib/tso.c                           |  54 +++++
> >   lib/tso.h                           |  23 ++
> >   vswitchd/bridge.c                   |   2 +
> >   vswitchd/vswitch.xml                |  12 ++
> >   17 files changed, 982 insertions(+), 88 deletions(-)
> >   create mode 100644 Documentation/topics/dpdk/tso.rst
> >   create mode 100644 lib/tso.c
> >   create mode 100644 lib/tso.h
> > 
> > diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> > index 5f7c3e07b..abe5aaed1 100644
> > --- a/Documentation/automake.mk
> > +++ b/Documentation/automake.mk
> > @@ -35,6 +35,7 @@ DOC_SOURCE = \
> >   	Documentation/topics/dpdk/index.rst \
> >   	Documentation/topics/dpdk/bridge.rst \
> >   	Documentation/topics/dpdk/jumbo-frames.rst \
> > +	Documentation/topics/dpdk/tso.rst \
> >   	Documentation/topics/dpdk/memory.rst \
> >   	Documentation/topics/dpdk/pdump.rst \
> >   	Documentation/topics/dpdk/phy.rst \
> > diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
> > index f2862ea70..400d56051 100644
> > --- a/Documentation/topics/dpdk/index.rst
> > +++ b/Documentation/topics/dpdk/index.rst
> > @@ -40,4 +40,5 @@ DPDK Support
> >      /topics/dpdk/qos
> >      /topics/dpdk/pdump
> >      /topics/dpdk/jumbo-frames
> > +   /topics/dpdk/tso
> >      /topics/dpdk/memory
> > diff --git a/Documentation/topics/dpdk/tso.rst b/Documentation/topics/dpdk/tso.rst
> > new file mode 100644
> > index 000000000..0724513bd
> > --- /dev/null
> > +++ b/Documentation/topics/dpdk/tso.rst
> > @@ -0,0 +1,89 @@
> > +..
> > +      Copyright 2019, Red Hat, Inc.
> Minor but probably 2020 above now for the next revision, goes for the other
> new files added also.

Oh, right, going to update that.


> > +
> > +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> > +      not use this file except in compliance with the License. You may obtain
> > +      a copy of the License at
> > +
> > +          http://www.apache.org/licenses/LICENSE-2.0
> > +
> > +      Unless required by applicable law or agreed to in writing, software
> > +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> > +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> > +      License for the specific language governing permissions and limitations
> > +      under the License.
> > +
> > +      Convention for heading levels in Open vSwitch documentation:
> > +
> > +      =======  Heading 0 (reserved for the title in a document)
> > +      -------  Heading 1
> > +      ~~~~~~~  Heading 2
> > +      +++++++  Heading 3
> > +      '''''''  Heading 4
> > +
> > +      Avoid deeper levels because they do not render well.
> > +
> > +========================
> > +Userspace Datapath - TSO
> > +========================
> > +
> > +**Note:** This feature is considered experimental.
> > +
> > +TCP Segmentation Offload (TSO) enables a network stack to delegate segmentation
> > +of an oversized TCP segment to the underlying physical NIC. Offload of frame
> > +segmentation achieves computational savings in the core, freeing up CPU cycles
> > +for more useful work.
> > +
> > +A common use case for TSO is when using virtualization, where traffic that's
> > +coming in from a VM can offload the TCP segmentation, thus avoiding the
> > +fragmentation in software. Additionally, if the traffic is headed to a VM
> > +within the same host further optimization can be expected. As the traffic never
> > +leaves the machine, no MTU needs to be accounted for, and thus no segmentation
> > +and checksum calculations are required, which saves yet more cycles. Only when
> > +the traffic actually leaves the host the segmentation needs to happen, in which
> > +case it will be performed by the egress NIC. Consult your controller's
> > +datasheet for compatibility. Secondly, the NIC must have an associated DPDK
> > +Poll Mode Driver (PMD) which supports `TSO`.
> > +
> 
> Possibly a better place for the link I posted earlier would be here.

Yes, yes. I will do that.


> > +Enabling TSO
> > +~~~~~~~~~~~~
> > +
> > +The TSO support may be enabled via a global config value ``tso-support``.
> > +Setting this to ``true`` enables TSO support for all ports.
> > +
> > +    $ ovs-vsctl set Open_vSwitch . other_config:tso-support=true
> > +
> > +The default value is ``false``.
> > +
> > +When using :doc:`vHost User ports <vhost-user>`, TSO may be enabled as follows.
> > +
> > +`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest
> > +connection is established, `TSO` is thus advertised to the guest as an
> > +available feature:
> > +
> > +QEMU Command Line Parameter::
> > +
> > +    $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \
> > +    ...
> > +    -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\
> > +    csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\
> > +    ...
> > +
> > +2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool can be
> > +used to enable same::
> > +
> > +    $ ethtool -K eth0 sg on     # scatter-gather is a prerequisite for TSO
> > +    $ ethtool -K eth0 tso on
> > +    $ ethtool -k eth0
> > +
> > +~~~~~~~~~~~
> > +Limitations
> > +~~~~~~~~~~~
> Minor: white space needed here to separate heading from text?

Not necessary to see the doc in github, but definitely it looks
better.

> > +The current OvS userspace `TSO` implementation supports flat and VLAN networks
> > +only (i.e. no support for `TSO` over tunneled connection [VxLAN, GRE, IPinIP,
> > +etc.]).
> > +
> > +There is no software implementation of TSO, so all ports attached to the
> > +datapath must support TSO or packets using that feature will be dropped.
> > +That also means guests using vhost-user in client mode will receive TSO
> > +packet regardless of TSO being enabled or disabled within the guest.
> > diff --git a/NEWS b/NEWS
> > index 0d65d5a7f..df32930bf 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -14,6 +14,7 @@ Post-v2.12.0
> >        * DPDK pdump packet capture support disabled by default. New configure
> >          option '--enable-dpdk-pdump' to enable it.
> >        * DPDK pdump support is deprecated and will be removed in next releases.
> > +     * Add experimental support for TSO in vhost-user ports.
> Is it just for vhost-user ports though? If supported by the NIC, TSO will be
> handled there as well right and this work enabled that case also?

Right. I will just say "Add experimental support for TSO" here and
leave the details in the documentation.


> 
> >   v2.12.0 - 03 Sep 2019
> >   ---------------------
> > diff --git a/lib/automake.mk b/lib/automake.mk
> > index 17b36b43d..01c54d7f3 100644
> > --- a/lib/automake.mk
> > +++ b/lib/automake.mk
> > @@ -302,6 +302,8 @@ lib_libopenvswitch_la_SOURCES = \
> >   	lib/tnl-neigh-cache.h \
> >   	lib/tnl-ports.c \
> >   	lib/tnl-ports.h \
> > +	lib/tso.c \
> > +	lib/tso.h \
> >   	lib/netdev-native-tnl.c \
> >   	lib/netdev-native-tnl.h \
> >   	lib/token-bucket.c \
> > diff --git a/lib/conntrack.c b/lib/conntrack.c
> > index df7b9fa7a..188f58dd8 100644
> > --- a/lib/conntrack.c
> > +++ b/lib/conntrack.c
> > @@ -1885,7 +1885,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
> >           if (hwol_bad_l3_csum) {
> >               ok = false;
> >           } else {
> > -            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt);
> > +            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt)
> > +                                     || dp_packet_hwol_tx_ip_checksum(pkt);
> >               /* Validate the checksum only when hwol is not supported. */
> >               ok = extract_l3_ipv4(&ctx->key, l3, dp_packet_l3_size(pkt), NULL,
> >                                    !hwol_good_l3_csum);
> > @@ -1899,7 +1900,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
> >       if (ok) {
> >           bool hwol_bad_l4_csum = dp_packet_l4_checksum_bad(pkt);
> >           if (!hwol_bad_l4_csum) {
> > -            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt);
> > +            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt)
> > +                                      || dp_packet_hwol_tx_l4_checksum(pkt);
> >               /* Validate the checksum only when hwol is not supported. */
> >               if (extract_l4(&ctx->key, l4, dp_packet_l4_size(pkt),
> >                              &ctx->icmp_related, l3, !hwol_good_l4_csum,
> > @@ -3100,8 +3102,11 @@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
> >                   }
> >                   if (seq_skew) {
> >                       ip_len = ntohs(l3_hdr->ip_tot_len) + seq_skew;
> > -                    l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
> > -                                          l3_hdr->ip_tot_len, htons(ip_len));
> > +                    if (!dp_packet_hwol_tx_ip_checksum(pkt)) {
> > +                        l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
> > +                                                        l3_hdr->ip_tot_len,
> > +                                                        htons(ip_len));
> > +                    }
> >                       l3_hdr->ip_tot_len = htons(ip_len);
> >                   }
> >               }
> > @@ -3119,13 +3124,15 @@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
> >       }
> >       th->tcp_csum = 0;
> > -    if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
> > -        th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
> > -                           dp_packet_l4_size(pkt));
> > -    } else {
> > -        uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
> > -        th->tcp_csum = csum_finish(
> > -             csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
> > +    if (!dp_packet_hwol_tx_l4_checksum(pkt)) {
> > +        if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
> > +            th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
> > +                               dp_packet_l4_size(pkt));
> > +        } else {
> > +            uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
> > +            th->tcp_csum = csum_finish(
> > +                 csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
> > +        }
> >       }
> >       if (seq_skew) {
> > diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> > index 325924eaa..5cd6fcc68 100644
> > --- a/lib/dp-packet.h
> > +++ b/lib/dp-packet.h
> > @@ -109,6 +109,8 @@ static inline void dp_packet_set_size(struct dp_packet *, uint32_t);
> >   static inline uint16_t dp_packet_get_allocated(const struct dp_packet *);
> >   static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t);
> > +void dp_packet_prepend_vnet_hdr(struct dp_packet *, int mtu);
> > +
> >   void *dp_packet_resize_l2(struct dp_packet *, int increment);
> >   void *dp_packet_resize_l2_5(struct dp_packet *, int increment);
> >   static inline void *dp_packet_eth(const struct dp_packet *);
> > @@ -451,7 +453,7 @@ dp_packet_init_specific(struct dp_packet *p)
> >   {
> >       /* This initialization is needed for packets that do not come from DPDK
> >        * interfaces, when vswitchd is built with --with-dpdk. */
> > -    p->mbuf.tx_offload = p->mbuf.packet_type = 0;
> > +    p->mbuf.ol_flags = p->mbuf.tx_offload = p->mbuf.packet_type = 0;
> >       p->mbuf.nb_segs = 1;
> >       p->mbuf.next = NULL;
> >   }
> > @@ -514,6 +516,80 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
> >       b->mbuf.buf_len = s;
> >   }
> > +static inline bool
> > +dp_packet_hwol_is_tso(const struct dp_packet *b)
> > +{
> > +    return (b->mbuf.ol_flags & (PKT_TX_TCP_SEG | PKT_TX_L4_MASK))
> > +           ? true
> > +           : false;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_is_ipv4(const struct dp_packet *b)
> > +{
> > +    return b->mbuf.ol_flags & PKT_TX_IPV4 ? true : false;
> > +}
> > +
> > +static inline uint64_t
> > +dp_packet_hwol_l4_mask(const struct dp_packet *b)
> > +{
> > +    return b->mbuf.ol_flags & PKT_TX_L4_MASK;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b)
> > +{
> > +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_TCP_CKSUM
> > +           ? true
> > +           : false;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_l4_is_udp(struct dp_packet *b)
> > +{
> > +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_UDP_CKSUM
> > +           ? true
> > +           : false;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_l4_is_sctp(struct dp_packet *b)
> > +{
> > +    return (b->mbuf.ol_flags & PKT_TX_L4_MASK) == PKT_TX_SCTP_CKSUM
> > +           ? true
> > +           : false;
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b) {
> > +    b->mbuf.ol_flags |= PKT_TX_IPV4;
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b) {
> > +    b->mbuf.ol_flags |= PKT_TX_IPV6;
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_csum_tcp(struct dp_packet *b) {
> > +    b->mbuf.ol_flags |= PKT_TX_TCP_CKSUM;
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_csum_udp(struct dp_packet *b) {
> > +    b->mbuf.ol_flags |= PKT_TX_UDP_CKSUM;
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_csum_sctp(struct dp_packet *b) {
> > +    b->mbuf.ol_flags |= PKT_TX_SCTP_CKSUM;
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_tcp_seg(struct dp_packet *b) {
> > +    b->mbuf.ol_flags |= PKT_TX_TCP_SEG;
> > +}
> > +
> >   /* Returns the RSS hash of the packet 'p'.  Note that the returned value is
> >    * correct only if 'dp_packet_rss_valid(p)' returns true */
> >   static inline uint32_t
> > @@ -643,6 +719,66 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
> >       b->allocated_ = s;
> >   }
> > +static inline bool
> > +dp_packet_hwol_is_tso(const struct dp_packet *b OVS_UNUSED)
> > +{
> > +    return false;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_is_ipv4(const struct dp_packet *b OVS_UNUSED)
> > +{
> > +    return false;
> > +}
> > +
> > +static inline uint64_t
> > +dp_packet_hwol_l4_mask(const struct dp_packet *b OVS_UNUSED)
> > +{
> > +    return 0;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_l4_is_tcp(const struct dp_packet *b OVS_UNUSED)
> > +{
> > +    return false;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_l4_is_udp(const struct dp_packet *b OVS_UNUSED)
> > +{
> > +    return false;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_l4_is_sctp(const struct dp_packet *b OVS_UNUSED)
> > +{
> > +    return false;
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b OVS_UNUSED) {
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_tx_ipv6(struct dp_packet *b OVS_UNUSED) {
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_csum_tcp(struct dp_packet *b OVS_UNUSED) {
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_csum_udp(struct dp_packet *b OVS_UNUSED) {
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_csum_sctp(struct dp_packet *b OVS_UNUSED) {
> > +}
> > +
> > +static inline void
> > +dp_packet_hwol_set_tcp_seg(struct dp_packet *b OVS_UNUSED) {
> > +}
> > +
> >   /* Returns the RSS hash of the packet 'p'.  Note that the returned value is
> >    * correct only if 'dp_packet_rss_valid(p)' returns true */
> >   static inline uint32_t
> > @@ -934,6 +1070,20 @@ dp_packet_batch_reset_cutlen(struct dp_packet_batch *batch)
> >       }
> >   }
> > +static inline bool
> > +dp_packet_hwol_tx_ip_checksum(const struct dp_packet *p)
> > +{
> > +
> > +    return dp_packet_hwol_l4_mask(p) ? true : false;
> > +}
> > +
> > +static inline bool
> > +dp_packet_hwol_tx_l4_checksum(const struct dp_packet *p)
> > +{
> > +
> > +    return dp_packet_hwol_l4_mask(p) ? true : false;
> > +}
> > +
> >   #ifdef  __cplusplus
> >   }
> >   #endif
> > diff --git a/lib/ipf.c b/lib/ipf.c
> > index 4cc0f2df6..052867d90 100644
> > --- a/lib/ipf.c
> > +++ b/lib/ipf.c
> > @@ -433,9 +433,11 @@ ipf_reassemble_v4_frags(struct ipf_list *ipf_list)
> >       len += rest_len;
> >       l3 = dp_packet_l3(pkt);
> >       ovs_be16 new_ip_frag_off = l3->ip_frag_off & ~htons(IP_MORE_FRAGMENTS);
> > -    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
> > -                                new_ip_frag_off);
> > -    l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
> > +    if (!dp_packet_hwol_tx_ip_checksum(pkt)) {
> > +        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_frag_off,
> > +                                    new_ip_frag_off);
> > +        l3->ip_csum = recalc_csum16(l3->ip_csum, l3->ip_tot_len, htons(len));
> > +    }
> >       l3->ip_tot_len = htons(len);
> >       l3->ip_frag_off = new_ip_frag_off;
> >       dp_packet_set_l2_pad_size(pkt, 0);
> > @@ -606,6 +608,7 @@ ipf_is_valid_v4_frag(struct ipf *ipf, struct dp_packet *pkt)
> >       }
> >       if (OVS_UNLIKELY(!dp_packet_ip_checksum_valid(pkt)
> > +                     && !dp_packet_hwol_tx_ip_checksum(pkt)
> >                        && csum(l3, ip_hdr_len) != 0)) {
> >           goto invalid_pkt;
> >       }
> > @@ -1180,16 +1183,21 @@ ipf_post_execute_reass_pkts(struct ipf *ipf,
> >                   } else {
> >                       struct ip_header *l3_frag = dp_packet_l3(frag_0->pkt);
> >                       struct ip_header *l3_reass = dp_packet_l3(pkt);
> > -                    ovs_be32 reass_ip = get_16aligned_be32(&l3_reass->ip_src);
> > -                    ovs_be32 frag_ip = get_16aligned_be32(&l3_frag->ip_src);
> > -                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> > -                                                     frag_ip, reass_ip);
> > -                    l3_frag->ip_src = l3_reass->ip_src;
> > +                    if (!dp_packet_hwol_tx_ip_checksum(frag_0->pkt)) {
> > +                        ovs_be32 reass_ip =
> > +                            get_16aligned_be32(&l3_reass->ip_src);
> > +                        ovs_be32 frag_ip =
> > +                            get_16aligned_be32(&l3_frag->ip_src);
> > +
> > +                        l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> > +                                                         frag_ip, reass_ip);
> > +                        reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
> > +                        frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
> > +                        l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> > +                                                         frag_ip, reass_ip);
> > +                    }
> > -                    reass_ip = get_16aligned_be32(&l3_reass->ip_dst);
> > -                    frag_ip = get_16aligned_be32(&l3_frag->ip_dst);
> > -                    l3_frag->ip_csum = recalc_csum32(l3_frag->ip_csum,
> > -                                                     frag_ip, reass_ip);
> > +                    l3_frag->ip_src = l3_reass->ip_src;
> >                       l3_frag->ip_dst = l3_reass->ip_dst;
> >                   }
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> > index 57bff5e58..9f4360ae9 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -63,6 +63,7 @@
> >   #include "smap.h"
> >   #include "sset.h"
> >   #include "timeval.h"
> > +#include "tso.h"
> >   #include "unaligned.h"
> >   #include "unixctl.h"
> >   #include "util.h"
> > @@ -355,7 +356,8 @@ struct ingress_policer {
> >   enum dpdk_hw_ol_features {
> >       NETDEV_RX_CHECKSUM_OFFLOAD = 1 << 0,
> >       NETDEV_RX_HW_CRC_STRIP = 1 << 1,
> > -    NETDEV_RX_HW_SCATTER = 1 << 2
> > +    NETDEV_RX_HW_SCATTER = 1 << 2,
> > +    NETDEV_TX_TSO_OFFLOAD = 1 << 3,
> >   };
> >   /*
> > @@ -935,6 +937,12 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
> >           conf.rxmode.offloads |= DEV_RX_OFFLOAD_KEEP_CRC;
> >       }
> > +    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> > +        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_TSO;
> > +        conf.txmode.offloads |= DEV_TX_OFFLOAD_TCP_CKSUM;
> > +        conf.txmode.offloads |= DEV_TX_OFFLOAD_IPV4_CKSUM;
> > +    }
> > +
> >       /* Limit configured rss hash functions to only those supported
> >        * by the eth device. */
> >       conf.rx_adv_conf.rss_conf.rss_hf &= info.flow_type_rss_offloads;
> > @@ -1036,6 +1044,9 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
> >       uint32_t rx_chksm_offload_capa = DEV_RX_OFFLOAD_UDP_CKSUM |
> >                                        DEV_RX_OFFLOAD_TCP_CKSUM |
> >                                        DEV_RX_OFFLOAD_IPV4_CKSUM;
> > +    uint32_t tx_tso_offload_capa = DEV_TX_OFFLOAD_TCP_TSO |
> > +                                   DEV_TX_OFFLOAD_TCP_CKSUM |
> > +                                   DEV_TX_OFFLOAD_IPV4_CKSUM;
> >       rte_eth_dev_info_get(dev->port_id, &info);
> > @@ -1062,6 +1073,14 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
> >           dev->hw_ol_features &= ~NETDEV_RX_HW_SCATTER;
> >       }
> > +    if (info.tx_offload_capa & tx_tso_offload_capa) {
> > +        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
> > +    } else {
> > +        dev->hw_ol_features &= ~NETDEV_TX_TSO_OFFLOAD;
> > +        VLOG_WARN("Tx TSO offload is not supported on port "
> > +                  DPDK_PORT_ID_FMT, dev->port_id);
> 
> To make this warning more verbose we could include the interface name as
> well as the port ID?

Sounds good.


> 
> > +    }
> > +
> >       n_rxq = MIN(info.max_rx_queues, dev->up.n_rxq);
> >       n_txq = MIN(info.max_tx_queues, dev->up.n_txq);
> > @@ -1310,14 +1329,16 @@ netdev_dpdk_vhost_construct(struct netdev *netdev)
> >           goto out;
> >       }
> > -    err = rte_vhost_driver_disable_features(dev->vhost_id,
> > -                                1ULL << VIRTIO_NET_F_HOST_TSO4
> > -                                | 1ULL << VIRTIO_NET_F_HOST_TSO6
> > -                                | 1ULL << VIRTIO_NET_F_CSUM);
> > -    if (err) {
> > -        VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> > -                 "port: %s\n", name);
> > -        goto out;
> > +    if (!tso_enabled()) {
> > +        err = rte_vhost_driver_disable_features(dev->vhost_id,
> > +                                    1ULL << VIRTIO_NET_F_HOST_TSO4
> > +                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
> > +                                    | 1ULL << VIRTIO_NET_F_CSUM);
> > +        if (err) {
> > +            VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> > +                     "port: %s\n", name);
> > +            goto out;
> > +        }
> >       }
> >       err = rte_vhost_driver_start(dev->vhost_id);
> > @@ -1652,6 +1673,11 @@ netdev_dpdk_get_config(const struct netdev *netdev, struct smap *args)
> >           } else {
> >               smap_add(args, "rx_csum_offload", "false");
> >           }
> > +        if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> > +            smap_add(args, "tx_tso_offload", "true");
> > +        } else {
> > +            smap_add(args, "tx_tso_offload", "false");
> > +        }
> >           smap_add(args, "lsc_interrupt_mode",
> >                    dev->lsc_interrupt_mode ? "true" : "false");
> >       }
> > @@ -2051,6 +2077,64 @@ netdev_dpdk_rxq_dealloc(struct netdev_rxq *rxq)
> >       rte_free(rx);
> >   }
> > +/* Prepare the packet for HWOL.
> > + * Return True if the packet is OK to continue. */
> > +static bool
> > +netdev_dpdk_prep_hwol_packet(struct netdev_dpdk *dev, struct rte_mbuf *mbuf)
> > +{
> > +    struct dp_packet *pkt = CONTAINER_OF(mbuf, struct dp_packet, mbuf);
> > +
> > +    if (mbuf->ol_flags & PKT_TX_L4_MASK) {
> > +        mbuf->l2_len = (char *)dp_packet_l3(pkt) - (char *)dp_packet_eth(pkt);
> > +        mbuf->l3_len = (char *)dp_packet_l4(pkt) - (char *)dp_packet_l3(pkt);
> > +        mbuf->outer_l2_len = 0;
> > +        mbuf->outer_l3_len = 0;
> > +    }
> > +
> > +    if (mbuf->ol_flags & PKT_TX_TCP_SEG) {
> > +        struct tcp_header *th = dp_packet_l4(pkt);
> > +
> > +        if (!th) {
> > +            VLOG_WARN_RL(&rl, "%s: TCP Segmentation without L4 header"
> > +                         " pkt len: %"PRIu32"", dev->up.name, mbuf->pkt_len);
> > +            return false;
> > +        }
> > +
> > +        mbuf->l4_len = TCP_OFFSET(th->tcp_ctl) * 4;
> Can we replace the use of magic number 4 above?

Probably, but OvS uses exactly the above in other parts. Perhaps a
separate patch can change all them at once?


> > +        mbuf->ol_flags |= PKT_TX_TCP_CKSUM | PKT_TX_IP_CKSUM;
> > +        mbuf->tso_segsz = dev->mtu - mbuf->l3_len - mbuf->l4_len;
> > +    }
> > +
> > +    return true;
> > +}
> > +
> > +/* Prepare a batch for HWOL.
> > + * Return the number of good packets in the batch. */
> > +static int
> > +netdev_dpdk_prep_hwol_batch(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
> > +                            int pkt_cnt)
> > +{
> > +    int i = 0;
> > +    int cnt = 0;
> > +    struct rte_mbuf *pkt;
> > +
> > +    /* Prepare and filter bad HWOL packets */
> 
> Minor, missing period for comment above.

Ok

> > +    for (i = 0; i < pkt_cnt; i++) {
> > +        pkt = pkts[i];
> > +        if (!netdev_dpdk_prep_hwol_packet(dev, pkt)) {
> > +            rte_pktmbuf_free(pkt);
> > +            continue;
> > +        }
> > +
> > +        if (OVS_UNLIKELY(i != cnt)) {
> > +            pkts[cnt] = pkt;
> > +        }
> > +        cnt++;
> > +    }
> > +
> > +    return cnt;
> > +}
> > +
> >   /* Tries to transmit 'pkts' to txq 'qid' of device 'dev'.  Takes ownership of
> >    * 'pkts', even in case of failure.
> >    *
> > @@ -2060,11 +2144,22 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
> >                            struct rte_mbuf **pkts, int cnt)
> >   {
> >       uint32_t nb_tx = 0;
> > +    uint16_t nb_tx_prep = cnt;
> > +
> > +    if (tso_enabled()) {
> > +        nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
> > +        if (nb_tx_prep != cnt) {
> > +            VLOG_WARN_RL(&rl, "%s: Output batch contains invalid packets. "
> > +                         "Only %u/%u are valid: %s", dev->up.name, nb_tx_prep,
> > +                         cnt, rte_strerror(rte_errno));
> > +        }
> > +    }
> > -    while (nb_tx != cnt) {
> > +    while (nb_tx != nb_tx_prep) {
> >           uint32_t ret;
> > -        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - nb_tx);
> > +        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx,
> > +                               nb_tx_prep - nb_tx);
> >           if (!ret) {
> >               break;
> >           }
> > @@ -2348,11 +2443,14 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
> >       int cnt = 0;
> >       struct rte_mbuf *pkt;
> > +    /* Filter oversized packets, unless are marked for TSO. */
> >       for (i = 0; i < pkt_cnt; i++) {
> >           pkt = pkts[i];
> > -        if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) {
> > -            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " max_packet_len %d",
> > -                         dev->up.name, pkt->pkt_len, dev->max_packet_len);
> > +        if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len)
> > +            && !(pkt->ol_flags & PKT_TX_TCP_SEG))) {
> > +            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " "
> > +                         "max_packet_len %d", dev->up.name, pkt->pkt_len,
> > +                         dev->max_packet_len);
> >               rte_pktmbuf_free(pkt);
> >               continue;
> >           }
> > @@ -2401,7 +2499,7 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
> >       struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
> >       struct netdev_dpdk_sw_stats sw_stats_add;
> >       unsigned int n_packets_to_free = cnt;
> > -    unsigned int total_packets = cnt;
> > +    unsigned int total_packets;
> >       int i, retries = 0;
> >       int max_retries = VHOST_ENQ_RETRY_MIN;
> >       int vid = netdev_dpdk_get_vid(dev);
> > @@ -2421,7 +2519,8 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
> >           rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
> >       }
> > -    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
> > +    total_packets = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt);
> > +    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, total_packets);
> >       sw_stats_add.tx_mtu_exceeded_drops = total_packets - cnt;
> >       /* Check has QoS has been configured for the netdev */
> > @@ -2470,6 +2569,123 @@ out:
> >       }
> >   }
> > +static void
> > +netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
> > +{
> > +    rte_free(opaque);
> > +}
> > +
> > +static struct rte_mbuf *
> > +dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len)
> > +{
> > +    uint32_t total_len = RTE_PKTMBUF_HEADROOM + data_len;
> > +    struct rte_mbuf_ext_shared_info *shinfo = NULL;
> > +    uint16_t buf_len;
> > +    void *buf;
> > +
> > +    if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo)) {
> > +        shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
> > +    } else {
> > +        total_len += sizeof(*shinfo) + sizeof(uintptr_t);
> > +        total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> > +    }
> > +
> > +    if (unlikely(total_len > UINT16_MAX)) {
> > +        VLOG_ERR("Can't copy packet: too big %u", total_len);
> > +        return NULL;
> > +    }
> > +
> > +    buf_len = total_len;
> > +    buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> > +    if (unlikely(buf == NULL)) {
> > +        VLOG_ERR("Failed to allocate mem using rte_malloc: %u", buf_len);
> > +        return NULL;
> > +    }
> > +
> > +    /* Initialize shinfo */
> > +    if (shinfo) {
> > +        shinfo->free_cb = netdev_dpdk_extbuf_free;
> > +        shinfo->fcb_opaque = buf;
> > +        rte_mbuf_ext_refcnt_set(shinfo, 1);
> > +    } else {
> > +        shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> > +                                                    netdev_dpdk_extbuf_free,
> > +                                                    buf);
> > +        if (unlikely(shinfo == NULL)) {
> > +            rte_free(buf);
> > +            VLOG_ERR("Failed to initialize shinfo");
> 
> I'm not sure this error message will be clear to an end user, particualrly
> the meaning and context of 'shinfo'. Could we make it more verbose i.e.
> Failed to initialize shared info for mbuf while atteptming to attach
> external mbuf.

I liked your suggestion.


> 
> > +            return NULL;
> > +        }
> > +    }
> > +
> > +    rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), buf_len,
> > +                              shinfo);
> > +    rte_pktmbuf_reset_headroom(pkt);
> > +
> > +    return pkt;
> > +}
> > +
> > +static struct rte_mbuf *
> > +dpdk_pktmbuf_alloc(struct rte_mempool *mp, uint32_t data_len)
> > +{
> > +    struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> > +
> > +    if (OVS_UNLIKELY(!pkt)) {
> > +        return NULL;
> > +    }
> > +
> > +    dp_packet_init_specific((struct dp_packet *)pkt);
> > +    if (rte_pktmbuf_tailroom(pkt) >= data_len) {
> > +        return pkt;
> > +    }
> > +
> > +    if (dpdk_pktmbuf_attach_extbuf(pkt, data_len)) {
> > +        return pkt;
> > +    }
> > +
> > +    rte_pktmbuf_free(pkt);
> > +
> > +    return NULL;
> > +}
> > +
> 
> For the function below, is this expected to be used when a packet originates
> outside of DPDK interface I assume?

Yes. For example internal ports or veth pairs.

> 
> > +static struct dp_packet *
> > +dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet *pkt_orig)
> > +{
> > +    struct rte_mbuf *mbuf_dest;
> > +    struct dp_packet *pkt_dest;
> > +    uint32_t size;
> > +    uint32_t headroom;
> > +
> > +    size = dp_packet_size(pkt_orig);
> > +    mbuf_dest = dpdk_pktmbuf_alloc(mp, size);
> > +    if (OVS_UNLIKELY(mbuf_dest == NULL)) {
> > +            return NULL;
> > +    }
> > +
> > +    pkt_dest = CONTAINER_OF(mbuf_dest, struct dp_packet, mbuf);
> > +    headroom = dp_packet_headroom(pkt_orig);
> > +    dp_packet_set_data(pkt_dest, (char *)dp_packet_data(pkt_dest) + headroom);
> > +    memcpy(dp_packet_data(pkt_dest), dp_packet_data(pkt_orig), size);
> > +    dp_packet_set_size(pkt_dest, size);
> > +
> > +    mbuf_dest->tx_offload = pkt_orig->mbuf.tx_offload;
> > +    mbuf_dest->packet_type = pkt_orig->mbuf.packet_type;
> > +    mbuf_dest->ol_flags |= (pkt_orig->mbuf.ol_flags &
> > +                            ~(EXT_ATTACHED_MBUF | IND_ATTACHED_MBUF));
> > +
> > +    memcpy(&pkt_dest->l2_pad_size, &pkt_orig->l2_pad_size,
> > +           sizeof(struct dp_packet) - offsetof(struct dp_packet, l2_pad_size));
> > +
> > +    if (mbuf_dest->ol_flags & PKT_TX_L4_MASK) {
> > +        mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest)
> > +                                - (char *)dp_packet_eth(pkt_dest);
> > +        mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest)
> > +                                - (char *) dp_packet_l3(pkt_dest);
> > +    }
> > +
> > +    return pkt_dest;
> > +}
> > +
> >   /* Tx function. Transmit packets indefinitely */
> >   static void
> >   dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> > @@ -2483,7 +2699,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> >       enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
> >   #endif
> >       struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> > -    struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
> > +    struct dp_packet *pkts[PKT_ARRAY_SIZE];
> >       struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
> >       uint32_t cnt = batch_cnt;
> >       uint32_t dropped = 0;
> > @@ -2504,34 +2720,30 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> >           struct dp_packet *packet = batch->packets[i];
> >           uint32_t size = dp_packet_size(packet);
> > -        if (OVS_UNLIKELY(size > dev->max_packet_len)) {
> > -            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d",
> > -                         size, dev->max_packet_len);
> > -
> > +        if (size > dev->max_packet_len
> > +            && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) {
> > +            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size,
> > +                         dev->max_packet_len);
> >               mtu_drops++;
> >               continue;
> >           }
> > -        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
> > +        pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet);
> 
> Ah, ignore the previous questions, it's clearer now.

:-)

 
> >           if (OVS_UNLIKELY(!pkts[txcnt])) {
> >               dropped = cnt - i;
> >               break;
> >           }
> > -        /* We have to do a copy for now */
> > -        memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
> > -               dp_packet_data(packet), size);
> > -        dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
> > -
> >           txcnt++;
> >       }
> >       if (OVS_LIKELY(txcnt)) {
> >           if (dev->type == DPDK_DEV_VHOST) {
> > -            __netdev_dpdk_vhost_send(netdev, qid, (struct dp_packet **) pkts,
> > -                                     txcnt);
> > +            __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt);
> >           } else {
> > -            tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, txcnt);
> > +            tx_failure += netdev_dpdk_eth_tx_burst(dev, qid,
> > +                                                   (struct rte_mbuf **)pkts,
> > +                                                   txcnt);
> >           }
> >       }
> > @@ -2589,6 +2801,7 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
> >           int batch_cnt = dp_packet_batch_size(batch);
> >           struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
> > +        batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, batch_cnt);
> >           tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
> >           mtu_drops = batch_cnt - tx_cnt;
> >           qos_drops = tx_cnt;
> > @@ -4277,6 +4490,12 @@ netdev_dpdk_reconfigure(struct netdev *netdev)
> >       rte_free(dev->tx_q);
> >       err = dpdk_eth_dev_init(dev);
> > +    if (dev->hw_ol_features & NETDEV_TX_TSO_OFFLOAD) {
> > +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> > +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> > +        netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> > +    }
> > +
> >       dev->tx_q = netdev_dpdk_alloc_txq(netdev->n_txq);
> >       if (!dev->tx_q) {
> >           err = ENOMEM;
> > @@ -4306,6 +4525,11 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
> >           dev->tx_q[0].map = 0;
> >       }
> > +    if (tso_enabled()) {
> > +        dev->hw_ol_features |= NETDEV_TX_TSO_OFFLOAD;
> > +        VLOG_DBG("%s: TSO enabled on vhost port", dev->up.name);
> 
> Can we return the device name via netdev_get_name(dev->up), I know there is
> an incinsistency with how the name is returned in the logs but I think the
> preference is to use this function when possible.

Fixed!


> > +    }
> > +
> >       netdev_dpdk_remap_txqs(dev);
> >       err = netdev_dpdk_mempool_configure(dev);
> > @@ -4378,6 +4602,11 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
> >               vhost_flags |= RTE_VHOST_USER_DEQUEUE_ZERO_COPY;
> >           }
> > +        /* Enable External Buffers if TCP Segmentation Offload is enabled */
> Minor, missing period.

OK


> > +        if (tso_enabled()) {
> > +            vhost_flags |= RTE_VHOST_USER_EXTBUF_SUPPORT;
> > +        }
> > +
> >           err = rte_vhost_driver_register(dev->vhost_id, vhost_flags);
> >           if (err) {
> >               VLOG_ERR("vhost-user device setup failure for device %s\n",
> > @@ -4402,14 +4631,20 @@ netdev_dpdk_vhost_client_reconfigure(struct netdev *netdev)
> >               goto unlock;
> >           }
> > -        err = rte_vhost_driver_disable_features(dev->vhost_id,
> > -                                    1ULL << VIRTIO_NET_F_HOST_TSO4
> > -                                    | 1ULL << VIRTIO_NET_F_HOST_TSO6
> > -                                    | 1ULL << VIRTIO_NET_F_CSUM);
> > -        if (err) {
> > -            VLOG_ERR("rte_vhost_driver_disable_features failed for vhost user "
> > -                     "client port: %s\n", dev->up.name);
> > -            goto unlock;
> > +        if (tso_enabled()) {
> > +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> > +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> > +            netdev->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> > +        } else {
> > +            err = rte_vhost_driver_disable_features(dev->vhost_id,
> > +                                        1ULL << VIRTIO_NET_F_HOST_TSO4
> > +                                        | 1ULL << VIRTIO_NET_F_HOST_TSO6
> > +                                        | 1ULL << VIRTIO_NET_F_CSUM);
> > +            if (err) {
> > +                VLOG_ERR("rte_vhost_driver_disable_features failed for "
> > +                         "vhost user client port: %s\n", dev->up.name);
> > +                goto unlock;
> > +            }
> >           }
> >           err = rte_vhost_driver_start(dev->vhost_id);
> > diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> > index c14f2fb81..2eb8badd9 100644
> > --- a/lib/netdev-linux-private.h
> > +++ b/lib/netdev-linux-private.h
> > @@ -37,10 +37,14 @@
> >   struct netdev;
> > +#define LINUX_RXQ_TSO_MAX_LEN 65536
> > +
> >   struct netdev_rxq_linux {
> >       struct netdev_rxq up;
> >       bool is_tap;
> >       int fd;
> > +    char *bufaux;          /* Extra buffer to recv TSO pkt */
> > +    int bufaux_len;        /* Extra buffer length */
> >   };
> >   int netdev_linux_construct(struct netdev *);
> > diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
> > index 0a32cf9bc..187206fc5 100644
> > --- a/lib/netdev-linux.c
> > +++ b/lib/netdev-linux.c
> > @@ -29,16 +29,18 @@
> >   #include <linux/filter.h>
> >   #include <linux/gen_stats.h>
> >   #include <linux/if_ether.h>
> > +#include <linux/if_packet.h>
> >   #include <linux/if_tun.h>
> >   #include <linux/types.h>
> >   #include <linux/ethtool.h>
> >   #include <linux/mii.h>
> >   #include <linux/rtnetlink.h>
> >   #include <linux/sockios.h>
> > +#include <linux/virtio_net.h>
> >   #include <sys/ioctl.h>
> >   #include <sys/socket.h>
> > +#include <sys/uio.h>
> >   #include <sys/utsname.h>
> > -#include <netpacket/packet.h>
> >   #include <net/if.h>
> >   #include <net/if_arp.h>
> >   #include <net/route.h>
> > @@ -72,6 +74,7 @@
> >   #include "socket-util.h"
> >   #include "sset.h"
> >   #include "tc.h"
> > +#include "tso.h"
> >   #include "timer.h"
> >   #include "unaligned.h"
> >   #include "openvswitch/vlog.h"
> > @@ -501,6 +504,8 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> >    * changes in the device miimon status, so we can use atomic_count. */
> >   static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0);
> > +static int netdev_linux_parse_vnet_hdr(struct dp_packet *b);
> > +static void netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu);
> >   static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *,
> >                                      int cmd, const char *cmd_name);
> >   static int get_flags(const struct netdev *, unsigned int *flags);
> > @@ -902,6 +907,13 @@ netdev_linux_common_construct(struct netdev *netdev_)
> >       /* The device could be in the same network namespace or in another one. */
> >       netnsid_unset(&netdev->netnsid);
> >       ovs_mutex_init(&netdev->mutex);
> > +
> > +    if (tso_enabled()) {
> > +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_TSO;
> > +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_TCP_CKSUM;
> > +        netdev_->ol_flags |= NETDEV_TX_OFFLOAD_IPV4_CKSUM;
> > +    }
> > +
> >       return 0;
> >   }
> > @@ -961,6 +973,10 @@ netdev_linux_construct_tap(struct netdev *netdev_)
> >       /* Create tap device. */
> >       get_flags(&netdev->up, &netdev->ifi_flags);
> >       ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
> > +    if (tso_enabled()) {
> > +        ifr.ifr_flags |= IFF_VNET_HDR;
> > +    }
> > +
> >       ovs_strzcpy(ifr.ifr_name, name, sizeof ifr.ifr_name);
> >       if (ioctl(netdev->tap_fd, TUNSETIFF, &ifr) == -1) {
> >           VLOG_WARN("%s: creating tap device failed: %s", name,
> > @@ -1024,6 +1040,13 @@ static struct netdev_rxq *
> >   netdev_linux_rxq_alloc(void)
> >   {
> >       struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
> > +    if (tso_enabled()) {
> > +        rx->bufaux = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
> > +        if (rx->bufaux) {
> > +            rx->bufaux_len = LINUX_RXQ_TSO_MAX_LEN;
> > +        }
> > +    }
> > +
> >       return &rx->up;
> >   }
> > @@ -1069,6 +1092,17 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
> >               goto error;
> >           }
> > +        if (tso_enabled()) {
> > +            error = setsockopt(rx->fd, SOL_PACKET, PACKET_VNET_HDR, &val,
> > +                               sizeof val);
> > +            if (error) {
> > +                error = errno;
> > +                VLOG_ERR("%s: failed to enable vnet hdr in txq raw socket: %s",
> > +                         netdev_get_name(netdev_), ovs_strerror(errno));
> > +                goto error;
> > +            }
> > +        }
> > +
> >           /* Set non-blocking mode. */
> >           error = set_nonblocking(rx->fd);
> >           if (error) {
> > @@ -1123,6 +1157,8 @@ netdev_linux_rxq_destruct(struct netdev_rxq *rxq_)
> >       if (!rx->is_tap) {
> >           close(rx->fd);
> >       }
> > +
> > +    free(rx->bufaux);
> >   }
> >   static void
> > @@ -1151,12 +1187,15 @@ auxdata_has_vlan_tci(const struct tpacket_auxdata *aux)
> >       return aux->tp_vlan_tci || aux->tp_status & TP_STATUS_VLAN_VALID;
> >   }
> > +
> Is the extra white space needed here?

Removed.

> >   static int
> > -netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
> > +netdev_linux_rxq_recv_sock(int fd, char *bufaux, int bufaux_len,
> > +                           struct dp_packet *buffer)
> >   {
> > -    size_t size;
> > +    size_t std_len;
> > +    size_t total_len;
> >       ssize_t retval;
> > -    struct iovec iov;
> > +    struct iovec iov[2];
> >       struct cmsghdr *cmsg;
> >       union {
> >           struct cmsghdr cmsg;
> > @@ -1166,14 +1205,17 @@ netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
> >       /* Reserve headroom for a single VLAN tag */
> >       dp_packet_reserve(buffer, VLAN_HEADER_LEN);
> > -    size = dp_packet_tailroom(buffer);
> > +    std_len = dp_packet_tailroom(buffer);
> > +    total_len = std_len + bufaux_len;
> > -    iov.iov_base = dp_packet_data(buffer);
> > -    iov.iov_len = size;
> > +    iov[0].iov_base = dp_packet_data(buffer);
> > +    iov[0].iov_len = std_len;
> > +    iov[1].iov_base = bufaux;
> > +    iov[1].iov_len = bufaux_len;
> >       msgh.msg_name = NULL;
> >       msgh.msg_namelen = 0;
> > -    msgh.msg_iov = &iov;
> > -    msgh.msg_iovlen = 1;
> > +    msgh.msg_iov = iov;
> > +    msgh.msg_iovlen = 2;
> >       msgh.msg_control = &cmsg_buffer;
> >       msgh.msg_controllen = sizeof cmsg_buffer;
> >       msgh.msg_flags = 0;
> > @@ -1184,11 +1226,26 @@ netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
> >       if (retval < 0) {
> >           return errno;
> > -    } else if (retval > size) {
> > +    } else if (retval > total_len) {
> >           return EMSGSIZE;
> >       }
> > -    dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> > +    if (retval > std_len) {
> > +        /* Build a single linear TSO packet */
> Minor, missing period.

Fixed.


> > +        size_t extra_len = retval - std_len;
> > +
> > +        dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> > +        dp_packet_prealloc_tailroom(buffer, extra_len);
> > +        memcpy(dp_packet_tail(buffer), bufaux, extra_len);
> > +        dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> > +    } else {
> > +        dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> > +    }
> > +
> > +    if (tso_enabled() && netdev_linux_parse_vnet_hdr(buffer)) {
> > +        VLOG_WARN_RL(&rl, "Invalid virtio net header");
> > +        return EINVAL;
> > +    }
> >       for (cmsg = CMSG_FIRSTHDR(&msgh); cmsg; cmsg = CMSG_NXTHDR(&msgh, cmsg)) {
> >           const struct tpacket_auxdata *aux;
> > @@ -1221,20 +1278,44 @@ netdev_linux_rxq_recv_sock(int fd, struct dp_packet *buffer)
> >   }
> >   static int
> > -netdev_linux_rxq_recv_tap(int fd, struct dp_packet *buffer)
> > +netdev_linux_rxq_recv_tap(int fd, char *bufaux, int bufaux_len,
> > +                          struct dp_packet *buffer)
> >   {
> >       ssize_t retval;
> > -    size_t size = dp_packet_tailroom(buffer);
> > +    size_t std_len;
> > +    struct iovec iov[2];
> > +
> > +    std_len = dp_packet_tailroom(buffer);
> > +    iov[0].iov_base = dp_packet_data(buffer);
> > +    iov[0].iov_len = std_len;
> > +    iov[1].iov_base = bufaux;
> > +    iov[1].iov_len = bufaux_len;
> >       do {
> > -        retval = read(fd, dp_packet_data(buffer), size);
> > +        retval = readv(fd, iov, 2);
> >       } while (retval < 0 && errno == EINTR);
> >       if (retval < 0) {
> >           return errno;
> >       }
> > -    dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> > +    if (retval > std_len) {
> > +        /* Build a single linear TSO packet */
> Minor, missing period.

Fixed.

> 
> > +        size_t extra_len = retval - std_len;
> > +
> > +        dp_packet_set_size(buffer, dp_packet_size(buffer) + std_len);
> > +        dp_packet_prealloc_tailroom(buffer, extra_len);
> > +        memcpy(dp_packet_tail(buffer), bufaux, extra_len);
> > +        dp_packet_set_size(buffer, dp_packet_size(buffer) + extra_len);
> > +    } else {
> > +        dp_packet_set_size(buffer, dp_packet_size(buffer) + retval);
> > +    }
> > +
> > +    if (tso_enabled() && netdev_linux_parse_vnet_hdr(buffer)) {
> > +        VLOG_WARN_RL(&rl, "Invalid virtio net header");
> > +        return EINVAL;
> > +    }
> > +
> >       return 0;
> >   }
> > @@ -1245,6 +1326,7 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> >       struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_);
> >       struct netdev *netdev = rx->up.netdev;
> >       struct dp_packet *buffer;
> > +    size_t buffer_len;
> >       ssize_t retval;
> >       int mtu;
> > @@ -1252,12 +1334,18 @@ netdev_linux_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch,
> >           mtu = ETH_PAYLOAD_MAX;
> >       }
> > +    buffer_len = VLAN_ETH_HEADER_LEN + mtu;
> > +    if (tso_enabled()) {
> > +            buffer_len += sizeof(struct virtio_net_hdr);
> > +    }
> > +
> >       /* Assume Ethernet port. No need to set packet_type. */
> > -    buffer = dp_packet_new_with_headroom(VLAN_ETH_HEADER_LEN + mtu,
> > -                                           DP_NETDEV_HEADROOM);
> > +    buffer = dp_packet_new_with_headroom(buffer_len, DP_NETDEV_HEADROOM);
> >       retval = (rx->is_tap
> > -              ? netdev_linux_rxq_recv_tap(rx->fd, buffer)
> > -              : netdev_linux_rxq_recv_sock(rx->fd, buffer));
> > +              ? netdev_linux_rxq_recv_tap(rx->fd, rx->bufaux, rx->bufaux_len,
> > +                                          buffer)
> > +              : netdev_linux_rxq_recv_sock(rx->fd, rx->bufaux, rx->bufaux_len,
> > +                                           buffer));
> >       if (retval) {
> >           if (retval != EAGAIN && retval != EMSGSIZE) {
> > @@ -1302,7 +1390,7 @@ netdev_linux_rxq_drain(struct netdev_rxq *rxq_)
> >   }
> >   static int
> > -netdev_linux_sock_batch_send(int sock, int ifindex,
> > +netdev_linux_sock_batch_send(int sock, int ifindex, bool tso, int mtu,
> >                                struct dp_packet_batch *batch)
> >   {
> >       const size_t size = dp_packet_batch_size(batch);
> > @@ -1316,6 +1404,10 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
> >       struct dp_packet *packet;
> >       DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > +        if (tso) {
> > +            netdev_linux_prepend_vnet_hdr(packet, mtu);
> > +        }
> > +
> >           iov[i].iov_base = dp_packet_data(packet);
> >           iov[i].iov_len = dp_packet_size(packet);
> >           mmsg[i].msg_hdr = (struct msghdr) { .msg_name = &sll,
> > @@ -1348,7 +1440,7 @@ netdev_linux_sock_batch_send(int sock, int ifindex,
> >    * on other interface types because we attach a socket filter to the rx
> >    * socket. */
> >   static int
> > -netdev_linux_tap_batch_send(struct netdev *netdev_,
> > +netdev_linux_tap_batch_send(struct netdev *netdev_, bool tso, int mtu,
> >                               struct dp_packet_batch *batch)
> >   {
> >       struct netdev_linux *netdev = netdev_linux_cast(netdev_);
> > @@ -1365,10 +1457,15 @@ netdev_linux_tap_batch_send(struct netdev *netdev_,
> >       }
> >       DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > -        size_t size = dp_packet_size(packet);
> > +        size_t size;
> >           ssize_t retval;
> >           int error;
> > +        if (tso) {
> > +            netdev_linux_prepend_vnet_hdr(packet, mtu);
> > +        }
> > +
> > +        size = dp_packet_size(packet);
> >           do {
> >               retval = write(netdev->tap_fd, dp_packet_data(packet), size);
> >               error = retval < 0 ? errno : 0;
> > @@ -1403,9 +1500,15 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
> >                     struct dp_packet_batch *batch,
> >                     bool concurrent_txq OVS_UNUSED)
> >   {
> > +    bool tso = tso_enabled();
> > +    int mtu = ETH_PAYLOAD_MAX;
> >       int error = 0;
> >       int sock = 0;
> > +    if (tso) {
> > +        netdev_linux_get_mtu__(netdev_linux_cast(netdev_), &mtu);
> > +    }
> > +
> >       if (!is_tap_netdev(netdev_)) {
> >           if (netdev_linux_netnsid_is_remote(netdev_linux_cast(netdev_))) {
> >               error = EOPNOTSUPP;
> > @@ -1424,9 +1527,9 @@ netdev_linux_send(struct netdev *netdev_, int qid OVS_UNUSED,
> >               goto free_batch;
> >           }
> > -        error = netdev_linux_sock_batch_send(sock, ifindex, batch);
> > +        error = netdev_linux_sock_batch_send(sock, ifindex, tso, mtu, batch);
> >       } else {
> > -        error = netdev_linux_tap_batch_send(netdev_, batch);
> > +        error = netdev_linux_tap_batch_send(netdev_, tso, mtu, batch);
> >       }
> >       if (error) {
> >           if (error == ENOBUFS) {
> > @@ -6170,6 +6273,19 @@ af_packet_sock(void)
> >                   close(sock);
> >                   sock = -error;
> >               }
> > +
> > +            if (tso_enabled()) {
> > +                int val = 1;
> > +                error = setsockopt(sock, SOL_PACKET, PACKET_VNET_HDR, &val,
> > +                                   sizeof val);
> > +                if (error) {
> > +                    error = errno;
> > +                    VLOG_ERR("failed to enable vnet hdr in raw socket: %s",
> > +                             ovs_strerror(errno));
> > +                    close(sock);
> > +                    sock = -error;
> > +                }
> > +            }
> >           } else {
> >               sock = -errno;
> >               VLOG_ERR("failed to create packet socket: %s",
> > @@ -6180,3 +6296,134 @@ af_packet_sock(void)
> >       return sock;
> >   }
> > +
> > +static int
> > +netdev_linux_parse_l2(struct dp_packet *b, uint16_t *l4proto)
> > +{
> > +    struct eth_header *eth_hdr;
> > +    ovs_be16 eth_type;
> > +    int l2_len;
> > +
> > +    eth_hdr = dp_packet_at(b, 0, ETH_HEADER_LEN);
> > +    if (!eth_hdr) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    l2_len = ETH_HEADER_LEN;
> > +    eth_type = eth_hdr->eth_type;
> > +    if (eth_type_vlan(eth_type)) {
> > +        struct vlan_header *vlan = dp_packet_at(b, l2_len, VLAN_HEADER_LEN);
> > +
> > +        if (!vlan) {
> > +            return -EINVAL;
> > +        }
> > +
> > +        eth_type = vlan->vlan_next_type;
> > +        l2_len += VLAN_HEADER_LEN;
> > +    }
> > +
> > +    if (eth_type == htons(ETH_TYPE_IP)) {
> > +        struct ip_header *ip_hdr = dp_packet_at(b, l2_len, IP_HEADER_LEN);
> > +
> > +        if (!ip_hdr) {
> > +            return -EINVAL;
> > +        }
> > +
> > +        *l4proto = ip_hdr->ip_proto;
> > +        dp_packet_hwol_set_tx_ipv4(b);
> > +    } else if (eth_type == htons(ETH_TYPE_IPV6)) {
> > +        struct ovs_16aligned_ip6_hdr *nh6;
> > +
> > +        nh6 = dp_packet_at(b, l2_len, IPV6_HEADER_LEN);
> > +        if (!nh6) {
> > +            return -EINVAL;
> > +        }
> > +
> > +        *l4proto = nh6->ip6_ctlun.ip6_un1.ip6_un1_nxt;
> > +        dp_packet_hwol_set_tx_ipv6(b);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int
> > +netdev_linux_parse_vnet_hdr(struct dp_packet *b)
> > +{
> > +    struct virtio_net_hdr *vnet = dp_packet_pull(b, sizeof *vnet);
> > +    uint16_t l4proto = 0;
> > +
> > +    if (OVS_UNLIKELY(!vnet)) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (vnet->flags == 0 && vnet->gso_type == VIRTIO_NET_HDR_GSO_NONE) {
> > +        return 0;
> > +    }
> > +
> > +    if (netdev_linux_parse_l2(b, &l4proto)) {
> > +        return -EINVAL;
> > +    }
> > +
> > +    if (vnet->flags == VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> > +        if (l4proto == IPPROTO_TCP) {
> > +            dp_packet_hwol_set_csum_tcp(b);
> > +        } else if (l4proto == IPPROTO_UDP) {
> > +            dp_packet_hwol_set_csum_udp(b);
> > +        } else if (l4proto == IPPROTO_SCTP) {
> > +            dp_packet_hwol_set_csum_sctp(b);
> > +        }
> > +    }
> > +
> > +    if (l4proto && vnet->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> > +        uint8_t allowed_mask = VIRTIO_NET_HDR_GSO_TCPV4
> > +                                | VIRTIO_NET_HDR_GSO_TCPV6
> > +                                | VIRTIO_NET_HDR_GSO_UDP;
> > +        uint8_t type = vnet->gso_type & allowed_mask;
> > +
> > +        if (type == VIRTIO_NET_HDR_GSO_TCPV4
> > +            || type == VIRTIO_NET_HDR_GSO_TCPV6) {
> > +            dp_packet_hwol_set_tcp_seg(b);
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static void
> > +netdev_linux_prepend_vnet_hdr(struct dp_packet *b, int mtu)
> > +{
> > +    struct virtio_net_hdr *vnet = dp_packet_push_zeros(b, sizeof *vnet);
> > +
> > +    if ((dp_packet_size(b) > mtu) && dp_packet_hwol_is_tso(b)) {
> > +
> > +        vnet->hdr_len = (char *)dp_packet_l4(b) - (char *)dp_packet_eth(b);
> > +        vnet->hdr_len += TCP_HEADER_LEN;
> > +        vnet->gso_size = mtu - vnet->hdr_len;
> > +        if (dp_packet_hwol_is_ipv4(b)) {
> > +            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> > +        } else {
> > +            vnet->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> > +        }
> > +
> > +    } else {
> > +        vnet->flags = VIRTIO_NET_HDR_GSO_NONE;
> > +    }
> > +
> > +    if (dp_packet_hwol_l4_mask(b)) {
> > +        vnet->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
> > +        vnet->csum_start = (char *)dp_packet_l4(b) - (char *)dp_packet_eth(b);
> > +
> > +        if (dp_packet_hwol_l4_is_tcp(b)) {
> > +            vnet->csum_offset = __builtin_offsetof(struct tcp_header,
> > +                                                   tcp_csum);
> > +        } else if (dp_packet_hwol_l4_is_udp(b)) {
> > +            vnet->csum_offset = __builtin_offsetof(struct udp_header,
> > +                                                   udp_csum);
> > +        } else if (dp_packet_hwol_l4_is_sctp(b)) {
> > +            vnet->csum_offset = __builtin_offsetof(struct sctp_header,
> > +                                                   sctp_csum);
> > +        } else {
> > +            VLOG_WARN_RL(&rl, "Unsupported L4 protocol");
> > +        }
> > +    }
> > +}
> > diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h
> > index 1e5a40c89..be1db62f8 100644
> > --- a/lib/netdev-provider.h
> > +++ b/lib/netdev-provider.h
> > @@ -37,6 +37,12 @@ extern "C" {
> >   struct netdev_tnl_build_header_params;
> >   #define NETDEV_NUMA_UNSPEC OVS_NUMA_UNSPEC
> > +enum netdev_ol_flags {
> > +    NETDEV_TX_OFFLOAD_IPV4_CKSUM = 1 << 0,
> > +    NETDEV_TX_OFFLOAD_TCP_CKSUM = 1 << 1,
> > +    NETDEV_TX_OFFLOAD_TCP_TSO = 1 << 2,
> > +};
> > +
> >   /* A network device (e.g. an Ethernet device).
> >    *
> >    * Network device implementations may read these members but should not modify
> > @@ -51,6 +57,10 @@ struct netdev {
> >        * opening this device, and therefore got assigned to the "system" class */
> >       bool auto_classified;
> > +    /* This bitmask of the offloading features enabled/supported by the
> > +     * supported by the netdev. */
> > +    uint64_t ol_flags;
> > +
> >       /* If this is 'true', the user explicitly specified an MTU for this
> >        * netdev.  Otherwise, Open vSwitch is allowed to override it. */
> >       bool mtu_user_config;
> > diff --git a/lib/netdev.c b/lib/netdev.c
> > index af8f8560d..c33378803 100644
> > --- a/lib/netdev.c
> > +++ b/lib/netdev.c
> > @@ -782,6 +782,52 @@ netdev_get_pt_mode(const struct netdev *netdev)
> >               : NETDEV_PT_LEGACY_L2);
> >   }
> > +/* Check if a 'packet' is compatible with 'netdev_flags'.
> > + * If a packet is incompatible, return 'false' with the 'errormsg'
> > + * pointing to a reason. */
> > +static bool
> > +netdev_send_prepare_packet(const uint64_t netdev_flags,
> > +                           struct dp_packet *packet, char **errormsg)
> > +{
> > +    if (dp_packet_hwol_is_tso(packet)
> > +        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_TSO)) {
> > +            /* fall back to GSO in software */
> Minor formatting, capilize start of comment, add missing period.

Fixed here and right below as well.

> 
> > +            *errormsg = "No TSO support";
> > +            return false;
> > +    }
> > +
> > +    if (dp_packet_hwol_l4_mask(packet)
> > +        && !(netdev_flags & NETDEV_TX_OFFLOAD_TCP_CKSUM)) {
> > +            /* fall back to L4 csum in software */

here.

> > +            *errormsg = "No L4 checksum support";
> > +            return false;
> > +    }
> > +
> > +    return true;
> > +}
> > +
> > +/* Check if each packet in 'batch' is compatible with 'netdev' features,
> > + * otherwise either fall back to software implementation or drop it. */
> > +static void
> > +netdev_send_prepare_batch(const struct netdev *netdev,
> > +                          struct dp_packet_batch *batch)
> > +{
> > +    struct dp_packet *packet;
> > +    size_t i, size = dp_packet_batch_size(batch);
> > +
> > +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
> > +        char *errormsg = NULL;
> > +
> > +        if (netdev_send_prepare_packet(netdev->ol_flags, packet, &errormsg)) {
> > +            dp_packet_batch_refill(batch, packet, i);
> > +        } else {
> > +            VLOG_WARN_RL(&rl, "%s: Packet dropped: %s",
> > +                         errormsg ? errormsg : "Unsupported feature",
> > +                         netdev_get_name(netdev));
> > +        }
> > +    }
> > +}
> > +
> >   /* Sends 'batch' on 'netdev'.  Returns 0 if successful (for every packet),
> >    * otherwise a positive errno value.  Returns EAGAIN without blocking if
> >    * at least one the packets cannot be queued immediately.  Returns EMSGSIZE
> > @@ -811,8 +857,10 @@ int
> >   netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch *batch,
> >               bool concurrent_txq)
> >   {
> > -    int error = netdev->netdev_class->send(netdev, qid, batch,
> > -                                           concurrent_txq);
> > +    int error;
> > +
> > +    netdev_send_prepare_batch(netdev, batch);
> > +    error = netdev->netdev_class->send(netdev, qid, batch, concurrent_txq);
> >       if (!error) {
> >           COVERAGE_INC(netdev_sent);
> >       }
> > diff --git a/lib/tso.c b/lib/tso.c
> > new file mode 100644
> > index 000000000..2b062d14a
> > --- /dev/null
> > +++ b/lib/tso.c
> > @@ -0,0 +1,54 @@
> > +/*
> > + * Copyright (c) 2019 Red Hat, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#include <config.h>
> > +
> > +#include "smap.h"
> > +#include "ovs-thread.h"
> > +#include "openvswitch/vlog.h"
> > +#include "dpdk.h"
> > +#include "tso.h"
> > +#include "vswitch-idl.h"
> > +
> > +VLOG_DEFINE_THIS_MODULE(tso);
> > +
> > +static bool tso_support_enabled = false;
> > +
> > +void
> > +tso_init(const struct smap *ovs_other_config)
> > +{
> > +    if (smap_get_bool(ovs_other_config, "tso-support", false)) {
> > +        static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> > +
> > +        if (ovsthread_once_start(&once)) {
> > +            if (dpdk_available()) {
> > +                VLOG_INFO("TCP Segmentation Offloading (TSO) support enabled");
> > +                tso_support_enabled = true;
> > +            } else {
> > +                VLOG_ERR("TCP Segmentation Offloading (TSO) is unsupported "
> > +                         "without enabling DPDK");
> > +                tso_support_enabled = false;
> > +            }
> > +            ovsthread_once_done(&once);
> > +        }
> > +    }
> > +}
> > +
> > +bool
> > +tso_enabled(void)
> > +{
> > +    return tso_support_enabled;
> > +}
> > diff --git a/lib/tso.h b/lib/tso.h
> > new file mode 100644
> > index 000000000..5cc6993f3
> > --- /dev/null
> > +++ b/lib/tso.h
> > @@ -0,0 +1,23 @@
> > +/*
> > + * Copyright (c) 2019 Red Hat Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef TSO_H
> > +#define TSO_H 1
> > +
> > +void tso_init(const struct smap *ovs_other_config);
> > +bool tso_enabled(void);
> > +
> > +#endif /* tso.h */
> > diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
> > index 9095ebf5d..a1cd1c541 100644
> > --- a/vswitchd/bridge.c
> > +++ b/vswitchd/bridge.c
> > @@ -65,6 +65,7 @@
> >   #include "system-stats.h"
> >   #include "timeval.h"
> >   #include "tnl-ports.h"
> > +#include "tso.h"
> >   #include "util.h"
> >   #include "unixctl.h"
> >   #include "lib/vswitch-idl.h"
> > @@ -3234,6 +3235,7 @@ bridge_run(void)
> >       if (cfg) {
> >           netdev_set_flow_api_enabled(&cfg->other_config);
> >           dpdk_init(&cfg->other_config);
> > +        tso_init(&cfg->other_config);
> >       }
> >       /* Initialize the ofproto library.  This only needs to run once, but
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> > index efdfb83bb..ae7f4f265 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -690,6 +690,18 @@
> >            once in few hours or a day or a week.
> >           </p>
> >         </column>
> > +      <column name="other_config" key="tso-support"
> > +              type='{"type": "boolean"}'>
> > +        <p>
> > +          Set this value to <code>true</code> to enable support for TSO (TCP
> > +          Segmentation Offloading). When TSO is enabled, vhost-user client
> > +          interfaces can transmit packets up to 64KB.
> > +        </p>
> > +        <p>
> > +          The default value is <code>false</code>. Changing this value requires
> > +          restarting the daemon
> Minor missing period at end of sentence.

Fixed.

> Also woulld it be worthing flagging this as part of the TSO doc you've
> introduced?

Good idea.

Thanks Ian!


> 
> > +        </p>
> > +      </column>
> >       </group>
> >       <group title="Status">
> >         <column name="next_cfg">
> > 

-- 
fbl


More information about the dev mailing list