[ovs-dev] [PATCH v3 3/3] netdev-dpdk: Add TCP Segmentation Offload support

Flavio Leitner fbl at sysclose.org
Tue Jan 14 22:17:11 UTC 2020


On Tue, Jan 14, 2020 at 06:48:14PM +0100, Ilya Maximets wrote:
> On 09.01.2020 15:44, Flavio Leitner wrote:
> > Abbreviated as TSO, TCP Segmentation Offload is a feature which enables
> > the network stack to delegate the TCP segmentation to the NIC reducing
> > the per packet CPU overhead.
> > 
> > A guest using vhostuser interface with TSO enabled can send TCP packets
> > much bigger than the MTU, which saves CPU cycles normally used to break
> > the packets down to MTU size and to calculate checksums.
> > 
> > It also saves CPU cycles used to parse multiple packets/headers during
> > the packet processing inside virtual switch.
> > 
> > If the destination of the packet is another guest in the same host, then
> > the same big packet can be sent through a vhostuser interface skipping
> > the segmentation completely. However, if the destination is not local,
> > the NIC hardware is instructed to do the TCP segmentation and checksum
> > calculation.
> > 
> > It is recommended to check if NIC hardware supports TSO before enabling
> > the feature, which is off by default. For additional information please
> > check the tso.rst document.
> > 
> > Signed-off-by: Flavio Leitner <fbl at sysclose.org>
> > ---
> 
> It seems this patch needs a rebase due to recvmmsg related changes in
> netdev-linux.

Yeah, Ian requested that too. I am working on it.

 
> I didn't check the sizes and offsets inside the code and didn't look
> close to the features enabling on devices. Some comments inline.

OK, thanks for this review anyways.

> >  Documentation/automake.mk           |   1 +
> >  Documentation/topics/dpdk/index.rst |   1 +
> >  Documentation/topics/dpdk/tso.rst   |  96 +++++++++
> >  NEWS                                |   1 +
> >  lib/automake.mk                     |   2 +
> >  lib/conntrack.c                     |  29 ++-
> >  lib/dp-packet.h                     | 152 +++++++++++++-
> >  lib/ipf.c                           |  32 +--
> >  lib/netdev-dpdk.c                   | 312 ++++++++++++++++++++++++----
> >  lib/netdev-linux-private.h          |   4 +
> >  lib/netdev-linux.c                  | 296 +++++++++++++++++++++++---
> >  lib/netdev-provider.h               |  10 +
> >  lib/netdev.c                        |  66 +++++-
> >  lib/tso.c                           |  54 +++++
> >  lib/tso.h                           |  23 ++
> >  vswitchd/bridge.c                   |   2 +
> >  vswitchd/vswitch.xml                |  12 ++
> >  17 files changed, 1002 insertions(+), 91 deletions(-)
> >  create mode 100644 Documentation/topics/dpdk/tso.rst
> >  create mode 100644 lib/tso.c
> >  create mode 100644 lib/tso.h
> > 
> > Changelog:
> > - v3
> >  * Improved the documentation.
> >  * Updated copyright year to 2020.
> >  * TSO offloaded msg now includes the netdev's name.
> >  * Added period at the end of all code comments.
> >  * Warn and drop encapsulation of TSO packets.
> >  * Fixed travis issue with restricted virtio types.
> >  * Fixed double headroom allocation in dpdk_copy_dp_packet_to_mbuf()
> >    which caused packet corruption.
> >  * Fixed netdev_dpdk_prep_hwol_packet() to unconditionally set
> >    PKT_TX_IP_CKSUM only for IPv4 packets.
> > 
> > diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> > index f2ca17bad..284327edd 100644
> > --- a/Documentation/automake.mk
> > +++ b/Documentation/automake.mk
> > @@ -35,6 +35,7 @@ DOC_SOURCE = \
> >  	Documentation/topics/dpdk/index.rst \
> >  	Documentation/topics/dpdk/bridge.rst \
> >  	Documentation/topics/dpdk/jumbo-frames.rst \
> > +	Documentation/topics/dpdk/tso.rst \
> >  	Documentation/topics/dpdk/memory.rst \
> >  	Documentation/topics/dpdk/pdump.rst \
> >  	Documentation/topics/dpdk/phy.rst \
> > diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
> > index f2862ea70..400d56051 100644
> > --- a/Documentation/topics/dpdk/index.rst
> > +++ b/Documentation/topics/dpdk/index.rst
> > @@ -40,4 +40,5 @@ DPDK Support
> >     /topics/dpdk/qos
> >     /topics/dpdk/pdump
> >     /topics/dpdk/jumbo-frames
> > +   /topics/dpdk/tso
> >     /topics/dpdk/memory
> > diff --git a/Documentation/topics/dpdk/tso.rst b/Documentation/topics/dpdk/tso.rst
> > new file mode 100644
> > index 000000000..189c86480
> > --- /dev/null
> > +++ b/Documentation/topics/dpdk/tso.rst
> > @@ -0,0 +1,96 @@
> > +..
> > +      Copyright 2020, Red Hat, Inc.
> > +
> > +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> > +      not use this file except in compliance with the License. You may obtain
> > +      a copy of the License at
> > +
> > +          http://www.apache.org/licenses/LICENSE-2.0
> > +
> > +      Unless required by applicable law or agreed to in writing, software
> > +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> > +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> > +      License for the specific language governing permissions and limitations
> > +      under the License.
> > +
> > +      Convention for heading levels in Open vSwitch documentation:
> > +
> > +      =======  Heading 0 (reserved for the title in a document)
> > +      -------  Heading 1
> > +      ~~~~~~~  Heading 2
> > +      +++++++  Heading 3
> > +      '''''''  Heading 4
> > +
> > +      Avoid deeper levels because they do not render well.
> > +
> > +========================
> > +Userspace Datapath - TSO
> > +========================
> > +
> > +**Note:** This feature is considered experimental.
> > +
> > +TCP Segmentation Offload (TSO) enables a network stack to delegate segmentation
> > +of an oversized TCP segment to the underlying physical NIC. Offload of frame
> > +segmentation achieves computational savings in the core, freeing up CPU cycles
> > +for more useful work.
> > +
> > +A common use case for TSO is when using virtualization, where traffic that's
> > +coming in from a VM can offload the TCP segmentation, thus avoiding the
> > +fragmentation in software. Additionally, if the traffic is headed to a VM
> > +within the same host further optimization can be expected. As the traffic never
> > +leaves the machine, no MTU needs to be accounted for, and thus no segmentation
> > +and checksum calculations are required, which saves yet more cycles. Only when
> > +the traffic actually leaves the host the segmentation needs to happen, in which
> > +case it will be performed by the egress NIC. Consult your controller's
> > +datasheet for compatibility. Secondly, the NIC must have an associated DPDK
> > +Poll Mode Driver (PMD) which supports `TSO`. For a list of features per PMD,
> > +refer to the `DPDK documentation`__.
> > +
> > +__ https://doc.dpdk.org/guides/nics/overview.html
> 
> This should point to 19.11 version of a guide instead of latest one:
> https://doc.dpdk.org/guides-19.11/nics/overview.html

Sounds fair, ok.

> 
> > +
> > +Enabling TSO
> > +~~~~~~~~~~~~
> > +
> > +The TSO support may be enabled via a global config value ``tso-support``.
> > +Setting this to ``true`` enables TSO support for all ports.
> > +
> > +    $ ovs-vsctl set Open_vSwitch . other_config:tso-support=true
> 
> 
> I'd suggest to rename this to 'userspace-tso-support' to avoid misunderstanding.

ok, I will rename.


> > +
> > +The default value is ``false``.
> > +
> > +Changing ``tso-support`` requires restarting the daemon.
> > +
> > +When using :doc:`vHost User ports <vhost-user>`, TSO may be enabled as follows.
> > +
> > +`TSO` is enabled in OvS by the DPDK vHost User backend; when a new guest
> > +connection is established, `TSO` is thus advertised to the guest as an
> > +available feature:
> > +
> > +QEMU Command Line Parameter::
> > +
> > +    $ sudo $QEMU_DIR/x86_64-softmmu/qemu-system-x86_64 \
> > +    ...
> > +    -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,\
> > +    csum=on,guest_csum=on,guest_tso4=on,guest_tso6=on\
> > +    ...
> > +
> > +2. Ethtool. Assuming that the guest's OS also supports `TSO`, ethtool can be
> > +used to enable same::
> > +
> > +    $ ethtool -K eth0 sg on     # scatter-gather is a prerequisite for TSO
> > +    $ ethtool -K eth0 tso on
> > +    $ ethtool -k eth0
> > +
> > +~~~~~~~~~~~
> > +Limitations
> > +~~~~~~~~~~~
> > +
> > +The current OvS userspace `TSO` implementation supports flat and VLAN networks
> > +only (i.e. no support for `TSO` over tunneled connection [VxLAN, GRE, IPinIP,
> > +etc.]).
> > +
> > +There is no software implementation of TSO, so all ports attached to the
> > +datapath must support TSO or packets using that feature will be dropped
> > +on ports without TSO support.  That also means guests using vhost-user
> > +in client mode will receive TSO packet regardless of TSO being enabled
> > +or disabled within the guest.
> > diff --git a/NEWS b/NEWS
> > index 965facaf8..306c0493d 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -26,6 +26,7 @@ Post-v2.12.0
> >       * DPDK ring ports (dpdkr) are deprecated and will be removed in next
> >         releases.
> >       * Add support for DPDK 19.11.
> > +     * Add experimental support for TSO.
> >     - RSTP:
> >       * The rstp_statistics column in Port table will only be updated every
> >         stats-update-interval configured in Open_vSwtich table.
> > diff --git a/lib/automake.mk b/lib/automake.mk
> > index ebf714501..94a1b4459 100644
> > --- a/lib/automake.mk
> > +++ b/lib/automake.mk
> > @@ -304,6 +304,8 @@ lib_libopenvswitch_la_SOURCES = \
> >  	lib/tnl-neigh-cache.h \
> >  	lib/tnl-ports.c \
> >  	lib/tnl-ports.h \
> > +	lib/tso.c \
> > +	lib/tso.h \
> 
> s/tso/userspace-tso/

Yup

> 
> >  	lib/netdev-native-tnl.c \
> >  	lib/netdev-native-tnl.h \
> >  	lib/token-bucket.c \
> > diff --git a/lib/conntrack.c b/lib/conntrack.c
> > index b80080e72..679054b98 100644
> > --- a/lib/conntrack.c
> > +++ b/lib/conntrack.c
> > @@ -2022,7 +2022,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
> >          if (hwol_bad_l3_csum) {
> >              ok = false;
> >          } else {
> > -            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt);
> > +            bool hwol_good_l3_csum = dp_packet_ip_checksum_valid(pkt)
> > +                                     || dp_packet_hwol_tx_ip_checksum(pkt);
> >              /* Validate the checksum only when hwol is not supported. */
> >              ok = extract_l3_ipv4(&ctx->key, l3, dp_packet_l3_size(pkt), NULL,
> >                                   !hwol_good_l3_csum);
> > @@ -2036,7 +2037,8 @@ conn_key_extract(struct conntrack *ct, struct dp_packet *pkt, ovs_be16 dl_type,
> >      if (ok) {
> >          bool hwol_bad_l4_csum = dp_packet_l4_checksum_bad(pkt);
> >          if (!hwol_bad_l4_csum) {
> > -            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt);
> > +            bool  hwol_good_l4_csum = dp_packet_l4_checksum_valid(pkt)
> > +                                      || dp_packet_hwol_tx_l4_checksum(pkt);
> >              /* Validate the checksum only when hwol is not supported. */
> >              if (extract_l4(&ctx->key, l4, dp_packet_l4_size(pkt),
> >                             &ctx->icmp_related, l3, !hwol_good_l4_csum,
> > @@ -3237,8 +3239,11 @@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
> >                  }
> >                  if (seq_skew) {
> >                      ip_len = ntohs(l3_hdr->ip_tot_len) + seq_skew;
> > -                    l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
> > -                                          l3_hdr->ip_tot_len, htons(ip_len));
> > +                    if (!dp_packet_hwol_tx_ip_checksum(pkt)) {
> > +                        l3_hdr->ip_csum = recalc_csum16(l3_hdr->ip_csum,
> > +                                                        l3_hdr->ip_tot_len,
> > +                                                        htons(ip_len));
> > +                    }
> >                      l3_hdr->ip_tot_len = htons(ip_len);
> >                  }
> >              }
> > @@ -3256,13 +3261,15 @@ handle_ftp_ctl(struct conntrack *ct, const struct conn_lookup_ctx *ctx,
> >      }
> >  
> >      th->tcp_csum = 0;
> > -    if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
> > -        th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
> > -                           dp_packet_l4_size(pkt));
> > -    } else {
> > -        uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
> > -        th->tcp_csum = csum_finish(
> > -             csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
> > +    if (!dp_packet_hwol_tx_l4_checksum(pkt)) {
> > +        if (ctx->key.dl_type == htons(ETH_TYPE_IPV6)) {
> > +            th->tcp_csum = packet_csum_upperlayer6(nh6, th, ctx->key.nw_proto,
> > +                               dp_packet_l4_size(pkt));
> > +        } else {
> > +            uint32_t tcp_csum = packet_csum_pseudoheader(l3_hdr);
> > +            th->tcp_csum = csum_finish(
> > +                 csum_continue(tcp_csum, th, dp_packet_l4_size(pkt)));
> > +        }
> >      }
> >  
> >      if (seq_skew) {
> > diff --git a/lib/dp-packet.h b/lib/dp-packet.h
> > index 133942155..d10a0416e 100644
> > --- a/lib/dp-packet.h
> > +++ b/lib/dp-packet.h
> > @@ -114,6 +114,8 @@ static inline void dp_packet_set_size(struct dp_packet *, uint32_t);
> >  static inline uint16_t dp_packet_get_allocated(const struct dp_packet *);
> >  static inline void dp_packet_set_allocated(struct dp_packet *, uint16_t);
> >  
> > +void dp_packet_prepend_vnet_hdr(struct dp_packet *, int mtu);
> 
> No such function.

Good catch, it got moved to netdev_linux_prepend_vnet_hdr().

> > +
> >  void *dp_packet_resize_l2(struct dp_packet *, int increment);
> >  void *dp_packet_resize_l2_5(struct dp_packet *, int increment);
> >  static inline void *dp_packet_eth(const struct dp_packet *);
> > @@ -456,7 +458,7 @@ dp_packet_init_specific(struct dp_packet *p)
> >  {
> >      /* This initialization is needed for packets that do not come from DPDK
> >       * interfaces, when vswitchd is built with --with-dpdk. */
> > -    p->mbuf.tx_offload = p->mbuf.packet_type = 0;
> > +    p->mbuf.ol_flags = p->mbuf.tx_offload = p->mbuf.packet_type = 0;
> >      p->mbuf.nb_segs = 1;
> >      p->mbuf.next = NULL;
> >  }
> > @@ -519,6 +521,80 @@ dp_packet_set_allocated(struct dp_packet *b, uint16_t s)
> >      b->mbuf.buf_len = s;
> >  }
> >  
> > +static inline bool
> > +dp_packet_hwol_is_tso(const struct dp_packet *b)
> > +{
> > +    return (b->mbuf.ol_flags & (PKT_TX_TCP_SEG | PKT_TX_L4_MASK))
> > +           ? true
> > +           : false;
> 
> Usual way for converting to bool is to use '!!'.  This will save some space.

Ok will change that and the other similar cases.

> > +static inline void
> > +dp_packet_hwol_set_tx_ipv4(struct dp_packet *b) {
> 
> '{' should be on the next line.  Same for all the functions below.
> 
> And some comments to these functions would be nice.  At least a single
> comment for a group of functions.

Ok.

> Some comments for below functions too.

Ok.
> 
> > +static inline bool
> > +dp_packet_hwol_tx_ip_checksum(const struct dp_packet *p)
> > +{
> > +
> > +    return dp_packet_hwol_l4_mask(p) ? true : false;
> 
> '!!'?
> Also, it seems strange to check l4 offloading mask to check if
> we have ip checksum.  Shouldn't we check for PKT_TX_IPV4/6
> instead?  This might not work for pure IP packets (without L4).

I remember there were cases where the flag was not set. I don't have
my notes handy now, but I will double check later.

[...]
> > @@ -2097,11 +2184,22 @@ netdev_dpdk_eth_tx_burst(struct netdev_dpdk *dev, int qid,
> >                           struct rte_mbuf **pkts, int cnt)
> >  {
> >      uint32_t nb_tx = 0;
> > +    uint16_t nb_tx_prep = cnt;
> > +
> > +    if (tso_enabled()) {
> > +        nb_tx_prep = rte_eth_tx_prepare(dev->port_id, qid, pkts, cnt);
> 
> Packets dropped and not counted.


It returns cnt (total) - nb_tx (transmitted), which the caller will
account as tx failures.
 

> 
> > +        if (nb_tx_prep != cnt) {
> > +            VLOG_WARN_RL(&rl, "%s: Output batch contains invalid packets. "
> > +                         "Only %u/%u are valid: %s", dev->up.name, nb_tx_prep,
> > +                         cnt, rte_strerror(rte_errno));
> > +        }
> > +    }
> >  
> > -    while (nb_tx != cnt) {
> > +    while (nb_tx != nb_tx_prep) {
> >          uint32_t ret;
> >  
> > -        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx, cnt - nb_tx);
> > +        ret = rte_eth_tx_burst(dev->port_id, qid, pkts + nb_tx,
> > +                               nb_tx_prep - nb_tx);
> >          if (!ret) {
> >              break;
> >          }
> > @@ -2386,11 +2484,14 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk *dev, struct rte_mbuf **pkts,
> >      int cnt = 0;
> >      struct rte_mbuf *pkt;
> >  
> > +    /* Filter oversized packets, unless are marked for TSO. */
> >      for (i = 0; i < pkt_cnt; i++) {
> >          pkt = pkts[i];
> > -        if (OVS_UNLIKELY(pkt->pkt_len > dev->max_packet_len)) {
> > -            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " max_packet_len %d",
> > -                         dev->up.name, pkt->pkt_len, dev->max_packet_len);
> > +        if (OVS_UNLIKELY((pkt->pkt_len > dev->max_packet_len)
> > +            && !(pkt->ol_flags & PKT_TX_TCP_SEG))) {
> > +            VLOG_WARN_RL(&rl, "%s: Too big size %" PRIu32 " "
> > +                         "max_packet_len %d", dev->up.name, pkt->pkt_len,
> > +                         dev->max_packet_len);
> >              rte_pktmbuf_free(pkt);
> >              continue;
> >          }
> > @@ -2442,7 +2543,7 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
> >      struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts;
> >      struct netdev_dpdk_sw_stats sw_stats_add;
> >      unsigned int n_packets_to_free = cnt;
> > -    unsigned int total_packets = cnt;
> > +    unsigned int total_packets;
> >      int i, retries = 0;
> >      int max_retries = VHOST_ENQ_RETRY_MIN;
> >      int vid = netdev_dpdk_get_vid(dev);
> > @@ -2462,7 +2563,8 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int qid,
> >          rte_spinlock_lock(&dev->tx_q[qid].tx_lock);
> >      }
> >  
> > -    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, cnt);
> > +    total_packets = netdev_dpdk_prep_hwol_batch(dev, cur_pkts, cnt);
> 
> lib/daemon-unix.c Have you checked the performance
> impact for non-TSO setup?

I did a light test and noticed no difference. It only checks for two flags
in the same field for the non-TSO case.


> 
> > +    cnt = netdev_dpdk_filter_packet_len(dev, cur_pkts, total_packets);
> >      sw_stats_add.tx_mtu_exceeded_drops = total_packets - cnt;
> >  
> >      /* Check has QoS has been configured for the netdev */
> > @@ -2511,6 +2613,121 @@ out:
> >      }
> >  }
> >  
> > +static void
> > +netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
> > +{
> > +    rte_free(opaque);
> > +}
> > +
> > +static struct rte_mbuf *
> > +dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len)
> > +{
> > +    uint32_t total_len = RTE_PKTMBUF_HEADROOM + data_len;
> > +    struct rte_mbuf_ext_shared_info *shinfo = NULL;
> > +    uint16_t buf_len;
> > +    void *buf;
> > +
> > +    if (rte_pktmbuf_tailroom(pkt) >= sizeof(*shinfo)) {
> > +        shinfo = rte_pktmbuf_mtod(pkt, struct rte_mbuf_ext_shared_info *);
> > +    } else {
> > +        total_len += sizeof(*shinfo) + sizeof(uintptr_t);
> > +        total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> > +    }
> > +
> > +    if (unlikely(total_len > UINT16_MAX)) {
> > +        VLOG_ERR("Can't copy packet: too big %u", total_len);
> > +        return NULL;
> > +    }
> > +
> > +    buf_len = total_len;
> > +    buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> > +    if (unlikely(buf == NULL)) {
> > +        VLOG_ERR("Failed to allocate memory using rte_malloc: %u", buf_len);
> > +        return NULL;
> > +    }
> > +
> > +    /* Initialize shinfo. */
> > +    if (shinfo) {
> > +        shinfo->free_cb = netdev_dpdk_extbuf_free;
> > +        shinfo->fcb_opaque = buf;
> > +        rte_mbuf_ext_refcnt_set(shinfo, 1);
> > +    } else {
> > +        shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> > +                                                    netdev_dpdk_extbuf_free,
> > +                                                    buf);
> > +        if (unlikely(shinfo == NULL)) {
> > +            rte_free(buf);
> > +            VLOG_ERR("Failed to initialize shared info for mbuf while "
> > +                     "attempting to attach an external buffer.");
> > +            return NULL;
> > +        }
> > +    }
> > +
> > +    rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), buf_len,
> > +                              shinfo);
> > +    rte_pktmbuf_reset_headroom(pkt);
> > +
> > +    return pkt;
> > +}
> > +
> > +static struct rte_mbuf *
> > +dpdk_pktmbuf_alloc(struct rte_mempool *mp, uint32_t data_len)
> > +{
> > +    struct rte_mbuf *pkt = rte_pktmbuf_alloc(mp);
> > +
> > +    if (OVS_UNLIKELY(!pkt)) {
> > +        return NULL;
> > +    }
> > +
> > +    dp_packet_init_specific((struct dp_packet *)pkt);
> > +    if (rte_pktmbuf_tailroom(pkt) >= data_len) {
> > +        return pkt;
> > +    }
> > +
> > +    if (dpdk_pktmbuf_attach_extbuf(pkt, data_len)) {
> > +        return pkt;
> > +    }
> > +
> > +    rte_pktmbuf_free(pkt);
> > +
> > +    return NULL;
> > +}
> > +
> > +static struct dp_packet *
> > +dpdk_copy_dp_packet_to_mbuf(struct rte_mempool *mp, struct dp_packet *pkt_orig)
> > +{
> > +    struct rte_mbuf *mbuf_dest;
> > +    struct dp_packet *pkt_dest;
> > +    uint32_t pkt_len;
> > +
> > +    pkt_len = dp_packet_size(pkt_orig);
> > +    mbuf_dest = dpdk_pktmbuf_alloc(mp, pkt_len);
> > +    if (OVS_UNLIKELY(mbuf_dest == NULL)) {
> > +            return NULL;
> > +    }
> > +
> > +    pkt_dest = CONTAINER_OF(mbuf_dest, struct dp_packet, mbuf);
> > +    memcpy(dp_packet_data(pkt_dest), dp_packet_data(pkt_orig), pkt_len);
> > +    dp_packet_set_size(pkt_dest, pkt_len);
> > +
> > +    mbuf_dest->tx_offload = pkt_orig->mbuf.tx_offload;
> > +    mbuf_dest->packet_type = pkt_orig->mbuf.packet_type;
> > +    mbuf_dest->ol_flags |= (pkt_orig->mbuf.ol_flags &
> > +                            ~(EXT_ATTACHED_MBUF | IND_ATTACHED_MBUF));
> > +
> > +    memcpy(&pkt_dest->l2_pad_size, &pkt_orig->l2_pad_size,
> > +           sizeof(struct dp_packet) - offsetof(struct dp_packet, l2_pad_size));
> > +
> > +    if (mbuf_dest->ol_flags & PKT_TX_L4_MASK) {
> > +        mbuf_dest->l2_len = (char *)dp_packet_l3(pkt_dest)
> > +                                - (char *)dp_packet_eth(pkt_dest);
> > +        mbuf_dest->l3_len = (char *)dp_packet_l4(pkt_dest)
> > +                                - (char *) dp_packet_l3(pkt_dest);
> > +    }
> > +
> > +    return pkt_dest;
> > +}
> > +
> >  /* Tx function. Transmit packets indefinitely */
> >  static void
> >  dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> > @@ -2524,7 +2741,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> >      enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
> >  #endif
> >      struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> > -    struct rte_mbuf *pkts[PKT_ARRAY_SIZE];
> > +    struct dp_packet *pkts[PKT_ARRAY_SIZE];
> >      struct netdev_dpdk_sw_stats *sw_stats = dev->sw_stats;
> >      uint32_t cnt = batch_cnt;
> >      uint32_t dropped = 0;
> > @@ -2545,34 +2762,30 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> >          struct dp_packet *packet = batch->packets[i];
> >          uint32_t size = dp_packet_size(packet);
> >  
> > -        if (OVS_UNLIKELY(size > dev->max_packet_len)) {
> > -            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d",
> > -                         size, dev->max_packet_len);
> > -
> > +        if (size > dev->max_packet_len
> > +            && !(packet->mbuf.ol_flags & PKT_TX_TCP_SEG)) {
> > +            VLOG_WARN_RL(&rl, "Too big size %u max_packet_len %d", size,
> > +                         dev->max_packet_len);
> >              mtu_drops++;
> >              continue;
> >          }
> >  
> > -        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
> > +        pkts[txcnt] = dpdk_copy_dp_packet_to_mbuf(dev->dpdk_mp->mp, packet);
> >          if (OVS_UNLIKELY(!pkts[txcnt])) {
> >              dropped = cnt - i;
> >              break;
> >          }
> >  
> > -        /* We have to do a copy for now */
> > -        memcpy(rte_pktmbuf_mtod(pkts[txcnt], void *),
> > -               dp_packet_data(packet), size);
> > -        dp_packet_set_size((struct dp_packet *)pkts[txcnt], size);
> > -
> >          txcnt++;
> >      }
> >  
> >      if (OVS_LIKELY(txcnt)) {
> >          if (dev->type == DPDK_DEV_VHOST) {
> > -            __netdev_dpdk_vhost_send(netdev, qid, (struct dp_packet **) pkts,
> > -                                     txcnt);
> > +            __netdev_dpdk_vhost_send(netdev, qid, pkts, txcnt);
> >          } else {
> > -            tx_failure = netdev_dpdk_eth_tx_burst(dev, qid, pkts, txcnt);
> > +            tx_failure += netdev_dpdk_eth_tx_burst(dev, qid,
> > +                                                   (struct rte_mbuf **)pkts,
> > +                                                   txcnt);
> >          }
> >      }
> >  
> > @@ -2630,6 +2843,7 @@ netdev_dpdk_send__(struct netdev_dpdk *dev, int qid,
> >          int batch_cnt = dp_packet_batch_size(batch);
> >          struct rte_mbuf **pkts = (struct rte_mbuf **) batch->packets;
> >  
> > +        batch_cnt = netdev_dpdk_prep_hwol_batch(dev, pkts, batch_cnt);
> 
> Packets dropped and not counted.

Right.


> Also, this function called unconditionally. Perfomance impact on non-TSO case?

It's the same as above.


> >          tx_cnt = netdev_dpdk_filter_packet_len(dev, pkts, batch_cnt);
> >          mtu_drops = batch_cnt - tx_cnt;
> >          qos_drops = tx_cnt;
[...]
> > diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h
> > index f08159aa7..102548db7 100644
> > --- a/lib/netdev-linux-private.h
> > +++ b/lib/netdev-linux-private.h
> > @@ -37,10 +37,14 @@
> >  
> >  struct netdev;
> >  
> > +#define LINUX_RXQ_TSO_MAX_LEN 65536
> > +
> >  struct netdev_rxq_linux {
> >      struct netdev_rxq up;
> >      bool is_tap;
> >      int fd;
> > +    char *bufaux;          /* Extra buffer to recv TSO pkt. */
> > +    int bufaux_len;        /* Extra buffer length. */
> 
> Length never changes.  Why we need 'bufaux_len' ?

I have no strong opinion, so either way is fine by me.

[...]
> > @@ -1024,6 +1040,13 @@ static struct netdev_rxq *
> >  netdev_linux_rxq_alloc(void)
> >  {
> >      struct netdev_rxq_linux *rx = xzalloc(sizeof *rx);
> > +    if (tso_enabled()) {
> > +        rx->bufaux = xmalloc(LINUX_RXQ_TSO_MAX_LEN);
> > +        if (rx->bufaux) {
> 
> xmalloc can not fail.

Old habits, will fix that.


> > +            rx->bufaux_len = LINUX_RXQ_TSO_MAX_LEN;
> > +        }
> > +    }
> > +
> >      return &rx->up;
> >  }
> >  
> > @@ -1069,6 +1092,17 @@ netdev_linux_rxq_construct(struct netdev_rxq *rxq_)
> >              goto error;
> >          }
> >  
> > +        if (tso_enabled()) {
> > +            error = setsockopt(rx->fd, SOL_PACKET, PACKET_VNET_HDR, &val,
> > +                               sizeof val);
> > +            if (error) {
> 
> You're not using the 'error'.  Make it just "if (setsockopt()) {".

I agree.

> 
> > +                error = errno;
> > +                VLOG_ERR("%s: failed to enable vnet hdr in txq raw socket: %s",
> > +                         netdev_get_name(netdev_), ovs_strerror(errno));
> > +                goto error;
> > +            }
> > +        }
> > +
> >          /* Set non-blocking mode. */
> >          error = set_nonblocking(rx->fd);
> >          if (error) {
> > @@ -6173,6 +6275,19 @@ af_packet_sock(void)
> >                  close(sock);
> >                  sock = -error;
> >              }
> > +
> > +            if (tso_enabled()) {
> > +                int val = 1;
> > +                error = setsockopt(sock, SOL_PACKET, PACKET_VNET_HDR, &val,
> > +                                   sizeof val);
> 
> socket might be already closed here and there will be double close.

Good catch, thanks.

[...]
> > @@ -51,6 +57,10 @@ struct netdev {
> >       * opening this device, and therefore got assigned to the "system" class */
> >      bool auto_classified;
> >  
> > +    /* This bitmask of the offloading features enabled/supported by the
> > +     * supported by the netdev. */
> 
> So, enabled or supported?  Please, choose one.

I see, will fix that.


> 
> > +    uint64_t ol_flags;
> > +
> >      /* If this is 'true', the user explicitly specified an MTU for this
> >       * netdev.  Otherwise, Open vSwitch is allowed to override it. */
> >      bool mtu_user_config;
> > diff --git a/lib/netdev.c b/lib/netdev.c
> > index 405c98c68..998525875 100644
> > --- a/lib/netdev.c
> > +++ b/lib/netdev.c
> > @@ -782,6 +782,52 @@ netdev_get_pt_mode(const struct netdev *netdev)
> >              : NETDEV_PT_LEGACY_L2);
> >  }
> >  
> > +/* Check if a 'packet' is compatible with 'netdev_flags'.
> > + * If a packet is incompatible, return 'false' with the 'errormsg'
> > + * pointing to a reason. */
> > +static bool
> > +netdev_send_prepare_packet(const uint64_t netdev_flags,
> > +                           struct dp_packet *packet, char **errormsg)
> 
> It's better to use VLOG_ERR_BUF instead of passing char **.
> Not an OVS style.

Sounds good to me.


> > +netdev_send_prepare_batch(const struct netdev *netdev,
> > +                          struct dp_packet_batch *batch)
> > +{
> > +    struct dp_packet *packet;
> > +    size_t i, size = dp_packet_batch_size(batch);
> > +
> > +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
> > +        char *errormsg = NULL;
> > +
> > +        if (netdev_send_prepare_packet(netdev->ol_flags, packet, &errormsg)) {
> > +            dp_packet_batch_refill(batch, packet, i);
> > +        } else {
> > +            VLOG_WARN_RL(&rl, "%s: Packet dropped: %s",
> > +                         errormsg ? errormsg : "Unsupported feature",
> > +                         netdev_get_name(netdev));
> 
> Hmm, it seems that packet will be leaked here.
> 
> Also, you're dropping packet without accounting it anyhow.  We merged few
> patches about counting dropped packets recently, so, now we should add new
> counters for each dropping case in order to keep things consistent.

Right, will fix that too.


> > +        }
> > +    }
> > +}
> > +
> >  /* Sends 'batch' on 'netdev'.  Returns 0 if successful (for every packet),
> >   * otherwise a positive errno value.  Returns EAGAIN without blocking if
> >   * at least one the packets cannot be queued immediately.  Returns EMSGSIZE
> > @@ -811,8 +857,10 @@ int
> >  netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch *batch,
> >              bool concurrent_txq)
> >  {
> > -    int error = netdev->netdev_class->send(netdev, qid, batch,
> > -                                           concurrent_txq);
> > +    int error;
> > +
> > +    netdev_send_prepare_batch(netdev, batch);
> 
> send() doesn't expect empty batches.  You need to check before calling it.

OK.


> > +    error = netdev->netdev_class->send(netdev, qid, batch, concurrent_txq);
> >      if (!error) {
> >          COVERAGE_INC(netdev_sent);
> >      }
> > @@ -878,9 +926,17 @@ netdev_push_header(const struct netdev *netdev,
> >                     const struct ovs_action_push_tnl *data)
> >  {
> >      struct dp_packet *packet;
> > -    DP_PACKET_BATCH_FOR_EACH (i, packet, batch) {
> > -        netdev->netdev_class->push_header(netdev, packet, data);
> > -        pkt_metadata_init(&packet->md, data->out_port);
> > +    size_t i, size = dp_packet_batch_size(batch);
> > +
> > +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, size, packet, batch) {
> > +        if (!dp_packet_hwol_is_tso(packet)) {
> > +            netdev->netdev_class->push_header(netdev, packet, data);
> > +            pkt_metadata_init(&packet->md, data->out_port);
> > +            dp_packet_batch_refill(batch, packet, i);
> > +        } else {
> > +            VLOG_WARN_RL(&rl, "%s: Tunneling of TSO packet is not supported: "
> > +                         "packet dropped", netdev_get_name(netdev));
> 
> Packet leaked.  Drop not counted.

Yeah, same as above.


> > +        }
> >      }
> >  
> >      return 0;
> > diff --git a/lib/tso.c b/lib/tso.c
> > new file mode 100644
> > index 000000000..9dc15e146
> > --- /dev/null
> > +++ b/lib/tso.c
> > @@ -0,0 +1,54 @@
> > +/*
> > + * Copyright (c) 2020 Red Hat, Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#include <config.h>
> > +
> > +#include "smap.h"
> > +#include "ovs-thread.h"
> > +#include "openvswitch/vlog.h"
> > +#include "dpdk.h"
> > +#include "tso.h"
> > +#include "vswitch-idl.h"
> > +
> > +VLOG_DEFINE_THIS_MODULE(tso);
> 
> userspace_tso
> 
> > +
> > +static bool tso_support_enabled = false;
> > +
> > +void
> > +tso_init(const struct smap *ovs_other_config)
> 
> userspace_tso_init
> 
> > +{
> > +    if (smap_get_bool(ovs_other_config, "tso-support", false)) {
> > +        static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER;
> > +
> > +        if (ovsthread_once_start(&once)) {
> > +            if (dpdk_available()) {
> > +                VLOG_INFO("TCP Segmentation Offloading (TSO) support enabled");
> > +                tso_support_enabled = true;
> > +            } else {
> > +                VLOG_ERR("TCP Segmentation Offloading (TSO) is unsupported "
> > +                         "without enabling DPDK");
> 
> This is an artificial restriction.  But we may remove it later.
> 
> > +                tso_support_enabled = false;
> > +            }
> > +            ovsthread_once_done(&once);
> > +        }
> > +    }
> > +}
> > +
> > +bool
> > +tso_enabled(void)
> 
> userspace_tso_enabled()
> 
> > +{
> > +    return tso_support_enabled;
> > +}
> > diff --git a/lib/tso.h b/lib/tso.h
> > new file mode 100644
> > index 000000000..6594496ac
> > --- /dev/null
> > +++ b/lib/tso.h
> > @@ -0,0 +1,23 @@
> > +/*
> > + * Copyright (c) 2020 Red Hat Inc.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef TSO_H
> > +#define TSO_H 1
> > +
> > +void tso_init(const struct smap *ovs_other_config);
> > +bool tso_enabled(void);
> > +
> > +#endif /* tso.h */
> > diff --git a/vswitchd/bridge.c b/vswitchd/bridge.c
> > index 86c7b10a9..6d73922f6 100644
> > --- a/vswitchd/bridge.c
> > +++ b/vswitchd/bridge.c
> > @@ -65,6 +65,7 @@
> >  #include "system-stats.h"
> >  #include "timeval.h"
> >  #include "tnl-ports.h"
> > +#include "tso.h"
> >  #include "util.h"
> >  #include "unixctl.h"
> >  #include "lib/vswitch-idl.h"
> > @@ -3285,6 +3286,7 @@ bridge_run(void)
> >      if (cfg) {
> >          netdev_set_flow_api_enabled(&cfg->other_config);
> >          dpdk_init(&cfg->other_config);
> > +        tso_init(&cfg->other_config);
> >      }
> >  
> >      /* Initialize the ofproto library.  This only needs to run once, but
> > diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> > index 0ec726c39..354dcabfa 100644
> > --- a/vswitchd/vswitch.xml
> > +++ b/vswitchd/vswitch.xml
> > @@ -690,6 +690,18 @@
> >           once in few hours or a day or a week.
> >          </p>
> >        </column>
> > +      <column name="other_config" key="tso-support"
> > +              type='{"type": "boolean"}'>
> > +        <p>
> > +          Set this value to <code>true</code> to enable support for TSO (TCP
> > +          Segmentation Offloading). When TSO is enabled, vhost-user client
> 
> Why we're talking about vhost-user interfaces here?  Physical DPDK ports and
> netdev-linux will be able too.  Sounds strange.

Yeah, that comment can be improved now.

Thanks Ilya!
-- 
fbl


More information about the dev mailing list