[ovs-dev] [PATCH v3 11/17] netdev: Add Large packet segmentation support.

Wed May 18 22:01:50 UTC 2016

On Tue, May 10, 2016 at 10:30 AM, Pravin B Shelar <pshelar at ovn.org> wrote:
> diff --git a/lib/dp-packet-lso.c b/lib/dp-packet-lso.c
> new file mode 100644
> index 0000000..14a5ed8
> --- /dev/null
> +++ b/lib/dp-packet-lso.c
> +void
> +fixup_packet_cheksum(struct dp_packet *pkt, int l4_offset, int csum_offset,
> +                     int new_ip_tot_len, int old_ip_tot_len)
> +{
> +    ovs_be16 *data_ptr, *csum;
> +    uint32_t l4_csum;
> +
> +    data_ptr = (ovs_be16 *) ((uint16_t *) dp_packet_data(pkt) + (l4_offset >> 1));
> +    csum = data_ptr + (csum_offset >> 1);

All of the shifts and multiples of two arithmatic make the code hard
to read. It would be nice if it just did what it wanted to do directly
rather than trying to be too clever.

> +    l4_csum = csum_continue(0, data_ptr, dp_packet_size(pkt) - l4_offset);
> +    *csum = csum_finish(l4_csum);
> +    if (new_ip_tot_len != old_ip_tot_len) {
> +        *csum = recalc_csum16(*csum, htons(old_ip_tot_len), htons(new_ip_tot_len));
> +    }

We definitely need to document what the expectations are as far as
seeding checksums with pseudoheaders, etc. I know that it is the same
as Linux but that's not going to be obvious to most people.

> +static struct dp_packet *
> +segment_tcp_packet(struct dp_packet *orig)
> +{
[...]
> +        if (mss) {
> +            struct tcp_header *tcph = dp_packet_l4(seg);
> +
> +            put_16aligned_be32(&tcph->tcp_seq, htonl(tcp_seq));
> +            tcp_seq += mss;
> +            tcph->tcp_ctl = htons(ntohs(tcph->tcp_ctl) &
> +                                  ~(TCP_FIN | TCP_PSH));

These flags should be left alone on the last segment and cleared from
the earlier ones. I also don't think this handles ECN.

> +static struct dp_packet *
> +segment_l4_packet(struct dp_packet *orig)
> +{
> +    if (orig->lso.type & (DPBUF_LSO_TCPv4 | DPBUF_LSO_TCPv6)) {
> +        return segment_tcp_packet(orig);
> +    } else if (orig->lso.type & (DPBUF_LSO_UDPv4 | DPBUF_LSO_UDPv6)) {
> +        return segment_udp_packet(orig);
> +    }
> +    OVS_NOT_REACHED();

I'm really nervous about asserting that we won't have any other
protocol type here. I think the best thing to do would be log an error
and drop the packet.

> +static struct dp_packet *
> +segment_ipv4_packet(struct dp_packet *orig)
> +{
[...]
> +    orig->l4_ofs = orig->l3_ofs + IP_HEADER_LEN;

What about IP options?

> +static void
> +update_ipv6_frag_hdr(struct dp_packet *pkt, int *ipv6_frag_offset)
> +{
[...]

I don't really understand the idea behind this function. The
implication seems to be that there should always be a preexisting
fragment header but I'm not sure why that would be the case in most
situations.

In any case, it would also be nice to somehow consolidate the code
with the function below it.

> +static int
> +ipv6_set_l4_offset(struct dp_packet *pkt)
> +{
[...]
> +        } else if (nw_proto == IPPROTO_AH) {
> +            const struct ip6_ext *ext_hdr = data;
> +            nw_proto = ext_hdr->ip6e_nxt;
> +
> +            offset += (ext_hdr->ip6e_len + 2) * 4;
> +            if (offset > size) {
> +                goto out;
> +            }

I guess, realistically speaking, nothing good is going to come of
trying to checksum/segment a packet that already has an AH header on
it.

> +static struct dp_packet *
> +segment_eth_packet(struct dp_packet *orig, int offset)
> +{

This patch series has a bunch of things which are added ahead of when
they are actually used. 'offset' is one example of this where it is
not necessary in the current usage at this point but there were others
in previous patches as well. This makes it difficult to review since
you have to flip between different patches to understand why something
is being done.

> +    dp_packet_reset_packet(orig, offset);

What are the semantics around modifying the original packet?

> +    eth = dp_packet_data(orig);
> +    eth_type = eth->eth_type;
> +    if (eth_type_vlan(eth->eth_type)) {
> +        struct vlan_eth_header *vethh = (struct vlan_eth_header *) dp_packet_data(orig);
> +
> +        eth_type = vethh->veth_next_type;
> +        header_len += VLAN_HEADER_LEN;
> +    }

Is a single VLAN tag reasonable limit? I guess in this series the only
possible source of offloading is STT and that has a maximum of one but
what about other sources?

> +    if (eth_type == htons(ETH_TYPE_IP)) {
> +        seg_list = segment_ipv4_packet(orig);
> +    } else if (eth_type == htons(ETH_TYPE_IPV6)) {
> +        seg_list = segment_ipv6_packet(orig);
> +    } else {
> +        return NULL;
> +    }

If we return NULL when a protocol is not supported, it causes
FOR_EACH_LSO_SEG to just be a no-op rather than signaling any kind of
error. I think in at least some places we should check for this as it
indicates a bug, either here or in the caller.

> diff --git a/lib/dp-packet-lso.h b/lib/dp-packet-lso.h
> new file mode 100644
> index 0000000..09815e8
> --- /dev/null
> +++ b/lib/dp-packet-lso.h
> +#define DPBUF_LSO_TCPv4 (1 << 0)
> +#define DPBUF_LSO_TCPv6 (1 << 1)
> +#define DPBUF_LSO_UDPv4 (1 << 2)
> +#define DPBUF_LSO_UDPv6 (1 << 3)

I find myself wondering if this (meaning the overall approach to
offloading) is going to scale well in the future. It seems like it
would be better off it was somehow more modular. An obvious example is
protocols - it would be nice if they were pluggable rather than
needing to be added to a big chain of if statements. (Also, it would
seem to be good to break apart L3 vs. L4 types in the above #defines.)
Another part is the combining of checksum offload with LSO - it works
for now but will it make sense in the future? Can it integrate with
hardware checksum offloading state later on or will that be split
between OVS and DPDK data structures?

> +struct dp_packet_lso_ctx {
> +    struct dp_packet *next;       /* Used to list lso segments. */
> +};
> +
> +BUILD_ASSERT_DECL(DP_PACKET_CONTEXT_SIZE >= sizeof(struct dp_packet_lso_ctx));

Are there any places where dp_packet_lso_ctx is used at the same time
as pkt_metadata? For example, when a large packet is decapsulated by
STT and has some metadata? I guess that in the existing cases where we
have upcalls, the flow has already been extracted and we can get the
metadata from there. However, it seems somewhat risky and the
semantics of what is safe are not clearly defined.

> +#define PACKET_LSO_CTX(packet) ((struct dp_packet_lso_ctx *)(packet)->data)

I know that 'data' is defined in an earlier patch but looking at it
here, it seems like it would be good to make the name more descriptive
- it looks like it is referring to packet payload data.

> +void
> +fixup_packet_cheksum(struct dp_packet *pkt, int l4_offset, int csum_offset,
> +                     int new_ip_tot_len, int old_ip_tot_len);
> +

"checksum" - I think there are a few instances of this typo in other
places as well. In addition, I don't think that it is used outside of
dp-packet-lso.c, so it probably could be made static.

> @@ -567,12 +571,14 @@ enum { NETDEV_MAX_BURST = 32 }; /* Maximum number packets in a batch. */
>
>  struct dp_packet_batch {
>      int count;
> +    uint8_t  lso_type;
>      struct dp_packet *packets[NETDEV_MAX_BURST];
>  };

I think this could probably at least use a comment - it's not
immediately clear that this is the combination of all LSO types in the
batch.

> diff --git a/lib/netdev.c b/lib/netdev.c
> index a83e53e..c3fdb44 100644
> --- a/lib/netdev.c
> +++ b/lib/netdev.c
> +static int
> +send_packet__(struct netdev *netdev, int qid, struct dp_packet_batch *b,
> +             struct dp_packet *pkt, bool may_steal)
> +{

It looks like the only two callers of this function both set
'may_steal' to true unconditionally, so we might just be able to
assume that and remove the parameter.

> +static int
> +netdev_send_lso(struct netdev *netdev, int qid, struct dp_packet_batch *s,
> +                bool may_steal)
[...]
> +            error = 0;
> +            FOR_EACH_LSO_SEG_SAFE(seg_list, seg, next) {
> +                if (OVS_UNLIKELY(error)) {
> +                    dp_packet_delete(seg);
> +                    continue;
> +                }
> +                error = send_packet__(netdev, qid, &b, seg, true);
> +                if (OVS_UNLIKELY(error)) {
> +                    dp_packet_delete(seg);
> +                }
> +            }

Is the goal of the error checks in this loop to free all remaining
segments once there is an error in one? I'm not sure that's really
necessary - if we sent the remaining non-error segments then
potentially something like SACK in the receiver would be able to pick
up just the missing one later on.

> +    if (b.count) {
> +        error = netdev->netdev_class->send(netdev, qid,
> +                                           b.packets, b.count, true);
> +        if (!error) {
> +            dp_packet_batch_init(&b);

I don't know if we really need to clear the batch here after we've
successfully sent it since we return from the function and it goes out
of scope just a few lines later.

> +err:
> +    if (may_steal) {
> +        for (i = i + 1; i < s->count; i++) {
> +            dp_packet_delete(s->packets[i]);
> +        }
> +    }
> +
> +    dp_packet_delete_batch(&b, true);

According to the documentation for netdev send functions, ownership of
the batch is taken regardless of whether there is an error as long as
'may_steal' is true, so I think we'll have a double free as a result
of deleting this batch.

>  netdev_send(struct netdev *netdev, int qid, struct dp_packet_batch *batch,
>              bool may_steal)
>  {
> -    if (!netdev->netdev_class->send) {
> -        dp_packet_delete_batch(batch, may_steal);
> -        return EOPNOTSUPP;
> -    }
> +    int error;
>
> -    int error = netdev->netdev_class->send(netdev, qid,
> +    if (!netdev->netdev_class->send) {
> +        error = EOPNOTSUPP;
> +    } else if (batch->lso_type & ~netdev->supported_lso_types) {
> +        return netdev_send_lso(netdev, qid, batch, may_steal);
> +    } else {
> +        error = netdev->netdev_class->send(netdev, qid,
>                                             batch->packets, batch->count,
>                                             may_steal);
> +    }
> +
>      if (!error) {
>          COVERAGE_INC(netdev_sent);
> +    } else {
> +        goto err;
>      }
> +    return 0;
> +
> +err:
> +    dp_packet_delete_batch(batch, may_steal);
>      return error;
>  }

I think we have a similar issue with double free here (and the
original code didn't free the packets on error).

> @@ -773,6 +867,10 @@ netdev_push_header(const struct netdev *netdev,
>          return -EINVAL;
>      }
>
> +    if (batch->lso_type & ~netdev->supported_lso_types) {
> +        return -EINVAL;
> +    }

Shouldn't this be treated basically the same as sending and we should
do segmentation if the type isn't supported? Otherwise, we basically
require tunnels to support all types of LSO even just for correct
functionality.