[ovs-dev] [PATCH 8/8] native tunnel: Add support for STT

Tue Jan 26 19:17:22 UTC 2016

On Sun, Jan 10, 2016 at 11:18 PM, Pravin B Shelar <pshelar at nicira.com> wrote:
> This patch used userpsace tunneling mechanism for implementing
> STT tunneling protocol.
>
> Signed-off-by: Pravin B Shelar <pshelar at nicira.com>

It looks like this doesn't support segmentation either on transmit
(which would only happen here if the incoming packet is already at the
max MTU) or on receive (which could occur any time we receive a packet
from a sender that does know about TSO). I guess we're going to need
to find a way to emulate support even once we have hardware offloading
to cover the situations where it is not available. However, at the
moment, I think the consequence of this is that we won't be able to
pass traffic from existing senders in many cases.

> diff --git a/lib/odp-util.c b/lib/odp-util.c
> index f16e113..bdc7391 100644
> --- a/lib/odp-util.c
> +++ b/lib/odp-util.c
> +static const void *
> +format_tcp_tnl_push_header(struct ds *ds, const struct tcp_header *tcp)
> +{
> +    ds_put_format(ds, "tcp(src=%"PRIu16",dst=%"PRIu16",seq=0x%"PRIx16","
> +                  "ack=0x%"PRIx16",", ntohs(tcp->tcp_src), ntohs(tcp->tcp_dst),
> +                  ntohl(get_16aligned_be32(&tcp->tcp_seq)),
> +                  ntohl(get_16aligned_be32(&tcp->tcp_ack)));
> +
> +    format_flags_masked(ds, "flags", packet_tcp_flag_to_string,
> +                        ntohs(tcp->tcp_ctl), TCP_FLAGS(OVS_BE16_MAX),
> +                        TCP_FLAGS(OVS_BE16_MAX));
> +
> +    ds_put_format(ds, ",csum=0x%"PRIx16",urg=0x%"PRIx16")",
> +                  ntohs(tcp->tcp_csum), ntohs(tcp->tcp_urg));
> +    return tcp + 1;
> +}

I think I would try to print more of this in the context of STT rather
than TCP - for example, sequence number is really two separate fields
and the ACK field is not really an ack. It would also be good to
shorten it by avoiding printing fields that are normally zero or have
a fixed value; right now it is pretty verbose. Maybe the easiest thing
is just look at how tcpdump formats STT and match that since it should
be basically the same information that is interesting.

>  static void
>  format_odp_tnl_push_header(struct ds *ds, struct ovs_action_push_tnl *data)

Do we also need to have logic for parsing STT tunnels?

> diff --git a/lib/tnl-push-pop.c b/lib/tnl-push-pop.c
> index 86023c2..9440033 100644
> --- a/lib/tnl-push-pop.c
> +++ b/lib/tnl-push-pop.c
> +#define FRAG_EXP_TIME  frag_exp_time

I'm not sure that we still need this #define in its current form.
However, I think what it should really be defining is the amount of
time, which we can be multiplied by OVS_HZ later on.

> +/* The length and offset of a fragment are encoded in the sequence number.
> + * STT_SEQ_LEN_SHIFT is the left shift needed to store the length.
> + * STT_SEQ_OFFSET_MASK is the mask to extract the offset.
> + */
> +#define STT_SEQ_LEN_SHIFT      16
> +#define STT_SEQ_OFFSET_MASK    ((1 << STT_SEQ_LEN_SHIFT) - 1)

These two are really STT protocol fields. I guess that we might want
to either move the to packets.h with the rest or define them in a
different way somehow.

> +struct frag_packet_data {
> +        uint16_t offset;
> +       uint16_t pkt_size;
> +        /* Only valid for the first packet in the chain. */
> +        struct first_frag first;
> +        struct dp_packet *next;

I think the comment above is supposed to apply only to 'first' and not
'next', right?

> +static struct stt_reassemble *
> +get_reasm()
> +{
> +    struct stt_reassemble *reasm;
> +    uint32_t i;
> +    bool res;
> +
> +    reasm = ovsthread_getspecific(per_thread_reasm_data);
> +    if (OVS_UNLIKELY(reasm)) {
> +        return reasm;
> +    }

Isn't this LIKELY?

> +    reasm = xmalloc_cacheline(sizeof(*reasm));
> +    list_init(&reasm->frag_lru);
> +    reasm->counter = 0;

It seems like it would be nice to initialize the counter to a random value.

> +    ovs_mutex_lock(&thread_is_lock);
> +
> +    for (i = 0; i < USHRT_MAX; i++) {
> +        res = id_pool_alloc_id(id_ppol, &i);

I think &i is a return value, not an input, so I don't believe that it
is necessary to have a loop.

> +static void
> +reasm_destructor(void *_reasm)
> +{
> +    struct stt_reassemble *reasm = _reasm;
> +
> +    evict_frags(reasm, 0);
> +
> +    ovs_mutex_lock(&thread_is_lock);
> +    id_pool_free_id(id_ppol, reasm->id);
> +    ovs_mutex_unlock(&thread_is_lock);
> +}

Shouldn't we free the memory that we allocated for reasm?

Do we want to try to free unreassembled data that is sitting around
more aggressively? Right now, we can hold a decent amount of data that
won't get freed until thread exit.

> +static bool pkt_key_match(const struct pkt_key *a, const struct pkt_key *b)
> +{
> +    return !memcmp(a, b, sizeof (*a));
> +}

I think there are padding bytes in these structures which are
uninitialized when they come from the stack.

> +static uint32_t pkt_key_hash(const struct pkt_key *key)
> +{
> +        return hash_3words(hash_bytes(&key, offsetof(struct pkt_key, pkt_seq), 0),
> +                            (uint32_t)(key->pkt_seq),  frag_hash_seed);

Is this double hashing trying to avoid hashing padding bytes at the
end? It seems like we should be able to do this in one pass at least.

> +static void evict_frags(struct stt_reassemble *reasm, int mem_limit)
> +{
[...]
> +    /* Update Fragment cache expiration time. */
> +    frag_exp_time = 30 * OVS_HZ;

In what circumstances do we expect this to change? My guess it is for
non-constant TSC but in that case, this seems like it is too coarse of
a refresh to be useful.

> +static struct dp_packet *
> +packet_merge(struct dp_packet *pkt)
> +{
[...]

This merge strategy looks safe to me but it obviously will be pretty
slow. Do you plan to introduce some kind of zero copy mechanism?

> +    pkt = next;
> +    while (pkt) {
> +        void *data;
> +
> +        data = dp_packet_put_uninit(m, FRAG_DATA(pkt)->pkt_size);
> +        memcpy(data, STT_PACKET_DATA(pkt), FRAG_DATA(pkt)->pkt_size);

If you like, the above two lines could be merged into a single
dp_packet_put() call.

> +static struct dp_packet *
> +reassemble(struct dp_packet *packet)
> +{
[...]
> +    if (reasm->frag_mem_used + dp_packet_get_allocated(packet) > REASM_HI_THRESH) {
> +        evict_frags(reasm, REASM_LO_THRESH);
> +    }

It seems like it would be better to do the lookup before eviction. In
the worst case, it's possible that we might evict a fragment that we
could use now.

> +    frag = lookup_frag(reasm, &key, hash);
> +    if (!frag->pkts) {
> +        frag->pkts = packet;
> +        frag->key = key;
> +        frag->timestamp = cycles_counter();
> +        FRAG_DATA(packet)->first.last_pkt = packet;
> +        FRAG_DATA(packet)->first.mem_used = dp_packet_get_allocated(packet);
> +        FRAG_DATA(packet)->first.tot_len = tot_len;
> +        FRAG_DATA(packet)->first.rcvd_len = pkt_size;
> +        FRAG_DATA(packet)->first.set_ecn_ce = false;
> +        list_push_back(&reasm->frag_lru, &frag->lru_node);

We push new elements on the back and evict by popping from the back.
Isn't this a most recently used list instead?

> +
> +    if (is_ipv6) {
> +        struct ovs_16aligned_ip6_hdr *ip6 = dp_packet_l3(packet);
> +
> +        tos = ntohl(get_16aligned_be32(&ip6->ip6_flow)) >> 20;

Doesn't this pull in the version too?

> +unlock_free:
> +    dp_packet_delete(packet);
> +    packet = NULL;
> +out:
> +    return packet;
> +out_free:
> +    dp_packet_delete(packet);
> +    return NULL;
> +}

Aren't unlock_free and out_free the same? I don't think that we are
holding a lock here.

> +static int
> +stt_extract_tnl_md(struct dp_packet *packet)
> +{

We need to restore any VLAN tags in the header somewhere around here.

> +static void *
> +tcp_build_header(struct netdev_tunnel_config *tnl_cfg,
> +                 struct ovs_action_push_tnl *data,
> +                 unsigned int *hlen)
> +{
> +    struct ovs_16aligned_ip6_hdr *ip6;
> +    struct tcp_header *tcp;
> +    struct ip_header *ip;
> +    bool is_ipv6;
> +
> +    *hlen = sizeof(struct eth_header);
> +
> +    is_ipv6 = is_header_ipv6(data->header);
> +
> +    if (is_ipv6) {
> +        ip6 = ipv6_hdr(data->header);
> +        ip6->ip6_nxt = IPPROTO_TCP;
> +        tcp = (struct tcp_header *) (ip6 + 1);
> +        *hlen += IPV6_HEADER_LEN;
> +    } else {
> +        ip = ip_hdr(data->header);
> +        ip->ip_proto = IPPROTO_TCP;
> +        tcp = (struct tcp_header *) (ip + 1);
> +        *hlen += IP_HEADER_LEN;
> +    }

We could probably pull the above out into an ip_build_header()
function - this is the third time it is duplicated.

> +int
> +netdev_stt_build_header(const struct netdev *netdev,
> +                        struct ovs_action_push_tnl *data,
> +                        const struct flow *tnl_flow)
[...]
> +    stth->flags = STT_CSUM_VERIFIED;

Is this really right? I'm not sure that we have validated the incoming
checksums for all packets.

> +    stth->vlan_tci = 0;

The memory should already be zeroed so I think we can just omit the
above line since we're not really doing anything with the VLAN here.

> +static void
> +push_tcp_header(struct dp_packet *packet,
> +                const struct ovs_action_push_tnl *data)
> +{
[...]
> +    stt_len = (ip_tot_size - sizeof(struct tcp_header));

Extra set of parentheses in the above line.

There isn't any extraction of VLAN tags here. I guess it's not
strictly necessary at this point since we're not doing any offloading
but it seems like a step in the right direction.

> +void
> +netdev_stt_push_header(struct dp_packet *packet,
> +                       const struct ovs_action_push_tnl *data)
> +{
> +    push_tcp_header(packet, data);
> +}

I think the layering of TCP/STT is fine but I'm not sure that we
necessarily need to break them apart. With UDP we are actually sharing
it across multiple tunnel types but that seems fairly unlikely with
STT.

> diff --git a/tests/tunnel-push-pop.at b/tests/tunnel-push-pop.at
> index b04f4a6..b0363f2 100644
> --- a/tests/tunnel-push-pop.at
> +++ b/tests/tunnel-push-pop.at
> +dnl Check STT tunnel push
> +AT_CHECK([ovs-ofctl add-flow int-br action=6])
> +AT_CHECK([ovs-appctl ofproto/trace ovs-dummy 'in_port(2),eth_type(0x0800),ipv4(src=1.1.3.88,dst=1.1.3.112,proto=47,tos=0,ttl=64,frag=no)'], [0], [stdout])
> +AT_CHECK([tail -1 stdout], [0],
> +  [Datapath actions: tnl_push(tnl_port(7471),header(size=72,type=106,eth(dst=f8:bc:12:44:34:b6,src=aa:55:aa:55:00:00,dl_type=0x0800),ipv4(src=1.1.2.88,dst=1.1.2.92,proto=6,tos=0,ttl=64,frag=0x40),tcp(src=0,dst=7471,seq=0x0,ack=0x0,flags=psh|ack|0x5000,csum=0x0,urg=0x0),stt(tun_id=0x315,ver=0x0,flags=0x1,l4_offset=0x0,res=0x0,mss=0x0,vid=0,pcp=0,cfi=0)),out_port(100))
> +])

The TCP flags decoding seems odd - what is the extra 0x5000?

> +dnl Check STT only accepts encapsulated Ethernet frames
> +AT_CHECK([ovs-ofctl del-flows int-br])
> +AT_CHECK([ovs-appctl netdev-dummy/receive p0 'aa55aa550000001b213cab6408004500009c00004000400633a70101025c01010258204e1d2f007400000000001c5018ffff7cb8000000010000000000000000000000000315000066b2591a7347427c73ecf94a080045000054b15e400040017e950101045c010104580800a59658cb00018a7b8f560000000016f80a0000000000101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637'])
> +ovs-appctl time/warp 1000
> +
> +AT_CHECK([ovs-ofctl dump-ports int-br | grep 'port  6'], [0], [dnl
> +  port  6: rx pkts=1, bytes=98, drop=0, errs=0, frame=0, over=0, crc=0
> +])

I guess the comment over this test isn't right. The original version
(from GRE) is actually a negative test to make sure that we don't
accept something that we shouldn't but I think this test is really
checking to make sure that we can decapsulate a STT frame. Is that
right?