[ovs-dev] [PATCH v2 04/15] conntrack: New userspace connection tracker.

Wed Apr 27 06:36:07 UTC 2016

Thanks for the detailed review Joe, replies inline

On 19/04/2016 14:36, "Joe Stringer" <joe at ovn.org> wrote:

>On 15 April 2016 at 17:02, Daniele Di Proietto <diproiettod at vmware.com> wrote:
>> This commit adds the conntrack module.
>>
>> It is a connection tracker that resides entirely in userspace.  Its
>> primary user will be the dpif-netdev datapath.
>>
>> The module main goal is to provide conntrack_execute(), which offers a
>> convenient interface to implement the datapath ct() action.
>>
>> The conntrack module uses two submodules to deal with the l4 protocol
>> details (conntrack-other for UDP and ICMP, conntrack-tcp for TCP).
>>
>> The conntrack-tcp submodule implementation is adapted from FreeBSD's pf
>> subsystem, therefore it's BSD licensed.  It has been slightly altered to
>> match the OVS coding style and to allow the pickup of already
>> established connections.
>>
>> Signed-off-by: Daniele Di Proietto <diproiettod at vmware.com>
>
>Thanks for submitting this, I know there's a few people excited about
>this feature.
>
>I notice that there is no NEWS item about this, were you intending to
>hold off on announcing it until there is feature parity with the
>kernel, eg support for IPv[46] fragments; ALGs; NAT?

Actually I didn't think about this :-)

I'll add a line in NEWS

>
>
>> diff --git a/lib/conntrack-other.c b/lib/conntrack-other.c
>> new file mode 100644
>> index 0000000..65d02a9
>> --- /dev/null
>> +++ b/lib/conntrack-other.c
>> @@ -0,0 +1,91 @@
>> +/*
>> + * Copyright (c) 2015, 2016 Nicira, Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + *     http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> + * See the License for the specific language governing permissions and
>> + * limitations under the License.
>> + */
>> +
>> +#include <config.h>
>> +
>> +#include "conntrack-private.h"
>> +#include "dp-packet.h"
>> +
>> +enum other_state {
>> +    OTHERS_FIRST,
>> +    OTHERS_MULTIPLE,
>> +    OTHERS_BIDIR,
>> +};
>> +
>> +struct conn_other {
>> +    struct conn up;
>> +    enum other_state state;
>> +};
>> +
>> +static const long long other_timeouts[] = {
>> +    [OTHERS_FIRST] = 60 * 1000,
>> +    [OTHERS_MULTIPLE] = 60 * 1000,
>> +    [OTHERS_BIDIR] = 30 * 1000,
>> +};
>
>Are these timeouts in milliseconds? (a comment or #define might help with that)

Sure, I'll add a comment.

>
><snip>
>
>> diff --git a/lib/conntrack.c b/lib/conntrack.c
>> new file mode 100644
>> index 0000000..840335b
>> --- /dev/null
>> +++ b/lib/conntrack.c
>...
>> +static struct ct_l4_proto *l4_protos[] = {
>> +    [IPPROTO_TCP] = &ct_proto_tcp,
>> +    [IPPROTO_UDP] = &ct_proto_other,
>> +    [IPPROTO_ICMP] = &ct_proto_other,
>> +};
>
>I see that conntrack-other is shared by UDP and ICMP. In the Linux
>datapath, the UDP handler also checks UDP length and UDP checksum - I
>wonder if we can share most of the code here for these protocols but
>add these additional checks for the UDP case?

Thanks for pointing this out!

I think it's a good idea to check the UDP length, I've added a check later.

This didn't handle at all checksum verification.  I thought it was out of
scope for the connection tracker, but since the kernel connection tracker
does this, I've added the validation.

None of this need (yet) a separate conntrack-icmp module, I'm implementing
it when the key is being extracted (as you point out later).

We would need a separate conntrack-icmp module if we wanted to track icmp
type and code.

>
>> +static struct conn *
>> +conn_not_found(struct conntrack *ct, struct dp_packet *pkt,
>> +               struct conn_lookup_ctx *ctx, uint8_t *state, bool commit,
>> +               long long now)
>> +{
>> +    unsigned bucket = hash_to_bucket(ctx->hash);
>> +    struct conn *nc = NULL;
>> +
>> +    if (!valid_new(pkt, &ctx->key)) {
>> +        *state |= CS_INVALID;
>> +        return nc;
>> +    }
>> +
>> +    *state |= CS_NEW;
>> +
>> +    if (commit) {
>> +        nc = new_conn(pkt, &ctx->key, now);
>> +
>> +        memcpy(&nc->rev_key, &ctx->key, sizeof nc->rev_key);
>> +
>> +        conn_key_reverse(&nc->rev_key);
>> +        hmap_insert(&ct->connections[bucket], &nc->node, ctx->hash);
>> +    }
>> +
>> +    return nc;
>> +}
>
>This function can return NULL....
>
>> +static struct conn *
>> +process_one(struct conntrack *ct, struct dp_packet *pkt,
>> +            struct conn_lookup_ctx *ctx, uint16_t zone,
>> +            bool commit, long long now)
>> +{
>> +    unsigned bucket = hash_to_bucket(ctx->hash);
>> +    struct conn *conn = ctx->conn;
>> +    uint8_t state = 0;
>> +
>> +    if (conn) {
>> +        if (ctx->related) {
>> +            state |= CS_RELATED;
>> +            if (ctx->reply) {
>> +                state |= CS_REPLY_DIR;
>> +            }
>> +        } else {
>> +            enum ct_update_res res;
>> +
>> +            res = conn_update(conn, pkt, ctx->reply, now);
>> +
>> +            switch (res) {
>> +            case CT_UPDATE_VALID:
>> +                state |= CS_ESTABLISHED;
>> +                if (ctx->reply) {
>> +                    state |= CS_REPLY_DIR;
>> +                }
>> +                break;
>> +            case CT_UPDATE_INVALID:
>> +                state |= CS_INVALID;
>> +                break;
>> +            case CT_UPDATE_NEW:
>> +                hmap_remove(&ct->connections[bucket], &conn->node);
>> +                delete_conn(conn);
>> +                conn = conn_not_found(ct, pkt, ctx, &state, commit, now);
>
>...which is then stored here...
>
>> +                break;
>> +            }
>> +        }
>> +
>> +        pkt->md.ct_label = conn->label;
>> +        pkt->md.ct_mark = conn->mark;
>> +        write_ct_md(pkt, state, zone, conn->mark, conn->label);
>
>...and dereferenced here. Is it possible to send invalid packets for
>an existing connection and cause a NULL pointer dereference?

Yes, thanks for spotting this. I've added a NULL pointer check and
moved this out of the if branch (it's needed also in the else branch)

>
>
>> +    } else {
>> +        conn = conn_not_found(ct, pkt, ctx, &state, commit, now);
>> +        write_ct_md(pkt, state, zone, 0, (ovs_u128) {{0}});
>
>Maybe we can define OVS_U128_MIN (ovs_u128) {{0}} in a header somewhere?

Ok.  I've put OVS_U128_MIN in util.h (like OVS_U128_MAX) and defined an
OVS_U128_ZERO macro as an alias.

>
>
>> +/* Cleans up old connection entries.  Should be called periodically. */
>> +void
>> +conntrack_run(struct conntrack *ct)
>> +{
>> +    unsigned bucket = hash_to_bucket(ct->purge_bucket);
>> +    uint32_t inner_bucket = ct->purge_inner_bucket,
>> +             inner_offset = ct->purge_inner_offset;
>
>Style seems a little unusual for the above 'inner_offset'.

Right, I put the variable in two separate statements.

>
>
>> +static inline bool
>> +extract_l4_udp(struct conn_key *key, const void *data, size_t size)
>> +{
>> +    const struct udp_header *udp = data;
>> +
>> +    if (OVS_UNLIKELY(size < UDP_HEADER_LEN)) {
>> +        return false;
>> +    }
>
>Maybe my comment earlier about UDP len and checksum could be addressed here?

Yes, thank you!

>
>...
>
>
>> +static bool
>> +conn_key_extract(struct conntrack *ct, struct dp_packet *pkt,
>> +                 struct conn_lookup_ctx *ctx, uint16_t zone)
>> +{
>> +    const struct eth_header *l2 = dp_packet_l2(pkt);
>> +    const struct ip_header *l3 = dp_packet_l3(pkt);
>> +    const char *l4 = dp_packet_l4(pkt);
>> +    const char *tail = dp_packet_tail(pkt);
>> +    bool ok;
>> +
>> +    memset(ctx, 0, sizeof *ctx);
>> +
>> +    if (!l2 || !l3 || !l4) {
>> +        return false;
>> +    }
>> +
>> +    ctx->key.zone = zone;
>> +
>> +    /* XXX In this function we parse the packet (again, it has already
>> +     * gone through miniflow_extract()) for two reasons:
>> +     *
>> +     * 1) To extract the l3 addresses and l4 ports.
>> +     *    We already have the l3 and l4 headers' pointers.  Extracting
>> +     *    the l3 addresses and the l4 ports is really cheap, since they
>> +     *    can be found at fixed locations.
>> +     * 2) To extract the l3 and l4 types.
>> +     *    Extracting the l3 and l4 types (especially the l3[1]) on the
>> +     *    other hand is quite expensive, because they're not at a
>> +     *    fixed location.
>> +     *
>> +     * Here's a way to avoid (2) with the help of the datapath.
>> +     * The datapath doesn't keep the packet's extracted flow[2], so
>> +     * using that is not an option.  We could use the packet's matching
>> +     * megaflow, but we have to make sure that the l3 and l4 types
>> +     * are unwildcarded.  This means either:
>> +     *
>> +     * a) dpif-netdev unwildcards the l3 (and l4) types when a new flow
>> +     *    is installed if the actions contains ct().  This is what the
>> +     *    kernel datapath does.  It is not so straightforward, though.
>> +     *
>> +     * b) ofproto-dpif-xlate unwildcards the l3 (and l4) types when
>> +     *    translating a ct() action.  This is already done in different
>> +     *    actions and since both the userspace and the kernel datapath
>> +     *    would benefit from it, it seems an appropriate place to do
>> +     *    it.
>> +     *
>> +     * ---
>> +     * [1] A simple benchmark (running only the connection tracker
>> +     *     over and over on the same packets) shows that if the
>> +     *     l3 type is already provided we are 15% faster (running the
>> +     *     connection tracker over a couple of DPDK devices with a
>> +     *     stream of UDP 64-bytes packets shows that we are 4% faster).
>> +     *
>> +     * [2] The reasons for this are that keeping the flow increases
>> +     *     (slightly) the cache footprint and increases computation
>> +     *     time as we move the packet around. Most importantly, the flow
>> +     *     should be updated by the actions and this can be slow, as
>> +     *     we use a sparse representation (miniflow).
>> +     *
>> +     */
>
>I discussed the above offline with Jarno; we will always unwildcard
>the l3 -- see xlate_wc_init().

Oh, you're right, I forgot about that, I guess I need at least to update
the comment.

>
>Do we have a benchmark on how costly the options are for figuring out
>the l4 type?

It's a problem only for IPv6, as we need to parse the extension headers.

None of the solution I thought about seem particularly clean.  I'd be inclined
to keep this here and revisit it later.

>
>
>> +static void
>> +conn_key_lookup(struct conntrack *ct,
>> +                struct conn_lookup_ctx *ctx,
>> +                unsigned bucket,
>> +                long long now)
>> +{
>> +    struct conn *conn, *found = NULL;
>> +    uint32_t hash = ctx->hash;
>> +    bool reply;
>> +
>> +    HMAP_FOR_EACH_WITH_HASH (conn, node, hash, &ct->connections[bucket]) {
>> +        if (!memcmp(&conn->key, &ctx->key, sizeof(conn->key))) {
>> +            found = conn;
>> +            reply = false;
>> +            break;
>> +        }
>> +        if (!memcmp(&conn->rev_key, &ctx->key, sizeof(conn->rev_key))) {
>> +            found = conn;
>> +            reply = true;
>> +            break;
>> +        }
>> +    }
>> +
>> +    if (found) {
>> +        if (conn_expired(found, now)) {
>> +            found = NULL;
>> +        } else {
>> +            ctx->reply = reply;
>> +        }
>> +    }
>
>Should this check be within the HMAP_FOR_EACH_WITH_HASH?

Yes, that also simplify the code a little bit, thank you.

>
>
>Separate question, and maybe something to be followed up on separately
>from this series, today with the current userspace + kernel
>implementation we can install a flow like:
>
>arp,actions=ct(table=1)
>
>At OpenFlow layer we'll accept this (although maybe we shouldn't)...
>When processing an upcall, we'll translate this (though we have enough
>information to at least complain if not more)
>Then when installing/executing into the kernel datapath, we'll reject
>the command while parsing the action.
>
>I'm considering that maybe we should either a) Restrict this all the
>way up at OpenFlow, or b) loosen the restriction in kernel datapath
>and allow such flows (but mark the traffic as invalid early in
>conntrack processing and skip to the next action).
>
>So, to my question: How does the userspace handle a case like this?

The userspace datapath does what you're suggesting as b): in
conntrack_execute() if the packet is not ipv4 or ipv6 the packet is marked
as invalid (the key extraction fails).

Restricting this at the OpenFlow level means that it wouldn't be possible
to a have _single_ flow to send both ipv4 and ipv6 traffic to the connection
tracker.  Do we care about that?