[ovs-dev] [PATCH v2 04/15] conntrack: New userspace connection tracker.
Daniele Di Proietto
diproiettod at vmware.com
Wed Apr 27 06:36:07 UTC 2016
Thanks for the detailed review Joe, replies inline
On 19/04/2016 14:36, "Joe Stringer" <joe at ovn.org> wrote:
>On 15 April 2016 at 17:02, Daniele Di Proietto <diproiettod at vmware.com> wrote:
>> This commit adds the conntrack module.
>>
>> It is a connection tracker that resides entirely in userspace. Its
>> primary user will be the dpif-netdev datapath.
>>
>> The module main goal is to provide conntrack_execute(), which offers a
>> convenient interface to implement the datapath ct() action.
>>
>> The conntrack module uses two submodules to deal with the l4 protocol
>> details (conntrack-other for UDP and ICMP, conntrack-tcp for TCP).
>>
>> The conntrack-tcp submodule implementation is adapted from FreeBSD's pf
>> subsystem, therefore it's BSD licensed. It has been slightly altered to
>> match the OVS coding style and to allow the pickup of already
>> established connections.
>>
>> Signed-off-by: Daniele Di Proietto <diproiettod at vmware.com>
>
>Thanks for submitting this, I know there's a few people excited about
>this feature.
>
>I notice that there is no NEWS item about this, were you intending to
>hold off on announcing it until there is feature parity with the
>kernel, eg support for IPv[46] fragments; ALGs; NAT?
Actually I didn't think about this :-)
I'll add a line in NEWS
>
>
>> diff --git a/lib/conntrack-other.c b/lib/conntrack-other.c
>> new file mode 100644
>> index 0000000..65d02a9
>> --- /dev/null
>> +++ b/lib/conntrack-other.c
>> @@ -0,0 +1,91 @@
>> +/*
>> + * Copyright (c) 2015, 2016 Nicira, Inc.
>> + *
>> + * Licensed under the Apache License, Version 2.0 (the "License");
>> + * you may not use this file except in compliance with the License.
>> + * You may obtain a copy of the License at:
>> + *
>> + * http://www.apache.org/licenses/LICENSE-2.0
>> + *
>> + * Unless required by applicable law or agreed to in writing, software
>> + * distributed under the License is distributed on an "AS IS" BASIS,
>> + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>> + * See the License for the specific language governing permissions and
>> + * limitations under the License.
>> + */
>> +
>> +#include <config.h>
>> +
>> +#include "conntrack-private.h"
>> +#include "dp-packet.h"
>> +
>> +enum other_state {
>> + OTHERS_FIRST,
>> + OTHERS_MULTIPLE,
>> + OTHERS_BIDIR,
>> +};
>> +
>> +struct conn_other {
>> + struct conn up;
>> + enum other_state state;
>> +};
>> +
>> +static const long long other_timeouts[] = {
>> + [OTHERS_FIRST] = 60 * 1000,
>> + [OTHERS_MULTIPLE] = 60 * 1000,
>> + [OTHERS_BIDIR] = 30 * 1000,
>> +};
>
>Are these timeouts in milliseconds? (a comment or #define might help with that)
Sure, I'll add a comment.
>
><snip>
>
>> diff --git a/lib/conntrack.c b/lib/conntrack.c
>> new file mode 100644
>> index 0000000..840335b
>> --- /dev/null
>> +++ b/lib/conntrack.c
>...
>> +static struct ct_l4_proto *l4_protos[] = {
>> + [IPPROTO_TCP] = &ct_proto_tcp,
>> + [IPPROTO_UDP] = &ct_proto_other,
>> + [IPPROTO_ICMP] = &ct_proto_other,
>> +};
>
>I see that conntrack-other is shared by UDP and ICMP. In the Linux
>datapath, the UDP handler also checks UDP length and UDP checksum - I
>wonder if we can share most of the code here for these protocols but
>add these additional checks for the UDP case?
Thanks for pointing this out!
I think it's a good idea to check the UDP length, I've added a check later.
This didn't handle at all checksum verification. I thought it was out of
scope for the connection tracker, but since the kernel connection tracker
does this, I've added the validation.
None of this need (yet) a separate conntrack-icmp module, I'm implementing
it when the key is being extracted (as you point out later).
We would need a separate conntrack-icmp module if we wanted to track icmp
type and code.
>
>> +static struct conn *
>> +conn_not_found(struct conntrack *ct, struct dp_packet *pkt,
>> + struct conn_lookup_ctx *ctx, uint8_t *state, bool commit,
>> + long long now)
>> +{
>> + unsigned bucket = hash_to_bucket(ctx->hash);
>> + struct conn *nc = NULL;
>> +
>> + if (!valid_new(pkt, &ctx->key)) {
>> + *state |= CS_INVALID;
>> + return nc;
>> + }
>> +
>> + *state |= CS_NEW;
>> +
>> + if (commit) {
>> + nc = new_conn(pkt, &ctx->key, now);
>> +
>> + memcpy(&nc->rev_key, &ctx->key, sizeof nc->rev_key);
>> +
>> + conn_key_reverse(&nc->rev_key);
>> + hmap_insert(&ct->connections[bucket], &nc->node, ctx->hash);
>> + }
>> +
>> + return nc;
>> +}
>
>This function can return NULL....
>
>> +static struct conn *
>> +process_one(struct conntrack *ct, struct dp_packet *pkt,
>> + struct conn_lookup_ctx *ctx, uint16_t zone,
>> + bool commit, long long now)
>> +{
>> + unsigned bucket = hash_to_bucket(ctx->hash);
>> + struct conn *conn = ctx->conn;
>> + uint8_t state = 0;
>> +
>> + if (conn) {
>> + if (ctx->related) {
>> + state |= CS_RELATED;
>> + if (ctx->reply) {
>> + state |= CS_REPLY_DIR;
>> + }
>> + } else {
>> + enum ct_update_res res;
>> +
>> + res = conn_update(conn, pkt, ctx->reply, now);
>> +
>> + switch (res) {
>> + case CT_UPDATE_VALID:
>> + state |= CS_ESTABLISHED;
>> + if (ctx->reply) {
>> + state |= CS_REPLY_DIR;
>> + }
>> + break;
>> + case CT_UPDATE_INVALID:
>> + state |= CS_INVALID;
>> + break;
>> + case CT_UPDATE_NEW:
>> + hmap_remove(&ct->connections[bucket], &conn->node);
>> + delete_conn(conn);
>> + conn = conn_not_found(ct, pkt, ctx, &state, commit, now);
>
>...which is then stored here...
>
>> + break;
>> + }
>> + }
>> +
>> + pkt->md.ct_label = conn->label;
>> + pkt->md.ct_mark = conn->mark;
>> + write_ct_md(pkt, state, zone, conn->mark, conn->label);
>
>...and dereferenced here. Is it possible to send invalid packets for
>an existing connection and cause a NULL pointer dereference?
Yes, thanks for spotting this. I've added a NULL pointer check and
moved this out of the if branch (it's needed also in the else branch)
>
>
>> + } else {
>> + conn = conn_not_found(ct, pkt, ctx, &state, commit, now);
>> + write_ct_md(pkt, state, zone, 0, (ovs_u128) {{0}});
>
>Maybe we can define OVS_U128_MIN (ovs_u128) {{0}} in a header somewhere?
Ok. I've put OVS_U128_MIN in util.h (like OVS_U128_MAX) and defined an
OVS_U128_ZERO macro as an alias.
>
>
>> +/* Cleans up old connection entries. Should be called periodically. */
>> +void
>> +conntrack_run(struct conntrack *ct)
>> +{
>> + unsigned bucket = hash_to_bucket(ct->purge_bucket);
>> + uint32_t inner_bucket = ct->purge_inner_bucket,
>> + inner_offset = ct->purge_inner_offset;
>
>Style seems a little unusual for the above 'inner_offset'.
Right, I put the variable in two separate statements.
>
>
>> +static inline bool
>> +extract_l4_udp(struct conn_key *key, const void *data, size_t size)
>> +{
>> + const struct udp_header *udp = data;
>> +
>> + if (OVS_UNLIKELY(size < UDP_HEADER_LEN)) {
>> + return false;
>> + }
>
>Maybe my comment earlier about UDP len and checksum could be addressed here?
Yes, thank you!
>
>...
>
>
>> +static bool
>> +conn_key_extract(struct conntrack *ct, struct dp_packet *pkt,
>> + struct conn_lookup_ctx *ctx, uint16_t zone)
>> +{
>> + const struct eth_header *l2 = dp_packet_l2(pkt);
>> + const struct ip_header *l3 = dp_packet_l3(pkt);
>> + const char *l4 = dp_packet_l4(pkt);
>> + const char *tail = dp_packet_tail(pkt);
>> + bool ok;
>> +
>> + memset(ctx, 0, sizeof *ctx);
>> +
>> + if (!l2 || !l3 || !l4) {
>> + return false;
>> + }
>> +
>> + ctx->key.zone = zone;
>> +
>> + /* XXX In this function we parse the packet (again, it has already
>> + * gone through miniflow_extract()) for two reasons:
>> + *
>> + * 1) To extract the l3 addresses and l4 ports.
>> + * We already have the l3 and l4 headers' pointers. Extracting
>> + * the l3 addresses and the l4 ports is really cheap, since they
>> + * can be found at fixed locations.
>> + * 2) To extract the l3 and l4 types.
>> + * Extracting the l3 and l4 types (especially the l3[1]) on the
>> + * other hand is quite expensive, because they're not at a
>> + * fixed location.
>> + *
>> + * Here's a way to avoid (2) with the help of the datapath.
>> + * The datapath doesn't keep the packet's extracted flow[2], so
>> + * using that is not an option. We could use the packet's matching
>> + * megaflow, but we have to make sure that the l3 and l4 types
>> + * are unwildcarded. This means either:
>> + *
>> + * a) dpif-netdev unwildcards the l3 (and l4) types when a new flow
>> + * is installed if the actions contains ct(). This is what the
>> + * kernel datapath does. It is not so straightforward, though.
>> + *
>> + * b) ofproto-dpif-xlate unwildcards the l3 (and l4) types when
>> + * translating a ct() action. This is already done in different
>> + * actions and since both the userspace and the kernel datapath
>> + * would benefit from it, it seems an appropriate place to do
>> + * it.
>> + *
>> + * ---
>> + * [1] A simple benchmark (running only the connection tracker
>> + * over and over on the same packets) shows that if the
>> + * l3 type is already provided we are 15% faster (running the
>> + * connection tracker over a couple of DPDK devices with a
>> + * stream of UDP 64-bytes packets shows that we are 4% faster).
>> + *
>> + * [2] The reasons for this are that keeping the flow increases
>> + * (slightly) the cache footprint and increases computation
>> + * time as we move the packet around. Most importantly, the flow
>> + * should be updated by the actions and this can be slow, as
>> + * we use a sparse representation (miniflow).
>> + *
>> + */
>
>I discussed the above offline with Jarno; we will always unwildcard
>the l3 -- see xlate_wc_init().
Oh, you're right, I forgot about that, I guess I need at least to update
the comment.
>
>Do we have a benchmark on how costly the options are for figuring out
>the l4 type?
It's a problem only for IPv6, as we need to parse the extension headers.
None of the solution I thought about seem particularly clean. I'd be inclined
to keep this here and revisit it later.
>
>
>> +static void
>> +conn_key_lookup(struct conntrack *ct,
>> + struct conn_lookup_ctx *ctx,
>> + unsigned bucket,
>> + long long now)
>> +{
>> + struct conn *conn, *found = NULL;
>> + uint32_t hash = ctx->hash;
>> + bool reply;
>> +
>> + HMAP_FOR_EACH_WITH_HASH (conn, node, hash, &ct->connections[bucket]) {
>> + if (!memcmp(&conn->key, &ctx->key, sizeof(conn->key))) {
>> + found = conn;
>> + reply = false;
>> + break;
>> + }
>> + if (!memcmp(&conn->rev_key, &ctx->key, sizeof(conn->rev_key))) {
>> + found = conn;
>> + reply = true;
>> + break;
>> + }
>> + }
>> +
>> + if (found) {
>> + if (conn_expired(found, now)) {
>> + found = NULL;
>> + } else {
>> + ctx->reply = reply;
>> + }
>> + }
>
>Should this check be within the HMAP_FOR_EACH_WITH_HASH?
Yes, that also simplify the code a little bit, thank you.
>
>
>Separate question, and maybe something to be followed up on separately
>from this series, today with the current userspace + kernel
>implementation we can install a flow like:
>
>arp,actions=ct(table=1)
>
>At OpenFlow layer we'll accept this (although maybe we shouldn't)...
>When processing an upcall, we'll translate this (though we have enough
>information to at least complain if not more)
>Then when installing/executing into the kernel datapath, we'll reject
>the command while parsing the action.
>
>I'm considering that maybe we should either a) Restrict this all the
>way up at OpenFlow, or b) loosen the restriction in kernel datapath
>and allow such flows (but mark the traffic as invalid early in
>conntrack processing and skip to the next action).
>
>So, to my question: How does the userspace handle a case like this?
The userspace datapath does what you're suggesting as b): in
conntrack_execute() if the packet is not ipv4 or ipv6 the packet is marked
as invalid (the key extraction fails).
Restricting this at the OpenFlow level means that it wouldn't be possible
to a have _single_ flow to send both ipv4 and ipv6 traffic to the connection
tracker. Do we care about that?
More information about the dev
mailing list