[ovs-dev] [PATCH 7/7] datapath: Report kernel's flow key when passing packets up to userspace.

Tue Dec 28 20:29:09 UTC 2010

On Thu, Dec 23, 2010 at 8:15 PM, Ben Pfaff <blp at nicira.com> wrote:
> One of the goals for Open vSwitch is to decouple kernel and userspace
> software, so that either one can be upgraded or rolled back independent of
> the other.  To do this in full generality, it must be possible to change
> the kernel's idea of the flow key separately from the userspace version.
>
> This commit takes one step in that direction by making the kernel report
> its idea of the flow that a packet belongs to whenever it passes a packet
> up to userspace.  This means that userspace can intelligently figure out
> what to do:
>
>   - If userspace's notion of the flow for the packet matches the kernel's,
>     then nothing special is necessary.
>
>   - If the kernel has a more specific notion for the flow than userspace,
>     for example if the kernel decoded IPv6 headers but userspace stopped
>     at the Ethernet type (because it does not understand IPv6), then again
>     nothing special is necessary: userspace can still set up the flow in
>     the usual way.
>
>   - If userspace has a more specific notion for the flow than the kernel,
>     for example if userspace decoded an IPv6 header but the kernel
>     stopped at the Ethernet type, then userspace can forward the packet
>     manually, without setting up a flow in the kernel.  (This case is
>     bad from a performance point of view, but at least it is correct.)

This is very clever.  I like this a lot.

>
> This commit does not actually make userspace flexible enough to handle
> changes in the kernel flow key structure, although userspace does now
> have enough information to do that intelligently.  This will have to wait
> for later commits.
>
> This commit is bigger than it would otherwise be because it is rolled
> together with changing "struct odp_msg" to a sequence of Netlink
> attributes.  The alternative, to do each of those changes in a separate
> patch, seemed like overkill because it meant that either we would have to
> introduce and then kill off Netlink attributes for in_port and tun_id, if
> Netlink conversion went first, or shove yet another variable-length header
> into the stuff already after odp_msg, if adding the flow key to odp_msg
> went first.
>
> This commit will slow down performance of checksumming packets sent up to
> userspace.  I'm not entirely pleased with how I did it.  I considered a
> couple of alternatives, but none of them seemed that much better.
> Suggestions welcome.  Not changing anything wasn't an option,
> unfortunately.  At any rate some slowdown will become unavoidable when OVS
> actually starts using Netlink instead of just Netlink framing.
>
> (Actually, I thought of one option where we could avoid that: make
> userspace do the checksum instead, by passing csum_start and csum_offset as
> part of what goes to userspace.  But that's not perfect either.)

I don't think that the copy and checksum optimization is particularly
important.  The original reason why I did it is because there was a
bug in XenServer that was putting bad data in skbs.  Since we only
looked at the full packet when sending it to userspace, this moved the
accessing of the data from softirq context to process context, which
is somewhat less traumatic.  The optimization was somewhat secondary
and checksumming is pretty cheap if the data is already in the cache
(or more realistically, you pay for the cache misses when checksumming
but save them when copying).

However, I'm not sure why this change impacts checksumming.  It looks
like we still do the checksum and copy optimization when generating
the skb.  The difference is that we add a second copy but that's true
regardless of whether we are doing checksum offloading or not.

If you really wanted to avoid the copy, you could build a frag list instead.

> diff --git a/datapath/actions.c b/datapath/actions.c
> index d8dfeb0..ba86266 100644
> --- a/datapath/actions.c
> +++ b/datapath/actions.c
> -/* Send a copy of this packet up to the sFlow agent, along with extra
> - * information about what happened to it. */
>  static void sflow_sample(struct datapath *dp, struct sk_buff *skb,
> -                        const struct nlattr *a, u32 actions_len,
> -                        struct vport *vport)
> +                        const struct sw_flow_key *key,
> +                        const struct nlattr *a, u32 actions_len)
>  {
> -       struct odp_sflow_sample_header *hdr;
> -       unsigned int hdrlen = sizeof(struct odp_sflow_sample_header);
>        struct sk_buff *nskb;
> +       struct vport *p = OVS_CB(skb)->vport;
>
> -       nskb = skb_copy_expand(skb, actions_len + hdrlen, 0, GFP_ATOMIC);
> +       if (!p)
> +               return;

How about an unlikely() here?

> +
> +       atomic_inc(&p->sflow_pool);

Somewhat unrelated but: why do we need to separately track an sflow
pool?  Isn't it simply the number of packets received on this port?
Also, since it is only 32-bits, it's at risk of overflow.

> +       if (net_random() >= dp->sflow_probability)
> +               return;
> +
> +       nskb = skb_clone(skb, GFP_ATOMIC);
>        if (!nskb)
>                return;

Another unlikely() here?

> @@ -495,13 +497,7 @@ int execute_actions(struct datapath *dp, struct sk_buff *skb,
>                    const struct nlattr *actions, u32 actions_len)
>  {
>        if (dp->sflow_probability) {
> -               struct vport *p = OVS_CB(skb)->vport;
> -               if (p) {
> -                       atomic_inc(&p->sflow_pool);
> -                       if (dp->sflow_probability == UINT_MAX ||
> -                           net_random() < dp->sflow_probability)
> -                               sflow_sample(dp, skb, actions, actions_len, p);
> -               }
> +               sflow_sample(dp, skb, key, actions, actions_len);
>        }

The braces are unnecessary now.

> diff --git a/datapath/datapath.c b/datapath/datapath.c
> index 9096c44..273f1bd 100644
> --- a/datapath/datapath.c
> +++ b/datapath/datapath.c
> +static void copy_and_csum_skb(struct sk_buff *skb, void *to)
> +{
> +       u16 csum_start, csum_offset;
> +       __wsum csum;
> +
> +       get_skb_csum_pointers(skb, &csum_start, &csum_offset);
> +       csum_start -= skb_headroom(skb);
> +
> +       skb_copy_bits(skb, 0, to, csum_start);
> +
> +       BUG_ON(csum_start >= skb_headlen(skb));

It seems like the BUG_ON should be before we actually use the thing we
are checking.

>  /* Append each packet in 'skb' list to 'queue'.  There will be only one packet
>  * unless we broke up a GSO packet. */
> -static int queue_control_packets(struct sk_buff *skb, struct sk_buff_head *queue,
> -                                int queue_no, u64 arg)
> +static int queue_control_packets(struct datapath *dp, struct sk_buff *skb,
> +                                u32 type, u64 userdata, u32 sample_pool,
> +                                const struct sw_flow_key *key,
> +                                const struct nlattr *actions, u32 actions_len)

There are a lot of arguments here.  Should we group them into a struct?

> +               skb2 = alloc_skb(len, GFP_ATOMIC);

Can we find a more unique name than skb vs skb2?

> +               upcall = (struct odp_upcall *)__skb_put(skb2, sizeof *upcall);

There should be parenthesis with the sizeof call.

> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index c7403f2..278749c 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
>  static int
>  dp_netdev_output_control(struct dp_netdev *dp, const struct ofpbuf *packet,
> -                         int queue_no, int port_no, uint64_t arg)
> +                         int queue_no, const struct flow *flow, uint64_t arg)
[...]
> +    buf = ofpbuf_new((flow ? ODPUTIL_FLOW_KEY_BYTES : 0) + 2 + packet->size);

Shouldn't flow always be non-NULL?

> +    odp_flow_key_from_flow(buf, flow);
> +    key_len = buf->size;
> +    ofpbuf_pull(buf, key_len);
> +    ofpbuf_reserve(buf, 2);
> +    ofpbuf_put(buf, packet->data, packet->size);
> +
> +    upcall = xzalloc(sizeof *upcall);
> +    upcall->type = queue_no;
> +    upcall->packet = buf;
> +    upcall->key = buf->base;
> +    upcall->key_len = key_len;
> +    upcall->userdata = arg;

Is this actually correct?  odp_flow_key_from_flow() puts the flow at
the end of the buffer.  ofpbuf_pull() removes that data and then it
seems like ofpbuf_put() overwrites it.  In any case, it looks you are
trying to put the key at the front and the packet at the back.  Isn't
it simpler to just build it in one direction?

> @@ -1250,6 +1272,7 @@ dp_netdev_execute_actions(struct dp_netdev *dp,
>                           const struct nlattr *actions,
>                           size_t actions_len)
>  {
> +    struct flow orig_key = *key;

Do we actually change the key anywhere, such that it would require
making a copy?

> diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
> index dd79b8a..e7bbf7a 100644
> --- a/lib/dpif-provider.h
> +++ b/lib/dpif-provider.h
> @@ -300,16 +300,22 @@ struct dpif_class {
>     int (*queue_to_priority)(const struct dpif *dpif, uint32_t queue_id,
>                              uint32_t *priority);
>
> -    /* Attempts to receive a message from 'dpif'.  If successful, stores the
> -     * message into '*packetp'.  The message, if one is received, must begin
> -     * with 'struct odp_msg' as a header, and must have at least
> -     * DPIF_RECV_MSG_PADDING bytes of headroom (allocated using
> -     * e.g. ofpbuf_reserve()).  Only messages of the types selected with the
> -     * set_listen_mask member function should be received.
> +    /* Polls for an upcall from 'dpif'.  If successful, stores the upcall into
> +     * '*upcall'.  Only upcalls of the types selected with the set_listen_mask
> +     * member function should be received.
>      *
> -     * This function must not block.  If no message is ready to be received
> -     * when it is called, it should return EAGAIN without blocking. */
> -    int (*recv)(struct dpif *dpif, struct ofpbuf **packetp);
> +     * The caller takes ownership of the data that 'upcall' points to.  If
> +     * 'upcall->key' or 'upcall->actions' is nonnull, then it points into data
> +     * owned by 'upcall->packet', so their memory canno be freed separately.

"cannot"

> diff --git a/lib/dpif.h b/lib/dpif.h
> index 483fcde..6526d31 100644
> --- a/lib/dpif.h
> +++ b/lib/dpif.h
> @@ -89,19 +89,35 @@ int dpif_dump_next(const struct dpif *, struct dpif_dump *, struct odp_flow *);
>  int dpif_execute(struct dpif *, const struct nlattr *actions,
>                  size_t actions_len, const struct ofpbuf *);
>
> -/* Minimum number of bytes of headroom for a packet returned by dpif_recv()
> - * member function.  This headroom allows "struct odp_msg" to be replaced by
> - * "struct ofp_packet_in" without copying the buffer. */
> -#define DPIF_RECV_MSG_PADDING \
> -        ROUND_UP(sizeof(struct ofp_packet_in) - sizeof(struct odp_msg), 8)
> -BUILD_ASSERT_DECL(sizeof(struct ofp_packet_in) > sizeof(struct odp_msg));
> -BUILD_ASSERT_DECL(DPIF_RECV_MSG_PADDING % 8 == 0);
> +/* A packet passed up from the datapath to userspace.
> + *
> + * If 'key' or 'actions' is nonnull, then it points into data owned by
> + * 'packet', so their memory cannot be freed separately.  (This is hardly a
> + * great way to do things but it works out OK for the dpif providers and
> + * clients that exist so far.)
> + */

Shouldn't key always be non-NULL?

> diff --git a/ofproto/ofproto.c b/ofproto/ofproto.c
> index 95800c3..6bd95ec 100644
> --- a/ofproto/ofproto.c
> +++ b/ofproto/ofproto.c
> -schedule_packet_in(struct ofconn *ofconn, struct ofpbuf *packet, int max_len,
> -                   bool clone)
> +schedule_packet_in(struct ofconn *ofconn, struct dpif_upcall *upcall,
> +                   const struct flow *flow, bool clone)

The comment above this function still refers to max_len.