[ovs-dev] [PATCH 7/7] datapath: Report kernel's flow key when passing packets up to userspace.

Ben Pfaff blp at nicira.com
Tue Dec 28 21:47:36 UTC 2010


On Tue, Dec 28, 2010 at 03:29:09PM -0500, Jesse Gross wrote:
> On Thu, Dec 23, 2010 at 8:15 PM, Ben Pfaff <blp at nicira.com> wrote:
> > This commit will slow down performance of checksumming packets sent up to
> > userspace.  I'm not entirely pleased with how I did it.  I considered a
> > couple of alternatives, but none of them seemed that much better.
> > Suggestions welcome.  Not changing anything wasn't an option,
> > unfortunately.  At any rate some slowdown will become unavoidable when OVS
> > actually starts using Netlink instead of just Netlink framing.
> >
> > (Actually, I thought of one option where we could avoid that: make
> > userspace do the checksum instead, by passing csum_start and csum_offset as
> > part of what goes to userspace.  But that's not perfect either.)
> 
> I don't think that the copy and checksum optimization is particularly
> important.  The original reason why I did it is because there was a
> bug in XenServer that was putting bad data in skbs.  Since we only
> looked at the full packet when sending it to userspace, this moved the
> accessing of the data from softirq context to process context, which
> is somewhat less traumatic.  The optimization was somewhat secondary
> and checksumming is pretty cheap if the data is already in the cache
> (or more realistically, you pay for the cache misses when checksumming
> but save them when copying).
> 
> However, I'm not sure why this change impacts checksumming.  It looks
> like we still do the checksum and copy optimization when generating
> the skb.  The difference is that we add a second copy but that's true
> regardless of whether we are doing checksum offloading or not.

The main reason that I was concerned about checksum optimization was
that I thought you would be.  So I'm less concerned about it now ;-)

I was initially hoping when I wrote this code that it would be easy to
avoid adding a second copy.  It wasn't.

Do you want any changes here or does it look OK?

> If you really wanted to avoid the copy, you could build a frag list instead.

That would work here, but it would have to be taken out when we actually
start using Netlink, because Netlink uses frag lists for a completely
different and incompatible purpose (see netlink_recvmsg() in
net/netlink/af_netlink.c).

> > diff --git a/datapath/actions.c b/datapath/actions.c
> > index d8dfeb0..ba86266 100644
> > --- a/datapath/actions.c
> > +++ b/datapath/actions.c
> > -/* Send a copy of this packet up to the sFlow agent, along with extra
> > - * information about what happened to it. */
> >  static void sflow_sample(struct datapath *dp, struct sk_buff *skb,
> > -                        const struct nlattr *a, u32 actions_len,
> > -                        struct vport *vport)
> > +                        const struct sw_flow_key *key,
> > +                        const struct nlattr *a, u32 actions_len)
> >  {
> > -       struct odp_sflow_sample_header *hdr;
> > -       unsigned int hdrlen = sizeof(struct odp_sflow_sample_header);
> >        struct sk_buff *nskb;
> > +       struct vport *p = OVS_CB(skb)->vport;
> >
> > -       nskb = skb_copy_expand(skb, actions_len + hdrlen, 0, GFP_ATOMIC);
> > +       if (!p)
> > +               return;
> 
> How about an unlikely() here?

OK, done.

> > +
> > +       atomic_inc(&p->sflow_pool);
> 
> Somewhat unrelated but: why do we need to separately track an sflow
> pool?  Isn't it simply the number of packets received on this port?
> Also, since it is only 32-bits, it's at risk of overflow.

It's the number of packets that were received on the port, that had a
matching flow, that were not dropped by loop detection, since the time
that sflow sampling was enabled on the port.

We can probably avoid the atomic increment somehow, but this seems like
something to do elsewhere.

I wanted to make sflow_pool a 64-bit value originally, but the InMon
folks told me that wraparound didn't matter so many times that I
eventually believed them.

I also want to get rid of the need to send down the actions to userspace
here.  Ugh.

> > +       if (net_random() >= dp->sflow_probability)
> > +               return;
> > +
> > +       nskb = skb_clone(skb, GFP_ATOMIC);
> >        if (!nskb)
> >                return;
> 
> Another unlikely() here?

Done, thanks.

> > @@ -495,13 +497,7 @@ int execute_actions(struct datapath *dp, struct sk_buff *skb,
> >                    const struct nlattr *actions, u32 actions_len)
> >  {
> >        if (dp->sflow_probability) {
> > -               struct vport *p = OVS_CB(skb)->vport;
> > -               if (p) {
> > -                       atomic_inc(&p->sflow_pool);
> > -                       if (dp->sflow_probability == UINT_MAX ||
> > -                           net_random() < dp->sflow_probability)
> > -                               sflow_sample(dp, skb, actions, actions_len, p);
> > -               }
> > +               sflow_sample(dp, skb, key, actions, actions_len);
> >        }
> 
> The braces are unnecessary now.

Removed, thanks.

> > diff --git a/datapath/datapath.c b/datapath/datapath.c
> > index 9096c44..273f1bd 100644
> > --- a/datapath/datapath.c
> > +++ b/datapath/datapath.c
> > +static void copy_and_csum_skb(struct sk_buff *skb, void *to)
> > +{
> > +       u16 csum_start, csum_offset;
> > +       __wsum csum;
> > +
> > +       get_skb_csum_pointers(skb, &csum_start, &csum_offset);
> > +       csum_start -= skb_headroom(skb);
> > +
> > +       skb_copy_bits(skb, 0, to, csum_start);
> > +
> > +       BUG_ON(csum_start >= skb_headlen(skb));
> 
> It seems like the BUG_ON should be before we actually use the thing we
> are checking.

Good point, I moved this above the skb_copy_bits() call now.

> >  /* Append each packet in 'skb' list to 'queue'.  There will be only one packet
> >  * unless we broke up a GSO packet. */
> > -static int queue_control_packets(struct sk_buff *skb, struct sk_buff_head *queue,
> > -                                int queue_no, u64 arg)
> > +static int queue_control_packets(struct datapath *dp, struct sk_buff *skb,
> > +                                u32 type, u64 userdata, u32 sample_pool,
> > +                                const struct sw_flow_key *key,
> > +                                const struct nlattr *actions, u32 actions_len)
> 
> There are a lot of arguments here.  Should we group them into a struct?

OK, done.

> > +               skb2 = alloc_skb(len, GFP_ATOMIC);
> 
> Can we find a more unique name than skb vs skb2?

I renamed skb2 as user_skb, since it will be presented directly to
userspace.

> > +               upcall = (struct odp_upcall *)__skb_put(skb2, sizeof *upcall);
> 
> There should be parenthesis with the sizeof call.

OK, fixed.

> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > index c7403f2..278749c 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> >  static int
> >  dp_netdev_output_control(struct dp_netdev *dp, const struct ofpbuf *packet,
> > -                         int queue_no, int port_no, uint64_t arg)
> > +                         int queue_no, const struct flow *flow, uint64_t arg)
> [...]
> > +    buf = ofpbuf_new((flow ? ODPUTIL_FLOW_KEY_BYTES : 0) + 2 + packet->size);
> 
> Shouldn't flow always be non-NULL?

Oops, you're right, I changed that.

Fixed.

> > +    odp_flow_key_from_flow(buf, flow);
> > +    key_len = buf->size;
> > +    ofpbuf_pull(buf, key_len);
> > +    ofpbuf_reserve(buf, 2);
> > +    ofpbuf_put(buf, packet->data, packet->size);
> > +
> > +    upcall = xzalloc(sizeof *upcall);
> > +    upcall->type = queue_no;
> > +    upcall->packet = buf;
> > +    upcall->key = buf->base;
> > +    upcall->key_len = key_len;
> > +    upcall->userdata = arg;
> 
> Is this actually correct?  odp_flow_key_from_flow() puts the flow at
> the end of the buffer.  ofpbuf_pull() removes that data and then it
> seems like ofpbuf_put() overwrites it.  In any case, it looks you are
> trying to put the key at the front and the packet at the back.  Isn't
> it simpler to just build it in one direction?

odp_flow_key_from_flow() puts the flow at the tail of the buffer.
Initially that's the start of the allocated region, because
ofpbuf_new(n) returns an ofpbuf with no headroom and 'n' bytes of
tailroom.  Then ofpbuf_pull() advances past the flow, and then
ofpbuf_put() puts the packet data at the end.

I have not tested this but it still looks correct to me.

> > @@ -1250,6 +1272,7 @@ dp_netdev_execute_actions(struct dp_netdev *dp,
> >                           const struct nlattr *actions,
> >                           size_t actions_len)
> >  {
> > +    struct flow orig_key = *key;
> 
> Do we actually change the key anywhere, such that it would require
> making a copy?

Hmm, I thought we did, but you're right that we don't.  I was misled by
missing 'const' on various functions called by
dp_netdev_execute_actions().  I'll add a patch to add those consts, and
remove the copy.

> > diff --git a/lib/dpif-provider.h b/lib/dpif-provider.h
> > index dd79b8a..e7bbf7a 100644
> > --- a/lib/dpif-provider.h
> > +++ b/lib/dpif-provider.h
> > @@ -300,16 +300,22 @@ struct dpif_class {
> >     int (*queue_to_priority)(const struct dpif *dpif, uint32_t queue_id,
> >                              uint32_t *priority);
> >
> > -    /* Attempts to receive a message from 'dpif'.  If successful, stores the
> > -     * message into '*packetp'.  The message, if one is received, must begin
> > -     * with 'struct odp_msg' as a header, and must have at least
> > -     * DPIF_RECV_MSG_PADDING bytes of headroom (allocated using
> > -     * e.g. ofpbuf_reserve()).  Only messages of the types selected with the
> > -     * set_listen_mask member function should be received.
> > +    /* Polls for an upcall from 'dpif'.  If successful, stores the upcall into
> > +     * '*upcall'.  Only upcalls of the types selected with the set_listen_mask
> > +     * member function should be received.
> >      *
> > -     * This function must not block.  If no message is ready to be received
> > -     * when it is called, it should return EAGAIN without blocking. */
> > -    int (*recv)(struct dpif *dpif, struct ofpbuf **packetp);
> > +     * The caller takes ownership of the data that 'upcall' points to.  If
> > +     * 'upcall->key' or 'upcall->actions' is nonnull, then it points into data
> > +     * owned by 'upcall->packet', so their memory canno be freed separately.
> 
> "cannot"

Thanks, fixed.

> > diff --git a/lib/dpif.h b/lib/dpif.h
> > index 483fcde..6526d31 100644
> > --- a/lib/dpif.h
> > +++ b/lib/dpif.h
> > @@ -89,19 +89,35 @@ int dpif_dump_next(const struct dpif *, struct dpif_dump *, struct odp_flow *);
> >  int dpif_execute(struct dpif *, const struct nlattr *actions,
> >                  size_t actions_len, const struct ofpbuf *);
> >
> > -/* Minimum number of bytes of headroom for a packet returned by dpif_recv()
> > - * member function.  This headroom allows "struct odp_msg" to be replaced by
> > - * "struct ofp_packet_in" without copying the buffer. */
> > -#define DPIF_RECV_MSG_PADDING \
> > -        ROUND_UP(sizeof(struct ofp_packet_in) - sizeof(struct odp_msg), 8)
> > -BUILD_ASSERT_DECL(sizeof(struct ofp_packet_in) > sizeof(struct odp_msg));
> > -BUILD_ASSERT_DECL(DPIF_RECV_MSG_PADDING % 8 == 0);
> > +/* A packet passed up from the datapath to userspace.
> > + *
> > + * If 'key' or 'actions' is nonnull, then it points into data owned by
> > + * 'packet', so their memory cannot be freed separately.  (This is hardly a
> > + * great way to do things but it works out OK for the dpif providers and
> > + * clients that exist so far.)
> > + */
> 
> Shouldn't key always be non-NULL?

You're right, I changed the code to always supply a key before I sent it
out for review.

Fixed, thanks.

> > diff --git a/ofproto/ofproto.c b/ofproto/ofproto.c
> > index 95800c3..6bd95ec 100644
> > --- a/ofproto/ofproto.c
> > +++ b/ofproto/ofproto.c
> > -schedule_packet_in(struct ofconn *ofconn, struct ofpbuf *packet, int max_len,
> > -                   bool clone)
> > +schedule_packet_in(struct ofconn *ofconn, struct dpif_upcall *upcall,
> > +                   const struct flow *flow, bool clone)
> 
> The comment above this function still refers to max_len.

And the rest of that comment was wrong too.  Fixed, thanks.

By the way, do you actually like the term "upcall"?  I'm open to other
suggestions.  It seemed better than "packet" since that term is way too
overloaded.  "event" is another possibility I guess.

I'll send out a revised series in a while.

Thanks for the review,

Ben.




More information about the dev mailing list