[ovs-dev] [PATCH 7/7] datapath: Report kernel's flow key when passing packets up to userspace.

Jesse Gross jesse at nicira.com
Tue Dec 28 23:01:31 UTC 2010


On Tue, Dec 28, 2010 at 4:47 PM, Ben Pfaff <blp at nicira.com> wrote:
> On Tue, Dec 28, 2010 at 03:29:09PM -0500, Jesse Gross wrote:
>> On Thu, Dec 23, 2010 at 8:15 PM, Ben Pfaff <blp at nicira.com> wrote:
>> > This commit will slow down performance of checksumming packets sent up to
>> > userspace.  I'm not entirely pleased with how I did it.  I considered a
>> > couple of alternatives, but none of them seemed that much better.
>> > Suggestions welcome.  Not changing anything wasn't an option,
>> > unfortunately.  At any rate some slowdown will become unavoidable when OVS
>> > actually starts using Netlink instead of just Netlink framing.
>> >
>> > (Actually, I thought of one option where we could avoid that: make
>> > userspace do the checksum instead, by passing csum_start and csum_offset as
>> > part of what goes to userspace.  But that's not perfect either.)
>>
>> I don't think that the copy and checksum optimization is particularly
>> important.  The original reason why I did it is because there was a
>> bug in XenServer that was putting bad data in skbs.  Since we only
>> looked at the full packet when sending it to userspace, this moved the
>> accessing of the data from softirq context to process context, which
>> is somewhat less traumatic.  The optimization was somewhat secondary
>> and checksumming is pretty cheap if the data is already in the cache
>> (or more realistically, you pay for the cache misses when checksumming
>> but save them when copying).
>>
>> However, I'm not sure why this change impacts checksumming.  It looks
>> like we still do the checksum and copy optimization when generating
>> the skb.  The difference is that we add a second copy but that's true
>> regardless of whether we are doing checksum offloading or not.
>
> The main reason that I was concerned about checksum optimization was
> that I thought you would be.  So I'm less concerned about it now ;-)
>
> I was initially hoping when I wrote this code that it would be easy to
> avoid adding a second copy.  It wasn't.
>
> Do you want any changes here or does it look OK?

No, it looks OK.

>
>> If you really wanted to avoid the copy, you could build a frag list instead.
>
> That would work here, but it would have to be taken out when we actually
> start using Netlink, because Netlink uses frag lists for a completely
> different and incompatible purpose (see netlink_recvmsg() in
> net/netlink/af_netlink.c).

Hmm, well we could use frags instead although it won't help in quite
as many circumstances.

>> > +
>> > +       atomic_inc(&p->sflow_pool);
>>
>> Somewhat unrelated but: why do we need to separately track an sflow
>> pool?  Isn't it simply the number of packets received on this port?
>> Also, since it is only 32-bits, it's at risk of overflow.
>
> It's the number of packets that were received on the port, that had a
> matching flow, that were not dropped by loop detection, since the time
> that sflow sampling was enabled on the port.
>

The matching flow part doesn't seem quite right because it will miss
packets that are handled by userspace.  I don't think an external
collector should care where the packets are handled.  Userspace should
know when sflow was enabled on a port, so it should be able to
compensate manually.  Hopefully, loop detection should be a pretty
rare case...

It seems to me that this is something that could be done in userspace.
 The kernel piece could then become a more generic sample action.

> We can probably avoid the atomic increment somehow, but this seems like
> something to do elsewhere.

Yes, if this stays in the kernel it should become percpu data.

> I wanted to make sflow_pool a 64-bit value originally, but the InMon
> folks told me that wraparound didn't matter so many times that I
> eventually believed them.

It can be compensated for, if necessary.  Maybe the sflow protocol
only has a 32-bit field.

>
> I also want to get rid of the need to send down the actions to userspace
> here.  Ugh.

Yes, that would be nice as well.

>> > +    odp_flow_key_from_flow(buf, flow);
>> > +    key_len = buf->size;
>> > +    ofpbuf_pull(buf, key_len);
>> > +    ofpbuf_reserve(buf, 2);
>> > +    ofpbuf_put(buf, packet->data, packet->size);
>> > +
>> > +    upcall = xzalloc(sizeof *upcall);
>> > +    upcall->type = queue_no;
>> > +    upcall->packet = buf;
>> > +    upcall->key = buf->base;
>> > +    upcall->key_len = key_len;
>> > +    upcall->userdata = arg;
>>
>> Is this actually correct?  odp_flow_key_from_flow() puts the flow at
>> the end of the buffer.  ofpbuf_pull() removes that data and then it
>> seems like ofpbuf_put() overwrites it.  In any case, it looks you are
>> trying to put the key at the front and the packet at the back.  Isn't
>> it simpler to just build it in one direction?
>
> odp_flow_key_from_flow() puts the flow at the tail of the buffer.
> Initially that's the start of the allocated region, because
> ofpbuf_new(n) returns an ofpbuf with no headroom and 'n' bytes of
> tailroom.  Then ofpbuf_pull() advances past the flow, and then
> ofpbuf_put() puts the packet data at the end.
>
> I have not tested this but it still looks correct to me.

OK, it sounds that it does the right thing.  The usage is a little
non-standard but I can see why you did it that way.

> By the way, do you actually like the term "upcall"?  I'm open to other
> suggestions.  It seemed better than "packet" since that term is way too
> overloaded.  "event" is another possibility I guess.

It seems good enough to me.  It doesn't bother me and I don't have any
better idea.




More information about the dev mailing list