[ovs-dev] [PATCH 7/7] datapath: Report kernel's flow key when passing packets up to userspace.
Jesse Gross
jesse at nicira.com
Tue Dec 28 23:01:31 UTC 2010
On Tue, Dec 28, 2010 at 4:47 PM, Ben Pfaff <blp at nicira.com> wrote:
> On Tue, Dec 28, 2010 at 03:29:09PM -0500, Jesse Gross wrote:
>> On Thu, Dec 23, 2010 at 8:15 PM, Ben Pfaff <blp at nicira.com> wrote:
>> > This commit will slow down performance of checksumming packets sent up to
>> > userspace. I'm not entirely pleased with how I did it. I considered a
>> > couple of alternatives, but none of them seemed that much better.
>> > Suggestions welcome. Not changing anything wasn't an option,
>> > unfortunately. At any rate some slowdown will become unavoidable when OVS
>> > actually starts using Netlink instead of just Netlink framing.
>> >
>> > (Actually, I thought of one option where we could avoid that: make
>> > userspace do the checksum instead, by passing csum_start and csum_offset as
>> > part of what goes to userspace. But that's not perfect either.)
>>
>> I don't think that the copy and checksum optimization is particularly
>> important. The original reason why I did it is because there was a
>> bug in XenServer that was putting bad data in skbs. Since we only
>> looked at the full packet when sending it to userspace, this moved the
>> accessing of the data from softirq context to process context, which
>> is somewhat less traumatic. The optimization was somewhat secondary
>> and checksumming is pretty cheap if the data is already in the cache
>> (or more realistically, you pay for the cache misses when checksumming
>> but save them when copying).
>>
>> However, I'm not sure why this change impacts checksumming. It looks
>> like we still do the checksum and copy optimization when generating
>> the skb. The difference is that we add a second copy but that's true
>> regardless of whether we are doing checksum offloading or not.
>
> The main reason that I was concerned about checksum optimization was
> that I thought you would be. So I'm less concerned about it now ;-)
>
> I was initially hoping when I wrote this code that it would be easy to
> avoid adding a second copy. It wasn't.
>
> Do you want any changes here or does it look OK?
No, it looks OK.
>
>> If you really wanted to avoid the copy, you could build a frag list instead.
>
> That would work here, but it would have to be taken out when we actually
> start using Netlink, because Netlink uses frag lists for a completely
> different and incompatible purpose (see netlink_recvmsg() in
> net/netlink/af_netlink.c).
Hmm, well we could use frags instead although it won't help in quite
as many circumstances.
>> > +
>> > + atomic_inc(&p->sflow_pool);
>>
>> Somewhat unrelated but: why do we need to separately track an sflow
>> pool? Isn't it simply the number of packets received on this port?
>> Also, since it is only 32-bits, it's at risk of overflow.
>
> It's the number of packets that were received on the port, that had a
> matching flow, that were not dropped by loop detection, since the time
> that sflow sampling was enabled on the port.
>
The matching flow part doesn't seem quite right because it will miss
packets that are handled by userspace. I don't think an external
collector should care where the packets are handled. Userspace should
know when sflow was enabled on a port, so it should be able to
compensate manually. Hopefully, loop detection should be a pretty
rare case...
It seems to me that this is something that could be done in userspace.
The kernel piece could then become a more generic sample action.
> We can probably avoid the atomic increment somehow, but this seems like
> something to do elsewhere.
Yes, if this stays in the kernel it should become percpu data.
> I wanted to make sflow_pool a 64-bit value originally, but the InMon
> folks told me that wraparound didn't matter so many times that I
> eventually believed them.
It can be compensated for, if necessary. Maybe the sflow protocol
only has a 32-bit field.
>
> I also want to get rid of the need to send down the actions to userspace
> here. Ugh.
Yes, that would be nice as well.
>> > + odp_flow_key_from_flow(buf, flow);
>> > + key_len = buf->size;
>> > + ofpbuf_pull(buf, key_len);
>> > + ofpbuf_reserve(buf, 2);
>> > + ofpbuf_put(buf, packet->data, packet->size);
>> > +
>> > + upcall = xzalloc(sizeof *upcall);
>> > + upcall->type = queue_no;
>> > + upcall->packet = buf;
>> > + upcall->key = buf->base;
>> > + upcall->key_len = key_len;
>> > + upcall->userdata = arg;
>>
>> Is this actually correct? odp_flow_key_from_flow() puts the flow at
>> the end of the buffer. ofpbuf_pull() removes that data and then it
>> seems like ofpbuf_put() overwrites it. In any case, it looks you are
>> trying to put the key at the front and the packet at the back. Isn't
>> it simpler to just build it in one direction?
>
> odp_flow_key_from_flow() puts the flow at the tail of the buffer.
> Initially that's the start of the allocated region, because
> ofpbuf_new(n) returns an ofpbuf with no headroom and 'n' bytes of
> tailroom. Then ofpbuf_pull() advances past the flow, and then
> ofpbuf_put() puts the packet data at the end.
>
> I have not tested this but it still looks correct to me.
OK, it sounds that it does the right thing. The usage is a little
non-standard but I can see why you did it that way.
> By the way, do you actually like the term "upcall"? I'm open to other
> suggestions. It seemed better than "packet" since that term is way too
> overloaded. "event" is another possibility I guess.
It seems good enough to me. It doesn't bother me and I don't have any
better idea.
More information about the dev
mailing list