[ovs-dev] [PATCH] RFC: Pass more packet and flow key info to userspace.

Jesse Gross jesse at nicira.com
Wed Jan 30 00:59:59 UTC 2013


On Tue, Jan 29, 2013 at 7:10 AM, Rajahalme, Jarno (NSN - FI/Espoo)
<jarno.rajahalme at nsn.com> wrote:
>
> On Jan 24, 2013, at 19:41 , ext Jesse Gross wrote:
>
>> On Thu, Jan 24, 2013 at 7:34 AM, Jarno Rajahalme
>> <jarno.rajahalme at nsn.com> wrote:
>>>
>>> On Jan 23, 2013, at 19:30 , ext Jesse Gross wrote:
>>>
>>>> On Tue, Jan 22, 2013 at 9:48 PM, Jarno Rajahalme
>>>> <jarno.rajahalme at nsn.com> wrote:
>>>>> Add OVS_PACKET_ATTR_KEY_INFO to relieve userspace from re-computing
>>>>> data already computed within the kernel datapath.  In the typical
>>>>> case of an upcall with perfect key fitness between kernel and
>>>>> userspace this eliminates flow_extract() and flow_hash() calls in
>>>>> handle_miss_upcalls().
>>>>>
>>>>> Additional bookkeeping within the kernel datapath is minimal.
>>>>> Kernel flow insertion also saves one flow key hash computation.
>>>>>
>>>>> Removed setting the packet's l7 pointer for ICMP packets, as this was
>>>>> never used.
>>>>>
>>>>> Signed-off-by: Jarno Rajahalme <jarno.rajahalme at nsn.com>
>>>>> ---
>>>>>
>>>>> This likely requires some discussion, but it took a while for me to
>>>>> understand why each packet miss upcall would require flow_extract()
>>>>> right after the flow key has been obtained from odp attributes.
>>>>
>>>> Do you have any performance numbers to share?  Since this is an
>>>> optimization it's important to understand if the benefit is worth the
>>>> extra complexity.
>>>
>>> Not yet, but would be happy to. Any hits towards for the best way of obtaining
>>> meaningful numbers for something like this?
>>
>> This is a flow setup optimization, so usually something like netperf
>> TCP_CRR would be a good way to stress that.
>>
>> However, Ben mentioned to me that he had tried eliminating the
>> flow_extract() call from userspace in the past and the results were
>> disappointing.
>
> I made a simple test, where there is only one flow entry "in_port=LOCAL actions=drop", and only the local port is configured. One process sends UDP packets with different source/destination port combinations in a loop. OVS then tries to cope with the load. During the test both processes run near 100% CPU utilization in a virtual machine on a dual-core laptop. On each round 10100000 packets were generated:
>
> OFPST_PORT reply (xid=0x2): 1 ports
>   port LOCAL: rx pkts=10100006, bytes=464600468, drop=0, errs=0, frame=0, over=0, crc=0
>            tx pkts=0, bytes=0, drop=0, errs=0, coll=0
>
> With current master 19.35% of packets on average get processed by the flow:
>
> Round 1:
> NXST_FLOW reply (xid=0x4):
>  cookie=0x0, duration=29.124s, table=0, n_packets=1959794, n_bytes=90150548, idle_age=4, in_port=LOCAL actions=drop
>
> Round 2:
> NXST_FLOW reply (xid=0x4):
>  cookie=0x0, duration=63.534s, table=0, n_packets=1932785, n_bytes=88908158, idle_age=37, in_port=LOCAL actions=drop
>
> Round 3:
> NXST_FLOW reply (xid=0x4):
>  cookie=0x0, duration=33.394s, table=0, n_packets=1972389, n_bytes=90729894, idle_age=8, in_port=LOCAL actions=drop
>
>
> With the proposed change 20.2% of packets on average get processed by the flow:
>
> Round 4:
> NXST_FLOW reply (xid=0x4):
>  cookie=0x0, duration=31.96s, table=0, n_packets=2042759, n_bytes=93966914, idle_age=4, in_port=LOCAL actions=drop
>
> Round 5:
> NXST_FLOW reply (xid=0x4):
>  cookie=0x0, duration=38.6s, table=0, n_packets=2040224, n_bytes=93850372, idle_age=8, in_port=LOCAL actions=drop
>
> Round 6:
> NXST_FLOW reply (xid=0x4):
>  cookie=0x0, duration=35.661s, table=0, n_packets=2039595, n_bytes=93821418, idle_age=3, in_port=LOCAL actions=drop
>
>
> So there is a consistent benefit, but it is not very large. Seemingly the flow_extract() and flow_hash() represent only a small portion of the OVS flow setup CPU use.

Thanks for testing this out to get some concrete numbers.

One thing that comes to mind is whether we actually use the layer
pointers in the packet all that often.  It seems to me that in cases
where we are actually able to setup a kernel flow, userspace should be
able to do all of its work by only looking at that flow.  The other
cases should be rare, like if userspace is directly consuming the
packet or if the fit is not perfect.  If that's the case, could we get
a similar benefit without touching the userspace/kernel interface by
only initializing the pointers on demand?



More information about the dev mailing list