[ovs-dev] Tunable flow eviction threshold

Wed Jul 27 23:13:38 UTC 2011

On Wed, Jul 27, 2011 at 2:24 PM, Pravin Shelar <pshelar at nicira.com> wrote:
> On Wed, Jul 27, 2011 at 12:17 PM, Jesse Gross <jesse at nicira.com> wrote:
>> On Wed, Jul 27, 2011 at 11:14 AM, Ethan Jackson <ethan at nicira.com> wrote:
>>>> One strategy that I have considered is to be able to ask only for flows
>>>> that have a non-zero packet count.  That would help with the common case
>>>> where, when there is a large number of flows, they are caused by a port
>>>> scan or some other activity with 1-packet flows.  It wouldn't help at
>>>> all in your case.
>>>
>>> You could also have the kernel pass down to userspace what logically
>>> amounts to a list of the flows  which have had their statistics change
>>> in the past 10 seconds.  A bloom filter would be a sensible approach.
>>> Again, probably won't help at all in Simon's case, and may or may-not
>>> be a useful optimization above simply not pushing down statistics for
>>> flows which have a zero packet count.
>>
>> I don't think that you could implement a Bloom filter like this in a
>> manner that wouldn't cause cache contention.  Probably you would still
>> need to iterate over every flow in the kernel, you would just be
>> comparing last used time to current time - 10 instead of packet count
>> not equal to zero.
>>
> cpu cache contention can be fixed by partitioning all flow by
> something (e.g. port no)  and assigning cache replacement processing
> to a cpu. replacement algo could simple as active and inactive LRU
> list. this is how kernel page cache replacement looks like from high
> level.

This isn't really a cache replacement problem though.  Maybe that's
the high level goal that's being solved but I wouldn't want to make
that assumption in the kernel as it would likely impose too many
restrictions on what userspace can do if it wants to implement
something completely different in the future.  Anything the kernel
provides should just be a simple primitive, potentially analogous to
the referenced bit that you would find in a page table.

You also can't impose a CPU partitioning scheme on flows because we
don't control the CPU that packets are being processed on.  That's
determined by the originator of the packet (such as RSS on the NIC)
and then we just handle it on the same CPU.  However, you can use a
per-CPU data structure to store information regardless of flow and
then merge them later.  This actually works well enough for something
like a Bloom filter because you can superimpose the results on top of
each other without a problem.

The issue is what happens when you want to clear it to start the next
iteration.  I can think of a couple of different ways, none of which
are all that appealing:
 * If you do a destructive read, you can just zero it out as you read
from the different CPUs.  This is fast but potentially results in
losing some flows if one sets a value in between read and clear.
 * Also for destructive read: put a lock in the per-CPU data that gets
taken both by the CPUs processing packets and by the reader CPU.  This
works but I don't want to introduce an additional spinlock on the
packet processing path even if it isn't contended and is used by the
same CPU the vast majority of the time.
* If you truly do want time based expiration then it gets even messier
since what you want is stats for the trailing X seconds.  To do that,
you would probably need a ring buffer of Bloom filters that bucket by
the second on each CPU.  By also keeping track of a per-CPU last used
time, you could know which buckets were valid and when to clear them
without doing a write from a foreign CPU.  As a long the number of
buckets that you are tracking exceeds the time interval that you care
about and the interval is greater than the interval between reads, you
should never miss a flow.  So I guess it is possible but this seems
like a disproportionately complicated solution to the problem that is
being solved.