[ovs-dev] [PATCH 2/2] miniflow: Use 64-bit data.

Ben Pfaff blp at nicira.com
Mon Jan 5 22:15:13 UTC 2015


On Mon, Jan 05, 2015 at 02:08:41PM -0800, Jarno Rajahalme wrote:
> 
> On Dec 29, 2014, at 2:27 PM, Ben Pfaff <blp at nicira.com> wrote:
> 
> > On Wed, Dec 17, 2014 at 10:30:42AM -0800, Jarno Rajahalme wrote:
> >> So far the compressed flow data in struct miniflow has been in 32-bit
> >> words with a 63-bit map, allowing for a maximum size of struct flow of
> >> 252 bytes.  With the forthcoming Geneve options this is not sufficient
> >> any more.
> >> 
> >> This patch solves the problem by changing the miniflow data to 64-bit
> >> words, doubling the flow max size to 504 bytes.  Since the word size
> >> is doubled, there is some loss in compression efficiency.  To counter
> >> this some of the flow fields have been reordered to keep related
> >> fields together (e.g., the source and destination IP addresses share
> >> the same 64-bit word).
> >> 
> >> This change should speed up flow data processing on 64-bit CPUs, which
> >> may help counterbalance the impact of making the struct flow bigger in
> >> the future.
> >> 
> >> Classifier lookup stage boundaries are also changed to 64-bit
> >> alignment, as the current algorithm depends on each miniflow word to
> >> not be split between ranges.  This has resulted in new padding (part
> >> of the 'mpls_lse' field).
> >> 
> >> The 'dp_hash' field is also moved to packet metadata to eliminate
> >> otherwise needed padding there.  This allows the L4 to fit into one
> >> 64-bit word, and also makes matches on 'dp_hash' more efficient as
> >> misses can be found already on stage 1.
> >> 
> >> Signed-off-by: Jarno Rajahalme <jrajahalme at nicira.com>
> 
> > This seems mostly straightforward.  Are there particular parts you'd
> > like me to look over carefully?
> > 
> 
> Maybe the changes to the miniflow push macros, which get a bit more
> complicated...

OK, I'll do that in a bit.

> > I have a suggestion for miniflow_extract().  It is getting more
> > complex over time, and becoming more difficult to understand.  I
> > wonder whether a simpler approach would be just as fast and easier to
> > understand.  Suppose that, instead of constructing a miniflow
> > initially, we instead construct a regular "struct flow" on the stack.
> > All the zeroing and then later checking for nonzero values is what
> > drove us earlier to move to building a miniflow directly, so we'd want
> > to avoid that.  But we can do that by not initializing the flow at
> > all, and just keeping track in a map of the u32s (or u64s) we've
> > initialized, and then copying those fields into a miniflow based on
> > the map we assembled.  (I made the same suggestion in November, but I
> > didn't get a reply, so I think that you must have missed it in all the
> > email that you get.)
> 
> This is definitely worth experimenting with, but if may be that the
> cost of scatter writing to most (?) of the cachelines is almost as
> costly as memsetting them in the beginning. That said, parsing
> geneve options will need support for out-of-order processing, so
> what you propose here might be a good way of doing that, too.

Where in memory was the "struct flow" that we were previously zeroing?
I would guess that if we put our temporary struct flow on the stack,
then there wouldn't be cache contention for it since stacks are
(essentially) thread-local and accessed frequently enough to be
cache-hot.  When we copy that into a heap-based miniflow, that should
access a minimal number of cache lines.  I don't know whether, when we
previously zeroed an entire struct flow, that struct flow was on the
stack or the heap; presumably initializing a heap-based struct flow is
more expensive.



More information about the dev mailing list