[ovs-dev] [PATCH 5/5] lib/cmap: cmap_find_batch().

Ben Pfaff blp at nicira.com
Mon Oct 6 18:52:18 UTC 2014


On Wed, Sep 24, 2014 at 11:31:47AM -0700, Jarno Rajahalme wrote:
> Batching the cmap find improves the memory behavior with large cmaps
> and can make searches twice as fast:
> 
> $ tests/ovstest test-cmap benchmark 2000000 8 0.1 16
> Benchmarking with n=2000000, 8 threads, 0.10% mutations, batch size 16:
> cmap insert:    533 ms
> cmap iterate:    57 ms
> batch search:   146 ms
> cmap destroy:   233 ms
> 
> cmap insert:    552 ms
> cmap iterate:    56 ms
> cmap search:    299 ms
> cmap destroy:   229 ms
> 
> hmap insert:    222 ms
> hmap iterate:   198 ms
> hmap search:   2061 ms
> hmap destroy:   209 ms
> 
> Batch size 1 has small performance penalty, but all other batch sizes
> are faster than non-batched cmap_find().  The batch size 16 was
> experimentally found better than 8 or 32, so now
> classifier_lookup_miniflow_batch() performs the cmap find operations
> in batches of 16.
> 
> Signed-off-by: Jarno Rajahalme <jrajahalme at nicira.com>

Here, I think that it would be possible to use __builtin_constant_p()
to inline operations on "long"-size bitmaps, which would allow users
to get automatic benefit without having to be aware of the size of the
map.  I don't know whether it's important enough though:
> +/* More efficient access to a map of single ulong. */
> +#define ULONG_FOR_EACH_1(IDX, MAP)                  \
> +    for (unsigned long map__ = (MAP);               \
> +         map__ && (((IDX) = raw_ctz(map__)), true); \
> +         map__ = zero_rightmost_1bit(map__))
> +
> +#define ULONG_SET0(MAP, OFFSET) ((MAP) &= ~(1UL << (OFFSET)))
> +#define ULONG_SET1(MAP, OFFSET) ((MAP) |= 1UL << (OFFSET))

It's nice, when it's possible, to allow callers to use any batch size
they want and then loop internally as necessary to handle very large
batches.  I don't know whether you considered that approach for
classifier_lookup_miniflow_batch(), instead of defining
CLASSIFIER_MAX_BATCH.  In this case, the actual restriction is only
for Windows compilation, so we risk bugs that only show up on Windows.
(I guess that the easy way to avoid that risk is to include the
assertion
    ovs_assert(n_maps <= CLASSIFIER_MAX_BATCH);
on all platforms instead of just Windows.)

On the other hand, limiting the batch size to the size of an unsigned
long makes sense to me in the classifier_lookup_miniflow_batch()
interface.

map_type and MAP_BITS are declared in an IMO awkward spot between the
function explanatory comment and the function declaration.  I would
move them inside the function code block.

Here, I think that the Windows version should declare maps[] to have
DIV_ROUND_UP(CLASSIFIER_MAX_BATCH, MAP_BITS) elements, rather than
CLASSIFIER_MAX_BATCH elements:

        const int n_maps = DIV_ROUND_UP(cnt, MAP_BITS);

    #if !defined(__CHECKER__) && !defined(_WIN32)
        map_type maps[n_maps];
    #else
        map_type maps[CLASSIFIER_MAX_BATCH];
        ovs_assert(n_maps <= CLASSIFIER_MAX_BATCH);
    #endif

Why does a batch have 16 lookups instead of CHAR_BIT * sizeof(unsigned
long)?  It's not obvious and the comment doesn't say.  Oh, I see, it's
in the commit message, will you please copy that into the batch size
comment?

This is intricate code.  I spent some time reading it carefully.  I
did not find any bugs.

Thank you!

Acked-by: Ben Pfaff <blp at nicira.com>



More information about the dev mailing list