[ovs-dev] [PATCH v10 4/5] dpif-netdev: refactor generic implementation

Fri Jul 12 14:19:15 UTC 2019

On 7/9/2019 1:34 PM, Harry van Haaren wrote:
> This commit refactors the generic implementation. The
> goal of this refactor is to simply the code to enable
'simply' -> 'simplify'?

> "specialization" of the functions at compile time.
> 
> Given compile-time optimizations, the compiler is able
> to unroll loops, and create optimized code sequences due
> to compile time knowledge of loop-trip counts.
> 
> In order to enable these compiler optimizations, we must
> refactor the code to pass the loop-trip counts to functions
> as compile time constants.
> 
> This patch allows the number of miniflow-bits set per "unit"
> in the miniflow to be passed around as a function argument.
> 
> Note that this patch does NOT yet take advantage of doing so,
> this is only a refactor to enable it in the next patches.
> 
> Signed-off-by: Harry van Haaren <harry.van.haaren at intel.com>
> Tested-by: Malvika Gupta <malvika.gupta at arm.com>
> 

Thanks for the v10 Harry, some comments below.

> ---
> 
> v10:
> - Rebase updates from previous patches
> - Fix whitespace indentation of func params
> - Removed restrict keyword, Windows CI failing when it is used (Ian)
> - Fix integer 0 used to set NULL pointer (Ilya)
> - Postpone free() call on cls->blocks_scratch (Ilya)
> - Fix indentation of a function
> 
> v9:
> - Use count_1bits in favour of __builtin_popcount (Ilya)
> - Use ALWAYS_INLINE instead of __attribute__ synatx (Ilya)
> 
> v8:
> - Rework block_cache and mf_masks to avoid variable-lenght array
>    due to compiler issues. Provisioning for worst case is not a
>    good solution due to magnitue of over-provisioning required.
> - Rework netdev_flatten function removing unused parameter
> ---
>   lib/dpif-netdev-lookup-generic.c | 239 ++++++++++++++++++++++++-------
>   lib/dpif-netdev.c                |  77 +++++++++-
>   lib/dpif-netdev.h                |  18 +++
>   3 files changed, 281 insertions(+), 53 deletions(-)
> 
> diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
> index d49d4b570..432d8782e 100644
> --- a/lib/dpif-netdev-lookup-generic.c
> +++ b/lib/dpif-netdev-lookup-generic.c
> @@ -29,67 +29,204 @@
>   #include "packets.h"
>   #include "pvector.h"
>   
> -/* Returns a hash value for the bits of 'key' where there are 1-bits in
> - * 'mask'. */
> -static inline uint32_t
> -netdev_flow_key_hash_in_mask(const struct netdev_flow_key *key,
> -                             const struct netdev_flow_key *mask)
> +VLOG_DEFINE_THIS_MODULE(dpif_lookup_generic);
> +
> +/* netdev_flow_key_flatten_unit:

No need to mention function name in comments. Applies to other function 
comments in the patch also.

For function comments in OVS you can follow:

http://docs.openvswitch.org/en/latest/internals/contributing/coding-style/#functions

> + * Given a packet, table and mf_masks, this function iterates over each bit
> + * set in the subtable, and calculates the appropriate metadata to store in the
> + * blocks_scratch[].
> + *
> + * The results of the blocks_scratch[] can be used for hashing, and later for
> + * verification of if a rule matches the given packet.
> + */
> +static inline void
> +netdev_flow_key_flatten_unit(const uint64_t *pkt_blocks,
> +                             const uint64_t *tbl_blocks,
> +                             const uint64_t *mf_masks,
> +                             uint64_t *blocks_scratch,
> +                             const uint64_t pkt_mf_bits,
> +                             const uint32_t count)
>   {
> -    const uint64_t *p = miniflow_get_values(&mask->mf);
> -    uint32_t hash = 0;
> -    uint64_t value;
> +    uint32_t i;
New line just to separate variable declaration form function body.

> +    for (i = 0; i < count; i++) {
> +        uint64_t mf_mask = mf_masks[i];
> +        /* Calculate the block index for the packet metadata */
> +        uint64_t idx_bits = mf_mask & pkt_mf_bits;
> +        const uint32_t pkt_idx = count_1bits(idx_bits);
>   
> -    NETDEV_FLOW_KEY_FOR_EACH_IN_FLOWMAP (value, key, mask->mf.map) {
> -        hash = hash_add64(hash, value & *p);
> -        p++;
> +        /* check if the packet has the subtable miniflow bit set. If yes, the

For above and for other comments in the patch, capitalization at 
beginning, period to finish.

> +         * block at the above pkt_idx will be stored, otherwise it is masked
> +         * out to be zero.
> +         */
> +        uint64_t pkt_has_mf_bit = (mf_mask + 1) & pkt_mf_bits;
> +        uint64_t no_bit = ((!pkt_has_mf_bit) > 0) - 1;
> +
> +        /* mask packet block by table block, and mask to zero if packet
> +         * doesn't actually contain this block of metadata
> +         */
> +        blocks_scratch[i] = pkt_blocks[pkt_idx] & tbl_blocks[i] & no_bit;
>       }
> +}
> +
> +/* netdev_flow_key_flatten:
> + * This function takes a packet, and subtable and writes an array of uint64_t
> + * blocks. The blocks contain the metadata that the subtable matches on, in
> + * the same order as the subtable, allowing linear iteration over the blocks.
> + *
> + * To calculate the blocks contents, the netdev_flow_key_flatten_unit function
> + * is called twice, once for each "unit" of the miniflow. This call can be
> + * inlined by the compiler for performance.
> + *
> + * Note that the u0_count and u1_count variables can be compile-time constants,
> + * allowing the loop in the inlined flatten_unit() function to be compile-time
> + * unrolled, or possibly removed totally by unrolling by the loop iterations.
> + * The compile time optimizations enabled by this design improves performance.
> + */
> +static inline void
> +netdev_flow_key_flatten(const struct netdev_flow_key *key,
> +                        const struct netdev_flow_key *mask,
> +                        const uint64_t *mf_masks,
> +                        uint64_t *blocks_scratch,
> +                        const uint32_t u0_count,
> +                        const uint32_t u1_count)
> +{
> +    /* load mask from subtable, mask with packet mf, popcount to get idx */
> +    const uint64_t *pkt_blocks = miniflow_get_values(&key->mf);
> +    const uint64_t *tbl_blocks = miniflow_get_values(&mask->mf);
> +
> +    /* packet miniflow bits to be masked by pre-calculated mf_masks */
> +    const uint64_t pkt_bits_u0 = key->mf.map.bits[0];
> +    const uint32_t pkt_bits_u0_pop = count_1bits(pkt_bits_u0);
> +    const uint64_t pkt_bits_u1 = key->mf.map.bits[1];
>   
> -    return hash_finish(hash, (p - miniflow_get_values(&mask->mf)) * 8);
> +    /* Unit 0 flattening */
> +    netdev_flow_key_flatten_unit(&pkt_blocks[0],
> +                                 &tbl_blocks[0],
> +                                 &mf_masks[0],
> +                                 &blocks_scratch[0],
> +                                 pkt_bits_u0,
> +                                 u0_count);
> +
> +    /* Unit 1 flattening:
> +     * Move the pointers forward in the arrays based on u0 offsets, NOTE:
> +     * 1) pkt blocks indexed by actual popcount of u0, which is NOT always
> +     *    the same as the amount of bits set in the subtable.
> +     * 2) mf_masks, tbl_block and blocks_scratch are all "flat" arrays, so
> +     *    the index is always u0_count.
> +     */
> +    netdev_flow_key_flatten_unit(&pkt_blocks[pkt_bits_u0_pop],
> +                                 &tbl_blocks[u0_count],
> +                                 &mf_masks[u0_count],
> +                                 &blocks_scratch[u0_count],
> +                                 pkt_bits_u1,
> +                                 u1_count);
> +}
> +
> +static inline uint64_t
> +netdev_rule_matches_key(const struct dpcls_rule *rule,
> +                        const uint32_t mf_bits_total,
> +                        const uint64_t *blocks_scratch)
> +{
> +    const uint64_t *keyp = miniflow_get_values(&rule->flow.mf);
> +    const uint64_t *maskp = miniflow_get_values(&rule->mask->mf);
> +
> +    uint64_t not_match = 0;
> +    for (int i = 0; i < mf_bits_total; i++) {
> +        not_match |= (blocks_scratch[i] & maskp[i]) != keyp[i];
> +    }
> +
> +    /* invert result to show match as 1 */
> +    return !not_match;
>   }
>   
> +/* const prop version of the function: note that mf bits total and u0 are
> + * explicitly passed in here, while they're also available at runtime from the
> + * subtable pointer. By making them compile time, we enable the compiler to
> + * unroll loops and flatten out code-sequences based on the knowledge of the
> + * mf_bits_* compile time values. This results in improved performance.
> + */
> +static inline uint32_t ALWAYS_INLINE
> +lookup_generic_impl(struct dpcls_subtable *subtable,
> +                    uint64_t *blocks_scratch,
> +                    uint32_t keys_map,
> +                    const struct netdev_flow_key *keys[],
> +                    struct dpcls_rule **rules,
> +                    const uint32_t bit_count_u0,
> +                    const uint32_t bit_count_u1)
> +{
> +    const uint32_t n_pkts = count_1bits(keys_map);
> +    ovs_assert(NETDEV_MAX_BURST >= n_pkts);
> +    uint32_t hashes[NETDEV_MAX_BURST];
> +
> +    const uint32_t bit_count_total = bit_count_u0 + bit_count_u1;
> +    uint64_t *mf_masks = subtable->mf_masks;
> +    int i;
> +
> +    /* Flatten the packet metadata into the blocks_scratch[] using subtable */
> +    ULLONG_FOR_EACH_1(i, keys_map) {
> +            netdev_flow_key_flatten(keys[i],
> +                                    &subtable->mask,
> +                                    mf_masks,
> +                                    &blocks_scratch[i * bit_count_total],
> +                                    bit_count_u0,
> +                                    bit_count_u1);
> +    }
> +
> +    /* Hash the now linearized blocks of packet metadata */
> +    ULLONG_FOR_EACH_1(i, keys_map) {
> +         uint32_t hash = 0;
> +         uint32_t i_off = i * bit_count_total;
> +         for (int h = 0; h < bit_count_total; h++) {
> +             hash = hash_add64(hash, blocks_scratch[i_off + h]);
> +         }
> +         hashes[i] = hash_finish(hash, bit_count_total * 8);

Can we replace magic 8 above.

> +    }
> +
> +    /* Lookup: this returns a bitmask of packets where the hash table had
> +     * an entry for the given hash key. Presence of a hash key does not
> +     * guarantee matching the key, as there can be hash collisions.
> +     */
> +    uint32_t found_map;
> +    const struct cmap_node *nodes[NETDEV_MAX_BURST];
> +    found_map = cmap_find_batch(&subtable->rules, keys_map, hashes, nodes);
> +
> +    /* Verify that packet actually matched rule. If not found, a hash
> +     * collision has taken place, so continue searching with the next node.
> +     */
> +    ULLONG_FOR_EACH_1(i, found_map) {
> +        struct dpcls_rule *rule;
> +
> +        CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
> +            const uint32_t cidx = i * bit_count_total;
> +            uint32_t match = netdev_rule_matches_key(rule, bit_count_total,
> +                                                     &blocks_scratch[cidx]);
> +
> +            if (OVS_LIKELY(match)) {
> +                rules[i] = rule;
> +                subtable->hit_cnt++;
> +                goto next;
> +            }
> +        }
> +
> +        /* None of the found rules was a match.  Clear the i-th bit to
> +         * search for this key in the next subtable. */
> +        ULLONG_SET0(found_map, i);
> +    next:
> +        ;                     /* Keep Sparse happy. */
> +    }
> +
> +    return found_map;
> +}
> +
> +/* Generic - use runtime provided mf bits */
>   uint32_t
>   dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
> +                              uint64_t *blocks_scratch,
>                                 uint32_t keys_map,
>                                 const struct netdev_flow_key *keys[],
>                                 struct dpcls_rule **rules)
>   {
> -        int i;
> -        /* Compute hashes for the remaining keys.  Each search-key is
> -         * masked with the subtable's mask to avoid hashing the wildcarded
> -         * bits. */
> -        uint32_t hashes[NETDEV_MAX_BURST];
> -        ULLONG_FOR_EACH_1(i, keys_map) {
> -            hashes[i] = netdev_flow_key_hash_in_mask(keys[i],
> -                                                     &subtable->mask);
> -        }
> -
> -        /* Lookup. */
> -        const struct cmap_node *nodes[NETDEV_MAX_BURST];
> -        uint32_t found_map =
> -                cmap_find_batch(&subtable->rules, keys_map, hashes, nodes);
> -        /* Check results.  When the i-th bit of found_map is set, it means
> -         * that a set of nodes with a matching hash value was found for the
> -         * i-th search-key.  Due to possible hash collisions we need to check
> -         * which of the found rules, if any, really matches our masked
> -         * search-key. */
> -        ULLONG_FOR_EACH_1(i, found_map) {
> -            struct dpcls_rule *rule;
> -
> -            CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
> -                if (OVS_LIKELY(dpcls_rule_matches_key(rule, keys[i]))) {
> -                    rules[i] = rule;
> -                    /* Even at 20 Mpps the 32-bit hit_cnt cannot wrap
> -                     * within one second optimization interval. */
> -                    subtable->hit_cnt++;
> -                    goto next;
> -                }
> -            }
> -            /* None of the found rules was a match.  Reset the i-th bit to
> -             * keep searching this key in the next subtable. */
> -            ULLONG_SET0(found_map, i);  /* Did not match. */
> -        next:
> -            ;                     /* Keep Sparse happy. */
> -        }
> -
> -        return found_map;
> +    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
> +                               subtable->mf_bits_set_unit0,
> +                               subtable->mf_bits_set_unit1);
>   }
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 190cc8918..03ab5267a 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -233,6 +233,15 @@ struct dpcls {
>       odp_port_t in_port;
>       struct cmap subtables_map;
>       struct pvector subtables;
> +
> +    /* Region of memory for this DPCLS instance to use as scratch.
> +     * Size is garaunteed to be large enough to hold all blocks required for
'garaunteed' -> 'guaranteed'

> +     * the subtable's to match on. This allows each dpcls lookup to flatten
> +     * the packet miniflows into this blocks_scratch area, without using
> +     * variable lenght arrays. This region is allocated on subtable create, and
> +     * will be resized as required if a larger subtable is added. */
> +    uint64_t *blocks_scratch;
> +    uint32_t blocks_scratch_size;
>   };
>   
>   /* Data structure to keep packet order till fastpath processing. */
> @@ -7567,6 +7576,7 @@ static void
>   dpcls_subtable_destroy_cb(struct dpcls_subtable *subtable)
>   {
>       cmap_destroy(&subtable->rules);
> +    ovsrcu_postpone(free, subtable->mf_masks);
>       ovsrcu_postpone(free, subtable);
>   }
>   
> @@ -7577,6 +7587,8 @@ dpcls_init(struct dpcls *cls)
>   {
>       cmap_init(&cls->subtables_map);
>       pvector_init(&cls->subtables);
> +    cls->blocks_scratch = NULL;
> +    cls->blocks_scratch_size = 0;
>   }
>   
>   static void
> @@ -7604,6 +7616,7 @@ dpcls_destroy(struct dpcls *cls)
>           }
>           cmap_destroy(&cls->subtables_map);
>           pvector_destroy(&cls->subtables);
> +        ovsrcu_postpone(free, cls->blocks_scratch);
>       }
>   }
>   
> @@ -7619,7 +7632,28 @@ dpcls_create_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
>       subtable->hit_cnt = 0;
>       netdev_flow_key_clone(&subtable->mask, mask);
>   
> -    /* Decide which hash/lookup/verify function to use. */
> +    /* The count of bits in the mask defines the space required for masks.
> +     * Then call gen_masks() to create the appropriate masks, avoiding the cost
> +     * of doing runtime calculations. */
> +    uint32_t unit0 = count_1bits(mask->mf.map.bits[0]);
> +    uint32_t unit1 = count_1bits(mask->mf.map.bits[1]);
> +    subtable->mf_bits_set_unit0 = unit0;
> +    subtable->mf_bits_set_unit1 = unit1;
> +
> +    subtable->mf_masks = xmalloc(sizeof(uint64_t) * (unit0 + unit1));
> +    netdev_flow_key_gen_masks(mask, subtable->mf_masks, unit0, unit1);
> +
> +    /* Allocate blocks scratch space only if subtable requires more size than
> +     * is currently allocated. */
> +    const uint32_t blocks_required_per_pkt = unit0 + unit1;
> +    if (cls->blocks_scratch_size < blocks_required_per_pkt) {
> +        free(cls->blocks_scratch);
> +        cls->blocks_scratch = xmalloc(sizeof(uint64_t) * NETDEV_MAX_BURST *
> +                                      blocks_required_per_pkt);

For sizeof above I don't think you need the () as it's part of an 
expression. See OVS code standard below. This may apply to the xmalloc 
for mf_masks above as well.

The sizeof operator is unique among C operators in that it accepts two 
very different kinds of operands: an expression or a type. In general, 
prefer to specify an expression, e.g. int *x = xmalloc(sizeof *x);. When 
the operand of sizeof is an expression, there is no need to parenthesize 
that operand, and please don't.

> +        cls->blocks_scratch_size = blocks_required_per_pkt;
> +    }
> +
> +    /* Assign the generic lookup - this works with any miniflow fingerprint. */
>       subtable->lookup_func = dpcls_subtable_lookup_generic;
>   
>       cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash);
> @@ -7764,6 +7798,43 @@ dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)
>       }
>   }
>   

<snip>

> @@ -95,8 +97,18 @@ struct dpcls_subtable {
>        * subtable matches on. The miniflow "bits" are used to select the actual
>        * dpcls lookup implementation at subtable creation time.
>        */
Although you can't see it here the comment above seems a little 
misleading, maybe a copy and paste typo?

"The lookup function to use for this subtable. From a technical point of
view, this is just an implementation of mutliple dispatch. The attribute
used to identify the ideal function is the miniflow fingerprint that the
subtable matches on. The miniflow "bits" are used to select the actual
dpcls lookup implementation at subtable creation time."

The mf_bits_set_units are not the lookup function themselves. You could 
probably re-word it as something like

"Miniflow fingerprint that the subtable matches on. The miniflow "bits" 
are used to select the actual dpcls lookup implementation at subtable 
creation time"

> +    uint8_t mf_bits_set_unit0;
> +    uint8_t mf_bits_set_unit1;
> +
> +    /* the lookup function to use for this subtable. If there is a known
> +     * property of the subtable (eg: only 3 bits of miniflow metadata is
> +     * used for the lookup) then this can point at an optimized version of
> +     * the lookup function for this particular subtable. */
>       dpcls_subtable_lookup_func lookup_func;
>   
> +    /* caches the masks to match a packet to, reducing runtime calculations */
> +    uint64_t *mf_masks;
> +
>       struct netdev_flow_key mask; /* Wildcards for fields (const). */
>       /* 'mask' must be the last field, additional space is allocated here. */
>   };
> @@ -105,6 +117,12 @@ struct dpcls_subtable {
>   #define NETDEV_FLOW_KEY_FOR_EACH_IN_FLOWMAP(VALUE, KEY, FLOWMAP)   \
>       MINIFLOW_FOR_EACH_IN_FLOWMAP (VALUE, &(KEY)->mf, FLOWMAP)
>   

1 liner comment explaining the prototype comment would be nice here.

Thanks
Ian
> +void
> +netdev_flow_key_gen_masks(const struct netdev_flow_key *tbl,
> +                          uint64_t *mf_masks,
> +                          const uint32_t mf_bits_u0,
> +                          const uint32_t mf_bits_u1);
> +
>   bool dpcls_rule_matches_key(const struct dpcls_rule *rule,
>                               const struct netdev_flow_key *target);
>   
>