[ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized generic scalar functions

Wed Jun 19 16:17:20 UTC 2019

Hi Harry,

I would like to test your patch on ARM platforms and report any performance improvement. Can you please tell me the test vectors you used to match the subtables that call the specialized lookup functions, i.e., dpcls_subtable_lookup_mf_u0w5_u1w1, dpcls_subtable_lookup_mf_u0w4_u1w1, and dpcls_subtable_lookup_mf_u0w4_u1w0? And how did you make that distinction with traffic patterns, i.e., which traffic pattern would match with which subtable lookup?

Thank you,
Malvika Gupta

> -----Original Message-----
> From: ovs-dev-bounces at openvswitch.org <ovs-dev-
> bounces at openvswitch.org> On Behalf Of Harry van Haaren
> Sent: Wednesday, May 8, 2019 11:13 PM
> To: ovs-dev at openvswitch.org
> Cc: i.maximets at samsung.com
> Subject: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized generic scalar
> functions
>
> This commit adds a number of specialized functions, that handle common
> miniflow fingerprints. This enables compiler optimization, resulting in higher
> performance. Below a quick description of how this optimization actually
> works;
>
> "Specialized functions" are "instances" of the generic implementation, but
> the compiler is given extra context when compiling. In the case of iterating
> miniflow datastructures, the most interesting value to enable compile time
> optimizations is the loop trip count per unit.
>
> In order to create a specialized function, there is a generic implementation,
> which uses a for() loop without the compiler knowing the loop trip count at
> compile time. The loop trip count is passed in as an argument to the function:
>
> uint32_t miniflow_impl_generic(struct miniflow *mf, uint32_t loop_count) {
>     for(uint32_t i = 0; i < loop_count; i++)
>         // do work
> }
>
> In order to "specialize" the function, we call the generic implementation with
> hard-coded numbers - these are compile time constants!
>
> uint32_t miniflow_impl_loop5(struct miniflow *mf, uint32_t loop_count) {
>     // use hard coded constant for compile-time constant-propogation
>     return miniflow_impl_generic(mf, 5); }
>
> Given the compiler is aware of the loop trip count at compile time, it can
> perform an optimization known as "constant propogation". Combined with
> inlining of the miniflow_impl_generic() function, the compiler is now enabled
> to *compile time* unroll the loop 5x, and produce "flat" code.
>
> The last step to using the specialized functions is to utilize a function-pointer
> to choose the specialized (or generic) implementation.
> The selection of the function pointer is performed at subtable creation time,
> when miniflow fingerprint of the subtable is known. This technique is known
> as "multiple dispatch" in some literature, as it uses multiple items of
> information (miniflow bit counts) to select the dispatch function.
>
> By pointing the function pointer at the optimized implementation, OvS
> benefits from the compile time optimizations at runtime.
>
> Signed-off-by: Harry van Haaren <harry.van.haaren at intel.com>
>
> ---
>
> v8:
> - Rework to use blocks_cache from the dpcls instance, to avoid variable
>   lenght arrays in the data-path.
> ---
>  lib/dpif-netdev-lookup-generic.c | 83 +++++++++++++++++++++++++++++---
>  lib/dpif-netdev.c                |  9 +++-
>  lib/dpif-netdev.h                |  8 +++
>  3 files changed, 90 insertions(+), 10 deletions(-)
>
> diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-
> generic.c
> index 901d28ff0..ba2d024cc 100644
> --- a/lib/dpif-netdev-lookup-generic.c
> +++ b/lib/dpif-netdev-lookup-generic.c
> @@ -100,11 +100,11 @@ netdev_flow_key_flatten(const struct
> netdev_flow_key * restrict key,
>
>      /* Unit 0 flattening */
>      netdev_flow_key_flatten_unit(&pkt_blocks[0],
> -                            &tbl_blocks[0],
> -                            &mf_masks[0],
> -                            &blocks_scratch[0],
> -                            pkt_bits_u0,
> -                            u0_count);
> +                                 &tbl_blocks[0],
> +                                 &mf_masks[0],
> +                                 &blocks_scratch[0],
> +                                 pkt_bits_u0,
> +                                 u0_count);
>
>      /* Unit 1 flattening:
>       * Move the pointers forward in the arrays based on u0 offsets, NOTE:
> @@ -225,7 +225,74 @@ dpcls_subtable_lookup_generic(struct
> dpcls_subtable *subtable,
>                                const struct netdev_flow_key *keys[],
>                                struct dpcls_rule **rules)  {
> -        return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys,
> -                                   rules, subtable->mf_bits_set_unit0,
> -                                   subtable->mf_bits_set_unit1);
> +    /* Here the runtime subtable->mf_bits counts are used, which forces the
> +     * compiler to iterate normal for() loops. Due to this limitation in the
> +     * compilers available optimizations, this function has lower performance
> +     * than the below specialized functions.
> +     */
> +    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys,
> rules,
> +                               subtable->mf_bits_set_unit0,
> +                               subtable->mf_bits_set_unit1); }
> +
> +static uint32_t
> +dpcls_subtable_lookup_mf_u0w5_u1w1(struct dpcls_subtable *subtable,
> +                                   uint64_t *blocks_scratch,
> +                                   uint32_t keys_map,
> +                                   const struct netdev_flow_key *keys[],
> +                                   struct dpcls_rule **rules) {
> +    /* hard coded bit counts - enables compile time loop unrolling, and
> +     * generating of optimized code-sequences due to loop unrolled code.
> +     */
> +    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys,
> rules,
> +                               5, 1);
> +}
> +
> +static uint32_t
> +dpcls_subtable_lookup_mf_u0w4_u1w1(struct dpcls_subtable *subtable,
> +                                   uint64_t *blocks_scratch,
> +                                   uint32_t keys_map,
> +                                   const struct netdev_flow_key *keys[],
> +                                   struct dpcls_rule **rules) {
> +    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys,
> rules,
> +                               4, 1);
> +}
> +
> +static uint32_t
> +dpcls_subtable_lookup_mf_u0w4_u1w0(struct dpcls_subtable *subtable,
> +                                   uint64_t *blocks_scratch,
> +                                   uint32_t keys_map,
> +                                   const struct netdev_flow_key *keys[],
> +                                   struct dpcls_rule **rules) {
> +    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys,
> rules,
> +                               4, 0);
> +}
> +
> +/* Probe function to lookup an available specialized function.
> + * If capable to run the requested miniflow fingerprint, this function
> +returns
> + * the most optimal implementation for that miniflow fingerprint.
> + * @retval FunctionAddress A valid function to handle the miniflow bit
> +pattern
> + * @retval 0 The requested miniflow is not supported here, NULL is
> +returned  */ dpcls_subtable_lookup_func
> +dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits) {
> +    dpcls_subtable_lookup_func f = NULL;
> +
> +    if (u0_bits == 5 && u1_bits == 1) {
> +        f = dpcls_subtable_lookup_mf_u0w5_u1w1;
> +    } else if (u0_bits == 4 && u1_bits == 1) {
> +        f = dpcls_subtable_lookup_mf_u0w4_u1w1;
> +    } else if (u0_bits == 4 && u1_bits == 0) {
> +        f = dpcls_subtable_lookup_mf_u0w4_u1w0;
> +    }
> +
> +    if (f) {
> +        VLOG_INFO("Subtable using Generic Optimized for u0 %d, u1 %d\n",
> +                  u0_bits, u1_bits);
> +    }
> +    return f;
>  }
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 33b93cfdf..23fc5b7a6
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -7623,8 +7623,13 @@ dpcls_create_subtable(struct dpcls *cls, const
> struct netdev_flow_key *mask)
>          cls->blocks_scratch_size = blocks_required_per_pkt;
>      }
>
> -    /* Assign the generic lookup - this works with any miniflow fingerprint */
> -    subtable->lookup_func = dpcls_subtable_lookup_generic;
> +    /* Probe for an optimmized variant */
> +    subtable->lookup_func = dpcls_subtable_generic_probe(unit0, unit1);
> +
> +    /* If not set, assign generic lookup - this works with any miniflow */
> +    if (!subtable->lookup_func) {
> +        subtable->lookup_func = dpcls_subtable_lookup_generic;
> +    }
>
>      cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash);
>      /* Add the new subtable at the end of the pvector (with no hits yet) */ diff
> --git a/lib/dpif-netdev.h b/lib/dpif-netdev.h index 9263256a9..123eabad7
> 100644
> --- a/lib/dpif-netdev.h
> +++ b/lib/dpif-netdev.h
> @@ -70,6 +70,14 @@ typedef uint32_t
> (*dpcls_subtable_lookup_func)(struct dpcls_subtable *subtable,
>                  const struct netdev_flow_key *keys[],
>                  struct dpcls_rule **rules);
>
> +/* Probe function to select a specialized version of the generic lookup
> + * implementation. This provides performance benefit due to
> +compile-time
> + * optimizations such as loop-unrolling. These are enabled by the
> +compile-time
> + * constants in the specific function implementations.
> + */
> +dpcls_subtable_lookup_func
> +dpcls_subtable_generic_probe(uint32_t u0_bit_count, uint32_t
> +u1_bit_count);
> +
>  /* Prototype for generic lookup func, using same code path as before */
> uint32_t  dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
> --
> 2.17.1
>
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.