[ovs-dev] [PATCH v11 5/5] dpif-netdev: Add specialized generic scalar functions

Ian Stokes ian.stokes at intel.com
Tue Jul 16 21:07:09 UTC 2019


On 7/15/2019 5:36 PM, Harry van Haaren wrote:
> This commit adds a number of specialized functions, that handle
> common miniflow fingerprints. This enables compiler optimization,
> resulting in higher performance. Below a quick description of
> how this optimization actually works;
> 
> "Specialized functions" are "instances" of the generic implementation,
> but the compiler is given extra context when compiling. In the case of
> iterating miniflow datastructures, the most interesting value to enable
> compile time optimizations is the loop trip count per unit.
> 
> In order to create a specialized function, there is a generic implementation,
> which uses a for() loop without the compiler knowing the loop trip count at
> compile time. The loop trip count is passed in as an argument to the function:
> 
> uint32_t miniflow_impl_generic(struct miniflow *mf, uint32_t loop_count)
> {
>      for(uint32_t i = 0; i < loop_count; i++)
>          // do work
> }
> 
> In order to "specialize" the function, we call the generic implementation
> with hard-coded numbers - these are compile time constants!
> 
> uint32_t miniflow_impl_loop5(struct miniflow *mf, uint32_t loop_count)
> {
>      // use hard coded constant for compile-time constant-propogation
>      return miniflow_impl_generic(mf, 5);
> }
> 
> Given the compiler is aware of the loop trip count at compile time,
> it can perform an optimization known as "constant propogation". Combined
> with inlining of the miniflow_impl_generic() function, the compiler is
> now enabled to *compile time* unroll the loop 5x, and produce "flat" code.
> 
> The last step to using the specialized functions is to utilize a
> function-pointer to choose the specialized (or generic) implementation.
> The selection of the function pointer is performed at subtable creation
> time, when miniflow fingerprint of the subtable is known. This technique
> is known as "multiple dispatch" in some literature, as it uses multiple
> items of information (miniflow bit counts) to select the dispatch function.
> 
> By pointing the function pointer at the optimized implementation, OvS
> benefits from the compile time optimizations at runtime.
> 
> Signed-off-by: Harry van Haaren <harry.van.haaren at intel.com>
> Tested-by: Malvika Gupta <malvika.gupta at arm.com>
> 

Thanks for the v11 Harry, some minor comments inline below.

> ---
> 
> v11:
> - Use MACROs to declare and check optimized functions (Ilya)
> - Use captial letter for commit message (Ilya)
> - Rebase onto latest patchset changes
> - Added NEWS entry for data-path subtable specialization (Ian/Harry)
> - Checkpatch notes an "incorrect bracketing" in the MACROs, however I
>    didn't find a solution that it does like.
> 
> v10:
> - Rebase changes from previous patches.
> - Remove "restrict" keyword as windows CI failed, see here for details:
>    https://ci.appveyor.com/project/istokes/ovs-q8fvv/builds/24398228
> 
> v8:
> - Rework to use blocks_cache from the dpcls instance, to avoid variable
>    lenght arrays in the data-path.
> ---
>   NEWS                             |  4 +++
>   lib/dpif-netdev-lookup-generic.c | 51 ++++++++++++++++++++++++++++++++
>   lib/dpif-netdev-private.h        |  8 +++++
>   lib/dpif-netdev.c                |  9 ++++--
>   4 files changed, 70 insertions(+), 2 deletions(-)
> 
> diff --git a/NEWS b/NEWS
> index 81130e667..4cfffb1bc 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -34,6 +34,10 @@ Post-v2.11.0
>        * 'ovs-appctl exit' now implies cleanup of non-internal ports in userspace
>          datapath regardless of '--cleanup' option. Use '--cleanup' to remove
>          internal ports too.
> +     * Datapath classifer code refactored to enable function pointers to select
> +       the lookup implementation at runtime. This enables specialization of
> +       specific subtables based on the miniflow attributes, enhancing the
> +       performance of the subtable search.
>      - OVSDB:
>        * OVSDB clients can now resynchronize with clustered servers much more
>          quickly after a brief disconnection, saving bandwidth and CPU time.
> diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
> index abd166fc3..259c36645 100644
> --- a/lib/dpif-netdev-lookup-generic.c
> +++ b/lib/dpif-netdev-lookup-generic.c
> @@ -233,7 +233,58 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
>                                 const struct netdev_flow_key *keys[],
>                                 struct dpcls_rule **rules)
>   {
> +    /* Here the runtime subtable->mf_bits counts are used, which forces the
> +     * compiler to iterate normal for() loops. Due to this limitation in the
> +     * compilers available optimizations, this function has lower performance
> +     * than the below specialized functions.
> +     */
>       return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, rules,
>                                  subtable->mf_bits_set_unit0,
>                                  subtable->mf_bits_set_unit1);
>   }
> +
> +/* Expand out specialized functions with U0 and U1 bit attributes */

Minor, missing period at end of comment (can fix this on commit if there 
are no other revisions).

> +#define DECLARE_OPTIMIZED_LOOKUP_FUNCTION(U0, U1)                             \
> +    static uint32_t                                                           \
> +    dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1(                               \
> +                                         struct dpcls_subtable *subtable,     \
> +                                         uint64_t *blocks_scratch,            \
> +                                         uint32_t keys_map,                   \
> +                                         const struct netdev_flow_key *keys[],\
> +                                         struct dpcls_rule **rules)           \
> +    {                                                                         \
> +        return lookup_generic_impl(subtable, blocks_scratch, keys_map,        \
> +                                   keys, rules, U0, U1);                      \
> +    }                                                                         \
> +
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(5,1)
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4,1)
> +DECLARE_OPTIMIZED_LOOKUP_FUNCTION(4,0)
> +
> +/* Check if a speicalized function is valid for the required subtable. */
Minor, speicalized -> specialized, again can be fixed on commit otherwise.

> +#define CHECK_LOOKUP_FUNCTION(U0,U1)                                          \
> +    if (!f && u0_bits == U0 && u1_bits == U1) {                               \
> +        f = dpcls_subtable_lookup_mf_u0w##U0##_u1w##U1;                       \
> +    }
> +
> +/* Probe function to lookup an available specialized function.
> + * If capable to run the requested miniflow fingerprint, this function returns
> + * the most optimal implementation for that miniflow fingerprint.
> + * @retval FunctionAddress A valid function to handle the miniflow bit pattern
> + * @retval 0 The requested miniflow is not supported here, NULL is returned
> + */
> +dpcls_subtable_lookup_func
> +dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits)
> +{
> +    dpcls_subtable_lookup_func f = NULL;

In the comments you return FunctionAddress but you return f below, 
should this not be FunctionAddress or maybe another variable name if 
'FunctionAddress' is a bit unwieldy?

> +
> +    CHECK_LOOKUP_FUNCTION(5, 1);
> +    CHECK_LOOKUP_FUNCTION(4, 1);
> +    CHECK_LOOKUP_FUNCTION(4, 0);
> +
> +    if (f) {
> +        VLOG_INFO("Subtable using Generic Optimized for u0 %d, u1 %d\n",
> +                  u0_bits, u1_bits);
> +    }
> +    return f;
> +}
> diff --git a/lib/dpif-netdev-private.h b/lib/dpif-netdev-private.h
> index e1ceaa641..f541343f4 100644
> --- a/lib/dpif-netdev-private.h
> +++ b/lib/dpif-netdev-private.h
> @@ -69,6 +69,14 @@ dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
>                                 const struct netdev_flow_key *keys[],
>                                 struct dpcls_rule **rules);
>   
> +/* Probe function to select a specialized version of the generic lookup
> + * implementation. This provides performance benefit due to compile-time
> + * optimizations such as loop-unrolling. These are enabled by the compile-time
> + * constants in the specific function implementations.
> + */
> +dpcls_subtable_lookup_func
> +dpcls_subtable_generic_probe(uint32_t u0_bit_count, uint32_t u1_bit_count);
> +
>   /* A set of rules that all have the same fields wildcarded. */
>   struct dpcls_subtable {
>       /* The fields are only used by writers. */
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 8acc1445a..996dec35e 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -7735,8 +7735,13 @@ dpcls_create_subtable(struct dpcls *cls, const struct netdev_flow_key *mask)
>           cls->blocks_scratch_size = blocks_required_per_pkt;
>       }
>   
> -    /* Assign the generic lookup - this works with any miniflow fingerprint. */
> -    subtable->lookup_func = dpcls_subtable_lookup_generic;
> +    /* Probe for a specialized generic lookup function. */
> +    subtable->lookup_func = dpcls_subtable_generic_probe(unit0, unit1);
> +
> +    /* If not set, assign generic lookup. Generic works for any miniflow. */
> +    if (!subtable->lookup_func) {
> +        subtable->lookup_func = dpcls_subtable_lookup_generic;
> +    }
>   
>       cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash);
>       /* Add the new subtable at the end of the pvector (with no hits yet) */
Missing period above.
> 

Regards
Ian



More information about the dev mailing list