[ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized generic scalar functions

Wed Jun 19 16:22:29 UTC 2019

> -----Original Message-----
> From: Malvika Gupta [mailto:Malvika.Gupta at arm.com]
> Sent: Wednesday, June 19, 2019 5:17 PM
> To: Van Haaren, Harry <harry.van.haaren at intel.com>
> Cc: Yanqin Wei (Arm Technology China) <Yanqin.Wei at arm.com>; nd <nd at arm.com>;
> dev at openvswitch.org
> Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized generic
> scalar functions
> 
> Hi Harry,

Hi Malvika,

> I would like to test your patch on ARM platforms and report any performance
> improvement. Can you please tell me the test vectors you used to match the
> subtables that call the specialized lookup functions, i.e.,
> dpcls_subtable_lookup_mf_u0w5_u1w1, dpcls_subtable_lookup_mf_u0w4_u1w1, and
> dpcls_subtable_lookup_mf_u0w4_u1w0? And how did you make that distinction
> with traffic patterns, i.e., which traffic pattern would match with which
> subtable lookup?

I was just running Eth/IPv4/UDP data, and using OvS to do Phy to Phy forwarding.

./utilities/ovs-ofctl add-flow br0 in_port=1,action=output:2
./utilities/ovs-ofctl add-flow br0 in_port=2,action=output:1

This is an extremely simple setup, however it is enough to showcase value of the
patches here.

By changing traffic to Eth/IPv4 and Eth alone, you can change the # of miniflow items,
and hit the different u0_w5_u1_w0 cases :)

If you have suggestions for other subtables to special case, please post the
u0_wX_u1_wY values that you'd like to see, and I'll add them to the patchset.

> Thank you,
> Malvika Gupta

Testing welcomed, thanks! -Harry

> > -----Original Message-----
> > From: ovs-dev-bounces at openvswitch.org <ovs-dev-
> > bounces at openvswitch.org> On Behalf Of Harry van Haaren
> > Sent: Wednesday, May 8, 2019 11:13 PM
> > To: ovs-dev at openvswitch.org
> > Cc: i.maximets at samsung.com
> > Subject: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized generic
> scalar
> > functions
> >
> > This commit adds a number of specialized functions, that handle common
> > miniflow fingerprints. This enables compiler optimization, resulting in
> higher
> > performance. Below a quick description of how this optimization actually
> > works;
> >
> > "Specialized functions" are "instances" of the generic implementation, but
> > the compiler is given extra context when compiling. In the case of
> iterating
> > miniflow datastructures, the most interesting value to enable compile time
> > optimizations is the loop trip count per unit.
> >
> > In order to create a specialized function, there is a generic
> implementation,
> > which uses a for() loop without the compiler knowing the loop trip count
> at
> > compile time. The loop trip count is passed in as an argument to the
> function:
> >
> > uint32_t miniflow_impl_generic(struct miniflow *mf, uint32_t loop_count) {
> >     for(uint32_t i = 0; i < loop_count; i++)
> >         // do work
> > }
> >
> > In order to "specialize" the function, we call the generic implementation
> with
> > hard-coded numbers - these are compile time constants!
> >
> > uint32_t miniflow_impl_loop5(struct miniflow *mf, uint32_t loop_count) {
> >     // use hard coded constant for compile-time constant-propogation
> >     return miniflow_impl_generic(mf, 5); }
> >
> > Given the compiler is aware of the loop trip count at compile time, it can
> > perform an optimization known as "constant propogation". Combined with
> > inlining of the miniflow_impl_generic() function, the compiler is now
> enabled
> > to *compile time* unroll the loop 5x, and produce "flat" code.
> >
> > The last step to using the specialized functions is to utilize a function-
> pointer
> > to choose the specialized (or generic) implementation.
> > The selection of the function pointer is performed at subtable creation
> time,
> > when miniflow fingerprint of the subtable is known. This technique is
> known
> > as "multiple dispatch" in some literature, as it uses multiple items of
> > information (miniflow bit counts) to select the dispatch function.
> >
> > By pointing the function pointer at the optimized implementation, OvS
> > benefits from the compile time optimizations at runtime.
> >
> > Signed-off-by: Harry van Haaren <harry.van.haaren at intel.com>
> >
> > ---
> >
> > v8:
> > - Rework to use blocks_cache from the dpcls instance, to avoid variable
> >   lenght arrays in the data-path.
> > ---
> >  lib/dpif-netdev-lookup-generic.c | 83 +++++++++++++++++++++++++++++---
> >  lib/dpif-netdev.c                |  9 +++-
> >  lib/dpif-netdev.h                |  8 +++
> >  3 files changed, 90 insertions(+), 10 deletions(-)
> >
> > diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-
> > generic.c
> > index 901d28ff0..ba2d024cc 100644
> > --- a/lib/dpif-netdev-lookup-generic.c
> > +++ b/lib/dpif-netdev-lookup-generic.c
> > @@ -100,11 +100,11 @@ netdev_flow_key_flatten(const struct
> > netdev_flow_key * restrict key,
> >
> >      /* Unit 0 flattening */
> >      netdev_flow_key_flatten_unit(&pkt_blocks[0],
> > -                            &tbl_blocks[0],
> > -                            &mf_masks[0],
> > -                            &blocks_scratch[0],
> > -                            pkt_bits_u0,
> > -                            u0_count);
> > +                                 &tbl_blocks[0],
> > +                                 &mf_masks[0],
> > +                                 &blocks_scratch[0],
> > +                                 pkt_bits_u0,
> > +                                 u0_count);
> >
> >      /* Unit 1 flattening:
> >       * Move the pointers forward in the arrays based on u0 offsets, NOTE:
> > @@ -225,7 +225,74 @@ dpcls_subtable_lookup_generic(struct
> > dpcls_subtable *subtable,
> >                                const struct netdev_flow_key *keys[],
> >                                struct dpcls_rule **rules)  {
> > -        return lookup_generic_impl(subtable, blocks_scratch, keys_map,
> keys,
> > -                                   rules, subtable->mf_bits_set_unit0,
> > -                                   subtable->mf_bits_set_unit1);
> > +    /* Here the runtime subtable->mf_bits counts are used, which forces
> the
> > +     * compiler to iterate normal for() loops. Due to this limitation in
> the
> > +     * compilers available optimizations, this function has lower
> performance
> > +     * than the below specialized functions.
> > +     */
> > +    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys,
> > rules,
> > +                               subtable->mf_bits_set_unit0,
> > +                               subtable->mf_bits_set_unit1); }
> > +
> > +static uint32_t
> > +dpcls_subtable_lookup_mf_u0w5_u1w1(struct dpcls_subtable *subtable,
> > +                                   uint64_t *blocks_scratch,
> > +                                   uint32_t keys_map,
> > +                                   const struct netdev_flow_key *keys[],
> > +                                   struct dpcls_rule **rules) {
> > +    /* hard coded bit counts - enables compile time loop unrolling, and
> > +     * generating of optimized code-sequences due to loop unrolled code.
> > +     */
> > +    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys,
> > rules,
> > +                               5, 1);
> > +}
> > +
> > +static uint32_t
> > +dpcls_subtable_lookup_mf_u0w4_u1w1(struct dpcls_subtable *subtable,
> > +                                   uint64_t *blocks_scratch,
> > +                                   uint32_t keys_map,
> > +                                   const struct netdev_flow_key *keys[],
> > +                                   struct dpcls_rule **rules) {
> > +    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys,
> > rules,
> > +                               4, 1);
> > +}
> > +
> > +static uint32_t
> > +dpcls_subtable_lookup_mf_u0w4_u1w0(struct dpcls_subtable *subtable,
> > +                                   uint64_t *blocks_scratch,
> > +                                   uint32_t keys_map,
> > +                                   const struct netdev_flow_key *keys[],
> > +                                   struct dpcls_rule **rules) {
> > +    return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys,
> > rules,
> > +                               4, 0);
> > +}
> > +
> > +/* Probe function to lookup an available specialized function.
> > + * If capable to run the requested miniflow fingerprint, this function
> > +returns
> > + * the most optimal implementation for that miniflow fingerprint.
> > + * @retval FunctionAddress A valid function to handle the miniflow bit
> > +pattern
> > + * @retval 0 The requested miniflow is not supported here, NULL is
> > +returned  */ dpcls_subtable_lookup_func
> > +dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits) {
> > +    dpcls_subtable_lookup_func f = NULL;
> > +
> > +    if (u0_bits == 5 && u1_bits == 1) {
> > +        f = dpcls_subtable_lookup_mf_u0w5_u1w1;
> > +    } else if (u0_bits == 4 && u1_bits == 1) {
> > +        f = dpcls_subtable_lookup_mf_u0w4_u1w1;
> > +    } else if (u0_bits == 4 && u1_bits == 0) {
> > +        f = dpcls_subtable_lookup_mf_u0w4_u1w0;
> > +    }
> > +
> > +    if (f) {
> > +        VLOG_INFO("Subtable using Generic Optimized for u0 %d, u1 %d\n",
> > +                  u0_bits, u1_bits);
> > +    }
> > +    return f;
> >  }
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> 33b93cfdf..23fc5b7a6
> > 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -7623,8 +7623,13 @@ dpcls_create_subtable(struct dpcls *cls, const
> > struct netdev_flow_key *mask)
> >          cls->blocks_scratch_size = blocks_required_per_pkt;
> >      }
> >
> > -    /* Assign the generic lookup - this works with any miniflow
> fingerprint */
> > -    subtable->lookup_func = dpcls_subtable_lookup_generic;
> > +    /* Probe for an optimmized variant */
> > +    subtable->lookup_func = dpcls_subtable_generic_probe(unit0, unit1);
> > +
> > +    /* If not set, assign generic lookup - this works with any miniflow
> */
> > +    if (!subtable->lookup_func) {
> > +        subtable->lookup_func = dpcls_subtable_lookup_generic;
> > +    }
> >
> >      cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash);
> >      /* Add the new subtable at the end of the pvector (with no hits yet)
> */ diff
> > --git a/lib/dpif-netdev.h b/lib/dpif-netdev.h index 9263256a9..123eabad7
> > 100644
> > --- a/lib/dpif-netdev.h
> > +++ b/lib/dpif-netdev.h
> > @@ -70,6 +70,14 @@ typedef uint32_t
> > (*dpcls_subtable_lookup_func)(struct dpcls_subtable *subtable,
> >                  const struct netdev_flow_key *keys[],
> >                  struct dpcls_rule **rules);
> >
> > +/* Probe function to select a specialized version of the generic lookup
> > + * implementation. This provides performance benefit due to
> > +compile-time
> > + * optimizations such as loop-unrolling. These are enabled by the
> > +compile-time
> > + * constants in the specific function implementations.
> > + */
> > +dpcls_subtable_lookup_func
> > +dpcls_subtable_generic_probe(uint32_t u0_bit_count, uint32_t
> > +u1_bit_count);
> > +
> >  /* Prototype for generic lookup func, using same code path as before */
> > uint32_t  dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable,
> > --
> > 2.17.1
> >
> > _______________________________________________
> > dev mailing list
> > dev at openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.