[ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized generic scalar functions

Thu Jun 27 12:06:42 UTC 2019

> -----Original Message-----
> From: Malvika Gupta [mailto:Malvika.Gupta at arm.com]
> Sent: Tuesday, June 25, 2019 8:19 PM
> To: Van Haaren, Harry <harry.van.haaren at intel.com>
> Cc: nd <nd at arm.com>; Yanqin Wei (Arm Technology China) <Yanqin.Wei at arm.com>;
> dev at openvswitch.org
> Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized generic
> scalar functions
> 
> Hi Harry,

Hi Malvika,

> I tested your patch on two ARM machines, namely ThunderX2 and Octeon-Tx.
> 
> On Octeon-Tx (running gcc version 7.4.0), a performance improvement of 9-10%
> was seen. On the ThunderX2 (running gcc version 8.3.0), a performance
> improvement of 14-15% was observed. For both machines, I disabled the EMC
> and tested with two traffic patterns of 64B packet size, specifically
> UDP/IPv4/Ethernet and IPv4/Ethernet (Raw IP).

Great - thanks for testing and reporting back performance, appreciated!

> Please let me know if you require any more details.
> Thanks,
> Malvika

It would be good to know which miniflow fingerprints are commonly deployed with OVS.

If we (the OVS community) had more input in terms of benchmarks that OVS deployments
use, then we can optimize against those more (as was mentioned at OVS Conf '18 too :)
https://youtu.be/5-MDlpUIOBE?list=PLaJlRa-xItwCzuAL3mP6n02vmXab4Bwu-&t=1471

I haven't had much success with getting benchmarks yet - but if you have something
like a use-case or benchmark to test against, sharing it with OVS Community would be great!

-Harry

> > -----Original Message-----
> > From: Malvika Gupta
> > Sent: Wednesday, June 19, 2019 1:34 PM
> > To: Van Haaren, Harry <harry.van.haaren at intel.com>
> > Cc: Yanqin Wei (Arm Technology China) <Yanqin.Wei at arm.com>; nd
> > <nd at arm.com>; dev at openvswitch.org
> > Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized generic
> > scalar functions
> >
> > Hi Harry
> >
> > Thanks for your reply. I have some last few things I wanted to clarify.
> Please
> > see my inline comments below.
> >
> > > -----Original Message-----
> > > From: Van Haaren, Harry <harry.van.haaren at intel.com>
> > > Sent: Wednesday, June 19, 2019 11:22 AM
> > > To: Malvika Gupta <Malvika.Gupta at arm.com>
> > > Cc: Yanqin Wei (Arm Technology China) <Yanqin.Wei at arm.com>; nd
> > > <nd at arm.com>; dev at openvswitch.org
> > > Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized
> > > generic scalar functions
> > >
> > > > -----Original Message-----
> > > > From: Malvika Gupta [mailto:Malvika.Gupta at arm.com]
> > > > Sent: Wednesday, June 19, 2019 5:17 PM
> > > > To: Van Haaren, Harry <harry.van.haaren at intel.com>
> > > > Cc: Yanqin Wei (Arm Technology China) <Yanqin.Wei at arm.com>; nd
> > > > <nd at arm.com>; dev at openvswitch.org
> > > > Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized
> > > > generic scalar functions
> > > >
> > > > Hi Harry,
> > >
> > > Hi Malvika,
> > >
> > >
> > > > I would like to test your patch on ARM platforms and report any
> > > > performance improvement. Can you please tell me the test vectors you
> > > > used to match the subtables that call the specialized lookup
> > > > functions, i.e., dpcls_subtable_lookup_mf_u0w5_u1w1,
> > > > dpcls_subtable_lookup_mf_u0w4_u1w1, and
> > > > dpcls_subtable_lookup_mf_u0w4_u1w0? And how did you make that
> > > > distinction with traffic patterns, i.e., which traffic pattern would
> > > > match with
> > > which subtable lookup?
> > >
> > > I was just running Eth/IPv4/UDP data, and using OvS to do Phy to Phy
> > > forwarding.
> > >
> > > ./utilities/ovs-ofctl add-flow br0 in_port=1,action=output:2
> > > ./utilities/ovs- ofctl add-flow br0 in_port=2,action=output:1
> > >
> > > This is an extremely simple setup, however it is enough to showcase
> > > value of the patches here.
> > >
> > > By changing traffic to Eth/IPv4 and Eth alone, you can change the # of
> > > miniflow items, and hit the different u0_w5_u1_w0 cases :)
> > >
> >
> > 1. Are you also sending combinations of traffic flows at the same time?
> Like
> > UDP/IPv4/Eth is one traffic flow, raw IP could be another and Eth packets
> > only is a third flow. Did you also test by say, sending these 3 flows
> together
> > which would all hit different subtables or just one kind of traffic flow
> at a
> > time?
> > 2. Just to clarify, packet size is 64B, right?
> >
> > I am sorry if these seem trivial but I just wanted to have a clear
> > understanding of the testing environment.
> > Thanks!
> >
> > > If you have suggestions for other subtables to special case, please
> > > post the u0_wX_u1_wY values that you'd like to see, and I'll add them to
> > the patchset.
> > >
> > >
> > > > Thank you,
> > > > Malvika Gupta
> > >
> > > Testing welcomed, thanks! -Harry
> > >
> > > > > -----Original Message-----
> > > > > From: ovs-dev-bounces at openvswitch.org <ovs-dev-
> > > > > bounces at openvswitch.org> On Behalf Of Harry van Haaren
> > > > > Sent: Wednesday, May 8, 2019 11:13 PM
> > > > > To: ovs-dev at openvswitch.org
> > > > > Cc: i.maximets at samsung.com
> > > > > Subject: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized
> > > > > generic
> > > > scalar
> > > > > functions
> > > > >
> > > > > This commit adds a number of specialized functions, that handle
> > > > > common miniflow fingerprints. This enables compiler optimization,
> > > > > resulting in
> > > > higher
> > > > > performance. Below a quick description of how this optimization
> > > > > actually works;
> > > > >
> > > > > "Specialized functions" are "instances" of the generic
> > > > > implementation, but the compiler is given extra context when
> > > > > compiling. In the case of
> > > > iterating
> > > > > miniflow datastructures, the most interesting value to enable
> > > > > compile time optimizations is the loop trip count per unit.
> > > > >
> > > > > In order to create a specialized function, there is a generic
> > > > implementation,
> > > > > which uses a for() loop without the compiler knowing the loop trip
> > > > > count
> > > > at
> > > > > compile time. The loop trip count is passed in as an argument to
> > > > > the
> > > > function:
> > > > >
> > > > > uint32_t miniflow_impl_generic(struct miniflow *mf, uint32_t
> > > > > loop_count)
> > > {
> > > > >     for(uint32_t i = 0; i < loop_count; i++)
> > > > >         // do work
> > > > > }
> > > > >
> > > > > In order to "specialize" the function, we call the generic
> > > > > implementation
> > > > with
> > > > > hard-coded numbers - these are compile time constants!
> > > > >
> > > > > uint32_t miniflow_impl_loop5(struct miniflow *mf, uint32_t
> > > > > loop_count)
> > > {
> > > > >     // use hard coded constant for compile-time constant-propogation
> > > > >     return miniflow_impl_generic(mf, 5); }
> > > > >
> > > > > Given the compiler is aware of the loop trip count at compile
> > > > > time, it can perform an optimization known as "constant
> propogation".
> > > > > Combined with inlining of the miniflow_impl_generic() function,
> > > > > the compiler is now
> > > > enabled
> > > > > to *compile time* unroll the loop 5x, and produce "flat" code.
> > > > >
> > > > > The last step to using the specialized functions is to utilize a
> > > > > function-
> > > > pointer
> > > > > to choose the specialized (or generic) implementation.
> > > > > The selection of the function pointer is performed at subtable
> > > > > creation
> > > > time,
> > > > > when miniflow fingerprint of the subtable is known. This technique
> > > > > is
> > > > known
> > > > > as "multiple dispatch" in some literature, as it uses multiple
> > > > > items of information (miniflow bit counts) to select the dispatch
> > function.
> > > > >
> > > > > By pointing the function pointer at the optimized implementation,
> > > > > OvS benefits from the compile time optimizations at runtime.
> > > > >
> > > > > Signed-off-by: Harry van Haaren <harry.van.haaren at intel.com>
> > > > >
> > > > > ---
> > > > >
> > > > > v8:
> > > > > - Rework to use blocks_cache from the dpcls instance, to avoid
> variable
> > > > >   lenght arrays in the data-path.
> > > > > ---
> > > > >  lib/dpif-netdev-lookup-generic.c | 83
> > > > > +++++++++++++++++++++++++++++-
> > > --
> > > > >  lib/dpif-netdev.c                |  9 +++-
> > > > >  lib/dpif-netdev.h                |  8 +++
> > > > >  3 files changed, 90 insertions(+), 10 deletions(-)
> > > > >
> > > > > diff --git a/lib/dpif-netdev-lookup-generic.c
> > > > > b/lib/dpif-netdev-lookup- generic.c index 901d28ff0..ba2d024cc
> > > > > 100644
> > > > > --- a/lib/dpif-netdev-lookup-generic.c
> > > > > +++ b/lib/dpif-netdev-lookup-generic.c
> > > > > @@ -100,11 +100,11 @@ netdev_flow_key_flatten(const struct
> > > > > netdev_flow_key * restrict key,
> > > > >
> > > > >      /* Unit 0 flattening */
> > > > >      netdev_flow_key_flatten_unit(&pkt_blocks[0],
> > > > > -                            &tbl_blocks[0],
> > > > > -                            &mf_masks[0],
> > > > > -                            &blocks_scratch[0],
> > > > > -                            pkt_bits_u0,
> > > > > -                            u0_count);
> > > > > +                                 &tbl_blocks[0],
> > > > > +                                 &mf_masks[0],
> > > > > +                                 &blocks_scratch[0],
> > > > > +                                 pkt_bits_u0,
> > > > > +                                 u0_count);
> > > > >
> > > > >      /* Unit 1 flattening:
> > > > >       * Move the pointers forward in the arrays based on u0 offsets,
> > NOTE:
> > > > > @@ -225,7 +225,74 @@ dpcls_subtable_lookup_generic(struct
> > > > > dpcls_subtable *subtable,
> > > > >                                const struct netdev_flow_key *keys[],
> > > > >                                struct dpcls_rule **rules)  {
> > > > > -        return lookup_generic_impl(subtable, blocks_scratch,
> keys_map,
> > > > keys,
> > > > > -                                   rules, subtable-
> >mf_bits_set_unit0,
> > > > > -                                   subtable->mf_bits_set_unit1);
> > > > > +    /* Here the runtime subtable->mf_bits counts are used, which
> > > > > + forces
> > > > the
> > > > > +     * compiler to iterate normal for() loops. Due to this
> > > > > + limitation in
> > > > the
> > > > > +     * compilers available optimizations, this function has lower
> > > > performance
> > > > > +     * than the below specialized functions.
> > > > > +     */
> > > > > +    return lookup_generic_impl(subtable, blocks_scratch,
> > > > > + keys_map, keys,
> > > > > rules,
> > > > > +                               subtable->mf_bits_set_unit0,
> > > > > +                               subtable->mf_bits_set_unit1); }
> > > > > +
> > > > > +static uint32_t
> > > > > +dpcls_subtable_lookup_mf_u0w5_u1w1(struct dpcls_subtable
> > > *subtable,
> > > > > +                                   uint64_t *blocks_scratch,
> > > > > +                                   uint32_t keys_map,
> > > > > +                                   const struct netdev_flow_key
> *keys[],
> > > > > +                                   struct dpcls_rule **rules) {
> > > > > +    /* hard coded bit counts - enables compile time loop unrolling,
> and
> > > > > +     * generating of optimized code-sequences due to loop unrolled
> > code.
> > > > > +     */
> > > > > +    return lookup_generic_impl(subtable, blocks_scratch,
> > > > > +keys_map, keys,
> > > > > rules,
> > > > > +                               5, 1); }
> > > > > +
> > > > > +static uint32_t
> > > > > +dpcls_subtable_lookup_mf_u0w4_u1w1(struct dpcls_subtable
> > > *subtable,
> > > > > +                                   uint64_t *blocks_scratch,
> > > > > +                                   uint32_t keys_map,
> > > > > +                                   const struct netdev_flow_key
> *keys[],
> > > > > +                                   struct dpcls_rule **rules) {
> > > > > +    return lookup_generic_impl(subtable, blocks_scratch,
> > > > > +keys_map, keys,
> > > > > rules,
> > > > > +                               4, 1); }
> > > > > +
> > > > > +static uint32_t
> > > > > +dpcls_subtable_lookup_mf_u0w4_u1w0(struct dpcls_subtable
> > > *subtable,
> > > > > +                                   uint64_t *blocks_scratch,
> > > > > +                                   uint32_t keys_map,
> > > > > +                                   const struct netdev_flow_key
> *keys[],
> > > > > +                                   struct dpcls_rule **rules) {
> > > > > +    return lookup_generic_impl(subtable, blocks_scratch,
> > > > > +keys_map, keys,
> > > > > rules,
> > > > > +                               4, 0); }
> > > > > +
> > > > > +/* Probe function to lookup an available specialized function.
> > > > > + * If capable to run the requested miniflow fingerprint, this
> > > > > +function returns
> > > > > + * the most optimal implementation for that miniflow fingerprint.
> > > > > + * @retval FunctionAddress A valid function to handle the
> > > > > +miniflow bit pattern
> > > > > + * @retval 0 The requested miniflow is not supported here, NULL
> > > > > +is returned  */ dpcls_subtable_lookup_func
> > > > > +dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits) {
> > > > > +    dpcls_subtable_lookup_func f = NULL;
> > > > > +
> > > > > +    if (u0_bits == 5 && u1_bits == 1) {
> > > > > +        f = dpcls_subtable_lookup_mf_u0w5_u1w1;
> > > > > +    } else if (u0_bits == 4 && u1_bits == 1) {
> > > > > +        f = dpcls_subtable_lookup_mf_u0w4_u1w1;
> > > > > +    } else if (u0_bits == 4 && u1_bits == 0) {
> > > > > +        f = dpcls_subtable_lookup_mf_u0w4_u1w0;
> > > > > +    }
> > > > > +
> > > > > +    if (f) {
> > > > > +        VLOG_INFO("Subtable using Generic Optimized for u0 %d,
> > u1 %d\n",
> > > > > +                  u0_bits, u1_bits);
> > > > > +    }
> > > > > +    return f;
> > > > >  }
> > > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > > > 33b93cfdf..23fc5b7a6
> > > > > 100644
> > > > > --- a/lib/dpif-netdev.c
> > > > > +++ b/lib/dpif-netdev.c
> > > > > @@ -7623,8 +7623,13 @@ dpcls_create_subtable(struct dpcls *cls,
> > > > > const struct netdev_flow_key *mask)
> > > > >          cls->blocks_scratch_size = blocks_required_per_pkt;
> > > > >      }
> > > > >
> > > > > -    /* Assign the generic lookup - this works with any miniflow
> > > > fingerprint */
> > > > > -    subtable->lookup_func = dpcls_subtable_lookup_generic;
> > > > > +    /* Probe for an optimmized variant */
> > > > > +    subtable->lookup_func = dpcls_subtable_generic_probe(unit0,
> > > > > + unit1);
> > > > > +
> > > > > +    /* If not set, assign generic lookup - this works with any
> > > > > + miniflow
> > > > */
> > > > > +    if (!subtable->lookup_func) {
> > > > > +        subtable->lookup_func = dpcls_subtable_lookup_generic;
> > > > > +    }
> > > > >
> > > > >      cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask-
> > > >hash);
> > > > >      /* Add the new subtable at the end of the pvector (with no
> > > > > hits
> > > > > yet)
> > > > */ diff
> > > > > --git a/lib/dpif-netdev.h b/lib/dpif-netdev.h index
> > > > > 9263256a9..123eabad7
> > > > > 100644
> > > > > --- a/lib/dpif-netdev.h
> > > > > +++ b/lib/dpif-netdev.h
> > > > > @@ -70,6 +70,14 @@ typedef uint32_t
> > > > > (*dpcls_subtable_lookup_func)(struct dpcls_subtable *subtable,
> > > > >                  const struct netdev_flow_key *keys[],
> > > > >                  struct dpcls_rule **rules);
> > > > >
> > > > > +/* Probe function to select a specialized version of the generic
> > > > > +lookup
> > > > > + * implementation. This provides performance benefit due to
> > > > > +compile-time
> > > > > + * optimizations such as loop-unrolling. These are enabled by the
> > > > > +compile-time
> > > > > + * constants in the specific function implementations.
> > > > > + */
> > > > > +dpcls_subtable_lookup_func
> > > > > +dpcls_subtable_generic_probe(uint32_t u0_bit_count, uint32_t
> > > > > +u1_bit_count);
> > > > > +
> > > > >  /* Prototype for generic lookup func, using same code path as
> > > > > before */ uint32_t  dpcls_subtable_lookup_generic(struct
> > > > > dpcls_subtable *subtable,
> > > > > --
> > > > > 2.17.1
> > > > >
> > > > > _______________________________________________
> > > > > dev mailing list
> > > > > dev at openvswitch.org
> > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > > > IMPORTANT NOTICE: The contents of this email and any attachments are
> > > > confidential and may also be privileged. If you are not the intended
> > > > recipient, please notify the sender immediately and do not disclose
> > > > the contents to any other person, use it for any purpose, or store
> > > > or copy the information in any medium. Thank you.
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.