[ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized generic scalar functions

Thu Jun 27 12:25:47 UTC 2019

Sorry for double post - if you're happy with the changes please add a Tested-by: tag?

> -----Original Message-----
> From: Van Haaren, Harry
> Sent: Thursday, June 27, 2019 1:07 PM
> To: 'Malvika Gupta' <Malvika.Gupta at arm.com>
> Cc: nd <nd at arm.com>; Yanqin Wei (Arm Technology China) <Yanqin.Wei at arm.com>;
> dev at openvswitch.org
> Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized generic
> scalar functions
> 
> > -----Original Message-----
> > From: Malvika Gupta [mailto:Malvika.Gupta at arm.com]
> > Sent: Tuesday, June 25, 2019 8:19 PM
> > To: Van Haaren, Harry <harry.van.haaren at intel.com>
> > Cc: nd <nd at arm.com>; Yanqin Wei (Arm Technology China)
> <Yanqin.Wei at arm.com>;
> > dev at openvswitch.org
> > Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized generic
> > scalar functions
> >
> > Hi Harry,
> 
> Hi Malvika,
> 
> 
> > I tested your patch on two ARM machines, namely ThunderX2 and Octeon-Tx.
> >
> > On Octeon-Tx (running gcc version 7.4.0), a performance improvement of 9-
> 10%
> > was seen. On the ThunderX2 (running gcc version 8.3.0), a performance
> > improvement of 14-15% was observed. For both machines, I disabled the EMC
> > and tested with two traffic patterns of 64B packet size, specifically
> > UDP/IPv4/Ethernet and IPv4/Ethernet (Raw IP).
> 
> Great - thanks for testing and reporting back performance, appreciated!
> 
> 
> > Please let me know if you require any more details.
> > Thanks,
> > Malvika
> 
> It would be good to know which miniflow fingerprints are commonly deployed
> with OVS.
> 
> If we (the OVS community) had more input in terms of benchmarks that OVS
> deployments
> use, then we can optimize against those more (as was mentioned at OVS Conf
> '18 too :)
> https://youtu.be/5-MDlpUIOBE?list=PLaJlRa-xItwCzuAL3mP6n02vmXab4Bwu-&t=1471
> 
> I haven't had much success with getting benchmarks yet - but if you have
> something
> like a use-case or benchmark to test against, sharing it with OVS Community
> would be great!
> 
> -Harry
> 
> 
> > > -----Original Message-----
> > > From: Malvika Gupta
> > > Sent: Wednesday, June 19, 2019 1:34 PM
> > > To: Van Haaren, Harry <harry.van.haaren at intel.com>
> > > Cc: Yanqin Wei (Arm Technology China) <Yanqin.Wei at arm.com>; nd
> > > <nd at arm.com>; dev at openvswitch.org
> > > Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized
> generic
> > > scalar functions
> > >
> > > Hi Harry
> > >
> > > Thanks for your reply. I have some last few things I wanted to clarify.
> > Please
> > > see my inline comments below.
> > >
> > > > -----Original Message-----
> > > > From: Van Haaren, Harry <harry.van.haaren at intel.com>
> > > > Sent: Wednesday, June 19, 2019 11:22 AM
> > > > To: Malvika Gupta <Malvika.Gupta at arm.com>
> > > > Cc: Yanqin Wei (Arm Technology China) <Yanqin.Wei at arm.com>; nd
> > > > <nd at arm.com>; dev at openvswitch.org
> > > > Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized
> > > > generic scalar functions
> > > >
> > > > > -----Original Message-----
> > > > > From: Malvika Gupta [mailto:Malvika.Gupta at arm.com]
> > > > > Sent: Wednesday, June 19, 2019 5:17 PM
> > > > > To: Van Haaren, Harry <harry.van.haaren at intel.com>
> > > > > Cc: Yanqin Wei (Arm Technology China) <Yanqin.Wei at arm.com>; nd
> > > > > <nd at arm.com>; dev at openvswitch.org
> > > > > Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized
> > > > > generic scalar functions
> > > > >
> > > > > Hi Harry,
> > > >
> > > > Hi Malvika,
> > > >
> > > >
> > > > > I would like to test your patch on ARM platforms and report any
> > > > > performance improvement. Can you please tell me the test vectors you
> > > > > used to match the subtables that call the specialized lookup
> > > > > functions, i.e., dpcls_subtable_lookup_mf_u0w5_u1w1,
> > > > > dpcls_subtable_lookup_mf_u0w4_u1w1, and
> > > > > dpcls_subtable_lookup_mf_u0w4_u1w0? And how did you make that
> > > > > distinction with traffic patterns, i.e., which traffic pattern would
> > > > > match with
> > > > which subtable lookup?
> > > >
> > > > I was just running Eth/IPv4/UDP data, and using OvS to do Phy to Phy
> > > > forwarding.
> > > >
> > > > ./utilities/ovs-ofctl add-flow br0 in_port=1,action=output:2
> > > > ./utilities/ovs- ofctl add-flow br0 in_port=2,action=output:1
> > > >
> > > > This is an extremely simple setup, however it is enough to showcase
> > > > value of the patches here.
> > > >
> > > > By changing traffic to Eth/IPv4 and Eth alone, you can change the # of
> > > > miniflow items, and hit the different u0_w5_u1_w0 cases :)
> > > >
> > >
> > > 1. Are you also sending combinations of traffic flows at the same time?
> > Like
> > > UDP/IPv4/Eth is one traffic flow, raw IP could be another and Eth
> packets
> > > only is a third flow. Did you also test by say, sending these 3 flows
> > together
> > > which would all hit different subtables or just one kind of traffic flow
> > at a
> > > time?
> > > 2. Just to clarify, packet size is 64B, right?
> > >
> > > I am sorry if these seem trivial but I just wanted to have a clear
> > > understanding of the testing environment.
> > > Thanks!
> > >
> > > > If you have suggestions for other subtables to special case, please
> > > > post the u0_wX_u1_wY values that you'd like to see, and I'll add them
> to
> > > the patchset.
> > > >
> > > >
> > > > > Thank you,
> > > > > Malvika Gupta
> > > >
> > > > Testing welcomed, thanks! -Harry
> > > >
> > > > > > -----Original Message-----
> > > > > > From: ovs-dev-bounces at openvswitch.org <ovs-dev-
> > > > > > bounces at openvswitch.org> On Behalf Of Harry van Haaren
> > > > > > Sent: Wednesday, May 8, 2019 11:13 PM
> > > > > > To: ovs-dev at openvswitch.org
> > > > > > Cc: i.maximets at samsung.com
> > > > > > Subject: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized
> > > > > > generic
> > > > > scalar
> > > > > > functions
> > > > > >
> > > > > > This commit adds a number of specialized functions, that handle
> > > > > > common miniflow fingerprints. This enables compiler optimization,
> > > > > > resulting in
> > > > > higher
> > > > > > performance. Below a quick description of how this optimization
> > > > > > actually works;
> > > > > >
> > > > > > "Specialized functions" are "instances" of the generic
> > > > > > implementation, but the compiler is given extra context when
> > > > > > compiling. In the case of
> > > > > iterating
> > > > > > miniflow datastructures, the most interesting value to enable
> > > > > > compile time optimizations is the loop trip count per unit.
> > > > > >
> > > > > > In order to create a specialized function, there is a generic
> > > > > implementation,
> > > > > > which uses a for() loop without the compiler knowing the loop trip
> > > > > > count
> > > > > at
> > > > > > compile time. The loop trip count is passed in as an argument to
> > > > > > the
> > > > > function:
> > > > > >
> > > > > > uint32_t miniflow_impl_generic(struct miniflow *mf, uint32_t
> > > > > > loop_count)
> > > > {
> > > > > >     for(uint32_t i = 0; i < loop_count; i++)
> > > > > >         // do work
> > > > > > }
> > > > > >
> > > > > > In order to "specialize" the function, we call the generic
> > > > > > implementation
> > > > > with
> > > > > > hard-coded numbers - these are compile time constants!
> > > > > >
> > > > > > uint32_t miniflow_impl_loop5(struct miniflow *mf, uint32_t
> > > > > > loop_count)
> > > > {
> > > > > >     // use hard coded constant for compile-time constant-
> propogation
> > > > > >     return miniflow_impl_generic(mf, 5); }
> > > > > >
> > > > > > Given the compiler is aware of the loop trip count at compile
> > > > > > time, it can perform an optimization known as "constant
> > propogation".
> > > > > > Combined with inlining of the miniflow_impl_generic() function,
> > > > > > the compiler is now
> > > > > enabled
> > > > > > to *compile time* unroll the loop 5x, and produce "flat" code.
> > > > > >
> > > > > > The last step to using the specialized functions is to utilize a
> > > > > > function-
> > > > > pointer
> > > > > > to choose the specialized (or generic) implementation.
> > > > > > The selection of the function pointer is performed at subtable
> > > > > > creation
> > > > > time,
> > > > > > when miniflow fingerprint of the subtable is known. This technique
> > > > > > is
> > > > > known
> > > > > > as "multiple dispatch" in some literature, as it uses multiple
> > > > > > items of information (miniflow bit counts) to select the dispatch
> > > function.
> > > > > >
> > > > > > By pointing the function pointer at the optimized implementation,
> > > > > > OvS benefits from the compile time optimizations at runtime.
> > > > > >
> > > > > > Signed-off-by: Harry van Haaren <harry.van.haaren at intel.com>
> > > > > >
> > > > > > ---
> > > > > >
> > > > > > v8:
> > > > > > - Rework to use blocks_cache from the dpcls instance, to avoid
> > variable
> > > > > >   lenght arrays in the data-path.
> > > > > > ---
> > > > > >  lib/dpif-netdev-lookup-generic.c | 83
> > > > > > +++++++++++++++++++++++++++++-
> > > > --
> > > > > >  lib/dpif-netdev.c                |  9 +++-
> > > > > >  lib/dpif-netdev.h                |  8 +++
> > > > > >  3 files changed, 90 insertions(+), 10 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/dpif-netdev-lookup-generic.c
> > > > > > b/lib/dpif-netdev-lookup- generic.c index 901d28ff0..ba2d024cc
> > > > > > 100644
> > > > > > --- a/lib/dpif-netdev-lookup-generic.c
> > > > > > +++ b/lib/dpif-netdev-lookup-generic.c
> > > > > > @@ -100,11 +100,11 @@ netdev_flow_key_flatten(const struct
> > > > > > netdev_flow_key * restrict key,
> > > > > >
> > > > > >      /* Unit 0 flattening */
> > > > > >      netdev_flow_key_flatten_unit(&pkt_blocks[0],
> > > > > > -                            &tbl_blocks[0],
> > > > > > -                            &mf_masks[0],
> > > > > > -                            &blocks_scratch[0],
> > > > > > -                            pkt_bits_u0,
> > > > > > -                            u0_count);
> > > > > > +                                 &tbl_blocks[0],
> > > > > > +                                 &mf_masks[0],
> > > > > > +                                 &blocks_scratch[0],
> > > > > > +                                 pkt_bits_u0,
> > > > > > +                                 u0_count);
> > > > > >
> > > > > >      /* Unit 1 flattening:
> > > > > >       * Move the pointers forward in the arrays based on u0
> offsets,
> > > NOTE:
> > > > > > @@ -225,7 +225,74 @@ dpcls_subtable_lookup_generic(struct
> > > > > > dpcls_subtable *subtable,
> > > > > >                                const struct netdev_flow_key
> *keys[],
> > > > > >                                struct dpcls_rule **rules)  {
> > > > > > -        return lookup_generic_impl(subtable, blocks_scratch,
> > keys_map,
> > > > > keys,
> > > > > > -                                   rules, subtable-
> > >mf_bits_set_unit0,
> > > > > > -                                   subtable->mf_bits_set_unit1);
> > > > > > +    /* Here the runtime subtable->mf_bits counts are used, which
> > > > > > + forces
> > > > > the
> > > > > > +     * compiler to iterate normal for() loops. Due to this
> > > > > > + limitation in
> > > > > the
> > > > > > +     * compilers available optimizations, this function has lower
> > > > > performance
> > > > > > +     * than the below specialized functions.
> > > > > > +     */
> > > > > > +    return lookup_generic_impl(subtable, blocks_scratch,
> > > > > > + keys_map, keys,
> > > > > > rules,
> > > > > > +                               subtable->mf_bits_set_unit0,
> > > > > > +                               subtable->mf_bits_set_unit1); }
> > > > > > +
> > > > > > +static uint32_t
> > > > > > +dpcls_subtable_lookup_mf_u0w5_u1w1(struct dpcls_subtable
> > > > *subtable,
> > > > > > +                                   uint64_t *blocks_scratch,
> > > > > > +                                   uint32_t keys_map,
> > > > > > +                                   const struct netdev_flow_key
> > *keys[],
> > > > > > +                                   struct dpcls_rule **rules) {
> > > > > > +    /* hard coded bit counts - enables compile time loop
> unrolling,
> > and
> > > > > > +     * generating of optimized code-sequences due to loop
> unrolled
> > > code.
> > > > > > +     */
> > > > > > +    return lookup_generic_impl(subtable, blocks_scratch,
> > > > > > +keys_map, keys,
> > > > > > rules,
> > > > > > +                               5, 1); }
> > > > > > +
> > > > > > +static uint32_t
> > > > > > +dpcls_subtable_lookup_mf_u0w4_u1w1(struct dpcls_subtable
> > > > *subtable,
> > > > > > +                                   uint64_t *blocks_scratch,
> > > > > > +                                   uint32_t keys_map,
> > > > > > +                                   const struct netdev_flow_key
> > *keys[],
> > > > > > +                                   struct dpcls_rule **rules) {
> > > > > > +    return lookup_generic_impl(subtable, blocks_scratch,
> > > > > > +keys_map, keys,
> > > > > > rules,
> > > > > > +                               4, 1); }
> > > > > > +
> > > > > > +static uint32_t
> > > > > > +dpcls_subtable_lookup_mf_u0w4_u1w0(struct dpcls_subtable
> > > > *subtable,
> > > > > > +                                   uint64_t *blocks_scratch,
> > > > > > +                                   uint32_t keys_map,
> > > > > > +                                   const struct netdev_flow_key
> > *keys[],
> > > > > > +                                   struct dpcls_rule **rules) {
> > > > > > +    return lookup_generic_impl(subtable, blocks_scratch,
> > > > > > +keys_map, keys,
> > > > > > rules,
> > > > > > +                               4, 0); }
> > > > > > +
> > > > > > +/* Probe function to lookup an available specialized function.
> > > > > > + * If capable to run the requested miniflow fingerprint, this
> > > > > > +function returns
> > > > > > + * the most optimal implementation for that miniflow fingerprint.
> > > > > > + * @retval FunctionAddress A valid function to handle the
> > > > > > +miniflow bit pattern
> > > > > > + * @retval 0 The requested miniflow is not supported here, NULL
> > > > > > +is returned  */ dpcls_subtable_lookup_func
> > > > > > +dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits)
> {
> > > > > > +    dpcls_subtable_lookup_func f = NULL;
> > > > > > +
> > > > > > +    if (u0_bits == 5 && u1_bits == 1) {
> > > > > > +        f = dpcls_subtable_lookup_mf_u0w5_u1w1;
> > > > > > +    } else if (u0_bits == 4 && u1_bits == 1) {
> > > > > > +        f = dpcls_subtable_lookup_mf_u0w4_u1w1;
> > > > > > +    } else if (u0_bits == 4 && u1_bits == 0) {
> > > > > > +        f = dpcls_subtable_lookup_mf_u0w4_u1w0;
> > > > > > +    }
> > > > > > +
> > > > > > +    if (f) {
> > > > > > +        VLOG_INFO("Subtable using Generic Optimized for u0 %d,
> > > u1 %d\n",
> > > > > > +                  u0_bits, u1_bits);
> > > > > > +    }
> > > > > > +    return f;
> > > > > >  }
> > > > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
> > > > > 33b93cfdf..23fc5b7a6
> > > > > > 100644
> > > > > > --- a/lib/dpif-netdev.c
> > > > > > +++ b/lib/dpif-netdev.c
> > > > > > @@ -7623,8 +7623,13 @@ dpcls_create_subtable(struct dpcls *cls,
> > > > > > const struct netdev_flow_key *mask)
> > > > > >          cls->blocks_scratch_size = blocks_required_per_pkt;
> > > > > >      }
> > > > > >
> > > > > > -    /* Assign the generic lookup - this works with any miniflow
> > > > > fingerprint */
> > > > > > -    subtable->lookup_func = dpcls_subtable_lookup_generic;
> > > > > > +    /* Probe for an optimmized variant */
> > > > > > +    subtable->lookup_func = dpcls_subtable_generic_probe(unit0,
> > > > > > + unit1);
> > > > > > +
> > > > > > +    /* If not set, assign generic lookup - this works with any
> > > > > > + miniflow
> > > > > */
> > > > > > +    if (!subtable->lookup_func) {
> > > > > > +        subtable->lookup_func = dpcls_subtable_lookup_generic;
> > > > > > +    }
> > > > > >
> > > > > >      cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask-
> > > > >hash);
> > > > > >      /* Add the new subtable at the end of the pvector (with no
> > > > > > hits
> > > > > > yet)
> > > > > */ diff
> > > > > > --git a/lib/dpif-netdev.h b/lib/dpif-netdev.h index
> > > > > > 9263256a9..123eabad7
> > > > > > 100644
> > > > > > --- a/lib/dpif-netdev.h
> > > > > > +++ b/lib/dpif-netdev.h
> > > > > > @@ -70,6 +70,14 @@ typedef uint32_t
> > > > > > (*dpcls_subtable_lookup_func)(struct dpcls_subtable *subtable,
> > > > > >                  const struct netdev_flow_key *keys[],
> > > > > >                  struct dpcls_rule **rules);
> > > > > >
> > > > > > +/* Probe function to select a specialized version of the generic
> > > > > > +lookup
> > > > > > + * implementation. This provides performance benefit due to
> > > > > > +compile-time
> > > > > > + * optimizations such as loop-unrolling. These are enabled by the
> > > > > > +compile-time
> > > > > > + * constants in the specific function implementations.
> > > > > > + */
> > > > > > +dpcls_subtable_lookup_func
> > > > > > +dpcls_subtable_generic_probe(uint32_t u0_bit_count, uint32_t
> > > > > > +u1_bit_count);
> > > > > > +
> > > > > >  /* Prototype for generic lookup func, using same code path as
> > > > > > before */ uint32_t  dpcls_subtable_lookup_generic(struct
> > > > > > dpcls_subtable *subtable,
> > > > > > --
> > > > > > 2.17.1
> > > > > >
> > > > > > _______________________________________________
> > > > > > dev mailing list
> > > > > > dev at openvswitch.org
> > > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > > > > IMPORTANT NOTICE: The contents of this email and any attachments are
> > > > > confidential and may also be privileged. If you are not the intended
> > > > > recipient, please notify the sender immediately and do not disclose
> > > > > the contents to any other person, use it for any purpose, or store
> > > > > or copy the information in any medium. Thank you.
> > IMPORTANT NOTICE: The contents of this email and any attachments are
> > confidential and may also be privileged. If you are not the intended
> > recipient, please notify the sender immediately and do not disclose the
> > contents to any other person, use it for any purpose, or store or copy the
> > information in any medium. Thank you.