[ovs-dev] [v4 03/12] dpif-netdev: Add study function to select the best mfex function
Van Haaren, Harry
harry.van.haaren at intel.com
Wed Jun 30 11:21:54 UTC 2021
> -----Original Message-----
> From: Eelco Chaudron <echaudro at redhat.com>
> Sent: Wednesday, June 30, 2021 10:52 AM
> To: Van Haaren, Harry <harry.van.haaren at intel.com>
> Cc: Amber, Kumar <kumar.amber at intel.com>; dev at openvswitch.org;
> i.maximets at ovn.org
> Subject: Re: [ovs-dev] [v4 03/12] dpif-netdev: Add study function to select the best
> mfex function
> On 30 Jun 2021, at 11:32, Van Haaren, Harry wrote:
> >> -----Original Message-----
> >> From: Eelco Chaudron <echaudro at redhat.com>
> >> Sent: Wednesday, June 30, 2021 10:18 AM
> >> To: Van Haaren, Harry <harry.van.haaren at intel.com>
> >> Cc: Amber, Kumar <kumar.amber at intel.com>; dev at openvswitch.org;
> >> i.maximets at ovn.org
> >> Subject: Re: [ovs-dev] [v4 03/12] dpif-netdev: Add study function to select the
> >> mfex function
> >> On 29 Jun 2021, at 18:32, Van Haaren, Harry wrote:
> >>>> -----Original Message-----
> >>>> From: dev <ovs-dev-bounces at openvswitch.org> On Behalf Of Eelco Chaudron
<snip away outdated context>
> >>>> Maybe we should report the numbers/hits for the other methods, as they might
> >> be
> >>>> equal, and some might be faster in execution time?
> >>> As above, the implementations are sorted in performance order. Performance
> >>> here can be known by micro-benchmarks, and developers of such SIMD
> >>> code can be expected to know which impl is fastest.
> >> Don’t think we can, as it’s not documented in the code, and some one can just
> >> his own, and has no clue about the existing ones.
> > Yes, in theory somebody could add his own, and get this wrong. There are many
> > things that could go wrong when making code changes. We cannot document
> I meant that the code currently does not document that the implementation table,
> mfex_impls, is in order of preference. So I think this should be added.
Sure we can document that the impl list is iterated & searched in order, hence
code-doc would help there. Will add this to the code.
> >>> In our current code, the avx512_vbmi_* impls are always before the avx512_*
> >>> impls, as the VBMI instruction set allows a faster runtime.
> >> Guess we need some documentation in the developer's section on how to add
> >> processor optimized functions, and how to benchmark them (and maybe some
> >> benchmark data for the current implementations).
> >> Also, someone can write a sloppy avx512_vbmi* function that might be slower
> >> an avx512_*, right?
> > What are we trying to achieve here? What is the root problem that is being
> > Yes, somebody "could" write sloppy (complex, advanced, ISA specific, SIMD)
> avx512 code,
> > and have it be slower. Who is realistically going to do that?
> > I'm fine with documenting a few things if they make sense to document, but
> > trying to "hand hold" at every level just doesn't work. Adding sections on how
> > to benchmark code, and how function pointers work and how to add them?
> > These things are documented in various places across the internet.
> > If there's really an interest to learn AVX512 SIMD optimization, reach out to the
> > OVS community, put me on CC, and I'll be willing to help. Adding documentation
> > ad nauseam is not the solution, as each optimization is likely to have subtle
> I think the problem is that except you, and some other small group at Intel might
> know AVX512, but for most of the OVS community this is moving back to
> handwritten assembler.
Nitpick but worth mentioning: optimizing with intrinsics is much easier, and much
less mental overhead than actual assembler (e.g. register allocation handled by compiler).
I agree lots of developers don't see this on a daily basis, but its really not that "crazy".
Once over the 1st level of "reading intrinsics", scalar becomes looped scalar becomes vector:
uint64_t x = y & z;
for (int i = 0; i < 8; i++)
x[i] = y[i] & z[i];
__m512i x = _mm512_and_si512(y, z);
Anyway, this is getting off topic, so I'll stop adding detail here.
> So at least some guidelines on what you should do when
> adding a custom function would help. Like order them in priority, maybe some
> simple example on how to benchmark the runtime of the mfex function. Don't think
> this has to be part of this patch, but a follow-up would be nice.
Honestly I'm still not convinced. Just running the normal OVS benchmarks is enough.
If the cycle-counts/packet-rate reported by OVS are better, you're going faster. These
things are already documented:
If you're a developer writing SIMD code, I think its fair to assume some level of knowledge
on profiling. If not, the OVS documentation is IMO still _not_ the place to document how
to profile optimized code. There's nothing special about benchmarking these AVX512 MFEX
implementations compared to any other datapath (or otherwise) function.
> >>> <snip code changes till end of patch>
> > <snip snip away irrelevant old discussions>
More information about the dev