[ovs-dev] [v13 12/12] dpcls-avx512: Enable avx512 vector popcount instruction.

Thu Jun 24 12:07:26 UTC 2021

> -----Original Message-----
> From: Ilya Maximets <i.maximets at ovn.org>
> Sent: Thursday, June 24, 2021 12:42 PM
> To: Van Haaren, Harry <harry.van.haaren at intel.com>; Flavio Leitner
> <fbl at sysclose.org>; Ferriter, Cian <cian.ferriter at intel.com>
> Cc: ovs-dev at openvswitch.org; i.maximets at ovn.org; Amber, Kumar
> <kumar.amber at intel.com>
> Subject: Re: [ovs-dev] [v13 12/12] dpcls-avx512: Enable avx512 vector popcount
> instruction.
> 
> On 6/24/21 1:07 PM, Van Haaren, Harry wrote:

<snip lots of ISA discussion & commit message>

> > I'll stop promoting ISA here, but am happy to continue detailed discussions, or
> break out
> > conversations about specific areas of compute in OVS if there's appetite for that!
> Feel free
> > to email to OVS Mailing list (with me on CC please :) or email directly OK too.
> >
> > Regards, -Harry
> >
> 
> Speaking of "magic" compiler optimizations, I'm wondering what
> kind of performance improvement we can have by just compiling
> "generic" implementations of DPCLS and other stuff with the same
> flags with which we're compiling hand-crafted avx512 code.

That's pretty easy to do? CFLAGS="-march=skylake-avx512 " on a Skylake
or newer CPU will achieve that. Or "-march=native" for whatever CPU
it is you're compiling on will enable all available ISA on that machine.

Note that subtable search specialization is actually a *huge* help to the
compiler in this case, as it can (at compile time) know how many times
a specific loop can be unrolled... and loop unrolling into SIMD code is
often the easiest of transforms to do & validate as correct for the compiler.

Look up CPU SIMD vector optimization, and 99.999% of the time the
example given is a float matrix multiply. Why? It has a nice property of loop-
unrolling into a SIMD register, and this optimization is inside a hot loop.
It’s the "home run" of compiler auto-vectorization. Packet processing
is much more complex in nature, and I've never seen a complex scalar
function be neatly vectorized by a compiler yet...

> I mean, if we'll have a separate .c file that would include
> lib/dpif-netdev-lookup-generic.c (With some MACRO tricks to
> generate a different name for the classifier callback) and will
> compile it as part of libopenvswitchavx512 and have a separate
> implementation switch for it in runtime.  Did you consider this
> kind of solution?

Not really, because there's no actual benefit. Try compiling all
of OVS as above with CFLAGS="-march=skylake-avx512" and see
how much the compiler manages to actually vectorize into SIMD code...

Unfortunately the complexity in the compiler to do transformations
such as scalar -> vector code are complex, and many "hazards" exist
that dis-allow vectorization.

Things like two pointers of the same type being loaded/stored to is already going
to stop the compiler, as those two pointers may overlap. This can be solved with
"restrict" C keyword, to inform the compiler that it is impossible to access the memory
region pointed to by that pointer through any other way... this would require large changes
to the OVS codebase to indicate that one struct flow* or struct miniflow* cannot overlap
with another.

Once that is done, we must rely on the compiler to actually understand the data
movement taking place, and be able to *guarantee correctness* for any input data.
As humans, we can logic about specific things, and rule them out. The compiler is not
allowed to do this, hence often CPU-auto vectorization just doesn't work.

Lastly, any if() conditions that have stores in them, these must be made "branch free".
As x86-64 has Total Store Ordering, this makes it difficult for the compiler to take liberties
in terms of re-ordering stores from program order. The result is the if the order of stores in
your program would change due to the compiler having auto-vectorized the code, it is not
valid, so the compiler will not emit it.

> It would be interesting to compare manual optimizations with
> automatic.  I'm pretty sure that manual will be faster, but
> it would be great to know the difference.
> Maybe you have numbers for comparison where the whole OVS
> just built with the same instruction set available?

Note that glibc functions such as memcmp() etc are already being optimized for
the ISA that is available. The VDSO that is at runtime linked into the Userspace app
is capable of using SIMD registers/CPU ISA as it likes, as the linker can define which
version of the VDSO to link. As a result, you can get functions like memcmp and
memcpy() to use SIMD registers "under the hood". This may help e.g. EMC for compares,
if its not memory bound on loading the data to be compared. 

In my experience, the compiler is not able to automatically vectorize to use
AVX512 SIMD instructions in any meaningful way to accelerate datapath.
As a result, the performance is pretty similar to the scalar code, within a few %.

Feel free to test this, the CFLAGS="" string is above. I would be a bit surprised
if there were > 5% differences in Phy-Phy end-to-end performance. 

> Best regards, Ilya Maximets.

Regards, -Harry