[ovs-dev] [v13 12/12] dpcls-avx512: Enable avx512 vector popcount instruction.

Thu Jun 24 11:07:59 UTC 2021

> -----Original Message-----
> From: dev <ovs-dev-bounces at openvswitch.org> On Behalf Of Flavio Leitner
> Sent: Thursday, June 24, 2021 4:57 AM
> To: Ferriter, Cian <cian.ferriter at intel.com>
> Cc: ovs-dev at openvswitch.org; i.maximets at ovn.org
> Subject: Re: [ovs-dev] [v13 12/12] dpcls-avx512: Enable avx512 vector popcount
> instruction.
> 
> On Thu, Jun 17, 2021 at 05:18:25PM +0100, Cian Ferriter wrote:
> > From: Harry van Haaren <harry.van.haaren at intel.com>
> >
> > This commit enables the AVX512-VPOPCNTDQ Vector Popcount
> > instruction. This instruction is not available on every CPU
> > that supports the AVX512-F Foundation ISA, hence it is enabled
> > only when the additional VPOPCNTDQ ISA check is passed.
> >
> > The vector popcount instruction is used instead of the AVX512
> > popcount emulation code present in the avx512 optimized DPCLS today.
> > It provides higher performance in the SIMD miniflow processing
> > as that requires the popcount to calculate the miniflow block indexes.
> >
> > Signed-off-by: Harry van Haaren <harry.van.haaren at intel.com>
> 
> Acked-by: Flavio Leitner <fbl at sysclose.org>

Thanks for reviewing!

> This patch series implements low level optimizations by manually
> coding instructions. I wonder if gcc couldn't get some relevant
> level of vectorized optimizations refactoring and enabling
> compiling flags. I assume the answer is no, but I would appreciate
> some enlightenment on the matter.

Unfortunately no... there is no magic solution here to have the toolchain
provide fallbacks if the latest ISA is not available. You're 100% right, these
are manually implemented versions of new ISA, implemented in "older"
ISA, to allow usage of the functionality. In this case, Skylake grade "AVX512-F"
is used to implement the Icelake grade "AVX512-VPOPCNTDQ" instruction:
(https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_popcnt_epi64%2520&expand=4368,4368)

I do like the idea of toolchain supporting ISA options a bit more, there is
so much compute performance available that is not widely used today.
Such an effort industry wide would be very beneficial to all for improving
performance, but would be a pretty large undertaking too... outside the
scope of this patchset! :)

I'll admit to being a bit of an ISA fan, but there's some magical instructions
that can do stuff in 1x instruction that otherwise take large amounts of
shifts & loops. Did I hear somebody ask for examples..??

Miniflow Bits processing with "BMI" (Bit Manipulation Instructions)
Introduced in Haswell era, https://software.intel.com/sites/landingpage/IntrinsicsGuide/#othertechs=BMI1,BMI2
- Favorite instructions are pdep and pext (parallel bit deposit, and parallel bit extract)
- Very useful for dense bitfield unpacking, instead of "load - shift - AND" per field, can
   unpack up to 8 bitfields in a u64 and align them to byte-boundaries
- Its "opposite" "pext" also exists, extracting sparse bits from an integer into a packed layout
(pext is used in DPCLS, to pull sparse bits from the packet's miniflow into linear packed layout,
allowing it to be processed in a single packed AVX512 register)

Note that we're all benefitting from novel usage of the scalar "popcount" instruction too, since merging
commit: a0b36b392 (introduced in SSE4.2, with CPUID flag POPCNT) It uses a bitmask & popcount approach
to index into the miniflow, improving on the previous "count and shifts bits" to iterate miniflows approach.

There are likely multiple other places in OVS where we spend significant cycles
on processing data in ways that can be accelerated significantly by using all available ISA.
There is ongoing work in miniflow extract (MFEX) with AVX512 SIMD ISA, allowing parsing
of multiple packet protocols at the same time (see here https://patchwork.ozlabs.org/project/openvswitch/list/?series=249470)

I'll stop promoting ISA here, but am happy to continue detailed discussions, or break out
conversations about specific areas of compute in OVS if there's appetite for that! Feel free
to email to OVS Mailing list (with me on CC please :) or email directly OK too.

Regards, -Harry