[ovs-dev] [v13 12/12] dpcls-avx512: Enable avx512 vector popcount instruction.

Flavio Leitner fbl at sysclose.org
Thu Jun 24 12:17:52 UTC 2021


Hi Harry,

On Thu, Jun 24, 2021 at 11:07:59AM +0000, Van Haaren, Harry wrote:
> > -----Original Message-----
> > From: dev <ovs-dev-bounces at openvswitch.org> On Behalf Of Flavio Leitner
> > Sent: Thursday, June 24, 2021 4:57 AM
> > To: Ferriter, Cian <cian.ferriter at intel.com>
> > Cc: ovs-dev at openvswitch.org; i.maximets at ovn.org
> > Subject: Re: [ovs-dev] [v13 12/12] dpcls-avx512: Enable avx512 vector popcount
> > instruction.
> > 
> > On Thu, Jun 17, 2021 at 05:18:25PM +0100, Cian Ferriter wrote:
> > > From: Harry van Haaren <harry.van.haaren at intel.com>
> > >
> > > This commit enables the AVX512-VPOPCNTDQ Vector Popcount
> > > instruction. This instruction is not available on every CPU
> > > that supports the AVX512-F Foundation ISA, hence it is enabled
> > > only when the additional VPOPCNTDQ ISA check is passed.
> > >
> > > The vector popcount instruction is used instead of the AVX512
> > > popcount emulation code present in the avx512 optimized DPCLS today.
> > > It provides higher performance in the SIMD miniflow processing
> > > as that requires the popcount to calculate the miniflow block indexes.
> > >
> > > Signed-off-by: Harry van Haaren <harry.van.haaren at intel.com>
> > 
> > Acked-by: Flavio Leitner <fbl at sysclose.org>
> 
> Thanks for reviewing!
> 
> > This patch series implements low level optimizations by manually
> > coding instructions. I wonder if gcc couldn't get some relevant
> > level of vectorized optimizations refactoring and enabling
> > compiling flags. I assume the answer is no, but I would appreciate
> > some enlightenment on the matter.
> 
> Unfortunately no... there is no magic solution here to have the toolchain
> provide fallbacks if the latest ISA is not available. You're 100% right, these
> are manually implemented versions of new ISA, implemented in "older"
> ISA, to allow usage of the functionality. In this case, Skylake grade "AVX512-F"
> is used to implement the Icelake grade "AVX512-VPOPCNTDQ" instruction:
> (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_popcnt_epi64%2520&expand=4368,4368)
> 
> I do like the idea of toolchain supporting ISA options a bit more, there is
> so much compute performance available that is not widely used today.
> Such an effort industry wide would be very beneficial to all for improving
> performance, but would be a pretty large undertaking too... outside the
> scope of this patchset! :)

Yeah, it is. I mean, if the toolchain is not ready yet and we think
worth the benefits considering that most probably fewer people will
be able to contribute or maintain, then I see no other way to solve
the issue.

Do you think improving the toolchain is a larger commitment than
manually improving applications? A quick look on gcc gave me the
impression that it does support at least some basic vector
optimization capabilities.


> I'll admit to being a bit of an ISA fan, but there's some magical instructions
> that can do stuff in 1x instruction that otherwise take large amounts of
> shifts & loops. Did I hear somebody ask for examples..??

Out of curiosity, which tool are you using (if you are) to measure
the improvements at cycles level? vtune?


> Miniflow Bits processing with "BMI" (Bit Manipulation Instructions)
> Introduced in Haswell era, https://software.intel.com/sites/landingpage/IntrinsicsGuide/#othertechs=BMI1,BMI2
> - Favorite instructions are pdep and pext (parallel bit deposit, and parallel bit extract)
> - Very useful for dense bitfield unpacking, instead of "load - shift - AND" per field, can
>    unpack up to 8 bitfields in a u64 and align them to byte-boundaries
> - Its "opposite" "pext" also exists, extracting sparse bits from an integer into a packed layout
> (pext is used in DPCLS, to pull sparse bits from the packet's miniflow into linear packed layout,
> allowing it to be processed in a single packed AVX512 register)
> 
> Note that we're all benefitting from novel usage of the scalar "popcount" instruction too, since merging
> commit: a0b36b392 (introduced in SSE4.2, with CPUID flag POPCNT) It uses a bitmask & popcount approach
> to index into the miniflow, improving on the previous "count and shifts bits" to iterate miniflows approach.
> 
> There are likely multiple other places in OVS where we spend significant cycles
> on processing data in ways that can be accelerated significantly by using all available ISA.
> There is ongoing work in miniflow extract (MFEX) with AVX512 SIMD ISA, allowing parsing
> of multiple packet protocols at the same time (see here https://patchwork.ozlabs.org/project/openvswitch/list/?series=249470)
> 
> I'll stop promoting ISA here, but am happy to continue detailed discussions, or break out
> conversations about specific areas of compute in OVS if there's appetite for that! Feel free
> to email to OVS Mailing list (with me on CC please :) or email directly OK too.

I am definitely learning more about it and I appreciated your
longer reply.

Thanks,
-- 
fbl


More information about the dev mailing list