[ovs-dev] [v13 12/12] dpcls-avx512: Enable avx512 vector popcount instruction.
i.maximets at ovn.org
Thu Jun 24 11:41:37 UTC 2021
On 6/24/21 1:07 PM, Van Haaren, Harry wrote:
>> -----Original Message-----
>> From: dev <ovs-dev-bounces at openvswitch.org> On Behalf Of Flavio Leitner
>> Sent: Thursday, June 24, 2021 4:57 AM
>> To: Ferriter, Cian <cian.ferriter at intel.com>
>> Cc: ovs-dev at openvswitch.org; i.maximets at ovn.org
>> Subject: Re: [ovs-dev] [v13 12/12] dpcls-avx512: Enable avx512 vector popcount
>> On Thu, Jun 17, 2021 at 05:18:25PM +0100, Cian Ferriter wrote:
>>> From: Harry van Haaren <harry.van.haaren at intel.com>
>>> This commit enables the AVX512-VPOPCNTDQ Vector Popcount
>>> instruction. This instruction is not available on every CPU
>>> that supports the AVX512-F Foundation ISA, hence it is enabled
>>> only when the additional VPOPCNTDQ ISA check is passed.
>>> The vector popcount instruction is used instead of the AVX512
>>> popcount emulation code present in the avx512 optimized DPCLS today.
>>> It provides higher performance in the SIMD miniflow processing
>>> as that requires the popcount to calculate the miniflow block indexes.
>>> Signed-off-by: Harry van Haaren <harry.van.haaren at intel.com>
>> Acked-by: Flavio Leitner <fbl at sysclose.org>
> Thanks for reviewing!
>> This patch series implements low level optimizations by manually
>> coding instructions. I wonder if gcc couldn't get some relevant
>> level of vectorized optimizations refactoring and enabling
>> compiling flags. I assume the answer is no, but I would appreciate
>> some enlightenment on the matter.
> Unfortunately no... there is no magic solution here to have the toolchain
> provide fallbacks if the latest ISA is not available. You're 100% right, these
> are manually implemented versions of new ISA, implemented in "older"
> ISA, to allow usage of the functionality. In this case, Skylake grade "AVX512-F"
> is used to implement the Icelake grade "AVX512-VPOPCNTDQ" instruction:
> I do like the idea of toolchain supporting ISA options a bit more, there is
> so much compute performance available that is not widely used today.
> Such an effort industry wide would be very beneficial to all for improving
> performance, but would be a pretty large undertaking too... outside the
> scope of this patchset! :)
> I'll admit to being a bit of an ISA fan, but there's some magical instructions
> that can do stuff in 1x instruction that otherwise take large amounts of
> shifts & loops. Did I hear somebody ask for examples..??
> Miniflow Bits processing with "BMI" (Bit Manipulation Instructions)
> Introduced in Haswell era, https://software.intel.com/sites/landingpage/IntrinsicsGuide/#othertechs=BMI1,BMI2
> - Favorite instructions are pdep and pext (parallel bit deposit, and parallel bit extract)
> - Very useful for dense bitfield unpacking, instead of "load - shift - AND" per field, can
> unpack up to 8 bitfields in a u64 and align them to byte-boundaries
> - Its "opposite" "pext" also exists, extracting sparse bits from an integer into a packed layout
> (pext is used in DPCLS, to pull sparse bits from the packet's miniflow into linear packed layout,
> allowing it to be processed in a single packed AVX512 register)
> Note that we're all benefitting from novel usage of the scalar "popcount" instruction too, since merging
> commit: a0b36b392 (introduced in SSE4.2, with CPUID flag POPCNT) It uses a bitmask & popcount approach
> to index into the miniflow, improving on the previous "count and shifts bits" to iterate miniflows approach.
> There are likely multiple other places in OVS where we spend significant cycles
> on processing data in ways that can be accelerated significantly by using all available ISA.
> There is ongoing work in miniflow extract (MFEX) with AVX512 SIMD ISA, allowing parsing
> of multiple packet protocols at the same time (see here https://patchwork.ozlabs.org/project/openvswitch/list/?series=249470)
> I'll stop promoting ISA here, but am happy to continue detailed discussions, or break out
> conversations about specific areas of compute in OVS if there's appetite for that! Feel free
> to email to OVS Mailing list (with me on CC please :) or email directly OK too.
> Regards, -Harry
Speaking of "magic" compiler optimizations, I'm wondering what
kind of performance improvement we can have by just compiling
"generic" implementations of DPCLS and other stuff with the same
flags with which we're compiling hand-crafted avx512 code.
I mean, if we'll have a separate .c file that would include
lib/dpif-netdev-lookup-generic.c (With some MACRO tricks to
generate a different name for the classifier callback) and will
compile it as part of libopenvswitchavx512 and have a separate
implementation switch for it in runtime. Did you consider this
kind of solution?
It would be interesting to compare manual optimizations with
automatic. I'm pretty sure that manual will be faster, but
it would be great to know the difference.
Maybe you have numbers for comparison where the whole OVS
just built with the same instruction set available?
Best regards, Ilya Maximets.
More information about the dev