[ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather implementation

Van Haaren, Harry harry.van.haaren at intel.com
Thu May 21 17:09:59 UTC 2020


Hey All,

[OT: Apologies for a missing indent, some HTML mixup occurred somewhere, now plain-text email again.]

>From: Federico Iezzi <fiezzi at redhat.com>
>Sent: Wednesday, May 20, 2020 5:13 PM
>To: William Tu <u9012063 at gmail.com>
>Cc: Van Haaren, Harry <harry.van.haaren at intel.com>; ovs-dev at openvswitch.org; i.maximets at ovn.org
>Subject: Re: [ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather implementation
>
>On Wed, 20 May 2020 at 15:32, William Tu <u9012063 at gmail.com> wrote:
>On Wed, May 20, 2020 at 3:35 AM Federico Iezzi <fiezzi at redhat.com> wrote:
>> On Wed, 20 May 2020 at 12:20, Van Haaren, Harry <harry.van.haaren at intel.com> wrote:
>>>
>>> > -----Original Message-----
>>> > From: William Tu <u9012063 at gmail.com>
>>> > Sent: Wednesday, May 20, 2020 1:12 AM
>>> > To: Van Haaren, Harry <harry.van.haaren at intel.com>
>>> > Cc: ovs-dev at openvswitch.org; i.maximets at ovn.org
>>> > Subject: Re: [ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather
>>> > implementation
>>> >
>>> > On Mon, May 18, 2020 at 9:12 AM Van Haaren, Harry
>>> > <harry.van.haaren at intel.com> wrote:
>>> > >
>>> > > > -----Original Message-----
>>> > > > From: William Tu <u9012063 at gmail.com>
>>> > > > Sent: Monday, May 18, 2020 3:58 PM
>>> > > > To: Van Haaren, Harry <harry.van.haaren at intel.com>
>>> > > > Cc: ovs-dev at openvswitch.org; i.maximets at ovn.org
>>> > > > Subject: Re: [ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather
>>> > > > implementation
>>> > > >
>>> > > > On Wed, May 06, 2020 at 02:06:09PM +0100, Harry van Haaren wrote:
>>> > > > > This commit adds an AVX-512 dpcls lookup implementation.
>>> > > > > It uses the AVX-512 SIMD ISA to perform multiple miniflow
>>> > > > > operations in parallel.
>>>
>>> <snip lots of code/patch contents for readability>
>>>
>>> > Hi Harry,
>>> >
>>> > I managed to find a machine with avx512 in google cloud and did some
>>> > performance testing. I saw lower performance when enabling avx512,
>>
>>
>> AVX512 instruction path lowers the clock speed well below the base frequency [1].
>> Aren't you killing the PMD performance while improving the lookup ones?
>>
>> [1] https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/2nd-gen-xeon-scalable-spec-update.pdf (see page 20)

Thanks for raising your question – likely there are others with similar questions. It will be good to
discuss here and to be able to present the logic and design taken these OVS patches for enabling AVX512.

From a frequency perspective, there is a mis-conception that AVX512 will always cause the worst-case degradation.
For example, there are differences in frequency based on what instructions are executing. This does makes it more
complicated, however there are rules here – and those rules provide us SW developers with best practices. I've added
my colleague Edwin on CC, who is much more familiar with AVX512 frequency topic, and can provide more detail.


From an OVS Software Developer perspective, these were the design decisions that made AVX512 enabling work:
AVX512 provides very powerful compute ISA, so to optimize with it we must efficiently achieve compute. This patchset
achieves "flattening" of a packet miniflow data-structure, based on the miniflow of the subtable to match on. In short,
it implements the tuple-space-search as required for DPCLS wildcarded lookup in SIMD. The instruction count reduction
is large – and that's what ultimately leads to the performance improvements.

Given a DPCLS implementation with AVX512, we must consider the other work done on that thread – you correctly
point out that other work (e.g. DPDK PMDs) also execute on that core. My experience has been that performance goes
up – including DPDK PMD rx and tx – overall rate of work done increases. Given OVS can spend significant amounts of
time in DPCLS itself, any potential slowdown of the PMD code is very likely still giving performance improvements.

Finally – the design itself here is very flexible – this allows each deployment of OVS to test if/how-much the AVX512
code-path improves real-world performance, and enable it based on that.


>Thanks for sharing the link.
>Does that mean if OVS PMD uses avx512 on one core, then all the other cores's
>frequency will be lower?
>
>Only where avx512 instructions are executed the clock is reduced to cope with the thermals
>I'm not sure if there is a situation where avx512 code is executed only on specific PMDs, if that happens is bad as some may PMD be faster/slower (see below)
>Kinda like when dynamic turbo boost is enabled and some pmd go faster because of the higher clock
>
>
>There are some discussion here:
>https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-use-these-new-instructions/
>
>Wow, quite interesting. Thanks!
>
>
>My take is that overall down clocking will happen, but application
>will get better performance.
>
>Indeed the part of the code wrote for avx512 goes much faster, the rest, stay on the normal path and will go slow due to the reduced clock.
>Those are different use-cases and programs but see Cannon Lake Anandtech review regarding what AVX512 can deliver
>
>###
>When we crank on the AVX2 and AVX512, there is no stopping the Cannon Lake chip here. At a score of 4519, it beats a full 18-core Core i9-7980XE processor running in non-AVX.
>https://www.anandtech.com/show/13405/intel-10nm-cannon-lake-and-core-i3-8121u-deep-dive-review/9
>###
>
>Indeed you have to expect much-improved performance from it, the question is how much non-avx512 code will slow down
>See also this one -> https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

There's a lot of (and some very detailed) information out there,  and it's useful to read the available information.
Ultimately it is very unlikely somebody has tested your exact configuration or deployment, particularly since this
OVS patchset is fresh on the mailing-list in the past weeks. I welcome $ perf top  output like William's email,
showing CPU %'s spent in DPCLS, more real-world data the better for showing the value of AVX512 in DPCLS.

Regards, -Harry


More information about the dev mailing list