[ovs-dev] [v13 12/12] dpcls-avx512: Enable avx512 vector popcount instruction.
fbl at sysclose.org
Thu Jun 24 18:04:50 UTC 2021
On Thu, Jun 24, 2021 at 12:52:49PM +0000, Van Haaren, Harry wrote:
> > On Thu, Jun 24, 2021 at 11:07:59AM +0000, Van Haaren, Harry wrote:
> > > > On Thu, Jun 17, 2021 at 05:18:25PM +0100, Cian Ferriter wrote:
> > > > > From: Harry van Haaren <harry.van.haaren at intel.com>
> > > I do like the idea of toolchain supporting ISA options a bit more, there is
> > > so much compute performance available that is not widely used today.
> > > Such an effort industry wide would be very beneficial to all for improving
> > > performance, but would be a pretty large undertaking too... outside the
> > > scope of this patchset! :)
> > Yeah, it is. I mean, if the toolchain is not ready yet and we think
> > worth the benefits considering that most probably fewer people will
> > be able to contribute or maintain, then I see no other way to solve
> > the issue.
> So the toolchain is "ready" in that we have a path to enable CPU ISA, and
> see the benefits. We can dream about future toolchains, and how those might
> improve our workflow in future, but pragmatically the approach here is the
> best-known-method based on available tools today. DPDK uses the same
> techniques (Function pointer, CPUID based ISA check, and plug in ISA if available).
> Improving the toolchain would only solve the problem to allow the compiler to use the
> CPU ISA. This does not solve the problem of the compiler not being able to understand
> the data-movement & processing to be able to reason about it and auto-vectorize.
Yeah, the examples I found are straight forward use of ISA as you said,
then I wasn't sure about how much a compiler is able to help nowadays.
> > Do you think improving the toolchain is a larger commitment than
> > manually improving applications? A quick look on gcc gave me the
> > impression that it does support at least some basic vector
> > optimization capabilities.
> Yes - you raise a good point, "basic vector optimization capabilities" are present
> in various compilers (gcc and clang/llvm is what I test with). For the matrix-multiply
> problem that is often used to showcase compiler auto-vectorization, it is an extremely
> well bounded, and simple task from understanding the work to be done.
> Our emails crossed paths, there's more detail here about matrix multiply & basic vectorization.
Exactly :) For sure we want OVS to run faster, but there needs to be
line on how low level we can go because it's always a trade off with
complexity. In this case the line was blur, at least to me, because
I wasn't aware of how far the toolchain can help us.
Do you think these optimizations will be a problem with Windows or
BSDs? I haven't found an alternative to Cirrus which I used before
to build on BSD.
> > > I'll admit to being a bit of an ISA fan, but there's some magical instructions
> > > that can do stuff in 1x instruction that otherwise take large amounts of
> > > shifts & loops. Did I hear somebody ask for examples..??
> > Out of curiosity, which tool are you using (if you are) to measure
> > the improvements at cycles level? vtune?
> I use the Linux Perf tooling for performance measurements, along with OVS's
> own per-packet cycle count reporting. Hardware performance measuring (as Linux
> Perf and VTune use) provide all the info that's required.
> For those not measuring performance at the function/ASM level, run the following
> commands and view the performance in your terminal: perf top -C <pmd_core> -b
> Based on that, focus on the area's where lots of cycles are spent, and investigate
> alternative SIMD based implementations for that same functionality, making use
> of the CPU ISA. That's the general workflow :)
Yup, I am familiar with most of those except with VTune, so I wondered
if that provided more insights to see AVX512 optimizations impact.
> For those particularly interested, I done a "Measure Software Performance of Data Plane Applications"
> talk at DPDK Userspace in 2019 talking about workflow/method: https://www.youtube.com/watch?v=ZmwOKR5JyPk
Great, thanks for sharing it.
> <snip lots of ISA details>
> > > I'll stop promoting ISA here, but am happy to continue detailed discussions, or
> > break out
> > > conversations about specific areas of compute in OVS if there's appetite for that!
> > Feel free
> > > to email to OVS Mailing list (with me on CC please :) or email directly OK too.
> > I am definitely learning more about it and I appreciated your
> > longer reply.
> As you may notice, this is an area I'm passionate about. If there's specific interest,
> I can volunteer to try cover "measuring OVS's SW datapath performance" talk at a
> future OVS conference..
I'd say that interesting talks are always welcome! :)
One thing that maybe you have interest is to increase datapath
visibility with regards to performance. Today there are some
statistics, but maybe there could be more to potentially help
to monitor or pinpoint permanent (or transient) bottlenecks,
CPU cache misses, and so on. Giving that the datapath deals
with traffic and flow tables, and that both can be unpredictable,
the more visibility we have on how efficient it is running,
One idea that comes to mind after reviewing these patches as
an example, is that it seems cheap now to build a histogram
of how many different flows were used in a single batch. Say
that OVS received 32 packets in a batch, 30 of them matched
a single flow while the remaining 2 matched another flow. It
could build a histogram per port on how many flows were used
Again, that is just an example of stats at batching and
flow processing level that would be helpful to understand
workloads and apparently could leverage AVX512.
More information about the dev