[ovs-dev] [PATCH v2 0/5] DPCLS Subtable ISA Optimization

Mon May 18 11:34:09 UTC 2020

> -----Original Message-----
> From: William Tu <u9012063 at gmail.com>
> Sent: Saturday, May 16, 2020 5:01 AM
> To: Van Haaren, Harry <harry.van.haaren at intel.com>
> Cc: ovs-dev <ovs-dev at openvswitch.org>; Ilya Maximets <i.maximets at ovn.org>
> Subject: Re: [ovs-dev] [PATCH v2 0/5] DPCLS Subtable ISA Optimization
> 
> Hi Harry,

Hey William,

> Thanks for the patch, I learn a lot from them.

Cool, yeah it's been fun for me learning about the OVS datapath at this level.

> On Wed, May 6, 2020 at 6:05 AM Harry van Haaren
> <harry.van.haaren at intel.com> wrote:
> >
> > This patchset implements the changes as proposed during the
> > OVS Conf '19, in the talk "Next steps for SW Datapath".
> > Youtube link: https://youtu.be/x0bOpojnpmU
<snip>
> > Patch 5/5:
> > Actual AVX-512 implementation for DPCLS subtable search. This is the
> > actual SIMD vector code, which performs DPCLS miniflow iteration in
> > parallel.
> >
> From your previous slides and patch5, I roughly understand the avx code logic.

Any questions feel free to ask! The SIMD design & implementation can be difficult
to understand, I'd be happy to help if you're curious about specific aspects.

> I'm also thinking about a very rough idea.
> I wonder if it is possible to use avx scatter function to implement miniflow_expand.

Is miniflow expand a significant amount of cycles in your use-case? I know it's used to decompress
a miniflow as required for OF updates etc, but on the datapath it shouldn't matter? If there's a
benchmark to run that shows mf expand to be a hotspot that would be very interesting!

You're right that AVX scatter could be used to perform the writes from a single AVX register.

> And for lookup a subtable, we can expand to the origin "struct flow" memory
> layouts for both packets and subtable->mf.
> So each field for each packet is at a fixed offset from the mf values.
> This wastes some memory due to expand but makes rule match keys easier?

My concern here is that "miniflow" has this very nice attribute that it is compressed, and
hence requires fewer cache lines than the full "struct flow". Particularly, the miniflow 
is contiguous, meaning utilization of the cache lines is 100%. Typical miniflow sizes for
outer packets are ~6 or so miniflow blocks, so ~6*8bytes (uint64_t) + 2 bytes for "bits".
That means simple packets are resident in a single cache-line, and many tunneled packets
can be represented by 2 cache-lines.

Matching on "struct flow" would imply a sparsely populated region of 672 bytes, and depending
on the exact contents being matched on, could be anywhere from 2-X cache lines? Generally
compute is more performant than memory-accesses that aren't cache local, I'm not sure is really
going to give performance benefits in the bigger picture.

> Regards,
> William

Cheers for having a look at the patchset! -Harry