[ovs-dev] [PATCH v2 0/5] DPCLS Subtable ISA Optimization

Mon May 18 13:10:17 UTC 2020

On Mon, May 18, 2020 at 4:34 AM Van Haaren, Harry
<harry.van.haaren at intel.com> wrote:
>
> > -----Original Message-----
> > From: William Tu <u9012063 at gmail.com>
> > Sent: Saturday, May 16, 2020 5:01 AM
> > To: Van Haaren, Harry <harry.van.haaren at intel.com>
> > Cc: ovs-dev <ovs-dev at openvswitch.org>; Ilya Maximets <i.maximets at ovn.org>
> > Subject: Re: [ovs-dev] [PATCH v2 0/5] DPCLS Subtable ISA Optimization
> >
> > Hi Harry,
>
> Hey William,
>
> > Thanks for the patch, I learn a lot from them.
>
> Cool, yeah it's been fun for me learning about the OVS datapath at this level.
>
> > On Wed, May 6, 2020 at 6:05 AM Harry van Haaren
> > <harry.van.haaren at intel.com> wrote:
> > >
> > > This patchset implements the changes as proposed during the
> > > OVS Conf '19, in the talk "Next steps for SW Datapath".
> > > Youtube link: https://youtu.be/x0bOpojnpmU
> <snip>
> > > Patch 5/5:
> > > Actual AVX-512 implementation for DPCLS subtable search. This is the
> > > actual SIMD vector code, which performs DPCLS miniflow iteration in
> > > parallel.
> > >
> > From your previous slides and patch5, I roughly understand the avx code logic.
>
> Any questions feel free to ask! The SIMD design & implementation can be difficult
> to understand, I'd be happy to help if you're curious about specific aspects.
>
> > I'm also thinking about a very rough idea.
> > I wonder if it is possible to use avx scatter function to implement miniflow_expand.
>
> Is miniflow expand a significant amount of cycles in your use-case? I know it's used to decompress
> a miniflow as required for OF updates etc, but on the datapath it shouldn't matter? If there's a
> benchmark to run that shows mf expand to be a hotspot that would be very interesting!
>
> You're right that AVX scatter could be used to perform the writes from a single AVX register.
>
> > And for lookup a subtable, we can expand to the origin "struct flow" memory
> > layouts for both packets and subtable->mf.
> > So each field for each packet is at a fixed offset from the mf values.
> > This wastes some memory due to expand but makes rule match keys easier?
>
> My concern here is that "miniflow" has this very nice attribute that it is compressed, and
> hence requires fewer cache lines than the full "struct flow". Particularly, the miniflow
> is contiguous, meaning utilization of the cache lines is 100%. Typical miniflow sizes for
> outer packets are ~6 or so miniflow blocks, so ~6*8bytes (uint64_t) + 2 bytes for "bits".
> That means simple packets are resident in a single cache-line, and many tunneled packets
> can be represented by 2 cache-lines.
>
> Matching on "struct flow" would imply a sparsely populated region of 672 bytes, and depending
> on the exact contents being matched on, could be anywhere from 2-X cache lines? Generally
> compute is more performant than memory-accesses that aren't cache local, I'm not sure is really
> going to give performance benefits in the bigger picture.
>
Hi Harry,
Thanks for your explanation! And yes, the cache line miss overhead is definitely
more important. Now I understood the design.
William