[ovs-dev] criteria/benchmarks for an improved userspace datapath classifier?
Van Haaren, Harry
harry.van.haaren at intel.com
Thu Jul 4 12:53:46 UTC 2019
> -----Original Message-----
> From: Ilya Maximets [mailto:i.maximets at samsung.com]
> Sent: Thursday, July 4, 2019 1:43 PM
> To: Michal Orsák <michal.orsak at gmail.com>; Stokes, Ian <ian.stokes at intel.com>;
> Ben Pfaff <blp at ovn.org>; Gianni Antichi <g.antichi at qmul.ac.uk>; William Tu
> <u9012063 at gmail.com>
> Cc: dev at openvswitch.org; Van Haaren, Harry <harry.van.haaren at intel.com>;
> Flavio Leitner <fbl at sysclose.org>
> Subject: Re: criteria/benchmarks for an improved userspace datapath
> On 03.07.2019 0:03, Michal Orsák wrote:
> > Hello everyone,
> Hi Michal.
> > On 02/07/2019 19:43, Ian Stokes wrote:
> >> On 7/2/2019 6:00 PM, Ben Pfaff wrote:
> >>> Hi Ilya and Ian. Please allow me to introduce Michal Orsak, a grad
> >>> student currently looking at packet classifiers. He's implemented a
> >>> novel classifier that is faster than the one already in OVS in the
> >>> benchmarks that he's run. His classifier is tree-based, like most
> >>> high-performance classifiers, but also incremental so that flows can be
> >>> inserted and deleted individually without undue delay. Ultimately, it
> >>> might make sense to replace the OVS userspace datapath classifier by one
> >>> based on the concepts that he's come up with.
> >> Thanks for the introduction Ben. I wasn't aware of Michals work. The timing
> is quite apt as currently I've been looking at Harrys (CCd) approach to
> improving performance for the current DPCLS.
> >> https://mail.openvswitch.org/pipermail/ovs-dev/2019-May/358790.html
> >> I was hoping to have this approach in place for OVS 2.12 as there's been
> ongoing testing/reviews across the community with general improvements of ~
> 15% or more for both x86 and ARM architectures which looks promising.
> >> I'd be interested in seeing Michals proposals if there are patches
> available that would be great.
> > Now I am implementing the DPCLS API so we can test the thing. There are also
> some differences and problems like:
> > * support for priority of the rules
> > * only single matching rule
> > * c++
> > * my limited knowledge of DPCLS
> One important thing about datapath flow rules you need to consider
> is that datapath flows, unlike OF rules, has no priorities, i.e. only
> one flow in dpcls could match with your packet. So the order in which
> you will check flows doesn't matter for correctness, however it matters
> for performance. Sooner you'll find the right flow, faster your
> classifier will be.
> Current implementation of DPCLS in dpif-netdev looks like this:
> * Each thread has its own set of dpcls classifiers (pmd->classifiers),
> one classifier per in_port.
> * Each dpcls classifier is a set of subtables (cls->subtables).
> One subtable per key mask (bitmask of a fields that matches the flow).
> * Each subtable is a set of rules (subtable->rules). All rules in a
> subtable matches different values of the same fields (from the key mask).
> Lookup method:
> 1. Get the dpcls from pmd->classifiers by in_port.
> 2. For each subtable in dpcls:
> 3. Try to find the matching rule in a subtable->rules.
> 4. If found --> break.
> Important optimizations:
> * pmd->classifiers is a hash map that provides a fast lookup.
> * subtable->rules is a hash map too.
> * Lookup in subtable->rules is batched. i.e. we're trying to
> find rules for up to 32 packets at the same time.
> * dpcls collects hit counts for subtables and sorts them, so
> the 'hottest' subtables always checked first. This speeds up
> the lookup process.
Great explanation Ilya, we should try incorporate this level of detail in the docs somewhere.
> >>> A difficulty with classifiers, however, is coming up with an appropriate
> >>> set of benchmarks to compare them fairly. The userspace datapath
> >>> focuses on performance, so do you have a set of benchmarks that you
> >>> recommend for comparison? Are there other criteria that would be
> >>> important (besides correctness)?
> >> Agreed, that's something we've been looking for feedback on from the
> community to date. Most testing we have seen have been performance focused.
> > There are so many corner cases... Any test is appreciated.
> >> The areas I think that would be good to measure would be flow addition/flow
> deletion overhead in CPU cycles. Also memory and time complexity comparisons
> for existing and proposed techniques might be nice to see.
> > Currently I do have comparison with OVS TSS, DPDK librte_acl, PartitionSort,
> HyperSplit, HyperCuts and few others. (C based benchmarks. Some of them not
> optimized implementations. Originally for devel. of hardware based
> classifier.). All previously mentioned except TSS do have worst case
> complexities from the same class. However it is more complicated as it depends
> on size of the matching field and other factors which have different meanings
> between algs. I do not have understandable and short definitions yet.
> Scenarios for performance testing is always a hard topic as we
> still didn't found the common set of cases that will work for
> Similar to Flavio, I'm mostly working with OVS in OpenStack setups.
> So, the cases are similar. Not very big number of flows, mostly
> NORMAL actions. What I'm trying to test while evaluating performance
> related changes (beside the typical PHY-PHY and PVP forwarding) is
> the PHY-VM-PHY forwarding while PHY is the balanced-tcp bonding port.
> In this case all the packets that goes from VM to PHY are going through
> recirculation and matches twice in datapath. Here is the speed of
> classifier matters a lot even if we have not much rules installed.
> I'm usually testing the matrix of packet sizes (64 - 1518) and number
> of IPv4 traffic flows (1 - 1M).
> However, OF rules differs from datapath rules in dpcls. They usually
> more simple due to flow translation algorithms. In case of prevailing
> of NORMAL actions, datapath flow tables looks more like a learning
> table of L2 switch, where you mostly need to just match source and
> destination MACs. But there are still cases where dpcls rules could
> be complex anyway.
I'm very happy to see this discussion on list - its great to get visibility
into what users are actually doing with their DPCLS instances!
A few notes, if people can share how many bits are set in the miniflow
that their hottest subtables match on, then we can optimize around that.
(Details available in the commit messages of the DPCLS Func Ptrs patchset:
Cover letter: https://patchwork.ozlabs.org/cover/1097104/
Actual detail: https://patchwork.ozlabs.org/patch/1097111/ )
I would love to see benchmarks as community are testing integrated in the
OVS git repository - a number that the SW Optimization folks can work on
increasing, with confidence that the real-world usage will benefit!
Open Questions re performance not yet answered above:
- What is the signature of your hottest subtable's miniflow?
- Approx how many rules are active in your DPCLS instance?
> Best regards, Ilya Maximets.
Again, great to see this level of collaboration on performance, lets
keep this going and get (any implementation of) DPCLS racing even faster :)
> >> Thanks
> >> Ian
> >>> (I'd take answers from anyone, not just Ian and Ilya!)
> >>> Thanks,
> >>> Ben.
> > Thanks,
> > Michal
More information about the dev