[ovs-dev] [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation

Jan Scheurich jan.scheurich at ericsson.com
Wed Nov 29 18:08:05 UTC 2017


Hi Yipeng,

The benchmark I used is the same we applied last year when developing the first series of DPDK datapath optimizations. The test setup is described fairly detailed in the linked PDF:
https://drive.google.com/open?id=0ByBuumQUR_NYMTlmZFFuQmJUYmM.

The OVS bridge setup and the OpenFlow pipelines are modelled after the Netvirt service in OpenDaylight acting as SDN backend to OpenStack Neutron.

We are working on a more up-to-date variant of the OpenFlow test pipeline that includes newer features such as stateful Security Groups using conntrack to get even closer to realistic deployments. This should give us a slightly increased average number of subtable lookups. The current tests consistently only require 1 subtable lookup per EMC/DFC miss.

This is due to two factors:
1. An exact match on a UDP port in a security rule flow entry is translated into a family of up to 16 datapath flow entries with prefix matches on the UDP port for prefix lengths 1 through 16, with decreasing amount of hits: "/1" port mask: 50%, "/2" port mask: 25%, "/3": 12.5%, and so on. On average this requires 2 subtable lookups per packet due to sorting of subtables.
2. The conntrack recirculation itself, which matches on megaflows with their own mask

Still it would be far away from your tests with ~20 equally visited subtables.

BR, Jan

> -----Original Message-----
> From: Wang, Yipeng1 [mailto:yipeng1.wang at intel.com]
> Sent: Tuesday, 21 November, 2017 23:17
> To: Jan Scheurich <jan.scheurich at ericsson.com>; dev at openvswitch.org
> Cc: Gobriel, Sameh <sameh.gobriel at intel.com>; Fischetti, Antonio <antonio.fischetti at intel.com>; Ben Pfaff (blp at ovn.org) <blp at ovn.org>
> Subject: RE: [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation
> 
> Hi, Jan,
> 
> Thanks for sharing the patch and nice talking with you in the conference. Could you tell us more about the test case you used such as the
> rule set and traffic pattern? Since we have some other work at hand to be done soon, we will definitely test the patch later and get back to
> you in a week or two.
> 
> Thank
> Yipeng
> 
> > -----Original Message-----
> > From: Jan Scheurich [mailto:jan.scheurich at ericsson.com]
> > Sent: Monday, November 20, 2017 9:39 AM
> > To: Wang, Yipeng1 <yipeng1.wang at intel.com>; dev at openvswitch.org
> > Cc: Gobriel, Sameh <sameh.gobriel at intel.com>; Fischetti, Antonio
> > <antonio.fischetti at intel.com>; dball at vmware.com; Ben Pfaff (blp at ovn.org)
> > <blp at ovn.org>
> > Subject: RE: [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation
> >
> > Hi Yipeng and Sameh,
> >
> > As discussed on the OVS conference, I have just sent my Datapath Flow
> > Cache patch to the ML:
> > https://mail.openvswitch.org/pipermail/ovs-dev/2017-
> > November/341066.html
> > http://patchwork.ozlabs.org/patch/839645/
> >
> > I would be happy if you could test it in your setup and compare the
> > performance with your CD implementation. I don't have a test bed at hand
> > where I could easily test the speedup for many visited subtables.
> >
> > Thanks, Jan
> >
> > > -----Original Message-----
> > > From: Yipeng Wang [mailto:yipeng1.wang at intel.com]
> > > Sent: Wednesday, 01 November, 2017 00:40
> > > To: dev at openvswitch.org
> > > Cc: yipeng1.wang at intel.com; sameh.gobriel at intel.com;
> > antonio.fischetti at intel.com; dball at vmware.com; Jan Scheurich
> > > <jan.scheurich at ericsson.com>
> > > Subject: [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation
> > >
> > > The Datapath Classifier uses tuple space search for flow classification.
> > > The rules are arranged into a set of tuples/subtables (each with a
> > > distinct mask).  Each subtable is implemented as a hash table and lookup
> > > is done with flow keys formed by selecting the bits from the packet header
> > > based on each subtable's mask. Tuple space search will sequentially search
> > > each subtable until a match is found. With a large number of subtables, a
> > > sequential search of the subtables could consume a lot of CPU cycles. In
> > > a testbench with a uniform traffic pattern equally distributed across 20
> > > subtables, we measured that up to 65% of total execution time is
> > attributed
> > > to the megaflow cache lookup.
> > >
> > > This patch presents the idea of the two-layer hierarchical lookup, where a
> > > low overhead first level of indirection is accessed first, we call this
> > > level cuckoo distributor (CD). If a flow key has been inserted in the flow
> > > table the first level will indicate with high probability that which
> > > subtable to look into. A lookup is performed on the second level (the
> > > target subtable) to retrieve the result. If the key doesn’t have a match,
> > > then we revert back to the sequential search of subtables. The patch is
> > > partially inspired by earlier concepts proposed in "simTable"[1] and
> > > "Cuckoo Filter"[2], and DPDK's Cuckoo Hash implementation.
> > >
> > > This patch can improve the already existing Subtable Ranking when traffic
> > > data has high entropy. Subtable Ranking helps minimize the number of
> > > traversed subtables when most of the traffic hit the same subtable.
> > > However, in the case of high entropy traffic such as traffic coming from
> > > a physical port, multiple subtables could be hit with a similar frequency.
> > > In this case the average subtable lookups per hit would be much greater
> > > than 1. In addition, CD can adaptively turn off when it finds the traffic
> > > mostly hit one subtable. Thus, CD will not be an overhead when Subtable
> > > Ranking works well.
> > >
> > > Scheme:
> > > CD is in front of the subtables. Packets are directed to corresponding
> > subtable
> > > if hit in CD instead of searching each subtable sequentially.
> > >  -------
> > > |  CD   |
> > >  -------
> > >        \
> > >         \
> > >  -----  -----     -----
> > > |sub  ||sub  |...|sub  |
> > > |table||table|   |table|
> > >  -----  -----     -----
> > >
> > >  Evaluation:
> > >  ----------
> > > We create a set of rules with various src IP. We feed traffic containing
> > various
> > > numbers of flows with various src IP and dst IP. All the flows hit 10/20/30
> > > rules creating 10/20/30 subtables. We will explain the rule/traffic setup
> > > in detail later.
> > >
> > > The table below shows the preliminary continuous testing results (full line
> > > speed test) we collected with a uni-directional phy-to-phy setup. OvS
> > > runs with 1 PMD. We use Spirent as the hardware traffic generator.
> > >
> > >  Before v2 rebase:
> > >  ----
> > > AVX2 data:
> > > 20k flows:
> > > no.subtable: 10          20          30
> > > cd-ovs       4267332     3478251     3126763
> > > orig-ovs     3260883     2174551     1689981
> > > speedup      1.31x       1.60x       1.85x
> > >
> > > 100k flows:
> > > no.subtable: 10          20          30
> > > cd-ovs       4015783     3276100     2970645
> > > orig-ovs     2692882     1711955     1302321
> > > speedup      1.49x       1.91x       2.28x
> > >
> > > 1M flows:
> > > no.subtable: 10          20          30
> > > cd-ovs       3895961     3170530     2968555
> > > orig-ovs     2683455     1646227     1240501
> > > speedup      1.45x       1.92x       2.39x
> > >
> > > Scalar data:
> > > 1M flows:
> > > no.subtable: 10          20          30
> > > cd-ovs       3658328     3028111     2863329
> > > orig_ovs     2683455     1646227     1240501
> > > speedup      1.36x       1.84x       2.31x
> > >
> > >  After v2 rebase:
> > >  ----
> > > After rebase for v1, we tested 1M flows, 20 table cases, the results still hold.
> > > 1M flows:
> > > no.subtable:   20
> > > cd-ovs         3066483
> > > orig-ovs       1588049
> > > speedup        1.93x
> > >
> > >
> > >  Test rules/traffic setup:
> > >  ----
> > > To setup a test case with 20 subtables, the rule set we use is like below:
> > > tcp,nw_src=1.0.0.0/8, actions=output:1
> > > udp,nw_src=2.0.0.0/9, actions=output:1
> > > udp,nw_src=3.0.0.0/10,actions=output:1
> > > udp,nw_src=4.0.0.0/11,actions=output:1
> > > ...
> > > udp,nw_src=18.0.0.0/25,actions=output:1
> > > udp,nw_src=19.0.0.0/26,actions=output:1
> > > udp,nw_src=20.0.0.0/27,actions=output:1
> > >
> > > Then for the traffic generator, we generate corresponding traffics with
> > > src_ip varying from 1.0.0.0 to 20.0.0.0. For each src_ip, we change
> > > dst_ip for 50000 different values. This will effectively generate 1M
> > > different flows hitting the 20 rules we created. And because the different
> > > wildcarding bits in nw_src, the 20 rules will belong to 20 subtables.
> > > We use 64 Bytes packet across all tests.
> > >
> > > How to check if CD works or not for your use case:
> > >  ----
> > > CD cannot improve throughput for all use cases. It targets on use cases
> > when
> > > multiple subtables exist and when the top-ranked subtable is not hit by the
> > > vast majority of the traffic.
> > >
> > > One can use $OVS_DIR/utilities/ovs-appctl dpif-netdev/pmd-stats-show
> > > command to check CD statistics: hit/miss.
> > > Another statistic also shown is: "avg. subtable lookups per hit".
> > > In our test case, the original OvS will have an average subtable lookups
> > value
> > > as 10, because there are in total of 20 subtables, and on average, a hit
> > happens
> > > after iterating half of them. In such case, iterating 10 subtables are
> > > very expensive.
> > >
> > > By using CD, this value will be close to 1, which means on average only 1
> > > subtable needs to be iterated to hit the rule, which reduces a lot of
> > overhead.
> > >
> > > Other statistics to notice about is "megaflow hits" and "emc hits".
> > > If most packets hit EMC, CD does not improve much of the throughput
> > > since CD is used to optimize megaflow search instead of EMC lookup. If
> > your test
> > > case has less than 8k flows, all of them may be EMC hit.
> > >
> > > Note that CD is adaptively turned on/off according to the number of
> > subtables and
> > > their iterated pattern. If it finds there is not much benefit, CD will turn off
> > > itself automatically.
> > >
> > >
> > >  References:
> > >  ----------
> > > [1] H. Lee and B. Lee, Approaches for improving tuple space search-based
> > > table lookup, ICTC '15
> > > [2] B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher,
> > > Cuckoo Filter: Practically Better Than Bloom, CoNEXT '14
> > >
> > > The previous RFC on mailing list are at:
> > > https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/330570.html
> > >
> > > v2: Rebase to master head.
> > >     Add more testing details in cover letter.
> > >     Change commit messages.
> > >     Minor style changes to code.
> > >     Fix build errors happens without AVX and DPDK library.
> > >
> > > Yipeng Wang (5):
> > >   dpif-netdev: Basic CD feature with scalar lookup.
> > >   dpif-netdev: Add AVX2 implementation for CD lookup.
> > >   dpif-netdev: Add CD statistics
> > >   dpif-netdev: Add adaptive CD mechanism
> > >   unit-test: Add a delay for CD initialization.
> > >
> > >  lib/dpif-netdev.c     | 567
> > +++++++++++++++++++++++++++++++++++++++++++++++++-
> > >  tests/ofproto-dpif.at |   3 +
> > >  2 files changed, 560 insertions(+), 10 deletions(-)
> > >
> > > --
> > > 2.7.4



More information about the dev mailing list