[ovs-dev] [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation

Tue Nov 7 17:17:02 UTC 2017

We are waiting for more comments before reworking this patchset. Please feel free to post any comment.

Thanks!

> -----Original Message-----
> From: Wang, Yipeng1
> Sent: Tuesday, October 31, 2017 4:40 PM
> To: dev at openvswitch.org
> Cc: Wang, Yipeng1 <yipeng1.wang at intel.com>; Gobriel, Sameh
> <sameh.gobriel at intel.com>; Fischetti, Antonio
> <antonio.fischetti at intel.com>; dball at vmware.com;
> jan.scheurich at ericsson.com
> Subject: [PATCH v2 0/5] dpif-netdev: Cuckoo-Distributor implementation
> 
> The Datapath Classifier uses tuple space search for flow classification.
> The rules are arranged into a set of tuples/subtables (each with a
> distinct mask).  Each subtable is implemented as a hash table and lookup
> is done with flow keys formed by selecting the bits from the packet header
> based on each subtable's mask. Tuple space search will sequentially search
> each subtable until a match is found. With a large number of subtables, a
> sequential search of the subtables could consume a lot of CPU cycles. In
> a testbench with a uniform traffic pattern equally distributed across 20
> subtables, we measured that up to 65% of total execution time is attributed
> to the megaflow cache lookup.
> 
> This patch presents the idea of the two-layer hierarchical lookup, where a
> low overhead first level of indirection is accessed first, we call this
> level cuckoo distributor (CD). If a flow key has been inserted in the flow
> table the first level will indicate with high probability that which
> subtable to look into. A lookup is performed on the second level (the
> target subtable) to retrieve the result. If the key doesn’t have a match,
> then we revert back to the sequential search of subtables. The patch is
> partially inspired by earlier concepts proposed in "simTable"[1] and
> "Cuckoo Filter"[2], and DPDK's Cuckoo Hash implementation.
> 
> This patch can improve the already existing Subtable Ranking when traffic
> data has high entropy. Subtable Ranking helps minimize the number of
> traversed subtables when most of the traffic hit the same subtable.
> However, in the case of high entropy traffic such as traffic coming from
> a physical port, multiple subtables could be hit with a similar frequency.
> In this case the average subtable lookups per hit would be much greater
> than 1. In addition, CD can adaptively turn off when it finds the traffic
> mostly hit one subtable. Thus, CD will not be an overhead when Subtable
> Ranking works well.
> 
> Scheme:
> CD is in front of the subtables. Packets are directed to corresponding
> subtable
> if hit in CD instead of searching each subtable sequentially.
>  -------
> |  CD   |
>  -------
>        \
>         \
>  -----  -----     -----
> |sub  ||sub  |...|sub  |
> |table||table|   |table|
>  -----  -----     -----
> 
>  Evaluation:
>  ----------
> We create a set of rules with various src IP. We feed traffic containing various
> numbers of flows with various src IP and dst IP. All the flows hit 10/20/30
> rules creating 10/20/30 subtables. We will explain the rule/traffic setup
> in detail later.
> 
> The table below shows the preliminary continuous testing results (full line
> speed test) we collected with a uni-directional phy-to-phy setup. OvS
> runs with 1 PMD. We use Spirent as the hardware traffic generator.
> 
>  Before v2 rebase:
>  ----
> AVX2 data:
> 20k flows:
> no.subtable: 10          20          30
> cd-ovs       4267332     3478251     3126763
> orig-ovs     3260883     2174551     1689981
> speedup      1.31x       1.60x       1.85x
> 
> 100k flows:
> no.subtable: 10          20          30
> cd-ovs       4015783     3276100     2970645
> orig-ovs     2692882     1711955     1302321
> speedup      1.49x       1.91x       2.28x
> 
> 1M flows:
> no.subtable: 10          20          30
> cd-ovs       3895961     3170530     2968555
> orig-ovs     2683455     1646227     1240501
> speedup      1.45x       1.92x       2.39x
> 
> Scalar data:
> 1M flows:
> no.subtable: 10          20          30
> cd-ovs       3658328     3028111     2863329
> orig_ovs     2683455     1646227     1240501
> speedup      1.36x       1.84x       2.31x
> 
>  After v2 rebase:
>  ----
> After rebase for v1, we tested 1M flows, 20 table cases, the results still hold.
> 1M flows:
> no.subtable:   20
> cd-ovs         3066483
> orig-ovs       1588049
> speedup        1.93x
> 
> 
>  Test rules/traffic setup:
>  ----
> To setup a test case with 20 subtables, the rule set we use is like below:
> tcp,nw_src=1.0.0.0/8, actions=output:1
> udp,nw_src=2.0.0.0/9, actions=output:1
> udp,nw_src=3.0.0.0/10,actions=output:1
> udp,nw_src=4.0.0.0/11,actions=output:1
> ...
> udp,nw_src=18.0.0.0/25,actions=output:1
> udp,nw_src=19.0.0.0/26,actions=output:1
> udp,nw_src=20.0.0.0/27,actions=output:1
> 
> Then for the traffic generator, we generate corresponding traffics with
> src_ip varying from 1.0.0.0 to 20.0.0.0. For each src_ip, we change
> dst_ip for 50000 different values. This will effectively generate 1M
> different flows hitting the 20 rules we created. And because the different
> wildcarding bits in nw_src, the 20 rules will belong to 20 subtables.
> We use 64 Bytes packet across all tests.
> 
> How to check if CD works or not for your use case:
>  ----
> CD cannot improve throughput for all use cases. It targets on use cases when
> multiple subtables exist and when the top-ranked subtable is not hit by the
> vast majority of the traffic.
> 
> One can use $OVS_DIR/utilities/ovs-appctl dpif-netdev/pmd-stats-show
> command to check CD statistics: hit/miss.
> Another statistic also shown is: "avg. subtable lookups per hit".
> In our test case, the original OvS will have an average subtable lookups value
> as 10, because there are in total of 20 subtables, and on average, a hit
> happens
> after iterating half of them. In such case, iterating 10 subtables are
> very expensive.
> 
> By using CD, this value will be close to 1, which means on average only 1
> subtable needs to be iterated to hit the rule, which reduces a lot of overhead.
> 
> Other statistics to notice about is "megaflow hits" and "emc hits".
> If most packets hit EMC, CD does not improve much of the throughput
> since CD is used to optimize megaflow search instead of EMC lookup. If your
> test
> case has less than 8k flows, all of them may be EMC hit.
> 
> Note that CD is adaptively turned on/off according to the number of
> subtables and
> their iterated pattern. If it finds there is not much benefit, CD will turn off
> itself automatically.
> 
> 
>  References:
>  ----------
> [1] H. Lee and B. Lee, Approaches for improving tuple space search-based
> table lookup, ICTC '15
> [2] B. Fan, D. G. Andersen, M. Kaminsky, and M. D. Mitzenmacher,
> Cuckoo Filter: Practically Better Than Bloom, CoNEXT '14
> 
> The previous RFC on mailing list are at:
> https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/330570.html
> 
> v2: Rebase to master head.
>     Add more testing details in cover letter.
>     Change commit messages.
>     Minor style changes to code.
>     Fix build errors happens without AVX and DPDK library.
> 
> Yipeng Wang (5):
>   dpif-netdev: Basic CD feature with scalar lookup.
>   dpif-netdev: Add AVX2 implementation for CD lookup.
>   dpif-netdev: Add CD statistics
>   dpif-netdev: Add adaptive CD mechanism
>   unit-test: Add a delay for CD initialization.
> 
>  lib/dpif-netdev.c     | 567
> +++++++++++++++++++++++++++++++++++++++++++++++++-
>  tests/ofproto-dpif.at |   3 +
>  2 files changed, 560 insertions(+), 10 deletions(-)
> 
> --
> 2.7.4