[ovs-dev] [PATCH v2] dpif-netdev: dpcls per in_port with sorted subtables

Tue Aug 9 14:59:18 UTC 2016

I just submitted a v3 version of the patch. No need to review this one.

Jan

> -----Original Message-----
> From: dev [mailto:dev-bounces at openvswitch.org] On Behalf Of Jan Scheurich
> Sent: Friday, 15 July, 2016 18:35
> To: dev at openvswitch.org
> Subject: [ovs-dev] [PATCH v2] dpif-netdev: dpcls per in_port with sorted
> subtables
> 
> This turns the previous RFC PATCH dpif-netdev: dpcls per in_port with sorted
> subtables into a non-RFC patch v2.
> 
> The user-space datapath (dpif-netdev) consists of a first level "exact match
> cache" (EMC) matching on 5-tuples and the normal megaflow classifier. With
> many parallel packet flows (e.g. TCP connections) the EMC becomes inefficient
> and the OVS forwarding performance is determined by the megaflow classifier.
> 
> The megaflow classifier (dpcls) consists of a variable number of hash tables
> (aka subtables), each containing megaflow entries with the same mask of
> packet header and metadata fields to match upon. A dpcls lookup matches a
> given packet against all subtables in sequence until it hits a match. As
> megaflow cache entries are by construction non-overlapping, the first match is
> the only match.
> 
> Today the order of the subtables in the dpcls is essentially random so that on
> average a dpcsl lookup has to visit N/2 subtables for a hit, when N is the total
> number of subtables. Even though every single hash-table lookup is fast, the
> performance of the current dpcls degrades when there are many subtables.
> 
> How does the patch address this issue:
> 
> In reality there is often a strong correlation between the ingress port and a
> small subset of subtables that have hits. The entire megaflow cache typically
> decomposes nicely into partitions that are hit only by packets entering from a
> range of similar ports (e.g. traffic from Phy  -> VM vs. traffic from VM -> Phy).
> 
> Therefore, maintaining a separate dpcls instance per ingress port with its
> subtable vector sorted by frequency of hits reduces the average number of
> subtables lookups in the dpcls to a minimum, even if the total number of
> subtables gets large. This is possible because megaflows always have an exact
> match on in_port, so every megaflow belongs to unique dpcls instance.
> 
> For thread safety, the PMD thread needs to block out revalidators during the
> periodic optimization. We use ovs_mutex_trylock() to avoid blocking the PMD.
> 
> To monitor the effectiveness of the patch we have enhanced the ovs-appctl
> dpif-netdev/pmd-stats-show command with an extra line "avg. subtable lookups
> per hit" to report the average number of subtable lookup needed for a
> megaflow match. Ideally, this should be close to 1 and almost all cases much
> smaller than N/2.
> 
> I have benchmarked a cloud L3 overlay pipeline with a VXLAN overlay mesh.
> With pure L3 tenant traffic between VMs on different nodes the resulting
> netdev dpcls contains N=4 subtables.
> 
> Disabling the EMC, I have measured a baseline performance (in+out) of ~1.32
> Mpps (64 bytes, 1000 L4 flows). The average number of subtable lookups per
> dpcls match is 2.5.
> 
> With the patch the average number of subtable lookups per dpcls match is
> reduced 1 and the forwarding performance grows by ~30% to 1.72 Mpps.
> 
> As the actual number of subtables will often be higher in reality, we can
> assume that this is at the lower end of the speed-up one can expect from this
> optimization. Just running a parallel ping between the VXLAN tunnel endpoints
> increases the number of subtables and hence the average number of subtable
> lookups from 2.5 to 3.5 with a corresponding decrease of throughput to 1.14
> Mpps. With the patch the parallel ping has no impact on average number of
> subtable lookups and performance. The performance gain is then ~50%.
> 
> The main change to the previous patch is that instead of having a subtable
> vector per in_port in a single dplcs instance, we now have one dpcls instance
> with a single subtable per ingress port. This is better aligned with the design
> base code and also improves the number of subtable lookups in a miss case.
> 
> The PMD tests have been adjusted to the additional line in pmd-stats-show.
> 
> Signed-off-by: Jan Scheurich <jan.scheurich at ericsson.com>
> 
> 
> Changes in v2:
> - Rebased to master (commit 3041e1fc9638)
> - Take the pmd->flow_mutex during optimization to block out revalidators
>   Use trylock in order to not block the PMD thread
> - Made in_port an explicit input parameter to fast_path_processing()
> - Fixed coding style issues
> 
>