[ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations

Thu May 13 14:27:19 UTC 2021

On 5/11/21 7:35 AM, Van Haaren, Harry wrote:
>> -----Original Message-----
>> From: Timothy Redaelli <tredaell at redhat.com>
>> Sent: Monday, May 10, 2021 6:43 PM
>> To: Amber, Kumar <kumar.amber at intel.com>; dev at openvswitch.org
>> Cc: i.maximets at ovn.org; jhsiao at redhat.com; fbl at redhat.com; Van Haaren, Harry
>> <harry.van.haaren at intel.com>
>> Subject: Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations
> <snip patchset details for brevity>
>
>> Hi,
>> we (as Red Hat) did some tests with a "special" build created on top of
>> master (a019868a6268 at that time) with with the 2 series ("DPIF
>> Framework + Optimizations" and "MFEX Infrastructure + Optimizations")
>> cherry-picked.
>> The spec file was also modified in order to use add "-msse4.2 -mpopcnt"
>> to OVS CFLAGS.
> Hi Timothy,
>
> Thanks for testing and reporting back your findings! Most of the configuration is clear to me, but I have a few open questions inline below for context.
>
> The performance numbers reported in the email below do not show benefit when enabling AVX512, which contradicts our
> recent whitepaper on benchmarking an Optimized Deployment of OVS, which includes the AVX512 patches you've benchmarked too.
> Specifically Table 8. for DPIF/MFEX patches, and Table 9. for the overall optimizations at a platform level are relevant:
> https://networkbuilders.intel.com/solutionslibrary/open-vswitch-optimized-deployment-benchmark-technology-guide
>
> Based on the differences between these performance reports, there must be some discrepancy in our testing/measurements.
> I hope that the questions below help us understand any differences so we can all measure the benefits from these optimizations.
>
> Regards, -Harry
>
>
>> RPM=openvswitch2.15-2.15.0-37.avx512.1.el8fdp (the "special" build with
>> the patches backported)
>>
>>     * Master --- 15.2 Mpps
>>     * Plus "avx512_gather 3" Only --- 15.2 Mpps
>>     * Plus "dpif-set dpif_avx512" Only --- 10.1 Mpps
>>     * Plus "miniflow-parser-set study" --- Failed to converge
>>     * Plus all three --- 13.5 Mpps
> Open questions:
> 1) Is CPU frequency turbo enabled in any scenario, or always pinned to the 2.6 GHz base frequency?
>     - A "perf top -C x,y"   (where x,y are datapath hyperthread ids) would be interesting to compare with 3) below.
See attached screentshoots for two samples --- master-0 and master-1
>
> 2) "plus Avx512 gather 3" (aka, DPCLS in AVX512), we see same performance. Is DPCLS in use, or is EMC doing all the work?
>     - The output of " ovs-appctl dpif-netdev/pmd-perf-show" would be interesting to understand where packets are classified.

EMC doing all the work --- see log below. This could explain why setting 
avx512 is not helping.

NOTE: Our initial study showed that disabling EMC didn't help avx512 
wining the case.

[root at netqe29 jhsiao]# ovs-appctl dpif-netdev/subtable-lookup-prio-get
Available lookup functions (priority : name)
   0 : autovalidator
*1 : generic*
   0 : avx512_gather
[root at netqe29 jhsiao]#

sleep 60; ovs-appctl dpif-netdev/pmd-perf-show

Time: 13:54:40.213
Measurement duration: 2242.679 s

pmd thread numa_id 0 core_id 24:

   Iterations:         17531214131  (0.13 us/it)
   - Used TSC cycles: 5816810246080  (100.1 % of total cycles)
   - idle iterations:  17446464548  ( 84.1 % of used cycles)
   - busy iterations:     84749583  ( 15.9 % of used cycles)
   Rx packets:          2711982944  (1209 Kpps, 340 cycles/pkt)
   Datapath passes:     2711982944  (1.00 passes/pkt)
   - EMC hits:          2711677677  (100.0 %)
   - SMC hits:                   0  (  0.0 %)
   - Megaflow hits:         305261  (  0.0 %, 1.00 subtbl lookups/hit)
   - Upcalls:                    6  (  0.0 %, 0.0 us/upcall)
   - Lost upcalls:               0  (  0.0 %)
   Tx packets:          2711982944  (1209 Kpps)
   Tx batches:            84749583  (32.00 pkts/batch)

Time: 13:54:40.213
Measurement duration: 2242.675 s

pmd thread numa_id 0 core_id 52:

   Iterations:         17529480287  (0.13 us/it)
   - Used TSC cycles: 5816709563052  (100.1 % of total cycles)
   - idle iterations:  17444555421  ( 84.1 % of used cycles)
   - busy iterations:     84924866  ( 15.9 % of used cycles)
   Rx packets:          2717592640  (1212 Kpps, 340 cycles/pkt)
   Datapath passes:     2717592640  (1.00 passes/pkt)
   - EMC hits:          2717280240  (100.0 %)
   - SMC hits:                   0  (  0.0 %)
   - Megaflow hits:         312362  (  0.0 %, 1.00 subtbl lookups/hit)
   - Upcalls:                    6  (  0.0 %, 0.0 us/upcall)
   - Lost upcalls:               0  (  0.0 %)
   Tx packets:          2717592608  (1212 Kpps)
   Tx batches:            84924866  (32.00 pkts/batch)
[root at netqe29 jhsiao]#

>
> 3) "dpif-set dpif_avx512" only. The performance here is very strange, with ~30% reduction, while our testing shows performance improvement.
>     - A "perf top" here (compared vs step 1) would be helpful to see what is going on
See avx512-0 and avx512-1 attachments.
>
> 4) "miniflow parser set study", I don't understand what is meant by "Failed to converge"?
This is a 64-bytes 0-loss run. So, "Failed to converge" means the binary 
search fail to get a meaningful Mpps value. This could be the case that 
drops are happening --- could be 1 out of a million packets.
>     - Is the traffic running in your benchmark Ether()/IP()/UDP() ?
>     - Note that the only traffic pattern accelerated today is Ether()/IP()/UDP() (see patch https://patchwork.ozlabs.org/project/openvswitch/patch/20210428091931.2090062-5-kumar.amber@intel.com/ for details). The next revision of the patchset will include other traffic patterns, for example Ether()/Dot1Q()/IP()/UDP() and Ether()/IP()/TCP().
>
>
>> RPM=openvswitch2.15-2.15.0-15.el8fdp (w/o "-msse4.2 -mpopcnt")
>>     * 15.2 Mpps
> 5) What CFLAGS "-march=" CPU ISA and "-O" optimization options are being used for the package?
>     - It is likely that "-msse4.2 -mpopcnt" is already implied if -march=corei7 or Nehalem for example.
>
Tim, Can you answer this question?
>
>> P2P benchmark
>>     * ovs-dpdk/25 Gb i40e <-> trex/i40e
>>     * single queue two pmd's --- two HT's  out of a CPU core.
>>
>> Host CPU
>> Model name:          Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
> Thanks for detailing the configuration, and looking forward to understanding the configuration/performance better.
>