[ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations

Fri May 14 15:45:15 UTC 2021

On 5/14/21 11:33 AM, Van Haaren, Harry wrote:
>
> Hi Jean,
>
> Apologies for top post – just a quick note here today. Thanks for all 
> the info, good amount of detail.
>
> 1 & 3)
>
> Unfortunately the "perf top" output seems to be of a binary without 
> debug symbols, so it is not possible to see what is what. (Apologies, 
> I should have specified to include debug symbols & then we can see 
> function names like "dpcls_lookup" and "miniflow_extract" instead of 
> 0x00001234 :) I would be interested in the output with function-names, 
> if that's possible?
>
Not sure if debug codes of Tim's Master build still around. Will check.
>
> 2) Is it normal that the vswitch datapath core is >= 80% idle? This 
> seems a little strange – and might hint that the bottleneck is not on 
> the OVS vswitch datapath cores? (from your pmd-perf-stats below):
>
>   - idle iterations: 17444555421  ( 84.1 % of used cycles)
>   - busy iterations:     84924866  ( 15.9 % of used cycles
>
Should be almost 100% busy. Not sure if pmd-perf-stats has a clear 
option like***ovs-appctl dpif-netdev/pmd-stats-clear*.
>
> 4) Ah yes, no-drop testing, "failed to converge" suddenly makes a lot 
> of sense, thanks!
>
> Regards, -Harry
>
> *From:* Jean Hsiao <jhsiao at redhat.com>
> *Sent:* Thursday, May 13, 2021 3:27 PM
> *To:* Van Haaren, Harry <harry.van.haaren at intel.com>; Timothy Redaelli 
> <tredaell at redhat.com>; Amber, Kumar <kumar.amber at intel.com>; 
> dev at openvswitch.org; Jean Hsiao <jhsiao at redhat.com>
> *Cc:* i.maximets at ovn.org; fbl at redhat.com; Stokes, Ian 
> <ian.stokes at intel.com>; Christian Trautman <ctrautma at redhat.com>
> *Subject:* Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations
>
> On 5/11/21 7:35 AM, Van Haaren, Harry wrote:
>
>         -----Original Message-----
>
>         From: Timothy Redaelli <tredaell at redhat.com>
>         <mailto:tredaell at redhat.com>
>
>         Sent: Monday, May 10, 2021 6:43 PM
>
>         To: Amber, Kumar <kumar.amber at intel.com>
>         <mailto:kumar.amber at intel.com>; dev at openvswitch.org
>         <mailto:dev at openvswitch.org>
>
>         Cc: i.maximets at ovn.org <mailto:i.maximets at ovn.org>;
>         jhsiao at redhat.com <mailto:jhsiao at redhat.com>; fbl at redhat.com
>         <mailto:fbl at redhat.com>; Van Haaren, Harry
>
>         <harry.van.haaren at intel.com> <mailto:harry.van.haaren at intel.com>
>
>         Subject: Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure +
>         Optimizations
>
>     <snip patchset details for brevity>
>
>         Hi,
>
>         we (as Red Hat) did some tests with a "special" build created
>         on top of
>
>         master (a019868a6268 at that time) with with the 2 series ("DPIF
>
>         Framework + Optimizations" and "MFEX Infrastructure +
>         Optimizations")
>
>         cherry-picked.
>
>         The spec file was also modified in order to use add "-msse4.2
>         -mpopcnt"
>
>         to OVS CFLAGS.
>
>     Hi Timothy,
>
>     Thanks for testing and reporting back your findings! Most of the
>     configuration is clear to me, but I have a few open questions
>     inline below for context.
>
>     The performance numbers reported in the email below do not show
>     benefit when enabling AVX512, which contradicts our
>
>     recent whitepaper on benchmarking an Optimized Deployment of OVS,
>     which includes the AVX512 patches you've benchmarked too.
>
>     Specifically Table 8. for DPIF/MFEX patches, and Table 9. for the
>     overall optimizations at a platform level are relevant:
>
>     https://networkbuilders.intel.com/solutionslibrary/open-vswitch-optimized-deployment-benchmark-technology-guide
>     <https://networkbuilders.intel.com/solutionslibrary/open-vswitch-optimized-deployment-benchmark-technology-guide>
>
>     Based on the differences between these performance reports, there
>     must be some discrepancy in our testing/measurements.
>
>     I hope that the questions below help us understand any differences
>     so we can all measure the benefits from these optimizations.
>
>     Regards, -Harry
>
>         RPM=openvswitch2.15-2.15.0-37.avx512.1.el8fdp (the "special"
>         build with
>
>         the patches backported)
>
>            * Master --- 15.2 Mpps
>
>            * Plus "avx512_gather 3" Only --- 15.2 Mpps
>
>            * Plus "dpif-set dpif_avx512" Only --- 10.1 Mpps
>
>            * Plus "miniflow-parser-set study" --- Failed to converge
>
>            * Plus all three --- 13.5 Mpps
>
>     Open questions:
>
>     1) Is CPU frequency turbo enabled in any scenario, or always
>     pinned to the 2.6 GHz base frequency?
>
>        - A "perf top -C x,y"   (where x,y are datapath hyperthread
>     ids) would be interesting to compare with 3) below.
>
> See attached screentshoots for two samples --- master-0 and master-1
>
>     2) "plus Avx512 gather 3" (aka, DPCLS in AVX512), we see same
>     performance. Is DPCLS in use, or is EMC doing all the work?
>
>        - The output of " ovs-appctl dpif-netdev/pmd-perf-show" would
>     be interesting to understand where packets are classified.
>
> EMC doing all the work --- see log below. This could explain why 
> setting avx512 is not helping.
>
> NOTE: Our initial study showed that disabling EMC didn't help avx512 
> wining the case.
>
> [root at netqe29 jhsiao]# ovs-appctl dpif-netdev/subtable-lookup-prio-get
> Available lookup functions (priority : name)
>   0 : autovalidator
> *1 : generic*
>   0 : avx512_gather
> [root at netqe29 jhsiao]#
>
> sleep 60; ovs-appctl dpif-netdev/pmd-perf-show
>
>
> Time: 13:54:40.213
> Measurement duration: 2242.679 s
>
> pmd thread numa_id 0 core_id 24:
>
>   Iterations:         17531214131  (0.13 us/it)
>   - Used TSC cycles: 5816810246080  (100.1 % of total cycles)
>   - idle iterations:  17446464548  ( 84.1 % of used cycles)
>   - busy iterations:     84749583  ( 15.9 % of used cycles)
>   Rx packets:          2711982944  (1209 Kpps, 340 cycles/pkt)
>   Datapath passes:     2711982944  (1.00 passes/pkt)
>   - EMC hits:          2711677677  (100.0 %)
>   - SMC hits:                   0  (  0.0 %)
>   - Megaflow hits:         305261  (  0.0 %, 1.00 subtbl lookups/hit)
>   - Upcalls:                    6  (  0.0 %, 0.0 us/upcall)
>   - Lost upcalls:               0  (  0.0 %)
>   Tx packets:          2711982944  (1209 Kpps)
>   Tx batches:            84749583  (32.00 pkts/batch)
>
> Time: 13:54:40.213
> Measurement duration: 2242.675 s
>
> pmd thread numa_id 0 core_id 52:
>
>   Iterations:         17529480287  (0.13 us/it)
>   - Used TSC cycles: 5816709563052  (100.1 % of total cycles)
>   - idle iterations:  17444555421  ( 84.1 % of used cycles)
>   - busy iterations:     84924866  ( 15.9 % of used cycles)
>   Rx packets:          2717592640  (1212 Kpps, 340 cycles/pkt)
>   Datapath passes:     2717592640  (1.00 passes/pkt)
>   - EMC hits:          2717280240  (100.0 %)
>   - SMC hits:                   0  (  0.0 %)
>   - Megaflow hits:         312362  (  0.0 %, 1.00 subtbl lookups/hit)
>   - Upcalls:                    6  (  0.0 %, 0.0 us/upcall)
>   - Lost upcalls:               0  (  0.0 %)
>   Tx packets:          2717592608  (1212 Kpps)
>   Tx batches:            84924866  (32.00 pkts/batch)
> [root at netqe29 jhsiao]#
>
>
>     3) "dpif-set dpif_avx512" only. The performance here is very
>     strange, with ~30% reduction, while our testing shows performance
>     improvement.
>
>        - A "perf top" here (compared vs step 1) would be helpful to
>     see what is going on
>
> See avx512-0 and avx512-1 attachments.
>
>     4) "miniflow parser set study", I don't understand what is meant
>     by "Failed to converge"?
>
> This is a 64-bytes 0-loss run. So, "Failed to converge" means the 
> binary search fail to get a meaningful Mpps value. This could be the 
> case that drops are happening --- could be 1 out of a million packets.
>
>        - Is the traffic running in your benchmark Ether()/IP()/UDP() ?
>
>        - Note that the only traffic pattern accelerated today is
>     Ether()/IP()/UDP() (see patch
>     https://patchwork.ozlabs.org/project/openvswitch/patch/20210428091931.2090062-5-kumar.amber@intel.com/
>     <https://patchwork.ozlabs.org/project/openvswitch/patch/20210428091931.2090062-5-kumar.amber@intel.com/>
>     for details). The next revision of the patchset will include other
>     traffic patterns, for example Ether()/Dot1Q()/IP()/UDP() and
>     Ether()/IP()/TCP().
>
>         RPM=openvswitch2.15-2.15.0-15.el8fdp (w/o "-msse4.2 -mpopcnt")
>
>            * 15.2 Mpps
>
>     5) What CFLAGS "-march=" CPU ISA and "-O" optimization options are
>     being used for the package?
>
>        - It is likely that "-msse4.2 -mpopcnt" is already implied if
>     -march=corei7 or Nehalem for example.
>
> Tim, Can you answer this question?
>
>         P2P benchmark
>
>            * ovs-dpdk/25 Gb i40e <-> trex/i40e
>
>            * single queue two pmd's --- two HT's  out of a CPU core.
>
>         Host CPU
>
>         Model name:          Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
>
>     Thanks for detailing the configuration, and looking forward to
>     understanding the configuration/performance better.
>