[ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations
Jean Hsiao
jhsiao at redhat.com
Fri May 14 16:28:53 UTC 2021
Hi Harry,
Pleae take a look. Let me know if needing more info.
Thanks!
Jean
*1&3)*
Master run: See attachments master-1-0 and master-1-1.
Plus avx512: See attachments avx512-1-0 and avx512-1-1
NOTE: On a quick look don't see avx512 function(s) around. Is it because
EMC doing all the work?
*2)* NOTE: I am using different commands --- pmd-stats-clear and show;
Now you can see 100% processing cycles.
*[root at netqe29 jhsiao]# ovs-appctl dpif-netdev/pmd-stats-clear**
**[root at netqe29 jhsiao]# ovs-appctl dpif-netdev/pmd-stats-show*
pmd thread numa_id 0 core_id 24:
packets received: 79625792
packet recirculations: 0
avg. datapath passes per packet: 1.00
emc hits: 79625760
smc hits: 0
megaflow hits: 0
avg. subtable lookups per megaflow hit: 0.00
miss with success upcall: 0
miss with failed upcall: 0
avg. packets per output batch: 32.00
idle cycles: 0 (0.00%)
*processing cycles: 27430462544 (100.00%)*
avg cycles per packet: 344.49 (27430462544/79625792)
avg processing cycles per packet: 344.49 (27430462544/79625792)
pmd thread numa_id 0 core_id 52:
packets received: 79771872
packet recirculations: 0
avg. datapath passes per packet: 1.00
emc hits: 79771872
smc hits: 0
megaflow hits: 0
avg. subtable lookups per megaflow hit: 0.00
miss with success upcall: 0
miss with failed upcall: 0
avg. packets per output batch: 32.00
idle cycles: 0 (0.00%)
*processing cycles: 27430498048 (100.00%)*
avg cycles per packet: 343.86 (27430498048/79771872)
avg processing cycles per packet: 343.86 (27430498048/79771872)
main thread:
packets received: 0
packet recirculations: 0
avg. datapath passes per packet: 0.00
emc hits: 0
smc hits: 0
megaflow hits: 0
avg. subtable lookups per megaflow hit: 0.00
miss with success upcall: 0
miss with failed upcall: 0
avg. packets per output batch: 0.00
[root at netqe29 jhsiao]#
On 5/14/21 11:33 AM, Van Haaren, Harry wrote:
>
> Hi Jean,
>
> Apologies for top post – just a quick note here today. Thanks for all
> the info, good amount of detail.
>
> 1 & 3)
>
> Unfortunately the "perf top" output seems to be of a binary without
> debug symbols, so it is not possible to see what is what. (Apologies,
> I should have specified to include debug symbols & then we can see
> function names like "dpcls_lookup" and "miniflow_extract" instead of
> 0x00001234 :) I would be interested in the output with function-names,
> if that's possible?
>
> 2) Is it normal that the vswitch datapath core is >= 80% idle? This
> seems a little strange – and might hint that the bottleneck is not on
> the OVS vswitch datapath cores? (from your pmd-perf-stats below):
>
> - idle iterations: 17444555421 ( 84.1 % of used cycles)
> - busy iterations: 84924866 ( 15.9 % of used cycles)
>
> 4) Ah yes, no-drop testing, "failed to converge" suddenly makes a lot
> of sense, thanks!
>
> Regards, -Harry
>
> *From:* Jean Hsiao <jhsiao at redhat.com>
> *Sent:* Thursday, May 13, 2021 3:27 PM
> *To:* Van Haaren, Harry <harry.van.haaren at intel.com>; Timothy Redaelli
> <tredaell at redhat.com>; Amber, Kumar <kumar.amber at intel.com>;
> dev at openvswitch.org; Jean Hsiao <jhsiao at redhat.com>
> *Cc:* i.maximets at ovn.org; fbl at redhat.com; Stokes, Ian
> <ian.stokes at intel.com>; Christian Trautman <ctrautma at redhat.com>
> *Subject:* Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations
>
> On 5/11/21 7:35 AM, Van Haaren, Harry wrote:
>
> -----Original Message-----
>
> From: Timothy Redaelli <tredaell at redhat.com>
> <mailto:tredaell at redhat.com>
>
> Sent: Monday, May 10, 2021 6:43 PM
>
> To: Amber, Kumar <kumar.amber at intel.com>
> <mailto:kumar.amber at intel.com>; dev at openvswitch.org
> <mailto:dev at openvswitch.org>
>
> Cc: i.maximets at ovn.org <mailto:i.maximets at ovn.org>;
> jhsiao at redhat.com <mailto:jhsiao at redhat.com>; fbl at redhat.com
> <mailto:fbl at redhat.com>; Van Haaren, Harry
>
> <harry.van.haaren at intel.com> <mailto:harry.van.haaren at intel.com>
>
> Subject: Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure +
> Optimizations
>
> <snip patchset details for brevity>
>
> Hi,
>
> we (as Red Hat) did some tests with a "special" build created
> on top of
>
> master (a019868a6268 at that time) with with the 2 series ("DPIF
>
> Framework + Optimizations" and "MFEX Infrastructure +
> Optimizations")
>
> cherry-picked.
>
> The spec file was also modified in order to use add "-msse4.2
> -mpopcnt"
>
> to OVS CFLAGS.
>
> Hi Timothy,
>
> Thanks for testing and reporting back your findings! Most of the
> configuration is clear to me, but I have a few open questions
> inline below for context.
>
> The performance numbers reported in the email below do not show
> benefit when enabling AVX512, which contradicts our
>
> recent whitepaper on benchmarking an Optimized Deployment of OVS,
> which includes the AVX512 patches you've benchmarked too.
>
> Specifically Table 8. for DPIF/MFEX patches, and Table 9. for the
> overall optimizations at a platform level are relevant:
>
> https://networkbuilders.intel.com/solutionslibrary/open-vswitch-optimized-deployment-benchmark-technology-guide
> <https://networkbuilders.intel.com/solutionslibrary/open-vswitch-optimized-deployment-benchmark-technology-guide>
>
> Based on the differences between these performance reports, there
> must be some discrepancy in our testing/measurements.
>
> I hope that the questions below help us understand any differences
> so we can all measure the benefits from these optimizations.
>
> Regards, -Harry
>
> RPM=openvswitch2.15-2.15.0-37.avx512.1.el8fdp (the "special"
> build with
>
> the patches backported)
>
> * Master --- 15.2 Mpps
>
> * Plus "avx512_gather 3" Only --- 15.2 Mpps
>
> * Plus "dpif-set dpif_avx512" Only --- 10.1 Mpps
>
> * Plus "miniflow-parser-set study" --- Failed to converge
>
> * Plus all three --- 13.5 Mpps
>
> Open questions:
>
> 1) Is CPU frequency turbo enabled in any scenario, or always
> pinned to the 2.6 GHz base frequency?
>
> - A "perf top -C x,y" (where x,y are datapath hyperthread
> ids) would be interesting to compare with 3) below.
>
> See attached screentshoots for two samples --- master-0 and master-1
>
> 2) "plus Avx512 gather 3" (aka, DPCLS in AVX512), we see same
> performance. Is DPCLS in use, or is EMC doing all the work?
>
> - The output of " ovs-appctl dpif-netdev/pmd-perf-show" would
> be interesting to understand where packets are classified.
>
> EMC doing all the work --- see log below. This could explain why
> setting avx512 is not helping.
>
> NOTE: Our initial study showed that disabling EMC didn't help avx512
> wining the case.
>
> [root at netqe29 jhsiao]# ovs-appctl dpif-netdev/subtable-lookup-prio-get
> Available lookup functions (priority : name)
> 0 : autovalidator
> *1 : generic*
> 0 : avx512_gather
> [root at netqe29 jhsiao]#
>
> sleep 60; ovs-appctl dpif-netdev/pmd-perf-show
>
>
> Time: 13:54:40.213
> Measurement duration: 2242.679 s
>
> pmd thread numa_id 0 core_id 24:
>
> Iterations: 17531214131 (0.13 us/it)
> - Used TSC cycles: 5816810246080 (100.1 % of total cycles)
> - idle iterations: 17446464548 ( 84.1 % of used cycles)
> - busy iterations: 84749583 ( 15.9 % of used cycles)
> Rx packets: 2711982944 (1209 Kpps, 340 cycles/pkt)
> Datapath passes: 2711982944 (1.00 passes/pkt)
> - EMC hits: 2711677677 (100.0 %)
> - SMC hits: 0 ( 0.0 %)
> - Megaflow hits: 305261 ( 0.0 %, 1.00 subtbl lookups/hit)
> - Upcalls: 6 ( 0.0 %, 0.0 us/upcall)
> - Lost upcalls: 0 ( 0.0 %)
> Tx packets: 2711982944 (1209 Kpps)
> Tx batches: 84749583 (32.00 pkts/batch)
>
> Time: 13:54:40.213
> Measurement duration: 2242.675 s
>
> pmd thread numa_id 0 core_id 52:
>
> Iterations: 17529480287 (0.13 us/it)
> - Used TSC cycles: 5816709563052 (100.1 % of total cycles)
> - idle iterations: 17444555421 ( 84.1 % of used cycles)
> - busy iterations: 84924866 ( 15.9 % of used cycles)
> Rx packets: 2717592640 (1212 Kpps, 340 cycles/pkt)
> Datapath passes: 2717592640 (1.00 passes/pkt)
> - EMC hits: 2717280240 (100.0 %)
> - SMC hits: 0 ( 0.0 %)
> - Megaflow hits: 312362 ( 0.0 %, 1.00 subtbl lookups/hit)
> - Upcalls: 6 ( 0.0 %, 0.0 us/upcall)
> - Lost upcalls: 0 ( 0.0 %)
> Tx packets: 2717592608 (1212 Kpps)
> Tx batches: 84924866 (32.00 pkts/batch)
> [root at netqe29 jhsiao]#
>
>
> 3) "dpif-set dpif_avx512" only. The performance here is very
> strange, with ~30% reduction, while our testing shows performance
> improvement.
>
> - A "perf top" here (compared vs step 1) would be helpful to
> see what is going on
>
> See avx512-0 and avx512-1 attachments.
>
> 4) "miniflow parser set study", I don't understand what is meant
> by "Failed to converge"?
>
> This is a 64-bytes 0-loss run. So, "Failed to converge" means the
> binary search fail to get a meaningful Mpps value. This could be the
> case that drops are happening --- could be 1 out of a million packets.
>
> - Is the traffic running in your benchmark Ether()/IP()/UDP() ?
>
> - Note that the only traffic pattern accelerated today is
> Ether()/IP()/UDP() (see patch
> https://patchwork.ozlabs.org/project/openvswitch/patch/20210428091931.2090062-5-kumar.amber@intel.com/
> <https://patchwork.ozlabs.org/project/openvswitch/patch/20210428091931.2090062-5-kumar.amber@intel.com/>
> for details). The next revision of the patchset will include other
> traffic patterns, for example Ether()/Dot1Q()/IP()/UDP() and
> Ether()/IP()/TCP().
>
> RPM=openvswitch2.15-2.15.0-15.el8fdp (w/o "-msse4.2 -mpopcnt")
>
> * 15.2 Mpps
>
> 5) What CFLAGS "-march=" CPU ISA and "-O" optimization options are
> being used for the package?
>
> - It is likely that "-msse4.2 -mpopcnt" is already implied if
> -march=corei7 or Nehalem for example.
>
> Tim, Can you answer this question?
>
> P2P benchmark
>
> * ovs-dpdk/25 Gb i40e <-> trex/i40e
>
> * single queue two pmd's --- two HT's out of a CPU core.
>
> Host CPU
>
> Model name: Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
>
> Thanks for detailing the configuration, and looking forward to
> understanding the configuration/performance better.
>
More information about the dev
mailing list