[ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations

Timothy Redaelli tredaelli at redhat.com
Thu May 13 15:54:49 UTC 2021


On Thu, 13 May 2021 10:27:19 -0400
Jean Hsiao <jhsiao at redhat.com> wrote:

> 
> On 5/11/21 7:35 AM, Van Haaren, Harry wrote:
> >> -----Original Message-----
> >> From: Timothy Redaelli <tredaell at redhat.com>
> >> Sent: Monday, May 10, 2021 6:43 PM
> >> To: Amber, Kumar <kumar.amber at intel.com>; dev at openvswitch.org
> >> Cc: i.maximets at ovn.org; jhsiao at redhat.com; fbl at redhat.com; Van Haaren, Harry
> >> <harry.van.haaren at intel.com>
> >> Subject: Re: [ovs-dev] [v2 v2 0/6] MFEX Infrastructure + Optimizations
> > <snip patchset details for brevity>
> >
> >> Hi,
> >> we (as Red Hat) did some tests with a "special" build created on top of
> >> master (a019868a6268 at that time) with with the 2 series ("DPIF
> >> Framework + Optimizations" and "MFEX Infrastructure + Optimizations")
> >> cherry-picked.
> >> The spec file was also modified in order to use add "-msse4.2 -mpopcnt"
> >> to OVS CFLAGS.
> > Hi Timothy,
> >
> > Thanks for testing and reporting back your findings! Most of the configuration is clear to me, but I have a few open questions inline below for context.
> >
> > The performance numbers reported in the email below do not show benefit when enabling AVX512, which contradicts our
> > recent whitepaper on benchmarking an Optimized Deployment of OVS, which includes the AVX512 patches you've benchmarked too.
> > Specifically Table 8. for DPIF/MFEX patches, and Table 9. for the overall optimizations at a platform level are relevant:
> > https://networkbuilders.intel.com/solutionslibrary/open-vswitch-optimized-deployment-benchmark-technology-guide
> >
> > Based on the differences between these performance reports, there must be some discrepancy in our testing/measurements.
> > I hope that the questions below help us understand any differences so we can all measure the benefits from these optimizations.
> >
> > Regards, -Harry
> >
> >
> >> RPM=openvswitch2.15-2.15.0-37.avx512.1.el8fdp (the "special" build with
> >> the patches backported)
> >>
> >>     * Master --- 15.2 Mpps
> >>     * Plus "avx512_gather 3" Only --- 15.2 Mpps
> >>     * Plus "dpif-set dpif_avx512" Only --- 10.1 Mpps
> >>     * Plus "miniflow-parser-set study" --- Failed to converge
> >>     * Plus all three --- 13.5 Mpps
> > Open questions:
> > 1) Is CPU frequency turbo enabled in any scenario, or always pinned to the 2.6 GHz base frequency?
> >     - A "perf top -C x,y"   (where x,y are datapath hyperthread ids) would be interesting to compare with 3) below.
> See attached screentshoots for two samples --- master-0 and master-1
> >
> > 2) "plus Avx512 gather 3" (aka, DPCLS in AVX512), we see same performance. Is DPCLS in use, or is EMC doing all the work?
> >     - The output of " ovs-appctl dpif-netdev/pmd-perf-show" would be interesting to understand where packets are classified.
> 
> EMC doing all the work --- see log below. This could explain why setting 
> avx512 is not helping.
> 
> NOTE: Our initial study showed that disabling EMC didn't help avx512 
> wining the case.
> 
> [root at netqe29 jhsiao]# ovs-appctl dpif-netdev/subtable-lookup-prio-get
> Available lookup functions (priority : name)
>    0 : autovalidator
> *1 : generic*
>    0 : avx512_gather
> [root at netqe29 jhsiao]#
> 
> sleep 60; ovs-appctl dpif-netdev/pmd-perf-show
> 
> 
> Time: 13:54:40.213
> Measurement duration: 2242.679 s
> 
> pmd thread numa_id 0 core_id 24:
> 
>    Iterations:         17531214131  (0.13 us/it)
>    - Used TSC cycles: 5816810246080  (100.1 % of total cycles)
>    - idle iterations:  17446464548  ( 84.1 % of used cycles)
>    - busy iterations:     84749583  ( 15.9 % of used cycles)
>    Rx packets:          2711982944  (1209 Kpps, 340 cycles/pkt)
>    Datapath passes:     2711982944  (1.00 passes/pkt)
>    - EMC hits:          2711677677  (100.0 %)
>    - SMC hits:                   0  (  0.0 %)
>    - Megaflow hits:         305261  (  0.0 %, 1.00 subtbl lookups/hit)
>    - Upcalls:                    6  (  0.0 %, 0.0 us/upcall)
>    - Lost upcalls:               0  (  0.0 %)
>    Tx packets:          2711982944  (1209 Kpps)
>    Tx batches:            84749583  (32.00 pkts/batch)
> 
> Time: 13:54:40.213
> Measurement duration: 2242.675 s
> 
> pmd thread numa_id 0 core_id 52:
> 
>    Iterations:         17529480287  (0.13 us/it)
>    - Used TSC cycles: 5816709563052  (100.1 % of total cycles)
>    - idle iterations:  17444555421  ( 84.1 % of used cycles)
>    - busy iterations:     84924866  ( 15.9 % of used cycles)
>    Rx packets:          2717592640  (1212 Kpps, 340 cycles/pkt)
>    Datapath passes:     2717592640  (1.00 passes/pkt)
>    - EMC hits:          2717280240  (100.0 %)
>    - SMC hits:                   0  (  0.0 %)
>    - Megaflow hits:         312362  (  0.0 %, 1.00 subtbl lookups/hit)
>    - Upcalls:                    6  (  0.0 %, 0.0 us/upcall)
>    - Lost upcalls:               0  (  0.0 %)
>    Tx packets:          2717592608  (1212 Kpps)
>    Tx batches:            84924866  (32.00 pkts/batch)
> [root at netqe29 jhsiao]#
> 
> >
> > 3) "dpif-set dpif_avx512" only. The performance here is very strange, with ~30% reduction, while our testing shows performance improvement.
> >     - A "perf top" here (compared vs step 1) would be helpful to see what is going on
> See avx512-0 and avx512-1 attachments.
> >
> > 4) "miniflow parser set study", I don't understand what is meant by "Failed to converge"?
> This is a 64-bytes 0-loss run. So, "Failed to converge" means the binary 
> search fail to get a meaningful Mpps value. This could be the case that 
> drops are happening --- could be 1 out of a million packets.
> >     - Is the traffic running in your benchmark Ether()/IP()/UDP() ?
> >     - Note that the only traffic pattern accelerated today is Ether()/IP()/UDP() (see patch https://patchwork.ozlabs.org/project/openvswitch/patch/20210428091931.2090062-5-kumar.amber@intel.com/ for details). The next revision of the patchset will include other traffic patterns, for example Ether()/Dot1Q()/IP()/UDP() and Ether()/IP()/TCP().
> >
> >
> >> RPM=openvswitch2.15-2.15.0-15.el8fdp (w/o "-msse4.2 -mpopcnt")
> >>     * 15.2 Mpps
> > 5) What CFLAGS "-march=" CPU ISA and "-O" optimization options are being used for the package?
> >     - It is likely that "-msse4.2 -mpopcnt" is already implied if -march=corei7 or Nehalem for example.
> >
> Tim, Can you answer this question?

Sure,
I'm sure "-msse4.2 -mpopcnt" is not used since avx512_gather is not
present in subtable-lookup-prio-get and we have LD patched for AVX512.

This is a line from the buildlog so you can see which flags are used:

libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I.. -I ../include -I ./include -I ../lib -I ./lib -mavx512f -mavx512bw -mavx512dq -mbmi2 -fPIC -Wstrict-prototypes -Wall -Wextra -Wno-sign-compare -Wpointer-arith -Wformat -Wformat-security -Wswitch-enum -Wunused-parameter -Wbad-function-cast -Wcast-align -Wstrict-prototypes -Wold-style-definition -Wmissing-prototypes -Wmissing-field-initializers -fno-strict-aliasing -Wswitch-bool -Wlogical-not-parentheses -Wsizeof-array-argument -Wbool-compare -Wshift-negative-value -Wduplicated-cond -Wshadow -Wmultistatement-macros -Wcast-align=strict -mssse3 -I/builddir/build/BUILD/dpdk-build/include -include rte_config.h -I/usr/usr/include -D_FILE_OFFSET_BITS=64 -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -DHAVE_AVX512F -DHAVE_LD_AVX512_GOOD -c ../lib/dpif-netdev-lookup-avx512-gather.c -o lib/libopenvswitchavx512_la-dpif-netdev-lookup-avx512-gather.o

where the first -specs= is only to enable PIC or PIE and the second one
is to enable annobin plugin (that is used by the he annocheck program
which uses the notes generated by annobin to check that the specified
files were compiled with the correct security hardening).

As you can see, no -march or -mcpu are used

> >> P2P benchmark
> >>     * ovs-dpdk/25 Gb i40e <-> trex/i40e
> >>     * single queue two pmd's --- two HT's  out of a CPU core.
> >>
> >> Host CPU
> >> Model name:          Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
> > Thanks for detailing the configuration, and looking forward to understanding the configuration/performance better.
> >



More information about the dev mailing list