[ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather implementation
William Tu
u9012063 at gmail.com
Fri May 29 18:49:04 UTC 2020
On Fri, May 29, 2020 at 4:47 AM Van Haaren, Harry
<harry.van.haaren at intel.com> wrote:
>
> > -----Original Message-----
> > From: William Tu <u9012063 at gmail.com>
> > Sent: Friday, May 29, 2020 2:19 AM
> > To: Van Haaren, Harry <harry.van.haaren at intel.com>
> > Cc: ovs-dev at openvswitch.org; i.maximets at ovn.org
> > Subject: Re: [ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather
> > implementation
> >
> > On Wed, May 27, 2020 at 12:21:43PM +0000, Van Haaren, Harry wrote:
> <snip hashing details>
> > > As a result, hashing identical data in different .c files produces a different hash
> > values.
> > >
> > > From OVS docs (http://docs.openvswitch.org/en/latest/intro/install/general/)
> > the following
> > > enables native ISA for your build, or else just enable SSE4.2 and popcount:
> > > ./configure CFLAGS="-g -O2 -march=native"
> > > ./configure CFLAGS="-g -O2 -march=nehalem"
> >
> > Hi Harry,
> > Thanks for the info!
> > I can make it work now, with
> > ./configure CFLAGS="-g -O2 -msse4.2 -march=native"
>
> OK - that's good - the root cause of the bug/hash-mismatch is confirmed!
>
>
> > using similar setup
> > ovs-ofctl add-flow br0 'actions=drop'
> > ovs-appctl dpif-netdev/subtable-lookup-set avx512_gather 5
> > ovs-vsctl add-port br0 tg0 -- set int tg0 type=dpdk \
> > options:dpdk-
> > devargs=vdev:net_pcap0,rx_pcap=/root/ovs/p0.pcap,infinite_rx=1
> >
> > The performance seems a little worse (9.7Mpps -> 8.7Mpps).
> > I wonder whether it's due to running it in VM (however I don't
> > have physical machine).
>
> Performance degradations are not expected, let me try understand
> the below performance data posted, and work through it.
>
> Agree that isolating the hardware and being able to verify
> environment would help in removing potential noise.. but
> let us work with the setup you have. Do you know what CPU
> it is you're running on?
Thanks! I think it's skylake
root at instance-3:~/ovs# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) CPU @ 2.00GHz
Stepping: 3
CPU MHz: 2000.176
BogoMIPS: 4000.35
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 39424K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx
pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc
cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2
x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm
3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase
tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f
avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl
xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 03)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
>
> It seems you have EMC enabled (as per OVS defaults). The stats posted show
> an approx 10:1 ratio on hits in EMC and DPCLS. This likely adds noise to the
> measurements - as only 10% of the packets hit the changes in DPCLS.
>
> Also in the perf top profile dp_netdev_input__ takes more cycles than
> miniflow_extract, and the memcmp() is present, indicating EMC is consuming
> CPU cycles to perform its duties.
>
> I guess our simple test case is failing to show what we're trying to measure,
> as you know a EMC likes low flow counts, all explaining why DPCLS is
> only ~2% of CPU time.
>
> <snip>
> Removed details of CPU profiles & PMD stats for AVX512 and Generic DPCLS
> removed to trim conversation. Very helpful to see into your system, and I'm
> a big fan of perf top and friends - so this was useful to see, thanks!
> (Future readers, check the mailing list "thread" view for previous post's details).
>
>
> > Is there any thing I should double check?
>
> Would you mind re-testing with EMC disabled? Likely DPCLS will show up as a
> much larger % in the CPU profile, and this might provide some new insights.
>
OK, with EMC disabled, the performance gap is a little better.
Now we don't see memcmp.
=== generic ===
drop rate: 8.65Mpps
pmd thread numa_id 0 core_id 1:
packets received: 223168512
packet recirculations: 0
avg. datapath passes per packet: 1.00
emc hits: 0
smc hits: 0
megaflow hits: 223167820
avg. subtable lookups per megaflow hit: 1.00
miss with success upcall: 1
miss with failed upcall: 659
avg. packets per output batch: 0.00
idle cycles: 0 (0.00%)
processing cycles: 51969566520 (100.00%)
avg cycles per packet: 232.87 (51969566520/223168512)
avg processing cycles per packet: 232.87 (51969566520/223168512)
19.17% pmd-c01/id:9 ovs-vswitchd [.]
dpcls_subtable_lookup_mf_u0w4_u1w1
18.93% pmd-c01/id:9 ovs-vswitchd [.] miniflow_extract
16.15% pmd-c01/id:9 ovs-vswitchd [.] eth_pcap_rx_infinite
11.34% pmd-c01/id:9 ovs-vswitchd [.] dp_netdev_input__
10.51% pmd-c01/id:9 ovs-vswitchd [.] miniflow_hash_5tuple
6.88% pmd-c01/id:9 ovs-vswitchd [.] free_dpdk_buf
5.63% pmd-c01/id:9 ovs-vswitchd [.] fast_path_processing
4.95% pmd-c01/id:9 ovs-vswitchd [.] cmap_find_batch
=== AVX512 ===
drop rate: 8.28Mpps
pmd thread numa_id 0 core_id 1:
packets received: 138495296
packet recirculations: 0
avg. datapath passes per packet: 1.00
emc hits: 0
smc hits: 0
megaflow hits: 138494847
avg. subtable lookups per megaflow hit: 1.00
miss with success upcall: 1
miss with failed upcall: 416
avg. packets per output batch: 0.00
idle cycles: 0 (0.00%)
processing cycles: 33452482260 (100.00%)
avg cycles per packet: 241.54 (33452482260/138495296)
avg processing cycles per packet: 241.54 (33452482260/138495296)
19.78% pmd-c01/id:9 ovs-vswitchd [.] miniflow_extract
17.73% pmd-c01/id:9 ovs-vswitchd [.] eth_pcap_rx_infinite
13.53% pmd-c01/id:9 ovs-vswitchd [.] dpcls_avx512_gather_skx_mf_4_1
12.00% pmd-c01/id:9 ovs-vswitchd [.] dp_netdev_input__
10.94% pmd-c01/id:9 ovs-vswitchd [.] miniflow_hash_5tuple
7.80% pmd-c01/id:9 ovs-vswitchd [.] free_dpdk_buf
5.97% pmd-c01/id:9 ovs-vswitchd [.] fast_path_processing
5.23% pmd-c01/id:9 ovs-vswitchd [.] cmap_find_batch
I'm not able to get current cpu frequency, probably due to running in VM?
root at instance-3:~/ovs# modprobe acpi-cpufreq
root at instance-3:~/ovs# cpufreq-info
cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to cpufreq at vger.kernel.org, please.
analyzing CPU 0:
no or unknown cpufreq driver is active on this CPU
maximum transition latency: 4294.55 ms.
Regards,
William
More information about the dev
mailing list