[ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather implementation

Wed Jun 3 17:36:16 UTC 2020

> -----Original Message-----
> From: William Tu <u9012063 at gmail.com>
> Sent: Friday, May 29, 2020 7:49 PM
> To: Van Haaren, Harry <harry.van.haaren at intel.com>
> Cc: ovs-dev at openvswitch.org; i.maximets at ovn.org
> Subject: Re: [ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather
> implementation
> 
> On Fri, May 29, 2020 at 4:47 AM Van Haaren, Harry
> <harry.van.haaren at intel.com> wrote:
<snip old discussion>
> > Agree that isolating the hardware and being able to verify
> > environment would help in removing potential noise.. but
> > let us work with the setup you have. Do you know what CPU
> > it is you're running on?
> 
> Thanks! I think it's skylake
> root at instance-3:~/ovs# lscpu
> Architecture:        x86_64
<snip>

Yep looks like Skylake, and has AVX512, so requirements met.

<snip>
> > Would you mind re-testing with EMC disabled? Likely DPCLS will show up as a
> > much larger % in the CPU profile, and this might provide some new insights.
> >
> OK, with EMC disabled, the performance gap is a little better.
> Now we don't see memcmp.
> 
> === generic ===
> drop rate: 8.65Mpps
> pmd thread numa_id 0 core_id 1:
>   packets received: 223168512
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 0
>   smc hits: 0
>   megaflow hits: 223167820
>   avg. subtable lookups per megaflow hit: 1.00
>   miss with success upcall: 1
>   miss with failed upcall: 659
>   avg. packets per output batch: 0.00
>   idle cycles: 0 (0.00%)
>   processing cycles: 51969566520 (100.00%)
>   avg cycles per packet: 232.87 (51969566520/223168512)
>   avg processing cycles per packet: 232.87 (51969566520/223168512)
> 
>   19.17%  pmd-c01/id:9  ovs-vswitchd        [.] dpcls_subtable_lookup_mf_u0w4_u1w1
>   18.93%  pmd-c01/id:9  ovs-vswitchd        [.] miniflow_extract
>   16.15%  pmd-c01/id:9  ovs-vswitchd        [.] eth_pcap_rx_infinite
>   11.34%  pmd-c01/id:9  ovs-vswitchd        [.] dp_netdev_input__
>   10.51%  pmd-c01/id:9  ovs-vswitchd        [.] miniflow_hash_5tuple
>    6.88%  pmd-c01/id:9  ovs-vswitchd        [.] free_dpdk_buf
>    5.63%  pmd-c01/id:9  ovs-vswitchd        [.] fast_path_processing
>    4.95%  pmd-c01/id:9  ovs-vswitchd        [.] cmap_find_batch
> 
> === AVX512 ===
> drop rate: 8.28Mpps
> pmd thread numa_id 0 core_id 1:
>   packets received: 138495296
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 0
>   smc hits: 0
>   megaflow hits: 138494847
>   avg. subtable lookups per megaflow hit: 1.00
>   miss with success upcall: 1
>   miss with failed upcall: 416
>   avg. packets per output batch: 0.00
>   idle cycles: 0 (0.00%)
>   processing cycles: 33452482260 (100.00%)
>   avg cycles per packet: 241.54 (33452482260/138495296)
>   avg processing cycles per packet: 241.54 (33452482260/138495296)
> 
>   19.78%  pmd-c01/id:9  ovs-vswitchd        [.] miniflow_extract
>   17.73%  pmd-c01/id:9  ovs-vswitchd        [.] eth_pcap_rx_infinite
>   13.53%  pmd-c01/id:9  ovs-vswitchd        [.] dpcls_avx512_gather_skx_mf_4_1
>   12.00%  pmd-c01/id:9  ovs-vswitchd        [.] dp_netdev_input__
>   10.94%  pmd-c01/id:9  ovs-vswitchd        [.] miniflow_hash_5tuple
>    7.80%  pmd-c01/id:9  ovs-vswitchd        [.] free_dpdk_buf
>    5.97%  pmd-c01/id:9  ovs-vswitchd        [.] fast_path_processing
>    5.23%  pmd-c01/id:9  ovs-vswitchd        [.] cmap_find_batch

Discussing details posted above, we do see cycle reduction in DPCLS:
Scalar (232 cyc, ~19% dpcls) ~= 46 cyc/pkt
AVX512 (241 cyc, ~13% dpcls) ~= 31 cyc/pkt

Re-stating the obvious strangeness above, the overall performance decreases
This seems to show that somehow despite DPCLS running faster, the overall
rate of work is reduced. This has not been my experience, testing the AVX512
DPCLS code running in a baremetal (not VM) environment with HW NICs has
shown good performance uplift here.

> I'm not able to get current cpu frequency, probably due to running in VM?
> root at instance-3:~/ovs# modprobe acpi-cpufreq
> root at instance-3:~/ovs# cpufreq-info
> cpufrequtils 008: cpufreq-info (C) Dominik Brodowski 2004-2009
> Report errors and bugs to cpufreq at vger.kernel.org, please.
> analyzing CPU 0:
>   no or unknown cpufreq driver is active on this CPU
>   maximum transition latency: 4294.55 ms.

Yes, a likely cause for not getting frequencies etc is due to running in a VM.
Logical next steps would be to remove noise or environmental issues to identify
exactly what the root cause is of the slowdown - unfortunately it seems that
might not be possible due to the environment.

I'm preparing a v3 of the patchset, including a number of usability and general
improvements - fixing issues present in the v2 like the "subtable reprobe" at
one-second intervals, as well as adding a command to print the available
lookup functions and their current priorities. Hoping to get the v3 up early
next week.

Regards, -Harry