[ovs-dev] [PATCH v1 0/6] Memory access optimization for flow scalability of userspace datapath.

Mon Jul 6 10:22:22 UTC 2020

Hi William,

Many thanks for your time to test these patches. The number is achieved on Arm server, but x86 has the similar improvement. 
And CPU cache size will slightly impact the performance data, because the larger cache size, the lower probability of cache refilling/eviction . 

Best Regards,
Wei Yanqin 
> 
> On Tue, Jun 30, 2020 at 2:26 AM Yanqin Wei <Yanqin.Wei at arm.com> wrote:
> >
> > Hi, every contributor
> >
> > These patches could significantly improve multi-flow throughput of
> userspace datapath.  If you feel it will take too much time to review all patches,
> I suggest you could look at the 2nd/3rd first, which have the major
> improvement in these patches.
> > [ovs-dev][PATCH v1 2/6] dpif-netdev: add tunnel_valid flag to skip
> > ip/ipv6 address comparison [ovs-dev][PATCH v1 3/6] dpif-netdev: improve
> emc lookup performance by contiguous storage of hash value.
> >
> > Any comments from anyone are appreciated.
> >
> > Best Regards,
> > Wei Yanqin
> >
> > > -----Original Message-----
> > > From: Yanqin Wei <Yanqin.Wei at arm.com>
> > > Sent: Tuesday, June 2, 2020 3:10 PM
> > > To: dev at openvswitch.org
> > > Cc: nd <nd at arm.com>; i.maximets at ovn.org; u9012063 at gmail.com;
> Malvika
> > > Gupta <Malvika.Gupta at arm.com>; Lijian Zhang <Lijian.Zhang at arm.com>;
> > > Ruifeng Wang <Ruifeng.Wang at arm.com>; Lance Yang
> > > <Lance.Yang at arm.com>; Yanqin Wei <Yanqin.Wei at arm.com>
> > > Subject: [ovs-dev][PATCH v1 0/6] Memory access optimization for flow
> > > scalability of userspace datapath.
> > >
> > > OVS userspace datapath is a program with heavy memory access. It
> > > needs to load/store a large number of memory, including packet
> > > header, metadata, EMC/SMC/DPCLS tables and so on. It causes a lot of
> > > cache line missing and refilling, which has a great impact on flow
> > > scalability. And in some cases, EMC has a negative impact on the
> > > overall performance. It is difficult for user to dynamically manage the
> enabling of EMC.
> > >
> > > This series of patches improve memory access of userspace datapath
> > > as
> > > follows:
> > > 1. Reduce the number of metadata cache line accessed by non-tunnel
> traffic.
> > > 2. Decrease unnecessary memory load/store for batch/flow.
> > > 3. Modify the layout of EMC data struct. Centralize the storage of hash
> value.
> > >
> > > In the NIC2NIC traffic tests, the overall performance improvement is
> > > observed, especially in multi-flow cases.
> > > Flows           delta
> > > 1-1K flows      5-10%
> > > 10K flows       20%
> > > 100K flows      40%
> > > EMC disable     10%
> 
> Thanks for submitting the patch series. I apply the series and I do see the
> above performance improvement you describe above.
> btw, is your number on ARM server or x86?

> Below is my number using single flow and drop action on Intel(R)
> Xeon(R) CPU @ 2.00GHz
> In summary I see around 10% improvement using 1flow.
> 
> === master ===
> root at instance-3:~/ovs# ovs-appctl dpif-netdev/pmd-stats-show pmd thread
> numa_id 0 core_id 0:
>   packets received: 96269888
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 87513839
>   smc hits: 0
>   megaflow hits: 8755584
>   avg. subtable lookups per megaflow hit: 1.00
>   miss with success upcall: 1
>   miss with failed upcall: 432
>   avg. packets per output batch: 0.00
>   idle cycles: 0 (0.00%)
>   processing cycles: 20083008856 (100.00%)
>   avg cycles per packet: 208.61 (20083008856/96269888)
>   avg processing cycles per packet: 208.61 (20083008856/96269888)
> 
> === master without EMC ===
> pmd thread numa_id 0 core_id 1:
>   packets received: 90775936
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 0
>   smc hits: 0
>   megaflow hits: 90775424
>   avg. subtable lookups per megaflow hit: 1.00
>   miss with success upcall: 1
>   miss with failed upcall: 479
>   avg. packets per output batch: 0.00
>   idle cycles: 0 (0.00%)
>   processing cycles: 21239087946 (100.00%)
>   avg cycles per packet: 233.97 (21239087946/90775936)
>   avg processing cycles per packet: 233.97 (21239087946/90775936)
> 
> === yanqin v1: ===
> pmd thread numa_id 0 core_id 1:
>   packets received: 156582112
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 142344109
>   smc hits: 0
>   megaflow hits: 14237554
>   avg. subtable lookups per megaflow hit: 1.00
>   miss with success upcall: 1
>   miss with failed upcall: 448
>   avg. packets per output batch: 0.00
>   idle cycles: 4320112 (0.01%)
>   processing cycles: 30503055968 (99.99%)
>   avg cycles per packet: 194.83 (30507376080/156582112)
>   avg processing cycles per packet: 194.81 (30503055968/156582112)
> 
> === yanqin v1 without EMC: ===
> pmd thread numa_id 0 core_id 0:
>   packets received: 48441664
>   packet recirculations: 0
>   avg. datapath passes per packet: 1.00
>   emc hits: 0
>   smc hits: 0
>   megaflow hits: 48441182
>   avg. subtable lookups per megaflow hit: 1.00
>   miss with success upcall: 1
>   miss with failed upcall: 449
>   avg. packets per output batch: 0.00
>   idle cycles: 0 (0.00%)
>   processing cycles: 10513468302 (100.00%)
>   avg cycles per packet: 217.03 (10513468302/48441664)
>   avg processing cycles per packet: 217.03 (10513468302/48441664)