[ovs-dev] [PATCH] dpif-netdev: Refactor datapath flow cache

Wed Jan 10 04:38:14 UTC 2018

Hi, Jan, Please find my reply inlined

>-----Original Message-----
>>
>> [Wang, Yipeng] In my test, I compared the proposed EMC with current EMC with same 16k entries.
>> If I turned off THP, the current EMC will cause many TLB misses because of its larger entry size, which I profiled with vTunes.
>> Once I turned on THP with no other changes, the current EMC's 
>> throughput increases a lot and is comparable with the newly proposed EMC. From vTunes, the EMC lookup TLB misses decreases from 100 million to 0 during the 30sec profiling time.
>> So if THP is enabled, reducing EMC entry size may not give too much benefit comparing to the current EMC.
>> It is worth to mention that they both use similar amount of CPU cache 
>> since only the miniflow struct is accessed by CPU, thus the TLB should be the major concern.
>
>I understand your point. But I can't seem to reproduce the effect of THP on my system.
>I don't have vTunes available, but I guess "perf stat" should also 
>provide TLB miss statistics.
>
>How can you check if ovs-vswitchd is using transparent huge pages for 
>backing e.g. the EMC memory?
>

[Wang, Yipeng]
I used the master OVS and change the EMC to be 16k entries. I feed 10k or more flows to stress EMC.  With perf, I tried this command:
sudo perf stat -p PID -e dTLB-load-misses
It shows the TLB misses changed a lot with THP on or off on my machine. vtunes shows the EMC_lookup function's data separately though.

To check if THP is used by OvS, I found a Redhat suggested command handy:
From: https://access.redhat.com/solutions/46111
grep -e AnonHugePages  /proc/*/smaps | awk  '{ if($2>4) print $0} ' |  awk -F "/"  '{print $0; system("ps -fp " $3)} '
I don't know how to check each individual function though.

>>
>> [Wang, Yipeng] Yes that there is no systematic collisions. However, 
>> in general, 1-hash table tends to cause many more misses than 2-hash. 
>> For code simplicity, I agree that 1-hash is simpler and much easier 
>> to understand. For performance, if the flows can fit in 1-hash table, 
>> they should also stay in the primary location of the 2-hash table, so 
>> basically they should have similar lookup speed. For large numbers of 
>> flows in general, traffic will have higher miss ratio in 1-hash than 
>> 2-hash table. From one of our tests that has 10k flows and 3 subtable (test cases described later), and EMC is sized for 16k entries, the 2-hash EMC causes about 14% miss ratio,  while the 1-hash EMC causes 47% miss ratio.
>
>I agree that a lower EMC hit rate is a concern with just DPCLS or CD+DPCLS as second stage.
>But with DFC the extra cost for a miss on EMC is low as the DFC lookup 
>only slightly higher than EMC itself. The EMC miss is cheap as it will 
>typically already detected when comparing the full RSS hash.
>
>Furthermore, the EMC is now mainly meant to speed up the biggest 
>elephant flows, so it can be smaller and thrashing is avoided by very low insertion probability.
>Simplistic benchmarks using a large number of "eternal" flows with 
>equidistantly spaced packets are really an unrealistic worst case for any cache-based architecture.
>

[Wang, Yipeng]
If the realistic traffic patterns mostly hit EMC with elephant flows, I agree that EMC could be simplified.

>>
>> [Wang, Yipeng] We agree that a DFC hit performs better than a CD hit, 
>> but CD usually has higher hit rate for large number of flows, as the data shows later.
>
>That is something I don't yet understand. Is this because of the fact 
>that CD stores up to 16 entries per hash bucket and handles collisions better?

[Wang, Yipeng]
Yes, with 2-hash function and 16 entries per bucket, CD has much less misses in general.

As first step to combine both CD and DFC, I incorporated the signature and way-associative structure from CD into DFC. I just did simple prototype without Any performance tuning, preliminary results show good improvement over miss ratio and throughput. I will post the complete results soon.

Since DFC/CD is much faster than megaflow, I believe higher hit rate is preferred. So A CD-like way-associative structure should be helpful. The signature per entry also helps on performance, similar effect with EMC.

>>
>> [Wang, Yipeng] We use the test/rules we posted with our CD patch. 
>> Basically we vary src_IP to hit different subtables, and then vary 
>> dst_IP to create various numbers of flows. We use Spirent to generate 
>> src_IP from
>> 1.0.0.0 to 20.0.0.0 depending on the subtable count, and dst_IP from 
>> 0.0.0.0 to certain value depending on the flow count. It is similar to your traffic pattern with various UDP port number.
>> We use your proposed EMC design for both schemes. Here is the performance ratio we collected:
>>
>> throughput ratio: CD to DFC (both has 1M entries. CD costs 4MB while DFC 8MB, THP on).
>> table cnt/flow cnt	1	3	5	10	20
>> 10k			1.00	1.00	0.85	1.00	1.00
>> 100k			0.81	1.15	1.17	1.35	1.55
>> 1M			0.80	1.12	1.31	1.37	1.63
>
>I assume this is 10k/100k/1M flows in total, independent of the number of subtables, right?
>
[Wang, Yipeng] Yes.

>The degradation of DFC for large flow counts and many subtables comes 
>from the increasing cost for linear DPCLS searches after DFC misses I 
>wonder how CD can avoid similar number of misses with the same number of CD entries. Is this just because of the 16 entries per bucket?
>

[Wang, Yipeng]
The way-associative design with 2-hash helps a lot on miss ratio.
More ways usually solves conflict misses better. For throughput difference, another reason is that since DFC does not have a signature, the DFC miss case is more expensive because the rule key needs to be accessed. 
We also found the better memory efficiency helped CD too, but with THP on and no VMs to compete CPU cache, it becomes a second order factor and not very critical in this test.

>>
>> [Wang, Yipeng] We have the code and we can send to you for testing if 
>> you would like to. But since now we think it is better to combine the 
>> benefit of both DFC and CD, it would be better to post on mailing list a more mature patch later.
>>
>> >We would love to hear your opinion on this and we think the best 
>> >case is we could find a way to harmonize both patches, and find a 
>> >both scalable and efficient way to refactor the datapath.
>> >
>> >I would be interested to see your ideas how to combine DFC and CD in 
>> >a good way.
>> >
>> [Wang, Yipeng] We are thinking of using the indirect table to store 
>> either the pointer to the megaflow (like DFC), or the pointer to the 
>> subtable (like CD). The heuristic will depend on the number of active 
>> megaflows and the locality of the accesses. This way, we could keep the smaller size of CD using the indirect table, and higher hit rate, while avoid dpcls subtable access like how DFC works.
>
>Yes, I was thinking about that kind of approach. Can you explain the concept of "indirect table"?
>How does that save memory?
>

[Wang, Yipeng]
We designed indirect table based on the fact that rules/subtables are much less than flows. It basically stores the pointers to rules/subtables, thus the CD/DFC can store the much shorter index of the indirect table rather than storing the full 8-byte pointer.

With 1M entry CD/DFC and multi-threading, using indirect table with smaller CD/DFC entries could save a lot of CPU cache.
As you mentioned, if there are VMs competing LLC, it will be critical for performance. 

Please let me know what you think.

Thanks!
Yipeng