[ovs-dev] [PATCH] dpif-netdev: Refactor datapath flow cache

Fri Jan 5 15:12:51 UTC 2018

Hi Yipeng,

Thank you very much for your comments. Please find my answers inline.

Regards, Jan

> -----Original Message-----
> From: Wang, Yipeng1 [mailto:yipeng1.wang at intel.com]
> Sent: Wednesday, 20 December, 2017 00:09
> To: Jan Scheurich <jan.scheurich at ericsson.com>; dev at openvswitch.org
> Cc: Gobriel, Sameh <sameh.gobriel at intel.com>; Tai, Charlie <charlie.tai at intel.com>
> Subject: RE: [PATCH] dpif-netdev: Refactor datapath flow cache
> 
> Hi, Jan,
> 
> Please see my comments inlined.
> 
> Thanks
> Yipeng
> 
> >-----Original Message-----
> >From: Jan Scheurich [mailto:jan.scheurich at ericsson.com]
> >Sent: Monday, December 18, 2017 9:01 AM
> >To: Wang, Yipeng1 <yipeng1.wang at intel.com>; dev at openvswitch.org
> >Cc: Gobriel, Sameh <sameh.gobriel at intel.com>; Tai, Charlie
> ><charlie.tai at intel.com>
> >Subject: RE: [PATCH] dpif-netdev: Refactor datapath flow cache
> >
> >Hi Yipeng,
> >
> >Thanks a lot for your feedback. Please find some responses below.
> >
> >Regards, Jan
> >
> >
> >From: Wang, Yipeng1 [mailto:yipeng1.wang at intel.com]
> >Sent: Sunday, 17 December, 2017 19:49
> >To: Jan Scheurich <jan.scheurich at ericsson.com>; dev at openvswitch.org
> >Cc: Gobriel, Sameh <sameh.gobriel at intel.com>; Tai, Charlie
> ><charlie.tai at intel.com>
> >Subject: RE: [PATCH] dpif-netdev: Refactor datapath flow cache
> >
> >Hi, Jan
> >
> >We went through the code and did some performance comparisons. We
> >notice that the patch contains two parts of optimizations: EMC
> >simplifying/resizing, and another layer of cache added before megaflow cache.
> >The new cache idea has the same direction with our cuckoo distributor(CD)
> >proposal we posted back in April
> >(https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/330570.html
> ><https://mail.openvswitch.org/pipermail/ovs-dev/2017-April/330570.html> )
> >and presented in OvS 2017 conference. Comparing to our patch, we saw pros
> >and cons for both CD and DFC, and currently seeking a way to combine the
> >benefits of both patches. We are also seeking other ways to further simplify
> >current datapath, to have a scalable while simple solution. Below are some
> >detailed comments:
> >
> >For EMC part, we wonder if you enabled transparent huge page (THP) during
> >the test. For our test case, the new EMC only gives a little speedup if THP
> >enabled, since with huge page, reducing EMC entry does not benefit much.
> >Also, reducing 2-hash to 1-hash actually harms certain traffic patterns we
> >tested. I guess the optimization will largely depend on traffic patterns.
> >Another question is that it seems when EMC lookup does
> >"netdev_flow_key_equal_mf", the key length is not initialized yet. Thus, the
> >key comparison actually does not do correctly. Could you please double check?
> >
> >[Jan] Yes, THP is enabled on my test setup, but I have doubts that it
> >significantly boosts the performance of the DFC/EMC by transparently
> >allocating that memory on a hugepage. Do you have a means to check that on
> >a running OVS?
> 
> [Wang, Yipeng] In my test, I compared the proposed EMC with current EMC with same 16k entries.
> If I turned off THP, the current EMC will cause many TLB misses because of its larger entry size, which I profiled with vTunes.
> Once I turned on THP with no other changes, the current EMC's throughput increases a lot and is comparable with the newly
> proposed EMC. From vTunes, the EMC lookup TLB misses decreases from 100 million to 0 during the 30sec profiling time.
> So if THP is enabled, reducing EMC entry size may not give too much benefit comparing to the current EMC.
> It is worth to mention that they both use similar amount of CPU cache since only the miniflow struct is accessed by CPU,
> thus the TLB should be the major concern.

I understand your point. But I can't seem to reproduce the effect of THP on my system. 
I don't have vTunes available, but I guess "perf stat" should also provide TLB miss 
statistics. 

How can you check if ovs-vswitchd is using transparent huge pages for backing 
e.g. the EMC memory?

> 
> >My primary goal when I chose to change the EMC implementation from 2-way
> >to 1-way associativity was to simplify the code. In my tests I have not seen any
> >benefit of having two possible locations for an EMC entry. As far as I can see
> >there is no theoretical reason why we should expect systematic collisions of
> >pairs of flows that would justify such a design. There may well be specific
> >traffic patterns that benefit from 2-way EMC, but then there are obviously
> >others for which 1-way performs better. In doubt I believe we should choose
> >the simpler design.
> >
> [Wang, Yipeng] Yes that there is no systematic collisions. However, in general,
> 1-hash table tends to cause many more misses than 2-hash. For code simplicity,
> I agree that 1-hash is simpler and much easier to understand. For performance,
> if the flows can fit in 1-hash table, they should also stay in the primary location of the 2-hash table,
> so basically they should have similar lookup speed. For large numbers of flows in general,
> traffic will have higher miss ratio in 1-hash than 2-hash table. From one of our tests that has
> 10k flows and 3 subtable (test cases described later), and EMC is sized for 16k entries,
> the 2-hash EMC causes about 14% miss ratio,  while the 1-hash EMC causes 47% miss ratio.

I agree that a lower EMC hit rate is a concern with just DPCLS or CD+DPCLS as second stage. 
But with DFC the extra cost for a miss on EMC is low as the DFC lookup only slightly higher 
than EMC itself. The EMC miss is cheap as it will typically already detected when comparing the full 
RSS hash.

Furthermore, the EMC is now mainly meant to speed up the biggest elephant
flows, so it can be smaller and thrashing is avoided by very low insertion probability.
Simplistic benchmarks using a large number of "eternal" flows with equidistantly spaced 
packets are really an unrealistic worst case for any cache-based architecture.

> 
> >Regarding "netdev_flow_key_equal_mf", there is no difference to the
> >baseline. The key length is taken from the stored EMC entry, not the packet's
> >flow key.
> >
> [Wang, Yipeng] Sorry that I did not explain it correctly. It seems the key length may not be initialized
> when emc_probablilistic_insert is called. If I am correct,  the EMC entry does not contain the right
> key length and the EMC lookup does not use the correct length. This is because the emc insert
> now happens during dfc_lookup but originally it happens after fast_processing.
> Do I understand it correctly?

You may be right. I will double-check that and fix when necessary.

> 
> >For the DFC cache part, we compare with our CD patch we presented in the
> >OvS conference. We saw CD begins to perform better than DFC with 2 or
> >more subtables, and ~60% higher throughout with 20 subtables, especially
> >when flow count is large (around 100k or more). We found CD is a more
> >scalable implementation w.r.t subtable count and flow count. Part of the
> >reason is that the 1-hash hash table of DFC does not perform consistently with
> >various traffic patterns, and easily cause conflict misses. DFC's advantage
> >shows up when all flows hit DFC or only 1 subtable exists. We are currently
> >thinking about combining both approaches for example a hybrid model.
> >
> >[Jan] A DFC hit will always be faster than a CD hit because the latter involves
> >an DPCLS subtable lookup. In my tests the DFC miss rate goes up from ~4% at
> >5000 parallel flows to ~25% at 150K flows, so even for large number of flows
> >most still hit DFC. The cost of traditional DPCLS lookup (i.e. subtable search)
> >must be very high to cause a big degradation.
> >
> [Wang, Yipeng] We agree that a DFC hit performs better than a CD hit, but CD usually has higher
> hit rate for large number of flows, as the data shows later.

That is something I don't yet understand. Is this because of the fact that CD stores 
up to 16 entries per hash bucket and handles collisions better?

> 
> >Can you specify your test setup in more detail? What kind of DFC miss rates do
> >you measure depending on the flow rate? Can you publish your
> >measurement results?
> >
> [Wang, Yipeng] We use the test/rules we posted with our CD patch. Basically we vary src_IP to hit different subtables,
> and then vary dst_IP to create various numbers of flows. We use Spirent to generate src_IP from
> 1.0.0.0 to 20.0.0.0 depending on the subtable count, and dst_IP from 0.0.0.0 to certain value depending on the
> flow count. It is similar to your traffic pattern with various UDP port number.
> We use your proposed EMC design for both schemes. Here is the performance ratio we collected:
> 
> throughput ratio: CD to DFC (both has 1M entries. CD costs 4MB while DFC 8MB, THP on).
> table cnt/flow cnt	1	3	5	10	20
> 10k			1.00	1.00	0.85	1.00	1.00
> 100k			0.81	1.15	1.17	1.35	1.55
> 1M			0.80	1.12	1.31	1.37	1.63

I assume this is 10k/100k/1M flows in total, independent of the number of subtables, right?

The degradation of DFC for large flow counts and many subtables comes from the increasing
cost for linear DPCLS searches after DFC misses I wonder how CD can avoid similar number of
misses with the same number of CD entries. Is this just because of the 16 entries per bucket?

> 
> >Do you have an updated version of your CD patch that works with the
> >membership library in DPDK 17.11 now that OVS master builds against 17.11?
> >
> [Wang, Yipeng] We have the code and we can send to you for testing
> if you would like to. But since now we think it is better to combine the benefit of both DFC and CD,
> it would be better to post on mailing list a more mature patch later.
> 
> >We would love to hear your opinion on this and we think the best case is we
> >could find a way to harmonize both patches, and find a both scalable and
> >efficient way to refactor the datapath.
> >
> >I would be interested to see your ideas how to combine DFC and CD in a good
> >way.
> >
> [Wang, Yipeng] We are thinking of using the indirect table to store either the pointer to the megaflow (like DFC),
> or the pointer to the subtable (like CD). The heuristic will depend on the number of active megaflows and the locality
> of the accesses. This way, we could keep the smaller size of CD using the indirect table, and higher hit rate,
> while avoid dpcls subtable access like how DFC works.

Yes, I was thinking about that kind of approach. Can you explain the concept of "indirect table"?
How does that save memory?

> 
> >In principle I think CD acceleration of DPCLS lookup could complement DFC but
> >I am a bit concerned about the combined memory and cache footprint of EMC,
> >DFC and CD. Even for EMC+DFC I have some doubts here. This should be
> >evaluated in multi-core setups of OVS and with real VNFs/applications in the
> >guests that exert a significant L3 cache contention.
> >
> [Wang, Yipeng] The cache footprint was our concern too when we worked on CD thus we store
> a subtable index instead of the pointer to megaflow. With the hybrid model mentioned
> above, I think it would be a better solution that is both memory efficient and scalable.
> 
> >Also, as all three are based on the same RSS hash as key, isn't there a
> >likelihood of hash collisions hitting all three in the same way? I am thinking
> >about packets that have little/no entropy in the outer headers (e.g. GRE
> >tunnels).
> >
> [Wang, Yipeng] If there is no entropy within the range considered by the RSS, they
> will collide. But this is another case that the 2-hash EMC may help comparing to
> 1-hash, right? Keys with same signature has an alternative location to go to.

The problem with too little or no entropy in the RSS hash is that many or all flows are
mapped to the same RSS hash value, so a two-way EMC won't help at all. Here only 
a better suited RSS hash algorithm on the NIC can be the solution. Many NICs today
offer possibilities to tailor the RSS hashing to specific needs. But the question is still
how to set this up.

As a work-around one could consider to fall back to a SW hash in OVS-DPDK if the 
HW RSS hash is empirically found to be inadequate (extreme collision rate). 

Regards, Jan