[ovs-dev] Hitless resynchronisation of forwarding state

Wed Jun 29 14:35:20 UTC 2016

> On Jun 28, 2016, at 9:17 AM, Jan Scheurich <jan.scheurich at ericsson.com> wrote:
> 
> Hi,
> 
> We would like to resume our earlier discussion about how to support a simple, generic and efficient procedure for controllers to resync all OF forwarding state with OVS after a reconnect while maintaining non-stop forwarding (see http://openvswitch.org/pipermail/dev/2016-January/064925.html and following).
> 
> To briefly recap the earlier discussion, we have two main approaches:
> 
> A) A new OF experimenter procedure to resync state in three steps:
> 1. Controller marks the current state in OVS as stale
> 2. Controller downloads/refreshes the latest state
> 3. Controller tells switch to cleanup all remaining stale state
> The proposed procedure is described in more detail in 
> https://docs.google.com/document/d/1JBwARjUKDH_r9LK_Zg92WjquAxHrOLcqze1W60rV3j4
> This procedure has been implemented and used between Ericsson's controller and OF switches for some years. A patch for OVS 2.5 is available and could be rebased to master.
> 
> B) Use the OF1.4 bundle mechanism as follows:
> 1. Controller opens a bundle for resync
> 2. Clear all flows, groups and meters in the bundle
> 3. Download latest state within the bundle
> 4. Commit the bundle to atomically swap the new state into the data path
> The OF 1.4 bundle was implemented in OVS 2.5 but only for flows. Support for the bundle extension to OF 1.3 was added on master later. Groups and meters are not supported yet.
> 
> While we agree in principle that the bundle mechanism (with added support for groups and meters) would be a possible approach to the resync problem, our concern is that it was actually designed for a different use case, namely atomic incremental updates to the OF pipeline, and that the characteristics of the two approaches are very different in the resync scenario when a large volume of OpenFlow state is involved.
> 
> To analyze and quantify the characteristics difference, we have done some benchmarking comparing the two approaches. Due to the limitation of the current bundle implementation we had to limit to the tests to flow entries. All tests were run on a VM with 6 cores and 3 GB RAM without traffic. The tests were run using scripts executed with ovs-ofctl adding flows from a file.
> 
> With the proposed hitless resync procedure we were able to resync 1 million flow entries without increase in memory usage. Using the bundle procedure the VM ran out of memory for 1M and 500K flow entries. Only for 250K flow entries we were able to obtain comparable measurements. At 250K flow entries the ovs-vswitchd process occupies 455 MB virtual memory.
> 
> Measurements for resyncing 250K flow entries:
> Metric					Resync - OF1.3		Bundle - OF1.4
> Flow update time			~40 sec			~7 sec
> Flow update rate			~6.25K/s		~35K/s
> ovs-vswitchd CPU usage		~140%			~100%
> ovs-vswitchd virtual memory peak	457 Mbyte		1905 Mbyte
> 
> Refreshing the 250K flow entries using the proposed resync procedure requires 40 seconds at ~140% CPU usage with stable memory at 457 MB. The download rate is ~6250 flows/s. The scan for stale flow entries at the end of the resync procedure takes the vswitchd process around 200 ms.
> 
> Refreshing the 250K flow entries using the bundle mechanism increases the vswitchd memory linearly up to 1.9 GB, significantly more than the 910 MB one would expect for accommodating two versions of each rule at the moment of the atomic activation.

The reason for the high(er) memory use in the bundle case is that the bundle message storage has not been optimized for size, and that the processing itself has also not been optimized for memory consumption. Bundled messages could use the same compressed match format that the rules in the tables use, and the messages could be freed much earlier in the process. These two strategies should get the memory use much closer to the "accommodating the two versions" case. I started that at one point but did not have a benchmark to work with so this work was not finished at the time.

> 
> Somewhat to our surprise the download and activation of the 250K bundled flow entries takes only 7 seconds at 100% CPU load, much faster than the non-bundled download. Instrumenting the code with some additional log entries showed that the download of the bundle takes about 5 seconds, while the activation consumes the remaining 2 seconds. The bundled download rate is ~50K flows/s.
> 
> It appears that installing 250K flow entries individually in ofproto_dpif carries a significant processing overhead compared to the atomic activation of the same 250K entries in a bundle. What is the reason for this? Can this be improved by batching these updates internally?
> 
> Conclusion:
> In their current form the two approaches indeed exhibit radically different characteristics. The bundle mechanism is more than 5 times faster but it (temporarily) occupies 4 times the residual memory. Given that in many cases the delta between actual and desired flow state in OVS is small after a re-connect, we believe that speed of the cleanup may not be so crucial and that the ability to do it in-place without requiring a lot of extra memory resources (reserved huge-pages in the case of a DPDK datapath?) speaks in favor of the proposed resync procedure.
> 
> We would therefore like to ask the OVS community to reassess the proposed experimenter resync procedure in the light of the presented empiric data.
> 

Given that the only downside of the bundle mechanism seems to be the memory cost, and on other dimensions the bundle mechanism is actually better, would you be willing to reconsider your proposal if the bundle memory consumption was significantly improved?

  Jarno

> Regards, Jan
> 
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev