[ovs-discuss] NOX performance improvement by a factor 10
amin at cs.toronto.edu
Wed Dec 15 05:25:01 UTC 2010
[cross-posting to nox-dev, openflow-discuss, ovs-discuss]
I have prepared a patch based on NOX Zaku that improves its
performance by a factor of >10. This implies that a single controller
instance can run a large network with near a million flow initiations
per second. I am writing to open up a discussion and get feedback from
Here are some preliminary results:
- Benchmark configuration:
* Benchmark: Throughput test of cbench (controller benchmarker) with
64 switches. Cbench is a part of the OFlops package
(http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput
mode, cbench sends a batch of ofp_packet_in messages to the controller
and counts the number of replies it gets back.
* Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz
quad-core Intel Xeon processor (X3210), and 4GB RAM
* Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz
quad-core Intel Xeon processor (E5405), and 4GB RAM
* Connectivity: 1Gbps
- Benchmark results:
* NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core).
* Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8
available cores). The sustained controller->benchmarker throughput is
The patch updates the asynchronous harness of NOX to a standard
library (boost asynchronous I/O library) which simplifies the code
base. It fixes the code in several areas, including but not limited
- Multi-threading: The patch enables having any number of worker
threads running on multiple cores.
- Batching: Serving requests individually and sending replies one by
one is quite inefficient. The patch tries to batch requests together
were possible, as well replies (which reduces the number of system
- Memory allocation: The standard C++ memory allocator is not robust
in multi-threaded environments. Google's Thread-Caching Malloc
(TCMalloc) or Hoard memory allocator perform much better for NOX.
- Fully asynchronous operation: The patched version avoids wasting CPU
cycles polling sockets, or event/timer dispatchers when not necessary.
I would like to add that the patched version should perform much
better than what I reported above (the number reported is with a run
on 4 CPU cores). I guess a single NOX instance running on a machine
with 8 CPU cores should handle well above 1 million flow initiation
requests per second. Also having a more capable machine should help to
serve more requests! The code will be made available soon and I will
post updates as well.
More information about the discuss