[ovs-discuss] NOX performance improvement by a factor 10

Amin Tootoonchian amin at cs.toronto.edu
Wed Dec 15 05:25:01 UTC 2010

[cross-posting to nox-dev, openflow-discuss, ovs-discuss]

I have prepared a patch based on NOX Zaku that improves its
performance by a factor of >10. This implies that a single controller
instance can run a large network with near a million flow initiations
per second. I am writing to open up a discussion and get feedback from
the community.

Here are some preliminary results:

- Benchmark configuration:
  * Benchmark: Throughput test of cbench (controller benchmarker) with
64 switches. Cbench is a part of the OFlops package
(http://www.openflowswitch.org/wk/index.php/Oflops). Under throughput
mode, cbench sends a batch of ofp_packet_in messages to the controller
and counts the number of replies it gets back.
  * Benchmarker machine: HP ProLiant DL320 equipped with a 2.13GHz
quad-core Intel Xeon processor (X3210), and 4GB RAM
  * Controller machine: Dell PowerEdge 1950 equipped with two 2.00GHz
quad-core Intel Xeon processor (E5405), and 4GB RAM
  * Connectivity: 1Gbps

- Benchmark results:
  * NOX Zaku: ~60k replies/sec (NOX Zaku only utilizes a single core).
  * Patched NOX: ~650k replies/sec (utilizing only 4 cores out of 8
available cores). The sustained controller->benchmarker throughput is

The patch updates the asynchronous harness of NOX to a standard
library (boost asynchronous I/O library) which simplifies the code
base. It fixes the code in several areas, including but not limited

- Multi-threading: The patch enables having any number of worker
threads running on multiple cores.

- Batching: Serving requests individually and sending replies one by
one is quite inefficient. The patch tries to batch requests together
were possible, as well replies (which reduces the number of system
calls significantly).

- Memory allocation: The standard C++ memory allocator is not robust
in multi-threaded environments. Google's Thread-Caching Malloc
(TCMalloc) or Hoard memory allocator perform much better for NOX.

- Fully asynchronous operation: The patched version avoids wasting CPU
cycles polling sockets, or event/timer dispatchers when not necessary.

I would like to add that the patched version should perform much
better than what I reported above (the number reported is with a run
on 4 CPU cores). I guess a single NOX instance running on a machine
with 8 CPU cores should handle well above 1 million flow initiation
requests per second. Also having a more capable machine should help to
serve more requests! The code will be made available soon and I will
post updates as well.


More information about the discuss mailing list