[ovs-dev] [PATCH v5 2/2] datapath: Per NUMA node flow stats.

Jesse Gross jesse at nicira.com
Fri Feb 7 02:36:39 UTC 2014


On Thu, Feb 6, 2014 at 4:09 PM, Pravin Shelar <pshelar at nicira.com> wrote:
> On Thu, Feb 6, 2014 at 3:13 PM, Jarno Rajahalme <jrajahalme at nicira.com> wrote:
>>     Keep kernel flow stats for each NUMA node rather than each (logical)
>>     CPU.  This avoids using the per-CPU allocator and removes most of the
>>     kernel-side OVS locking overhead otherwise on the top of perf reports
>>     and allows OVS to scale better with higher number of threads.
>>
>>     With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup
>>     rate doubles on a server with two hyper-threaded physical CPUs (16
>>     logical cores each) compared to the current OVS master.  Tested with
>>     non-trivial flow table with a TCP port match rule forcing all new
>>     connections with unique port numbers to OVS userspace.  The IP
>>     addresses are still wildcarded, so the kernel flows are not considered
>>     as exact match 5-tuple flows.  This type of flows can be expected to
>>     appear in large numbers as the result of more effective wildcarding
>>     made possible by improvements in OVS userspace flow classifier.
>>
>>     Perf results for this test (master):
>>
>>     Events: 305K cycles
>>     +   8.43%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
>>     +   5.64%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
>>     +   4.75%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
>>     +   3.32%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
>>     +   2.61%     ovs-vswitchd  [kernel.kallsyms]   [k] pcpu_alloc_area
>>     +   2.19%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask_range
>>     +   2.03%          swapper  [kernel.kallsyms]   [k] intel_idle
>>     +   1.84%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
>>     +   1.64%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
>>     +   1.58%     ovs-vswitchd  libc-2.15.so        [.] 0x7f4e6
>>     +   1.07%     ovs-vswitchd  [kernel.kallsyms]   [k] memset
>>     +   1.03%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
>>     +   0.92%          swapper  [kernel.kallsyms]   [k] __ticket_spin_lock
>>     ...
>>
>>     And after this patch:
>>
>>     Events: 356K cycles
>>     +   6.85%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
>>     +   4.63%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
>>     +   3.06%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
>>     +   2.81%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask_range
>>     +   2.51%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
>>     +   2.27%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
>>     +   1.84%     ovs-vswitchd  libc-2.15.so        [.] 0x15d30f
>>     +   1.74%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
>>     +   1.47%          swapper  [kernel.kallsyms]   [k] intel_idle
>>     +   1.34%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask
>>     +   1.33%     ovs-vswitchd  ovs-vswitchd        [.] rule_actions_unref
>>     +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] hindex_node_with_hash
>>     +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] do_xlate_actions
>>     +   1.09%     ovs-vswitchd  ovs-vswitchd        [.] ofproto_rule_ref
>>     +   1.01%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
>>     ...
>>
>>     There is a small increase in kernel spinlock overhead due to the same
>>     spinlock being shared between multiple cores of the same physical CPU,
>>     but that is barely visible in the netperf TCP_CRR test performance
>>     (maybe ~1% performance drop, hard to tell exactly due to variance in the test
>>     results), when testing for kernel module throughput (with no userspace
>>     activity, handful of kernel flows).
>>
>>     On flow setup, a single stats instance is allocated (for the NUMA node
>>     0).  As CPUs from multiple NUMA nodes start updating stats, new
>>     NUMA-node specific stats instances are allocated.  This allocation on
>>     the packet processing code path is made to never sleep or look for
>>     emergency memory pools, minimizing the allocation latency.  If the
>>     allocation fails, the existing preallocated stats instance is used.
>>     Also, if only CPUs from one NUMA-node are updating the preallocated
>>     stats instance, no additional stats instances are allocated.  This
>>     eliminates the need to pre-allocate stats instances that will not be
>>     used, also relieving the stats reader from the burden of reading stats
>>     that are never used.  Finally, this allocation strategy allows the
>>     removal of the existing exact-5-tuple heuristics.
>>
>>     Signed-off-by: Jarno Rajahalme <jrajahalme at nicira.com>
> Looks good.
>
> Acked-by: Pravin B Shelar <pshelar at nicira.com>

Jarno, would you mind giving me a chance to look at this again before
you apply it? I'll try to do that tomorrow.



More information about the dev mailing list