[ovs-dev] [PATCH v5 2/2] datapath: Per NUMA node flow stats.
Jesse Gross
jesse at nicira.com
Fri Feb 7 02:36:39 UTC 2014
On Thu, Feb 6, 2014 at 4:09 PM, Pravin Shelar <pshelar at nicira.com> wrote:
> On Thu, Feb 6, 2014 at 3:13 PM, Jarno Rajahalme <jrajahalme at nicira.com> wrote:
>> Keep kernel flow stats for each NUMA node rather than each (logical)
>> CPU. This avoids using the per-CPU allocator and removes most of the
>> kernel-side OVS locking overhead otherwise on the top of perf reports
>> and allows OVS to scale better with higher number of threads.
>>
>> With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup
>> rate doubles on a server with two hyper-threaded physical CPUs (16
>> logical cores each) compared to the current OVS master. Tested with
>> non-trivial flow table with a TCP port match rule forcing all new
>> connections with unique port numbers to OVS userspace. The IP
>> addresses are still wildcarded, so the kernel flows are not considered
>> as exact match 5-tuple flows. This type of flows can be expected to
>> appear in large numbers as the result of more effective wildcarding
>> made possible by improvements in OVS userspace flow classifier.
>>
>> Perf results for this test (master):
>>
>> Events: 305K cycles
>> + 8.43% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner
>> + 5.64% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock
>> + 4.75% ovs-vswitchd ovs-vswitchd [.] find_match_wc
>> + 3.32% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock
>> + 2.61% ovs-vswitchd [kernel.kallsyms] [k] pcpu_alloc_area
>> + 2.19% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range
>> + 2.03% swapper [kernel.kallsyms] [k] intel_idle
>> + 1.84% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock
>> + 1.64% ovs-vswitchd ovs-vswitchd [.] classifier_lookup
>> + 1.58% ovs-vswitchd libc-2.15.so [.] 0x7f4e6
>> + 1.07% ovs-vswitchd [kernel.kallsyms] [k] memset
>> + 1.03% netperf [kernel.kallsyms] [k] __ticket_spin_lock
>> + 0.92% swapper [kernel.kallsyms] [k] __ticket_spin_lock
>> ...
>>
>> And after this patch:
>>
>> Events: 356K cycles
>> + 6.85% ovs-vswitchd ovs-vswitchd [.] find_match_wc
>> + 4.63% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock
>> + 3.06% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock
>> + 2.81% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range
>> + 2.51% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock
>> + 2.27% ovs-vswitchd ovs-vswitchd [.] classifier_lookup
>> + 1.84% ovs-vswitchd libc-2.15.so [.] 0x15d30f
>> + 1.74% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner
>> + 1.47% swapper [kernel.kallsyms] [k] intel_idle
>> + 1.34% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask
>> + 1.33% ovs-vswitchd ovs-vswitchd [.] rule_actions_unref
>> + 1.16% ovs-vswitchd ovs-vswitchd [.] hindex_node_with_hash
>> + 1.16% ovs-vswitchd ovs-vswitchd [.] do_xlate_actions
>> + 1.09% ovs-vswitchd ovs-vswitchd [.] ofproto_rule_ref
>> + 1.01% netperf [kernel.kallsyms] [k] __ticket_spin_lock
>> ...
>>
>> There is a small increase in kernel spinlock overhead due to the same
>> spinlock being shared between multiple cores of the same physical CPU,
>> but that is barely visible in the netperf TCP_CRR test performance
>> (maybe ~1% performance drop, hard to tell exactly due to variance in the test
>> results), when testing for kernel module throughput (with no userspace
>> activity, handful of kernel flows).
>>
>> On flow setup, a single stats instance is allocated (for the NUMA node
>> 0). As CPUs from multiple NUMA nodes start updating stats, new
>> NUMA-node specific stats instances are allocated. This allocation on
>> the packet processing code path is made to never sleep or look for
>> emergency memory pools, minimizing the allocation latency. If the
>> allocation fails, the existing preallocated stats instance is used.
>> Also, if only CPUs from one NUMA-node are updating the preallocated
>> stats instance, no additional stats instances are allocated. This
>> eliminates the need to pre-allocate stats instances that will not be
>> used, also relieving the stats reader from the burden of reading stats
>> that are never used. Finally, this allocation strategy allows the
>> removal of the existing exact-5-tuple heuristics.
>>
>> Signed-off-by: Jarno Rajahalme <jrajahalme at nicira.com>
> Looks good.
>
> Acked-by: Pravin B Shelar <pshelar at nicira.com>
Jarno, would you mind giving me a chance to look at this again before
you apply it? I'll try to do that tomorrow.
More information about the dev
mailing list