[ovs-dev] [PATCH v5 2/2] datapath: Per NUMA node flow stats.

Fri Feb 7 05:23:45 UTC 2014

> On Feb 6, 2014, at 6:36 PM, Jesse Gross <jesse at nicira.com> wrote:
> 
>> On Thu, Feb 6, 2014 at 4:09 PM, Pravin Shelar <pshelar at nicira.com> wrote:
>>> On Thu, Feb 6, 2014 at 3:13 PM, Jarno Rajahalme <jrajahalme at nicira.com> wrote:
>>>    Keep kernel flow stats for each NUMA node rather than each (logical)
>>>    CPU.  This avoids using the per-CPU allocator and removes most of the
>>>    kernel-side OVS locking overhead otherwise on the top of perf reports
>>>    and allows OVS to scale better with higher number of threads.
>>> 
>>>    With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup
>>>    rate doubles on a server with two hyper-threaded physical CPUs (16
>>>    logical cores each) compared to the current OVS master.  Tested with
>>>    non-trivial flow table with a TCP port match rule forcing all new
>>>    connections with unique port numbers to OVS userspace.  The IP
>>>    addresses are still wildcarded, so the kernel flows are not considered
>>>    as exact match 5-tuple flows.  This type of flows can be expected to
>>>    appear in large numbers as the result of more effective wildcarding
>>>    made possible by improvements in OVS userspace flow classifier.
>>> 
>>>    Perf results for this test (master):
>>> 
>>>    Events: 305K cycles
>>>    +   8.43%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
>>>    +   5.64%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
>>>    +   4.75%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
>>>    +   3.32%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
>>>    +   2.61%     ovs-vswitchd  [kernel.kallsyms]   [k] pcpu_alloc_area
>>>    +   2.19%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask_range
>>>    +   2.03%          swapper  [kernel.kallsyms]   [k] intel_idle
>>>    +   1.84%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
>>>    +   1.64%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
>>>    +   1.58%     ovs-vswitchd  libc-2.15.so        [.] 0x7f4e6
>>>    +   1.07%     ovs-vswitchd  [kernel.kallsyms]   [k] memset
>>>    +   1.03%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
>>>    +   0.92%          swapper  [kernel.kallsyms]   [k] __ticket_spin_lock
>>>    ...
>>> 
>>>    And after this patch:
>>> 
>>>    Events: 356K cycles
>>>    +   6.85%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
>>>    +   4.63%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
>>>    +   3.06%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
>>>    +   2.81%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask_range
>>>    +   2.51%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
>>>    +   2.27%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
>>>    +   1.84%     ovs-vswitchd  libc-2.15.so        [.] 0x15d30f
>>>    +   1.74%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
>>>    +   1.47%          swapper  [kernel.kallsyms]   [k] intel_idle
>>>    +   1.34%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask
>>>    +   1.33%     ovs-vswitchd  ovs-vswitchd        [.] rule_actions_unref
>>>    +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] hindex_node_with_hash
>>>    +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] do_xlate_actions
>>>    +   1.09%     ovs-vswitchd  ovs-vswitchd        [.] ofproto_rule_ref
>>>    +   1.01%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
>>>    ...
>>> 
>>>    There is a small increase in kernel spinlock overhead due to the same
>>>    spinlock being shared between multiple cores of the same physical CPU,
>>>    but that is barely visible in the netperf TCP_CRR test performance
>>>    (maybe ~1% performance drop, hard to tell exactly due to variance in the test
>>>    results), when testing for kernel module throughput (with no userspace
>>>    activity, handful of kernel flows).
>>> 
>>>    On flow setup, a single stats instance is allocated (for the NUMA node
>>>    0).  As CPUs from multiple NUMA nodes start updating stats, new
>>>    NUMA-node specific stats instances are allocated.  This allocation on
>>>    the packet processing code path is made to never sleep or look for
>>>    emergency memory pools, minimizing the allocation latency.  If the
>>>    allocation fails, the existing preallocated stats instance is used.
>>>    Also, if only CPUs from one NUMA-node are updating the preallocated
>>>    stats instance, no additional stats instances are allocated.  This
>>>    eliminates the need to pre-allocate stats instances that will not be
>>>    used, also relieving the stats reader from the burden of reading stats
>>>    that are never used.  Finally, this allocation strategy allows the
>>>    removal of the existing exact-5-tuple heuristics.
>>> 
>>>    Signed-off-by: Jarno Rajahalme <jrajahalme at nicira.com>
>> Looks good.
>> 
>> Acked-by: Pravin B Shelar <pshelar at nicira.com>
> 
> Jarno, would you mind giving me a chance to look at this again before
> you apply it? I'll try to do that tomorrow.

Sure :-)

  Jarno