[ovs-dev] [PATCH v7] datapath: Per NUMA node flow stats.
Pravin Shelar
pshelar at nicira.com
Sat Feb 15 00:15:54 UTC 2014
I got following sparse warning.
/home/pravin/ovs/w6/datapath/linux/flow_table.c:97:9: warning:
incorrect type in argument 1 (different address spaces)
/home/pravin/ovs/w6/datapath/linux/flow_table.c:97:9: expected
struct spinlock [usertype] *lock
/home/pravin/ovs/w6/datapath/linux/flow_table.c:97:9: got struct
spinlock [noderef] <asn:4>*<noident>
Otherwise looks good.
Acked-by: Pravin B Shelar <pshelar at nicira.com>
Thanks.
On Thu, Feb 13, 2014 at 10:51 AM, Jarno Rajahalme <jrajahalme at nicira.com> wrote:
> Keep kernel flow stats for each NUMA node rather than each (logical)
> CPU. This avoids using the per-CPU allocator and removes most of the
> kernel-side OVS locking overhead otherwise on the top of perf reports
> and allows OVS to scale better with higher number of threads.
>
> With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup
> rate doubles on a server with two hyper-threaded physical CPUs (16
> logical cores each) compared to the current OVS master. Tested with
> non-trivial flow table with a TCP port match rule forcing all new
> connections with unique port numbers to OVS userspace. The IP
> addresses are still wildcarded, so the kernel flows are not considered
> as exact match 5-tuple flows. This type of flows can be expected to
> appear in large numbers as the result of more effective wildcarding
> made possible by improvements in OVS userspace flow classifier.
>
> Perf results for this test (master):
>
> Events: 305K cycles
> + 8.43% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner
> + 5.64% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock
> + 4.75% ovs-vswitchd ovs-vswitchd [.] find_match_wc
> + 3.32% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock
> + 2.61% ovs-vswitchd [kernel.kallsyms] [k] pcpu_alloc_area
> + 2.19% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range
> + 2.03% swapper [kernel.kallsyms] [k] intel_idle
> + 1.84% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock
> + 1.64% ovs-vswitchd ovs-vswitchd [.] classifier_lookup
> + 1.58% ovs-vswitchd libc-2.15.so [.] 0x7f4e6
> + 1.07% ovs-vswitchd [kernel.kallsyms] [k] memset
> + 1.03% netperf [kernel.kallsyms] [k] __ticket_spin_lock
> + 0.92% swapper [kernel.kallsyms] [k] __ticket_spin_lock
> ...
>
> And after this patch:
>
> Events: 356K cycles
> + 6.85% ovs-vswitchd ovs-vswitchd [.] find_match_wc
> + 4.63% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_lock
> + 3.06% ovs-vswitchd [kernel.kallsyms] [k] __ticket_spin_lock
> + 2.81% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask_range
> + 2.51% ovs-vswitchd libpthread-2.15.so [.] pthread_mutex_unlock
> + 2.27% ovs-vswitchd ovs-vswitchd [.] classifier_lookup
> + 1.84% ovs-vswitchd libc-2.15.so [.] 0x15d30f
> + 1.74% ovs-vswitchd [kernel.kallsyms] [k] mutex_spin_on_owner
> + 1.47% swapper [kernel.kallsyms] [k] intel_idle
> + 1.34% ovs-vswitchd ovs-vswitchd [.] flow_hash_in_minimask
> + 1.33% ovs-vswitchd ovs-vswitchd [.] rule_actions_unref
> + 1.16% ovs-vswitchd ovs-vswitchd [.] hindex_node_with_hash
> + 1.16% ovs-vswitchd ovs-vswitchd [.] do_xlate_actions
> + 1.09% ovs-vswitchd ovs-vswitchd [.] ofproto_rule_ref
> + 1.01% netperf [kernel.kallsyms] [k] __ticket_spin_lock
> ...
>
> There is a small increase in kernel spinlock overhead due to the same
> spinlock being shared between multiple cores of the same physical CPU,
> but that is barely visible in the netperf TCP_CRR test performance
> (maybe ~1% performance drop, hard to tell exactly due to variance in
> the test results), when testing for kernel module throughput (with no
> userspace activity, handful of kernel flows).
>
> On flow setup, a single stats instance is allocated (for the NUMA node
> 0). As CPUs from multiple NUMA nodes start updating stats, new
> NUMA-node specific stats instances are allocated. This allocation on
> the packet processing code path is made to never block or look for
> emergency memory pools, minimizing the allocation latency. If the
> allocation fails, the existing preallocated stats instance is used.
> Also, if only CPUs from one NUMA-node are updating the preallocated
> stats instance, no additional stats instances are allocated. This
> eliminates the need to pre-allocate stats instances that will not be
> used, also relieving the stats reader from the burden of reading stats
> that are never used.
>
> Signed-off-by: Jarno Rajahalme <jrajahalme at nicira.com>
> ---
> v7: Addressed Pravins comments:
> - Use RCU to protect the stats pointers, as suggested by Jesse.
> - Move 'last_writer' to struct sw_flow.
> - Removed unnecessary compile-time conditional code.
>
> v6: Addressed Jesses comments, and fixed locking:
> - Assume it is likely we will write to the preallocated stats.
> - Do not add cache line alignment to flow_cache, as it is mostly read.
> - Always block bottom halves (use spin_lock_bh()) as e.g. NODE 1 can
> keep writing on the preallocated (NODE 0) stats, which could
> otherwise lead to a deadlock.
> - Add a write barrier before storing the newly initialized stats
> pointer, as otherwise other CPUs could access the stats (lock)
> uninitialized.
> - Use the NUMA_NO_NODE from /include/linux/numa.h instead of -1
> directly.
> - Make the most critical new per-NUMA code compile-time conditional
> (on MAX_NUMNODES) to avoid any overheads on kernels built for a
> single node only.
>
> v5: Addressed more comments by Pravin:
> - Removed prefetching as it may decrease performance when action execution
> takes more time (as with tunnels).
> - Disable preemption when using spin_lock() to avoid potential deadlocks.
> - Mark the flow stats cache as __read_mostly.
>
> v4: Addressed comments by Pravin:
> - Do not use __GFP_NOTRACK
> - Use spin_lock() instead of spin_lock_bh() when locking stats on remote nodes.
> - Be hotplug compatible by using num_possible_nodes() when allocating.
> - Remove unnecessary struct alignment attributes.
>
> datapath/flow.c | 136 +++++++++++++++++++++++++++----------------------
> datapath/flow.h | 9 +++-
> datapath/flow_table.c | 44 ++++++++++++----
> datapath/flow_table.h | 2 +
> 4 files changed, 116 insertions(+), 75 deletions(-)
>
> diff --git a/datapath/flow.c b/datapath/flow.c
> index f40cddb..3b76b78 100644
> --- a/datapath/flow.c
> +++ b/datapath/flow.c
> @@ -66,8 +66,9 @@ void ovs_flow_stats_update(struct sw_flow *flow, struct sk_buff *skb)
> {
> struct flow_stats *stats;
> __be16 tcp_flags = 0;
> + int node = numa_node_id();
>
> - stats = this_cpu_ptr(flow->stats);
> + stats = rcu_dereference(flow->stats[node]);
>
> if ((flow->key.eth.type == htons(ETH_P_IP) ||
> flow->key.eth.type == htons(ETH_P_IPV6)) &&
> @@ -77,88 +78,99 @@ void ovs_flow_stats_update(struct sw_flow *flow, struct sk_buff *skb)
> tcp_flags = TCP_FLAGS_BE16(tcp_hdr(skb));
> }
>
> - spin_lock(&stats->lock);
> + /* Check if already have node-specific stats. */
> + if (likely(stats)) {
> + spin_lock(&stats->lock);
> + /* Mark if we write on the pre-allocated stats. */
> + if (node == 0 && unlikely(flow->stats_last_writer != node))
> + flow->stats_last_writer = node;
> + } else {
> + stats = rcu_dereference(flow->stats[0]); /* Pre-allocated. */
> + spin_lock(&stats->lock);
> +
> + /* If the current NUMA-node is the only writer on the
> + * pre-allocated stats keep using them. */
> + if (unlikely(flow->stats_last_writer != node)) {
> + /* A previous locker may have already allocated the
> + * stats, so we need to check again. If node-specific
> + * stats were already allocated, we update the pre-
> + * allocated stats as we have already locked them. */
> + if (likely(flow->stats_last_writer != NUMA_NO_NODE)
> + && likely(!rcu_dereference(flow->stats[node]))) {
> + /* Try to allocate node-specific stats. */
> + struct flow_stats *new_stats;
> +
> + new_stats =
> + kmem_cache_alloc_node(flow_stats_cache,
> + GFP_THISNODE |
> + __GFP_NOMEMALLOC,
> + node);
> + if (likely(new_stats)) {
> + new_stats->used = jiffies;
> + new_stats->packet_count = 1;
> + new_stats->byte_count = skb->len;
> + new_stats->tcp_flags = tcp_flags;
> + spin_lock_init(&new_stats->lock);
> +
> + rcu_assign_pointer(flow->stats[node],
> + new_stats);
> + goto unlock;
> + }
> + }
> + flow->stats_last_writer = node;
> + }
> + }
> +
> stats->used = jiffies;
> stats->packet_count++;
> stats->byte_count += skb->len;
> stats->tcp_flags |= tcp_flags;
> +unlock:
> spin_unlock(&stats->lock);
> }
>
> -static void stats_read(struct flow_stats *stats, bool lock_bh,
> - struct ovs_flow_stats *ovs_stats,
> - unsigned long *used, __be16 *tcp_flags)
> -{
> - if (lock_bh)
> - spin_lock_bh(&stats->lock);
> - else
> - spin_lock(&stats->lock);
> -
> - if (time_after(stats->used, *used))
> - *used = stats->used;
> - *tcp_flags |= stats->tcp_flags;
> - ovs_stats->n_packets += stats->packet_count;
> - ovs_stats->n_bytes += stats->byte_count;
> -
> - if (lock_bh)
> - spin_unlock_bh(&stats->lock);
> - else
> - spin_unlock(&stats->lock);
> -}
> -
> void ovs_flow_stats_get(struct sw_flow *flow, struct ovs_flow_stats *ovs_stats,
> unsigned long *used, __be16 *tcp_flags)
> {
> - int cpu, cur_cpu;
> + int node;
>
> *used = 0;
> *tcp_flags = 0;
> memset(ovs_stats, 0, sizeof(*ovs_stats));
>
> - cur_cpu = get_cpu();
> -
> - for_each_possible_cpu(cpu) {
> - struct flow_stats *stats;
> - bool lock_bh;
> -
> - stats = per_cpu_ptr(flow->stats, cpu);
> - lock_bh = (cpu == cur_cpu);
> - stats_read(stats, lock_bh, ovs_stats, used, tcp_flags);
> + for_each_node(node) {
> + struct flow_stats *stats = rcu_dereference(flow->stats[node]);
> +
> + if (stats) {
> + /* Local CPU may write on non-local stats, so we must
> + * block bottom-halves here. */
> + spin_lock_bh(&stats->lock);
> + if (time_after(stats->used, *used))
> + *used = stats->used;
> + *tcp_flags |= stats->tcp_flags;
> + ovs_stats->n_packets += stats->packet_count;
> + ovs_stats->n_bytes += stats->byte_count;
> + spin_unlock_bh(&stats->lock);
> + }
> }
> - put_cpu();
> -}
> -
> -static void stats_reset(struct flow_stats *stats, bool lock_bh)
> -{
> - if (lock_bh)
> - spin_lock_bh(&stats->lock);
> - else
> - spin_lock(&stats->lock);
> -
> - stats->used = 0;
> - stats->packet_count = 0;
> - stats->byte_count = 0;
> - stats->tcp_flags = 0;
> -
> - if (lock_bh)
> - spin_unlock_bh(&stats->lock);
> - else
> - spin_unlock(&stats->lock);
> }
>
> void ovs_flow_stats_clear(struct sw_flow *flow)
> {
> - int cpu, cur_cpu;
> -
> - cur_cpu = get_cpu();
> -
> - for_each_possible_cpu(cpu) {
> - bool lock_bh;
> -
> - lock_bh = (cpu == cur_cpu);
> - stats_reset(per_cpu_ptr(flow->stats, cpu), lock_bh);
> + int node;
> +
> + for_each_node(node) {
> + struct flow_stats *stats = rcu_dereference(flow->stats[node]);
> +
> + if (stats) {
> + spin_lock_bh(&stats->lock);
> + stats->used = 0;
> + stats->packet_count = 0;
> + stats->byte_count = 0;
> + stats->tcp_flags = 0;
> + spin_unlock_bh(&stats->lock);
> + }
> }
> - put_cpu();
> }
>
> static int check_header(struct sk_buff *skb, int len)
> diff --git a/datapath/flow.h b/datapath/flow.h
> index c4de0e6..f98c716 100644
> --- a/datapath/flow.h
> +++ b/datapath/flow.h
> @@ -161,12 +161,17 @@ struct sw_flow {
> struct rcu_head rcu;
> struct hlist_node hash_node[2];
> u32 hash;
> -
> + int stats_last_writer; /* NUMA-node id of the last writer on
> + * 'stats[0]'. */
> struct sw_flow_key key;
> struct sw_flow_key unmasked_key;
> struct sw_flow_mask *mask;
> struct sw_flow_actions __rcu *sf_acts;
> - struct flow_stats __percpu *stats;
> + struct flow_stats __rcu *stats[]; /* One for each NUMA node. First one
> + * is allocated at flow creation time,
> + * the rest are allocated on demand
> + * while holding the 'stats[0].lock'.
> + */
> };
>
> struct arp_eth_header {
> diff --git a/datapath/flow_table.c b/datapath/flow_table.c
> index b7007ee..464b798 100644
> --- a/datapath/flow_table.c
> +++ b/datapath/flow_table.c
> @@ -50,6 +50,7 @@
> #define REHASH_INTERVAL (10 * 60 * HZ)
>
> static struct kmem_cache *flow_cache;
> +struct kmem_cache *flow_stats_cache __read_mostly;
>
> static u16 range_n_bytes(const struct sw_flow_key_range *range)
> {
> @@ -77,7 +78,7 @@ void ovs_flow_mask_key(struct sw_flow_key *dst, const struct sw_flow_key *src,
> struct sw_flow *ovs_flow_alloc(void)
> {
> struct sw_flow *flow;
> - int cpu;
> + int node;
>
> flow = kmem_cache_alloc(flow_cache, GFP_KERNEL);
> if (!flow)
> @@ -86,16 +87,20 @@ struct sw_flow *ovs_flow_alloc(void)
> flow->sf_acts = NULL;
> flow->mask = NULL;
>
> - flow->stats = alloc_percpu(struct flow_stats);
> - if (!flow->stats)
> + /* Initialize the default stat node. */
> + RCU_INIT_POINTER(flow->stats[0],
> + kmem_cache_alloc_node(flow_stats_cache,
> + GFP_KERNEL | __GFP_ZERO, 0));
> + if (!flow->stats[0])
> goto err;
>
> - for_each_possible_cpu(cpu) {
> - struct flow_stats *cpu_stats;
> + spin_lock_init(&flow->stats[0]->lock);
> + flow->stats_last_writer = NUMA_NO_NODE;
> +
> + for_each_node(node)
> + if (node != 0)
> + RCU_INIT_POINTER(flow->stats[node], NULL);
>
> - cpu_stats = per_cpu_ptr(flow->stats, cpu);
> - spin_lock_init(&cpu_stats->lock);
> - }
> return flow;
> err:
> kmem_cache_free(flow_cache, flow);
> @@ -132,8 +137,13 @@ static struct flex_array *alloc_buckets(unsigned int n_buckets)
>
> static void flow_free(struct sw_flow *flow)
> {
> + int node;
> +
> kfree((struct sf_flow_acts __force *)flow->sf_acts);
> - free_percpu(flow->stats);
> + for_each_node(node)
> + if (flow->stats[node])
> + kmem_cache_free(flow_stats_cache,
> + (struct flow_stats __force *)flow->stats[node]);
> kmem_cache_free(flow_cache, flow);
> }
>
> @@ -595,16 +605,28 @@ int ovs_flow_init(void)
> BUILD_BUG_ON(__alignof__(struct sw_flow_key) % __alignof__(long));
> BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long));
>
> - flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow), 0,
> - 0, NULL);
> + flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow)
> + + (num_possible_nodes()
> + * sizeof(struct flow_stats *)),
> + 0, 0, NULL);
> if (flow_cache == NULL)
> return -ENOMEM;
>
> + flow_stats_cache
> + = kmem_cache_create("sw_flow_stats", sizeof(struct flow_stats),
> + 0, SLAB_HWCACHE_ALIGN, NULL);
> + if (flow_stats_cache == NULL) {
> + kmem_cache_destroy(flow_cache);
> + flow_cache = NULL;
> + return -ENOMEM;
> + }
> +
> return 0;
> }
>
> /* Uninitializes the flow module. */
> void ovs_flow_exit(void)
> {
> + kmem_cache_destroy(flow_stats_cache);
> kmem_cache_destroy(flow_cache);
> }
> diff --git a/datapath/flow_table.h b/datapath/flow_table.h
> index c26c59a..ca8a582 100644
> --- a/datapath/flow_table.h
> +++ b/datapath/flow_table.h
> @@ -52,6 +52,8 @@ struct flow_table {
> unsigned int count;
> };
>
> +extern struct kmem_cache *flow_stats_cache;
> +
> int ovs_flow_init(void);
> void ovs_flow_exit(void);
>
> --
> 1.7.10.4
>
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev
More information about the dev
mailing list