[ovs-dev] [PATCH v7] datapath: Per NUMA node flow stats.

Sat Feb 15 00:15:54 UTC 2014

I got following sparse warning.

/home/pravin/ovs/w6/datapath/linux/flow_table.c:97:9: warning:
incorrect type in argument 1 (different address spaces)
/home/pravin/ovs/w6/datapath/linux/flow_table.c:97:9:    expected
struct spinlock [usertype] *lock
/home/pravin/ovs/w6/datapath/linux/flow_table.c:97:9:    got struct
spinlock [noderef] <asn:4>*<noident>

Otherwise looks good.

Acked-by: Pravin B Shelar <pshelar at nicira.com>

Thanks.

On Thu, Feb 13, 2014 at 10:51 AM, Jarno Rajahalme <jrajahalme at nicira.com> wrote:
> Keep kernel flow stats for each NUMA node rather than each (logical)
> CPU.  This avoids using the per-CPU allocator and removes most of the
> kernel-side OVS locking overhead otherwise on the top of perf reports
> and allows OVS to scale better with higher number of threads.
>
> With 9 handlers and 4 revalidators netperf TCP_CRR test flow setup
> rate doubles on a server with two hyper-threaded physical CPUs (16
> logical cores each) compared to the current OVS master.  Tested with
> non-trivial flow table with a TCP port match rule forcing all new
> connections with unique port numbers to OVS userspace.  The IP
> addresses are still wildcarded, so the kernel flows are not considered
> as exact match 5-tuple flows.  This type of flows can be expected to
> appear in large numbers as the result of more effective wildcarding
> made possible by improvements in OVS userspace flow classifier.
>
> Perf results for this test (master):
>
> Events: 305K cycles
> +   8.43%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
> +   5.64%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
> +   4.75%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
> +   3.32%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
> +   2.61%     ovs-vswitchd  [kernel.kallsyms]   [k] pcpu_alloc_area
> +   2.19%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask_range
> +   2.03%          swapper  [kernel.kallsyms]   [k] intel_idle
> +   1.84%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
> +   1.64%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
> +   1.58%     ovs-vswitchd  libc-2.15.so        [.] 0x7f4e6
> +   1.07%     ovs-vswitchd  [kernel.kallsyms]   [k] memset
> +   1.03%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
> +   0.92%          swapper  [kernel.kallsyms]   [k] __ticket_spin_lock
> ...
>
> And after this patch:
>
> Events: 356K cycles
> +   6.85%     ovs-vswitchd  ovs-vswitchd        [.] find_match_wc
> +   4.63%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_lock
> +   3.06%     ovs-vswitchd  [kernel.kallsyms]   [k] __ticket_spin_lock
> +   2.81%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask_range
> +   2.51%     ovs-vswitchd  libpthread-2.15.so  [.] pthread_mutex_unlock
> +   2.27%     ovs-vswitchd  ovs-vswitchd        [.] classifier_lookup
> +   1.84%     ovs-vswitchd  libc-2.15.so        [.] 0x15d30f
> +   1.74%     ovs-vswitchd  [kernel.kallsyms]   [k] mutex_spin_on_owner
> +   1.47%          swapper  [kernel.kallsyms]   [k] intel_idle
> +   1.34%     ovs-vswitchd  ovs-vswitchd        [.] flow_hash_in_minimask
> +   1.33%     ovs-vswitchd  ovs-vswitchd        [.] rule_actions_unref
> +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] hindex_node_with_hash
> +   1.16%     ovs-vswitchd  ovs-vswitchd        [.] do_xlate_actions
> +   1.09%     ovs-vswitchd  ovs-vswitchd        [.] ofproto_rule_ref
> +   1.01%          netperf  [kernel.kallsyms]   [k] __ticket_spin_lock
> ...
>
> There is a small increase in kernel spinlock overhead due to the same
> spinlock being shared between multiple cores of the same physical CPU,
> but that is barely visible in the netperf TCP_CRR test performance
> (maybe ~1% performance drop, hard to tell exactly due to variance in
> the test results), when testing for kernel module throughput (with no
> userspace activity, handful of kernel flows).
>
> On flow setup, a single stats instance is allocated (for the NUMA node
> 0).  As CPUs from multiple NUMA nodes start updating stats, new
> NUMA-node specific stats instances are allocated.  This allocation on
> the packet processing code path is made to never block or look for
> emergency memory pools, minimizing the allocation latency.  If the
> allocation fails, the existing preallocated stats instance is used.
> Also, if only CPUs from one NUMA-node are updating the preallocated
> stats instance, no additional stats instances are allocated.  This
> eliminates the need to pre-allocate stats instances that will not be
> used, also relieving the stats reader from the burden of reading stats
> that are never used.
>
> Signed-off-by: Jarno Rajahalme <jrajahalme at nicira.com>
> ---
> v7: Addressed Pravins comments:
> - Use RCU to protect the stats pointers, as suggested by Jesse.
> - Move 'last_writer' to struct sw_flow.
> - Removed unnecessary compile-time conditional code.
>
> v6: Addressed Jesses comments, and fixed locking:
> - Assume it is likely we will write to the preallocated stats.
> - Do not add cache line alignment to flow_cache, as it is mostly read.
> - Always block bottom halves (use spin_lock_bh()) as e.g. NODE 1 can
>   keep writing on the preallocated (NODE 0) stats, which could
>   otherwise lead to a deadlock.
> - Add a write barrier before storing the newly initialized stats
>   pointer, as otherwise other CPUs could access the stats (lock)
>   uninitialized.
> - Use the NUMA_NO_NODE from /include/linux/numa.h instead of -1
>   directly.
> - Make the most critical new per-NUMA code compile-time conditional
>   (on MAX_NUMNODES) to avoid any overheads on kernels built for a
>   single node only.
>
> v5: Addressed more comments by Pravin:
> - Removed prefetching as it may decrease performance when action execution
>   takes more time (as with tunnels).
> - Disable preemption when using spin_lock() to avoid potential deadlocks.
> - Mark the flow stats cache as __read_mostly.
>
> v4: Addressed comments by Pravin:
> - Do not use __GFP_NOTRACK
> - Use spin_lock() instead of spin_lock_bh() when locking stats on remote nodes.
> - Be hotplug compatible by using num_possible_nodes() when allocating.
> - Remove unnecessary struct alignment attributes.
>
>  datapath/flow.c       |  136 +++++++++++++++++++++++++++----------------------
>  datapath/flow.h       |    9 +++-
>  datapath/flow_table.c |   44 ++++++++++++----
>  datapath/flow_table.h |    2 +
>  4 files changed, 116 insertions(+), 75 deletions(-)
>
> diff --git a/datapath/flow.c b/datapath/flow.c
> index f40cddb..3b76b78 100644
> --- a/datapath/flow.c
> +++ b/datapath/flow.c
> @@ -66,8 +66,9 @@ void ovs_flow_stats_update(struct sw_flow *flow, struct sk_buff *skb)
>  {
>         struct flow_stats *stats;
>         __be16 tcp_flags = 0;
> +       int node = numa_node_id();
>
> -       stats = this_cpu_ptr(flow->stats);
> +       stats = rcu_dereference(flow->stats[node]);
>
>         if ((flow->key.eth.type == htons(ETH_P_IP) ||
>              flow->key.eth.type == htons(ETH_P_IPV6)) &&
> @@ -77,88 +78,99 @@ void ovs_flow_stats_update(struct sw_flow *flow, struct sk_buff *skb)
>                 tcp_flags = TCP_FLAGS_BE16(tcp_hdr(skb));
>         }
>
> -       spin_lock(&stats->lock);
> +       /* Check if already have node-specific stats. */
> +       if (likely(stats)) {
> +               spin_lock(&stats->lock);
> +               /* Mark if we write on the pre-allocated stats. */
> +               if (node == 0 && unlikely(flow->stats_last_writer != node))
> +                       flow->stats_last_writer = node;
> +       } else {
> +               stats = rcu_dereference(flow->stats[0]); /* Pre-allocated. */
> +               spin_lock(&stats->lock);
> +
> +               /* If the current NUMA-node is the only writer on the
> +                * pre-allocated stats keep using them. */
> +               if (unlikely(flow->stats_last_writer != node)) {
> +                       /* A previous locker may have already allocated the
> +                        * stats, so we need to check again.  If node-specific
> +                        * stats were already allocated, we update the pre-
> +                        * allocated stats as we have already locked them. */
> +                       if (likely(flow->stats_last_writer != NUMA_NO_NODE)
> +                           && likely(!rcu_dereference(flow->stats[node]))) {
> +                               /* Try to allocate node-specific stats. */
> +                               struct flow_stats *new_stats;
> +
> +                               new_stats =
> +                                       kmem_cache_alloc_node(flow_stats_cache,
> +                                                             GFP_THISNODE |
> +                                                             __GFP_NOMEMALLOC,
> +                                                             node);
> +                               if (likely(new_stats)) {
> +                                       new_stats->used = jiffies;
> +                                       new_stats->packet_count = 1;
> +                                       new_stats->byte_count = skb->len;
> +                                       new_stats->tcp_flags = tcp_flags;
> +                                       spin_lock_init(&new_stats->lock);
> +
> +                                       rcu_assign_pointer(flow->stats[node],
> +                                                          new_stats);
> +                                       goto unlock;
> +                               }
> +                       }
> +                       flow->stats_last_writer = node;
> +               }
> +       }
> +
>         stats->used = jiffies;
>         stats->packet_count++;
>         stats->byte_count += skb->len;
>         stats->tcp_flags |= tcp_flags;
> +unlock:
>         spin_unlock(&stats->lock);
>  }
>
> -static void stats_read(struct flow_stats *stats, bool lock_bh,
> -                      struct ovs_flow_stats *ovs_stats,
> -                      unsigned long *used, __be16 *tcp_flags)
> -{
> -       if (lock_bh)
> -               spin_lock_bh(&stats->lock);
> -       else
> -               spin_lock(&stats->lock);
> -
> -       if (time_after(stats->used, *used))
> -               *used = stats->used;
> -       *tcp_flags |= stats->tcp_flags;
> -       ovs_stats->n_packets += stats->packet_count;
> -       ovs_stats->n_bytes += stats->byte_count;
> -
> -       if (lock_bh)
> -               spin_unlock_bh(&stats->lock);
> -       else
> -               spin_unlock(&stats->lock);
> -}
> -
>  void ovs_flow_stats_get(struct sw_flow *flow, struct ovs_flow_stats *ovs_stats,
>                         unsigned long *used, __be16 *tcp_flags)
>  {
> -       int cpu, cur_cpu;
> +       int node;
>
>         *used = 0;
>         *tcp_flags = 0;
>         memset(ovs_stats, 0, sizeof(*ovs_stats));
>
> -       cur_cpu = get_cpu();
> -
> -       for_each_possible_cpu(cpu) {
> -               struct flow_stats *stats;
> -               bool lock_bh;
> -
> -               stats = per_cpu_ptr(flow->stats, cpu);
> -               lock_bh = (cpu == cur_cpu);
> -               stats_read(stats, lock_bh, ovs_stats, used, tcp_flags);
> +       for_each_node(node) {
> +               struct flow_stats *stats = rcu_dereference(flow->stats[node]);
> +
> +               if (stats) {
> +                       /* Local CPU may write on non-local stats, so we must
> +                        * block bottom-halves here. */
> +                       spin_lock_bh(&stats->lock);
> +                       if (time_after(stats->used, *used))
> +                               *used = stats->used;
> +                       *tcp_flags |= stats->tcp_flags;
> +                       ovs_stats->n_packets += stats->packet_count;
> +                       ovs_stats->n_bytes += stats->byte_count;
> +                       spin_unlock_bh(&stats->lock);
> +               }
>         }
> -       put_cpu();
> -}
> -
> -static void stats_reset(struct flow_stats *stats, bool lock_bh)
> -{
> -       if (lock_bh)
> -               spin_lock_bh(&stats->lock);
> -       else
> -               spin_lock(&stats->lock);
> -
> -       stats->used = 0;
> -       stats->packet_count = 0;
> -       stats->byte_count = 0;
> -       stats->tcp_flags = 0;
> -
> -       if (lock_bh)
> -               spin_unlock_bh(&stats->lock);
> -       else
> -               spin_unlock(&stats->lock);
>  }
>
>  void ovs_flow_stats_clear(struct sw_flow *flow)
>  {
> -       int cpu, cur_cpu;
> -
> -       cur_cpu = get_cpu();
> -
> -       for_each_possible_cpu(cpu) {
> -               bool lock_bh;
> -
> -               lock_bh = (cpu == cur_cpu);
> -               stats_reset(per_cpu_ptr(flow->stats, cpu), lock_bh);
> +       int node;
> +
> +       for_each_node(node) {
> +               struct flow_stats *stats = rcu_dereference(flow->stats[node]);
> +
> +               if (stats) {
> +                       spin_lock_bh(&stats->lock);
> +                       stats->used = 0;
> +                       stats->packet_count = 0;
> +                       stats->byte_count = 0;
> +                       stats->tcp_flags = 0;
> +                       spin_unlock_bh(&stats->lock);
> +               }
>         }
> -       put_cpu();
>  }
>
>  static int check_header(struct sk_buff *skb, int len)
> diff --git a/datapath/flow.h b/datapath/flow.h
> index c4de0e6..f98c716 100644
> --- a/datapath/flow.h
> +++ b/datapath/flow.h
> @@ -161,12 +161,17 @@ struct sw_flow {
>         struct rcu_head rcu;
>         struct hlist_node hash_node[2];
>         u32 hash;
> -
> +       int stats_last_writer;          /* NUMA-node id of the last writer on
> +                                        * 'stats[0]'. */
>         struct sw_flow_key key;
>         struct sw_flow_key unmasked_key;
>         struct sw_flow_mask *mask;
>         struct sw_flow_actions __rcu *sf_acts;
> -       struct flow_stats __percpu *stats;
> +       struct flow_stats __rcu *stats[]; /* One for each NUMA node.  First one
> +                                          * is allocated at flow creation time,
> +                                          * the rest are allocated on demand
> +                                          * while holding the 'stats[0].lock'.
> +                                          */
>  };
>
>  struct arp_eth_header {
> diff --git a/datapath/flow_table.c b/datapath/flow_table.c
> index b7007ee..464b798 100644
> --- a/datapath/flow_table.c
> +++ b/datapath/flow_table.c
> @@ -50,6 +50,7 @@
>  #define REHASH_INTERVAL                (10 * 60 * HZ)
>
>  static struct kmem_cache *flow_cache;
> +struct kmem_cache *flow_stats_cache __read_mostly;
>
>  static u16 range_n_bytes(const struct sw_flow_key_range *range)
>  {
> @@ -77,7 +78,7 @@ void ovs_flow_mask_key(struct sw_flow_key *dst, const struct sw_flow_key *src,
>  struct sw_flow *ovs_flow_alloc(void)
>  {
>         struct sw_flow *flow;
> -       int cpu;
> +       int node;
>
>         flow = kmem_cache_alloc(flow_cache, GFP_KERNEL);
>         if (!flow)
> @@ -86,16 +87,20 @@ struct sw_flow *ovs_flow_alloc(void)
>         flow->sf_acts = NULL;
>         flow->mask = NULL;
>
> -       flow->stats = alloc_percpu(struct flow_stats);
> -       if (!flow->stats)
> +       /* Initialize the default stat node. */
> +       RCU_INIT_POINTER(flow->stats[0],
> +                        kmem_cache_alloc_node(flow_stats_cache,
> +                                              GFP_KERNEL | __GFP_ZERO, 0));
> +       if (!flow->stats[0])
>                 goto err;
>
> -       for_each_possible_cpu(cpu) {
> -               struct flow_stats *cpu_stats;
> +       spin_lock_init(&flow->stats[0]->lock);
> +       flow->stats_last_writer = NUMA_NO_NODE;
> +
> +       for_each_node(node)
> +               if (node != 0)
> +                       RCU_INIT_POINTER(flow->stats[node], NULL);
>
> -               cpu_stats = per_cpu_ptr(flow->stats, cpu);
> -               spin_lock_init(&cpu_stats->lock);
> -       }
>         return flow;
>  err:
>         kmem_cache_free(flow_cache, flow);
> @@ -132,8 +137,13 @@ static struct flex_array *alloc_buckets(unsigned int n_buckets)
>
>  static void flow_free(struct sw_flow *flow)
>  {
> +       int node;
> +
>         kfree((struct sf_flow_acts __force *)flow->sf_acts);
> -       free_percpu(flow->stats);
> +       for_each_node(node)
> +               if (flow->stats[node])
> +                       kmem_cache_free(flow_stats_cache,
> +                                       (struct flow_stats __force *)flow->stats[node]);
>         kmem_cache_free(flow_cache, flow);
>  }
>
> @@ -595,16 +605,28 @@ int ovs_flow_init(void)
>         BUILD_BUG_ON(__alignof__(struct sw_flow_key) % __alignof__(long));
>         BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long));
>
> -       flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow), 0,
> -                                       0, NULL);
> +       flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow)
> +                                      + (num_possible_nodes()
> +                                         * sizeof(struct flow_stats *)),
> +                                      0, 0, NULL);
>         if (flow_cache == NULL)
>                 return -ENOMEM;
>
> +       flow_stats_cache
> +               = kmem_cache_create("sw_flow_stats", sizeof(struct flow_stats),
> +                                   0, SLAB_HWCACHE_ALIGN, NULL);
> +       if (flow_stats_cache == NULL) {
> +               kmem_cache_destroy(flow_cache);
> +               flow_cache = NULL;
> +               return -ENOMEM;
> +       }
> +
>         return 0;
>  }
>
>  /* Uninitializes the flow module. */
>  void ovs_flow_exit(void)
>  {
> +       kmem_cache_destroy(flow_stats_cache);
>         kmem_cache_destroy(flow_cache);
>  }
> diff --git a/datapath/flow_table.h b/datapath/flow_table.h
> index c26c59a..ca8a582 100644
> --- a/datapath/flow_table.h
> +++ b/datapath/flow_table.h
> @@ -52,6 +52,8 @@ struct flow_table {
>         unsigned int count;
>  };
>
> +extern struct kmem_cache *flow_stats_cache;
> +
>  int ovs_flow_init(void);
>  void ovs_flow_exit(void);
>
> --
> 1.7.10.4
>
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev