[ovs-dev] [PATCH ovn v2 3/3] northd.c: Optimize parallel build performance with hash based locks.

Anton Ivanov anton.ivanov at cambridgegreys.com
Mon Oct 4 09:31:48 UTC 2021


On 03/10/2021 23:45, Han Zhou wrote:
> The current implementation of parallel build in northd with dp-groups
> enabled results in bad performance when the below assumption is not
> true:
>
>   * 3. Most RW ops are not flow adds, but changes to the
>   * od groups.
>
> In fact most (if not all) of the ovn port based flows don't share
> dp-groups, so the assumption doesn't hold in the reality, and in a scale
> test environment with ovn-k8s topology of 800 nodes, the parallel build
> shows 5 times more time spent for one northd iteration than without
> parallel build on a test machine with 24 cores (24 northd worker
> threads). This is because of lock contension on the global rw lock
> protecting the lflow hmap.
>
> This patch optimizes it by using an array of bash based locks instead of
> a single global lock. It is similar to the approach prior to the commit
> 8c980ce6, but with two major differences:
>
> 1) It uses a fixed length mutex array instead of the dynamic array of
>     the struct hashrow_locks. It is equally efficient considering the low
>     chance of contention in a large array of locks, but without the burden
>     of resizing every time the hmap size changes. The uniqueness of the
>     lock is guaranteed by combining the masks of both the hmap and the
>     mutex array.
>
> 2) It fixes the corrupted hmap size after each round of parallel flow
>     build. The hash based lock protects the list in each bucket, but
>     doesn't protect the hmap size. The patch uses thread-local counters
>     and aggregate them at the end of each iteration, which is lock free.
>     This approach has lower cost than alternatively using atomic
>     incrementing a global counter.
>
> This patch ends up with 8 times speedup than the current parallel build
> with dp-group enabled for the same scale test (which is 30% faster than
> without parallel).
>
> Test command: ovn-nbctl --print-wait-time --wait=sb sync
>
> Before:
>
> no parallel:
> ovn-northd completion: 7807ms
>
> parallel:
> ovn-northd completion: 41267ms
>
> After:
>
> no parallel: (no change)
>
> parallel:
> ovn-northd completion: 5081ms
> (8x faster than before, 30% faster than no parallel)
>
> Note: all the above tests are with dp-groups enabled)
>
> Signed-off-by: Han Zhou <hzhou at ovn.org>
> ---
> v1 -> v2: Addressed comments from Anton
>      - Fixes the hmap size afte each round of prallel flow building.
>      - Refactored lflow_hash_lock and removed unnecessary functions to avoid
>        clang warnings.
>
>   northd/northd.c | 140 ++++++++++++++++++++++++------------------------
>   1 file changed, 70 insertions(+), 70 deletions(-)
>
> diff --git a/northd/northd.c b/northd/northd.c
> index afd812700..88ab0c929 100644
> --- a/northd/northd.c
> +++ b/northd/northd.c
> @@ -4319,41 +4319,44 @@ ovn_dp_group_add_with_reference(struct ovn_lflow *lflow_ref,
>       return true;
>   }
>   
> -/* Adds a row with the specified contents to the Logical_Flow table.
> - * Version to use with dp_groups + parallel - when locking is required.
> - *
> - * Assumptions:
> - *
> - * 1. A large proportion of the operations are lookups (reads).
> - * 2. RW operations are a small proportion of overall adds.
> - * 3. Most RW ops are not flow adds, but changes to the
> - * od groups.
> +/* The lflow_hash_lock is a mutex array that protects updates to the shared
> + * lflow table across threads when parallel lflow build and dp-group are both
> + * enabled. To avoid high contention between threads, a big array of mutexes
> + * are used instead of just one. This is possible because when parallel build
> + * is used we only use hmap_insert_fast() to update the hmap, which would not
> + * touch the bucket array but only the list in a single bucket. We only need to
> + * make sure that when adding lflows to the same hash bucket, the same lock is
> + * used, so that no two threads can add to the bucket at the same time.  It is
> + * ok that the same lock is used to protect multiple buckets, so a fixed sized
> + * mutex array is used instead of 1-1 mapping to the hash buckets. This
> + * simplies the implementation while effectively reduces lock contention
> + * because the chance that different threads contending the same lock amongst
> + * the big number of locks is very low. */
> +#define LFLOW_HASH_LOCK_MASK 0xFFFF
> +static struct ovs_mutex lflow_hash_locks[LFLOW_HASH_LOCK_MASK + 1];
> +
> +static void
> +lflow_hash_lock_init(void)
> +{
> +    for (size_t i = 0; i < LFLOW_HASH_LOCK_MASK + 1; i++) {
> +        ovs_mutex_init(&lflow_hash_locks[i]);
> +    }
> +}
> +
> +/* This thread-local var is used for parallel lflow building when dp-groups is
> + * enabled. It maintains the number of lflows inserted by the current thread to
> + * the shared lflow hmap in the current iteration. It is needed because the
> + * lflow_hash_lock cannot protect current update of the hmap's size (hmap->n)
> + * by different threads.
>    *
> - * Principles of operation:
> - * 1. All accesses to the flow table are protected by a rwlock.
> - * 2. By default, everyone grabs a rd lock so that multiple threads
> - * can do lookups simultaneously.
> - * 3. If a change to the lflow is needed, the rd lock is released and
> - * a wr lock is acquired instead (the fact that POSIX does not have an
> - * "upgrade" on locks is a major pain, but there is nothing we can do
> - * - it's not available).
> - * 4. WR lock operations in rd/wr locking have LOWER priority than RD.
> - * That is by design and spec. So the code after a request for WR lock
> - * may wait for a considerable amount of time until it is given a
> - * change to run. That means that another thread may get there in the
> - * meantime and change the data. Hence all wr operations MUST be coded
> - * to ensure that they are not vulnerable to "someone pulled this from
> - * under my feet". Re- reads, checks for presense, etc.
> - * 5. Operations on the actual od_group hash map are protected by
> - * per-flow locks. There is no need for these to be rd, mutex is more
> - * appropriate. They are low contention as each protects only its flow
> - * and only during modification which happen while holding a rd lock on
> - * the flow table.
> - */
> -static struct ovs_rwlock flowtable_lock;
> + * When all threads complete the tasks of an iteration, the counters of all the
> + * threads are collected to fix the lflow hmap's size (by the function
> + * fix_flow_map_size()).
> + * */
> +static thread_local size_t thread_lflow_counter = 0;
>   
>   /* Adds a row with the specified contents to the Logical_Flow table.
> - * Version to use when locking is NOT required.
> + * Version to use when hash bucket locking is NOT required.
>    */
>   static struct ovn_lflow *
>   do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od,
> @@ -4361,7 +4364,6 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od,
>                    const char *match, const char *actions, const char *io_port,
>                    const struct ovsdb_idl_row *stage_hint,
>                    const char *where, const char *ctrl_meter)
> -                 OVS_NO_THREAD_SAFETY_ANALYSIS
>   {
>   
>       struct ovn_lflow *old_lflow;
> @@ -4390,14 +4392,14 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od,
>           hmap_insert(lflow_map, &lflow->hmap_node, hash);
>       } else {
>           hmap_insert_fast(lflow_map, &lflow->hmap_node, hash);
> +        thread_lflow_counter++;
>       }
>       return lflow;
>   }
>   
>   /* Adds a row with the specified contents to the Logical_Flow table.
> - * Version to use when locking IS required.
> + * Version to use when hash bucket locking IS required.
>    */
> -
>   static struct ovn_lflow *
>   do_ovn_lflow_add_pd(struct hmap *lflow_map, struct ovn_datapath *od,
>                       uint32_t hash, enum ovn_stage stage, uint16_t priority,
> @@ -4405,44 +4407,16 @@ do_ovn_lflow_add_pd(struct hmap *lflow_map, struct ovn_datapath *od,
>                       const char *io_port,
>                       const struct ovsdb_idl_row *stage_hint,
>                       const char *where, const char *ctrl_meter)
> -                    OVS_NO_THREAD_SAFETY_ANALYSIS
>   {
>   
> -    struct ovn_lflow *old_lflow;
>       struct ovn_lflow *lflow;
> +    struct ovs_mutex *hash_lock =
> +        &lflow_hash_locks[hash & lflow_map->mask & LFLOW_HASH_LOCK_MASK];
>   
> -    /* Fast Path - try to amend an existing flow without asking
> -     * for WR access to the whole flow table. Locking the actual
> -     * hmapx for the particular flow's odg is low overhead as its
> -     * contention is much lower.
> -     */
> -
> -    ovs_rwlock_rdlock(&flowtable_lock);
> -    old_lflow = ovn_lflow_find(lflow_map, NULL, stage, priority, match,
> -                               actions, ctrl_meter, hash);
> -    if (old_lflow) {
> -        ovs_mutex_lock(&old_lflow->odg_lock);
> -        hmapx_add(&old_lflow->od_group, od);
> -        ovs_mutex_unlock(&old_lflow->odg_lock);
> -    }
> -    ovs_rwlock_unlock(&flowtable_lock);
> -
> -    if (old_lflow) {
> -        return old_lflow;
> -    }
> -
> -    ovs_rwlock_wrlock(&flowtable_lock);
> -
> -    /* We need to rerun the "if in flowtable" steps, because someone
> -     * could have inserted it while we were waiting to acquire an
> -     * wr lock. As we are now holding a wr lock on it nobody else is
> -     * in the * "fast" portion of the code which is protected by the
> -     * rwlock.
> -     */
> +    ovs_mutex_lock(hash_lock);
>       lflow = do_ovn_lflow_add(lflow_map, od, hash, stage, priority, match,
> -                                 actions, io_port, stage_hint, where,
> -                                 ctrl_meter);
> -    ovs_rwlock_unlock(&flowtable_lock);
> +                             actions, io_port, stage_hint, where, ctrl_meter);
> +    ovs_mutex_unlock(hash_lock);
>       return lflow;
>   }
>   
> @@ -12767,6 +12741,7 @@ struct lswitch_flow_build_info {
>       char *svc_check_match;
>       struct ds match;
>       struct ds actions;
> +    size_t thread_lflow_counter;
>   };
>   
>   /* Helper function to combine all lflow generation which is iterated by
> @@ -12892,6 +12867,7 @@ build_lflows_thread(void *arg)
>           if (stop_parallel_processing()) {
>               return NULL;
>           }
> +        thread_lflow_counter = 0;
>           if (lsi && workload) {
>               /* Iterate over bucket ThreadID, ThreadID+size, ... */
>               for (bnum = control->id;
> @@ -12952,6 +12928,7 @@ build_lflows_thread(void *arg)
>                   }
>               }
>           }
> +        lsi->thread_lflow_counter = thread_lflow_counter;
>           post_completed_work(control);
>       }
>       return NULL;
> @@ -12994,6 +12971,26 @@ noop_callback(struct worker_pool *pool OVS_UNUSED,
>       /* Do nothing */
>   }
>   
> +/* Fixes the hmap size (hmap->n) after parallel building the lflow_map when
> + * dp-groups is enabled, because in that case all threads are updating the
> + * global lflow hmap. Although the lflow_hash_lock prevents currently inserting
> + * to the same hash bucket, the hmap->n is updated currently by all threads and
> + * may not be accurate at the end of each iteration. This function collects the
> + * thread-local lflow counters maintained by each thread and update the hmap
> + * size with the aggregated value. This function must be called immediately
> + * after the worker threads complete the tasks in each iteration before any
> + * future operations on the lflow map. */
> +static void
> +fix_flow_map_size(struct hmap *lflow_map,
> +                  struct lswitch_flow_build_info *lsiv,
> +                  size_t n_lsiv)
> +{
> +    size_t total = 0;
> +    for (size_t i = 0; i < n_lsiv; i++) {
> +        total += lsiv[i].thread_lflow_counter;
> +    }
> +    lflow_map->n = total;
> +}
>   
>   static void
>   build_lswitch_and_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
> @@ -13046,6 +13043,7 @@ build_lswitch_and_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
>               lsiv[index].lbs = lbs;
>               lsiv[index].bfd_connections = bfd_connections;
>               lsiv[index].svc_check_match = svc_check_match;
> +            lsiv[index].thread_lflow_counter = 0;
>               ds_init(&lsiv[index].match);
>               ds_init(&lsiv[index].actions);
>   
> @@ -13054,7 +13052,9 @@ build_lswitch_and_lrouter_flows(struct hmap *datapaths, struct hmap *ports,
>   
>           /* Run thread pool. */
>           if (use_logical_dp_groups) {
> -            run_pool_callback(build_lflows_pool->pool, NULL, NULL, noop_callback);
> +            run_pool_callback(build_lflows_pool->pool, NULL, NULL,
> +                              noop_callback);
> +            fix_flow_map_size(lflows, lsiv, build_lflows_pool->pool->size);
>           } else {
>               run_pool_hash(build_lflows_pool->pool, lflows, lflow_segs);
>           }
> @@ -13221,7 +13221,7 @@ build_lflows(struct northd_context *ctx, struct hmap *datapaths,
>   
>       if (use_parallel_build && use_logical_dp_groups &&
>           needs_parallel_init) {
> -        ovs_rwlock_init(&flowtable_lock);
> +        lflow_hash_lock_init();
>           needs_parallel_init = false;
>           /* Disable parallel build on first run with dp_groups
>            * to determine the correct sizing of hashes. */


+1

As I said before - I would like to get to the bottom of why do we get improvement on ovn-heater with rwlock despite these numbers.

However, a fine grained lock is always better than a global lock (even a rw one). So a definite +1

Brgds,

-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/



More information about the dev mailing list