[ovs-dev] [PATCH ovn v2 3/3] northd.c: Optimize parallel build performance with hash based locks.

Han Zhou hzhou at ovn.org
Mon Oct 4 15:58:43 UTC 2021


On Mon, Oct 4, 2021 at 2:31 AM Anton Ivanov <anton.ivanov at cambridgegreys.com>
wrote:
>
>
> On 03/10/2021 23:45, Han Zhou wrote:
> > The current implementation of parallel build in northd with dp-groups
> > enabled results in bad performance when the below assumption is not
> > true:
> >
> >   * 3. Most RW ops are not flow adds, but changes to the
> >   * od groups.
> >
> > In fact most (if not all) of the ovn port based flows don't share
> > dp-groups, so the assumption doesn't hold in the reality, and in a scale
> > test environment with ovn-k8s topology of 800 nodes, the parallel build
> > shows 5 times more time spent for one northd iteration than without
> > parallel build on a test machine with 24 cores (24 northd worker
> > threads). This is because of lock contension on the global rw lock
> > protecting the lflow hmap.
> >
> > This patch optimizes it by using an array of bash based locks instead of
> > a single global lock. It is similar to the approach prior to the commit
> > 8c980ce6, but with two major differences:
> >
> > 1) It uses a fixed length mutex array instead of the dynamic array of
> >     the struct hashrow_locks. It is equally efficient considering the
low
> >     chance of contention in a large array of locks, but without the
burden
> >     of resizing every time the hmap size changes. The uniqueness of the
> >     lock is guaranteed by combining the masks of both the hmap and the
> >     mutex array.
> >
> > 2) It fixes the corrupted hmap size after each round of parallel flow
> >     build. The hash based lock protects the list in each bucket, but
> >     doesn't protect the hmap size. The patch uses thread-local counters
> >     and aggregate them at the end of each iteration, which is lock free.
> >     This approach has lower cost than alternatively using atomic
> >     incrementing a global counter.
> >
> > This patch ends up with 8 times speedup than the current parallel build
> > with dp-group enabled for the same scale test (which is 30% faster than
> > without parallel).
> >
> > Test command: ovn-nbctl --print-wait-time --wait=sb sync
> >
> > Before:
> >
> > no parallel:
> > ovn-northd completion: 7807ms
> >
> > parallel:
> > ovn-northd completion: 41267ms
> >
> > After:
> >
> > no parallel: (no change)
> >
> > parallel:
> > ovn-northd completion: 5081ms
> > (8x faster than before, 30% faster than no parallel)
> >
> > Note: all the above tests are with dp-groups enabled)
> >
> > Signed-off-by: Han Zhou <hzhou at ovn.org>
> > ---
> > v1 -> v2: Addressed comments from Anton
> >      - Fixes the hmap size afte each round of prallel flow building.
> >      - Refactored lflow_hash_lock and removed unnecessary functions to
avoid
> >        clang warnings.
> >
> >   northd/northd.c | 140 ++++++++++++++++++++++++------------------------
> >   1 file changed, 70 insertions(+), 70 deletions(-)
> >
> > diff --git a/northd/northd.c b/northd/northd.c
> > index afd812700..88ab0c929 100644
> > --- a/northd/northd.c
> > +++ b/northd/northd.c
> > @@ -4319,41 +4319,44 @@ ovn_dp_group_add_with_reference(struct
ovn_lflow *lflow_ref,
> >       return true;
> >   }
> >
> > -/* Adds a row with the specified contents to the Logical_Flow table.
> > - * Version to use with dp_groups + parallel - when locking is required.
> > - *
> > - * Assumptions:
> > - *
> > - * 1. A large proportion of the operations are lookups (reads).
> > - * 2. RW operations are a small proportion of overall adds.
> > - * 3. Most RW ops are not flow adds, but changes to the
> > - * od groups.
> > +/* The lflow_hash_lock is a mutex array that protects updates to the
shared
> > + * lflow table across threads when parallel lflow build and dp-group
are both
> > + * enabled. To avoid high contention between threads, a big array of
mutexes
> > + * are used instead of just one. This is possible because when
parallel build
> > + * is used we only use hmap_insert_fast() to update the hmap, which
would not
> > + * touch the bucket array but only the list in a single bucket. We
only need to
> > + * make sure that when adding lflows to the same hash bucket, the same
lock is
> > + * used, so that no two threads can add to the bucket at the same
time.  It is
> > + * ok that the same lock is used to protect multiple buckets, so a
fixed sized
> > + * mutex array is used instead of 1-1 mapping to the hash buckets. This
> > + * simplies the implementation while effectively reduces lock
contention
> > + * because the chance that different threads contending the same lock
amongst
> > + * the big number of locks is very low. */
> > +#define LFLOW_HASH_LOCK_MASK 0xFFFF
> > +static struct ovs_mutex lflow_hash_locks[LFLOW_HASH_LOCK_MASK + 1];
> > +
> > +static void
> > +lflow_hash_lock_init(void)
> > +{
> > +    for (size_t i = 0; i < LFLOW_HASH_LOCK_MASK + 1; i++) {
> > +        ovs_mutex_init(&lflow_hash_locks[i]);
> > +    }
> > +}
> > +
> > +/* This thread-local var is used for parallel lflow building when
dp-groups is
> > + * enabled. It maintains the number of lflows inserted by the current
thread to
> > + * the shared lflow hmap in the current iteration. It is needed
because the
> > + * lflow_hash_lock cannot protect current update of the hmap's size
(hmap->n)
> > + * by different threads.
> >    *
> > - * Principles of operation:
> > - * 1. All accesses to the flow table are protected by a rwlock.
> > - * 2. By default, everyone grabs a rd lock so that multiple threads
> > - * can do lookups simultaneously.
> > - * 3. If a change to the lflow is needed, the rd lock is released and
> > - * a wr lock is acquired instead (the fact that POSIX does not have an
> > - * "upgrade" on locks is a major pain, but there is nothing we can do
> > - * - it's not available).
> > - * 4. WR lock operations in rd/wr locking have LOWER priority than RD.
> > - * That is by design and spec. So the code after a request for WR lock
> > - * may wait for a considerable amount of time until it is given a
> > - * change to run. That means that another thread may get there in the
> > - * meantime and change the data. Hence all wr operations MUST be coded
> > - * to ensure that they are not vulnerable to "someone pulled this from
> > - * under my feet". Re- reads, checks for presense, etc.
> > - * 5. Operations on the actual od_group hash map are protected by
> > - * per-flow locks. There is no need for these to be rd, mutex is more
> > - * appropriate. They are low contention as each protects only its flow
> > - * and only during modification which happen while holding a rd lock on
> > - * the flow table.
> > - */
> > -static struct ovs_rwlock flowtable_lock;
> > + * When all threads complete the tasks of an iteration, the counters
of all the
> > + * threads are collected to fix the lflow hmap's size (by the function
> > + * fix_flow_map_size()).
> > + * */
> > +static thread_local size_t thread_lflow_counter = 0;
> >
> >   /* Adds a row with the specified contents to the Logical_Flow table.
> > - * Version to use when locking is NOT required.
> > + * Version to use when hash bucket locking is NOT required.
> >    */
> >   static struct ovn_lflow *
> >   do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od,
> > @@ -4361,7 +4364,6 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct
ovn_datapath *od,
> >                    const char *match, const char *actions, const char
*io_port,
> >                    const struct ovsdb_idl_row *stage_hint,
> >                    const char *where, const char *ctrl_meter)
> > -                 OVS_NO_THREAD_SAFETY_ANALYSIS
> >   {
> >
> >       struct ovn_lflow *old_lflow;
> > @@ -4390,14 +4392,14 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct
ovn_datapath *od,
> >           hmap_insert(lflow_map, &lflow->hmap_node, hash);
> >       } else {
> >           hmap_insert_fast(lflow_map, &lflow->hmap_node, hash);
> > +        thread_lflow_counter++;
> >       }
> >       return lflow;
> >   }
> >
> >   /* Adds a row with the specified contents to the Logical_Flow table.
> > - * Version to use when locking IS required.
> > + * Version to use when hash bucket locking IS required.
> >    */
> > -
> >   static struct ovn_lflow *
> >   do_ovn_lflow_add_pd(struct hmap *lflow_map, struct ovn_datapath *od,
> >                       uint32_t hash, enum ovn_stage stage, uint16_t
priority,
> > @@ -4405,44 +4407,16 @@ do_ovn_lflow_add_pd(struct hmap *lflow_map,
struct ovn_datapath *od,
> >                       const char *io_port,
> >                       const struct ovsdb_idl_row *stage_hint,
> >                       const char *where, const char *ctrl_meter)
> > -                    OVS_NO_THREAD_SAFETY_ANALYSIS
> >   {
> >
> > -    struct ovn_lflow *old_lflow;
> >       struct ovn_lflow *lflow;
> > +    struct ovs_mutex *hash_lock =
> > +        &lflow_hash_locks[hash & lflow_map->mask &
LFLOW_HASH_LOCK_MASK];
> >
> > -    /* Fast Path - try to amend an existing flow without asking
> > -     * for WR access to the whole flow table. Locking the actual
> > -     * hmapx for the particular flow's odg is low overhead as its
> > -     * contention is much lower.
> > -     */
> > -
> > -    ovs_rwlock_rdlock(&flowtable_lock);
> > -    old_lflow = ovn_lflow_find(lflow_map, NULL, stage, priority, match,
> > -                               actions, ctrl_meter, hash);
> > -    if (old_lflow) {
> > -        ovs_mutex_lock(&old_lflow->odg_lock);
> > -        hmapx_add(&old_lflow->od_group, od);
> > -        ovs_mutex_unlock(&old_lflow->odg_lock);
> > -    }
> > -    ovs_rwlock_unlock(&flowtable_lock);
> > -
> > -    if (old_lflow) {
> > -        return old_lflow;
> > -    }
> > -
> > -    ovs_rwlock_wrlock(&flowtable_lock);
> > -
> > -    /* We need to rerun the "if in flowtable" steps, because someone
> > -     * could have inserted it while we were waiting to acquire an
> > -     * wr lock. As we are now holding a wr lock on it nobody else is
> > -     * in the * "fast" portion of the code which is protected by the
> > -     * rwlock.
> > -     */
> > +    ovs_mutex_lock(hash_lock);
> >       lflow = do_ovn_lflow_add(lflow_map, od, hash, stage, priority,
match,
> > -                                 actions, io_port, stage_hint, where,
> > -                                 ctrl_meter);
> > -    ovs_rwlock_unlock(&flowtable_lock);
> > +                             actions, io_port, stage_hint, where,
ctrl_meter);
> > +    ovs_mutex_unlock(hash_lock);
> >       return lflow;
> >   }
> >
> > @@ -12767,6 +12741,7 @@ struct lswitch_flow_build_info {
> >       char *svc_check_match;
> >       struct ds match;
> >       struct ds actions;
> > +    size_t thread_lflow_counter;
> >   };
> >
> >   /* Helper function to combine all lflow generation which is iterated
by
> > @@ -12892,6 +12867,7 @@ build_lflows_thread(void *arg)
> >           if (stop_parallel_processing()) {
> >               return NULL;
> >           }
> > +        thread_lflow_counter = 0;
> >           if (lsi && workload) {
> >               /* Iterate over bucket ThreadID, ThreadID+size, ... */
> >               for (bnum = control->id;
> > @@ -12952,6 +12928,7 @@ build_lflows_thread(void *arg)
> >                   }
> >               }
> >           }
> > +        lsi->thread_lflow_counter = thread_lflow_counter;
> >           post_completed_work(control);
> >       }
> >       return NULL;
> > @@ -12994,6 +12971,26 @@ noop_callback(struct worker_pool *pool
OVS_UNUSED,
> >       /* Do nothing */
> >   }
> >
> > +/* Fixes the hmap size (hmap->n) after parallel building the lflow_map
when
> > + * dp-groups is enabled, because in that case all threads are updating
the
> > + * global lflow hmap. Although the lflow_hash_lock prevents currently
inserting
> > + * to the same hash bucket, the hmap->n is updated currently by all
threads and
> > + * may not be accurate at the end of each iteration. This function
collects the
> > + * thread-local lflow counters maintained by each thread and update
the hmap
> > + * size with the aggregated value. This function must be called
immediately
> > + * after the worker threads complete the tasks in each iteration
before any
> > + * future operations on the lflow map. */
> > +static void
> > +fix_flow_map_size(struct hmap *lflow_map,
> > +                  struct lswitch_flow_build_info *lsiv,
> > +                  size_t n_lsiv)
> > +{
> > +    size_t total = 0;
> > +    for (size_t i = 0; i < n_lsiv; i++) {
> > +        total += lsiv[i].thread_lflow_counter;
> > +    }
> > +    lflow_map->n = total;
> > +}
> >
> >   static void
> >   build_lswitch_and_lrouter_flows(struct hmap *datapaths, struct hmap
*ports,
> > @@ -13046,6 +13043,7 @@ build_lswitch_and_lrouter_flows(struct hmap
*datapaths, struct hmap *ports,
> >               lsiv[index].lbs = lbs;
> >               lsiv[index].bfd_connections = bfd_connections;
> >               lsiv[index].svc_check_match = svc_check_match;
> > +            lsiv[index].thread_lflow_counter = 0;
> >               ds_init(&lsiv[index].match);
> >               ds_init(&lsiv[index].actions);
> >
> > @@ -13054,7 +13052,9 @@ build_lswitch_and_lrouter_flows(struct hmap
*datapaths, struct hmap *ports,
> >
> >           /* Run thread pool. */
> >           if (use_logical_dp_groups) {
> > -            run_pool_callback(build_lflows_pool->pool, NULL, NULL,
noop_callback);
> > +            run_pool_callback(build_lflows_pool->pool, NULL, NULL,
> > +                              noop_callback);
> > +            fix_flow_map_size(lflows, lsiv,
build_lflows_pool->pool->size);
> >           } else {
> >               run_pool_hash(build_lflows_pool->pool, lflows,
lflow_segs);
> >           }
> > @@ -13221,7 +13221,7 @@ build_lflows(struct northd_context *ctx, struct
hmap *datapaths,
> >
> >       if (use_parallel_build && use_logical_dp_groups &&
> >           needs_parallel_init) {
> > -        ovs_rwlock_init(&flowtable_lock);
> > +        lflow_hash_lock_init();
> >           needs_parallel_init = false;
> >           /* Disable parallel build on first run with dp_groups
> >            * to determine the correct sizing of hashes. */
>
>
> +1
>
> As I said before - I would like to get to the bottom of why do we get
improvement on ovn-heater with rwlock despite these numbers.
>
Yes, there are still unknowns regarding the difference in our test results.
It is possible that there are some differences in our test topology which
results in different proportions of lflow update with DP groups v.s. with a
single DP. However, this still doesn't explain why we observed so different
results for the same "make check-perf" test. Those test cases have a bigger
number of per-LSP flows, which all have single DPs, so the lock contention
should be high. My result shows that the original parallel build spent 4x
more time than without parallel, while in yours with parallel was better
than without. Just to confirm, I was always using the
TESTSUITEFLAGS="--rebuild". Were you doing the same?

> However, a fine grained lock is always better than a global lock (even a
rw one). So a definite +1
>
Thanks for the review! May I consider your +1 for the series as "Acked-by"?
Han

> Brgds,
>
> --
> Anton R. Ivanov
> Cambridgegreys Limited. Registered in England. Company Number 10273661
> https://www.cambridgegreys.com/
>


More information about the dev mailing list