[ovs-dev] [PATCH ovn v2 3/3] northd.c: Optimize parallel build performance with hash based locks.

Tue Oct 5 16:37:03 UTC 2021

On Mon, Oct 4, 2021 at 1:28 PM Han Zhou <hzhou at ovn.org> wrote:
>
> On Mon, Oct 4, 2021 at 9:57 AM Anton Ivanov <anton.ivanov at cambridgegreys.com>
> wrote:
> >
> >
> > On 04/10/2021 16:58, Han Zhou wrote:
> >
> >
> >
> > On Mon, Oct 4, 2021 at 2:31 AM Anton Ivanov <
> anton.ivanov at cambridgegreys.com> wrote:
> > >
> > >
> > > On 03/10/2021 23:45, Han Zhou wrote:
> > > > The current implementation of parallel build in northd with dp-groups
> > > > enabled results in bad performance when the below assumption is not
> > > > true:
> > > >
> > > >   * 3. Most RW ops are not flow adds, but changes to the
> > > >   * od groups.
> > > >
> > > > In fact most (if not all) of the ovn port based flows don't share
> > > > dp-groups, so the assumption doesn't hold in the reality, and in a
> scale
> > > > test environment with ovn-k8s topology of 800 nodes, the parallel
> build
> > > > shows 5 times more time spent for one northd iteration than without
> > > > parallel build on a test machine with 24 cores (24 northd worker
> > > > threads). This is because of lock contension on the global rw lock
> > > > protecting the lflow hmap.
> > > >
> > > > This patch optimizes it by using an array of bash based locks instead
> of
> > > > a single global lock. It is similar to the approach prior to the
> commit
> > > > 8c980ce6, but with two major differences:
> > > >
> > > > 1) It uses a fixed length mutex array instead of the dynamic array of
> > > >     the struct hashrow_locks. It is equally efficient considering the
> low
> > > >     chance of contention in a large array of locks, but without the
> burden
> > > >     of resizing every time the hmap size changes. The uniqueness of
> the
> > > >     lock is guaranteed by combining the masks of both the hmap and the
> > > >     mutex array.
> > > >
> > > > 2) It fixes the corrupted hmap size after each round of parallel flow
> > > >     build. The hash based lock protects the list in each bucket, but
> > > >     doesn't protect the hmap size. The patch uses thread-local
> counters
> > > >     and aggregate them at the end of each iteration, which is lock
> free.
> > > >     This approach has lower cost than alternatively using atomic
> > > >     incrementing a global counter.
> > > >
> > > > This patch ends up with 8 times speedup than the current parallel
> build
> > > > with dp-group enabled for the same scale test (which is 30% faster
> than
> > > > without parallel).
> > > >
> > > > Test command: ovn-nbctl --print-wait-time --wait=sb sync
> > > >
> > > > Before:
> > > >
> > > > no parallel:
> > > > ovn-northd completion: 7807ms
> > > >
> > > > parallel:
> > > > ovn-northd completion: 41267ms
> > > >
> > > > After:
> > > >
> > > > no parallel: (no change)
> > > >
> > > > parallel:
> > > > ovn-northd completion: 5081ms
> > > > (8x faster than before, 30% faster than no parallel)
> > > >
> > > > Note: all the above tests are with dp-groups enabled)
> > > >
> > > > Signed-off-by: Han Zhou <hzhou at ovn.org>
> > > > ---
> > > > v1 -> v2: Addressed comments from Anton
> > > >      - Fixes the hmap size afte each round of prallel flow building.
> > > >      - Refactored lflow_hash_lock and removed unnecessary functions
> to avoid
> > > >        clang warnings.
> > > >
> > > >   northd/northd.c | 140
> ++++++++++++++++++++++++------------------------
> > > >   1 file changed, 70 insertions(+), 70 deletions(-)
> > > >
> > > > diff --git a/northd/northd.c b/northd/northd.c
> > > > index afd812700..88ab0c929 100644
> > > > --- a/northd/northd.c
> > > > +++ b/northd/northd.c
> > > > @@ -4319,41 +4319,44 @@ ovn_dp_group_add_with_reference(struct
> ovn_lflow *lflow_ref,
> > > >       return true;
> > > >   }
> > > >
> > > > -/* Adds a row with the specified contents to the Logical_Flow table.
> > > > - * Version to use with dp_groups + parallel - when locking is
> required.
> > > > - *
> > > > - * Assumptions:
> > > > - *
> > > > - * 1. A large proportion of the operations are lookups (reads).
> > > > - * 2. RW operations are a small proportion of overall adds.
> > > > - * 3. Most RW ops are not flow adds, but changes to the
> > > > - * od groups.
> > > > +/* The lflow_hash_lock is a mutex array that protects updates to the
> shared
> > > > + * lflow table across threads when parallel lflow build and dp-group
> are both
> > > > + * enabled. To avoid high contention between threads, a big array of
> mutexes
> > > > + * are used instead of just one. This is possible because when
> parallel build
> > > > + * is used we only use hmap_insert_fast() to update the hmap, which
> would not
> > > > + * touch the bucket array but only the list in a single bucket. We
> only need to
> > > > + * make sure that when adding lflows to the same hash bucket, the
> same lock is
> > > > + * used, so that no two threads can add to the bucket at the same
> time.  It is
> > > > + * ok that the same lock is used to protect multiple buckets, so a
> fixed sized
> > > > + * mutex array is used instead of 1-1 mapping to the hash buckets.
> This
> > > > + * simplies the implementation while effectively reduces lock
> contention
> > > > + * because the chance that different threads contending the same
> lock amongst
> > > > + * the big number of locks is very low. */
> > > > +#define LFLOW_HASH_LOCK_MASK 0xFFFF
> > > > +static struct ovs_mutex lflow_hash_locks[LFLOW_HASH_LOCK_MASK + 1];
> > > > +
> > > > +static void
> > > > +lflow_hash_lock_init(void)
> > > > +{
> > > > +    for (size_t i = 0; i < LFLOW_HASH_LOCK_MASK + 1; i++) {
> > > > +        ovs_mutex_init(&lflow_hash_locks[i]);
> > > > +    }
> > > > +}
> > > > +
> > > > +/* This thread-local var is used for parallel lflow building when
> dp-groups is
> > > > + * enabled. It maintains the number of lflows inserted by the
> current thread to
> > > > + * the shared lflow hmap in the current iteration. It is needed
> because the
> > > > + * lflow_hash_lock cannot protect current update of the hmap's size
> (hmap->n)
> > > > + * by different threads.
> > > >    *
> > > > - * Principles of operation:
> > > > - * 1. All accesses to the flow table are protected by a rwlock.
> > > > - * 2. By default, everyone grabs a rd lock so that multiple threads
> > > > - * can do lookups simultaneously.
> > > > - * 3. If a change to the lflow is needed, the rd lock is released and
> > > > - * a wr lock is acquired instead (the fact that POSIX does not have
> an
> > > > - * "upgrade" on locks is a major pain, but there is nothing we can do
> > > > - * - it's not available).
> > > > - * 4. WR lock operations in rd/wr locking have LOWER priority than
> RD.
> > > > - * That is by design and spec. So the code after a request for WR
> lock
> > > > - * may wait for a considerable amount of time until it is given a
> > > > - * change to run. That means that another thread may get there in the
> > > > - * meantime and change the data. Hence all wr operations MUST be
> coded
> > > > - * to ensure that they are not vulnerable to "someone pulled this
> from
> > > > - * under my feet". Re- reads, checks for presense, etc.
> > > > - * 5. Operations on the actual od_group hash map are protected by
> > > > - * per-flow locks. There is no need for these to be rd, mutex is more
> > > > - * appropriate. They are low contention as each protects only its
> flow
> > > > - * and only during modification which happen while holding a rd lock
> on
> > > > - * the flow table.
> > > > - */
> > > > -static struct ovs_rwlock flowtable_lock;
> > > > + * When all threads complete the tasks of an iteration, the counters
> of all the
> > > > + * threads are collected to fix the lflow hmap's size (by the
> function
> > > > + * fix_flow_map_size()).
> > > > + * */
> > > > +static thread_local size_t thread_lflow_counter = 0;
> > > >
> > > >   /* Adds a row with the specified contents to the Logical_Flow table.
> > > > - * Version to use when locking is NOT required.
> > > > + * Version to use when hash bucket locking is NOT required.
> > > >    */
> > > >   static struct ovn_lflow *
> > > >   do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath *od,
> > > > @@ -4361,7 +4364,6 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct
> ovn_datapath *od,
> > > >                    const char *match, const char *actions, const char
> *io_port,
> > > >                    const struct ovsdb_idl_row *stage_hint,
> > > >                    const char *where, const char *ctrl_meter)
> > > > -                 OVS_NO_THREAD_SAFETY_ANALYSIS
> > > >   {
> > > >
> > > >       struct ovn_lflow *old_lflow;
> > > > @@ -4390,14 +4392,14 @@ do_ovn_lflow_add(struct hmap *lflow_map,
> struct ovn_datapath *od,
> > > >           hmap_insert(lflow_map, &lflow->hmap_node, hash);
> > > >       } else {
> > > >           hmap_insert_fast(lflow_map, &lflow->hmap_node, hash);
> > > > +        thread_lflow_counter++;
> > > >       }
> > > >       return lflow;
> > > >   }
> > > >
> > > >   /* Adds a row with the specified contents to the Logical_Flow table.
> > > > - * Version to use when locking IS required.
> > > > + * Version to use when hash bucket locking IS required.
> > > >    */
> > > > -
> > > >   static struct ovn_lflow *
> > > >   do_ovn_lflow_add_pd(struct hmap *lflow_map, struct ovn_datapath *od,
> > > >                       uint32_t hash, enum ovn_stage stage, uint16_t
> priority,
> > > > @@ -4405,44 +4407,16 @@ do_ovn_lflow_add_pd(struct hmap *lflow_map,
> struct ovn_datapath *od,
> > > >                       const char *io_port,
> > > >                       const struct ovsdb_idl_row *stage_hint,
> > > >                       const char *where, const char *ctrl_meter)
> > > > -                    OVS_NO_THREAD_SAFETY_ANALYSIS
> > > >   {
> > > >
> > > > -    struct ovn_lflow *old_lflow;
> > > >       struct ovn_lflow *lflow;
> > > > +    struct ovs_mutex *hash_lock =
> > > > +        &lflow_hash_locks[hash & lflow_map->mask &
> LFLOW_HASH_LOCK_MASK];
> > > >
> > > > -    /* Fast Path - try to amend an existing flow without asking
> > > > -     * for WR access to the whole flow table. Locking the actual
> > > > -     * hmapx for the particular flow's odg is low overhead as its
> > > > -     * contention is much lower.
> > > > -     */
> > > > -
> > > > -    ovs_rwlock_rdlock(&flowtable_lock);
> > > > -    old_lflow = ovn_lflow_find(lflow_map, NULL, stage, priority,
> match,
> > > > -                               actions, ctrl_meter, hash);
> > > > -    if (old_lflow) {
> > > > -        ovs_mutex_lock(&old_lflow->odg_lock);
> > > > -        hmapx_add(&old_lflow->od_group, od);
> > > > -        ovs_mutex_unlock(&old_lflow->odg_lock);
> > > > -    }
> > > > -    ovs_rwlock_unlock(&flowtable_lock);
> > > > -
> > > > -    if (old_lflow) {
> > > > -        return old_lflow;
> > > > -    }
> > > > -
> > > > -    ovs_rwlock_wrlock(&flowtable_lock);
> > > > -
> > > > -    /* We need to rerun the "if in flowtable" steps, because someone
> > > > -     * could have inserted it while we were waiting to acquire an
> > > > -     * wr lock. As we are now holding a wr lock on it nobody else is
> > > > -     * in the * "fast" portion of the code which is protected by the
> > > > -     * rwlock.
> > > > -     */
> > > > +    ovs_mutex_lock(hash_lock);
> > > >       lflow = do_ovn_lflow_add(lflow_map, od, hash, stage, priority,
> match,
> > > > -                                 actions, io_port, stage_hint, where,
> > > > -                                 ctrl_meter);
> > > > -    ovs_rwlock_unlock(&flowtable_lock);
> > > > +                             actions, io_port, stage_hint, where,
> ctrl_meter);
> > > > +    ovs_mutex_unlock(hash_lock);
> > > >       return lflow;
> > > >   }
> > > >
> > > > @@ -12767,6 +12741,7 @@ struct lswitch_flow_build_info {
> > > >       char *svc_check_match;
> > > >       struct ds match;
> > > >       struct ds actions;
> > > > +    size_t thread_lflow_counter;
> > > >   };
> > > >
> > > >   /* Helper function to combine all lflow generation which is
> iterated by
> > > > @@ -12892,6 +12867,7 @@ build_lflows_thread(void *arg)
> > > >           if (stop_parallel_processing()) {
> > > >               return NULL;
> > > >           }
> > > > +        thread_lflow_counter = 0;
> > > >           if (lsi && workload) {
> > > >               /* Iterate over bucket ThreadID, ThreadID+size, ... */
> > > >               for (bnum = control->id;
> > > > @@ -12952,6 +12928,7 @@ build_lflows_thread(void *arg)
> > > >                   }
> > > >               }
> > > >           }
> > > > +        lsi->thread_lflow_counter = thread_lflow_counter;
> > > >           post_completed_work(control);
> > > >       }
> > > >       return NULL;
> > > > @@ -12994,6 +12971,26 @@ noop_callback(struct worker_pool *pool
> OVS_UNUSED,
> > > >       /* Do nothing */
> > > >   }
> > > >
> > > > +/* Fixes the hmap size (hmap->n) after parallel building the
> lflow_map when
> > > > + * dp-groups is enabled, because in that case all threads are
> updating the
> > > > + * global lflow hmap. Although the lflow_hash_lock prevents
> currently inserting
> > > > + * to the same hash bucket, the hmap->n is updated currently by all
> threads and
> > > > + * may not be accurate at the end of each iteration. This function
> collects the
> > > > + * thread-local lflow counters maintained by each thread and update
> the hmap
> > > > + * size with the aggregated value. This function must be called
> immediately
> > > > + * after the worker threads complete the tasks in each iteration
> before any
> > > > + * future operations on the lflow map. */
> > > > +static void
> > > > +fix_flow_map_size(struct hmap *lflow_map,
> > > > +                  struct lswitch_flow_build_info *lsiv,
> > > > +                  size_t n_lsiv)
> > > > +{
> > > > +    size_t total = 0;
> > > > +    for (size_t i = 0; i < n_lsiv; i++) {
> > > > +        total += lsiv[i].thread_lflow_counter;
> > > > +    }
> > > > +    lflow_map->n = total;
> > > > +}
> > > >
> > > >   static void
> > > >   build_lswitch_and_lrouter_flows(struct hmap *datapaths, struct hmap
> *ports,
> > > > @@ -13046,6 +13043,7 @@ build_lswitch_and_lrouter_flows(struct hmap
> *datapaths, struct hmap *ports,
> > > >               lsiv[index].lbs = lbs;
> > > >               lsiv[index].bfd_connections = bfd_connections;
> > > >               lsiv[index].svc_check_match = svc_check_match;
> > > > +            lsiv[index].thread_lflow_counter = 0;
> > > >               ds_init(&lsiv[index].match);
> > > >               ds_init(&lsiv[index].actions);
> > > >
> > > > @@ -13054,7 +13052,9 @@ build_lswitch_and_lrouter_flows(struct hmap
> *datapaths, struct hmap *ports,
> > > >
> > > >           /* Run thread pool. */
> > > >           if (use_logical_dp_groups) {
> > > > -            run_pool_callback(build_lflows_pool->pool, NULL, NULL,
> noop_callback);
> > > > +            run_pool_callback(build_lflows_pool->pool, NULL, NULL,
> > > > +                              noop_callback);
> > > > +            fix_flow_map_size(lflows, lsiv,
> build_lflows_pool->pool->size);
> > > >           } else {
> > > >               run_pool_hash(build_lflows_pool->pool, lflows,
> lflow_segs);
> > > >           }
> > > > @@ -13221,7 +13221,7 @@ build_lflows(struct northd_context *ctx,
> struct hmap *datapaths,
> > > >
> > > >       if (use_parallel_build && use_logical_dp_groups &&
> > > >           needs_parallel_init) {
> > > > -        ovs_rwlock_init(&flowtable_lock);
> > > > +        lflow_hash_lock_init();
> > > >           needs_parallel_init = false;
> > > >           /* Disable parallel build on first run with dp_groups
> > > >            * to determine the correct sizing of hashes. */
> > >
> > >
> > > +1
> > >
> > > As I said before - I would like to get to the bottom of why do we get
> improvement on ovn-heater with rwlock despite these numbers.
> > >
> > Yes, there are still unknowns regarding the difference in our test
> results. It is possible that there are some differences in our test
> topology which results in different proportions of lflow update with DP
> groups v.s. with a single DP. However, this still doesn't explain why we
> observed so different results for the same "make check-perf" test. Those
> test cases have a bigger number of per-LSP flows, which all have single
> DPs, so the lock contention should be high. My result shows that the
> original parallel build spent 4x more time than without parallel, while in
> yours with parallel was better than without. Just to confirm, I was always
> using the TESTSUITEFLAGS="--rebuild". Were you doing the same?
> >
> > > However, a fine grained lock is always better than a global lock (even
> a rw one). So a definite +1
> > >
> > Thanks for the review! May I consider your +1 for the series as
> "Acked-by"?
> >
> > I am not a commiter, hence using +1 (good for me) instead of an official
> Acked-By which is counted by patchwork and AFAIK changes the patch state.
> >
> Ok. I believe there is no rule preventing a non-committer to Ack. In this
> case since you are the original author of the parallel build, I think it is
> appropriate for you to give Acked-by. But of course it is also fine if you
> don't want to :) In any case, I will wait for more reviews.

Acked-by: Numan Siddique <numans at ovn.org>

Numan

>
> > Brgds,
> >
> > Han
> >
> > > Brgds,
> > >
> > > --
> > > Anton R. Ivanov
> > > Cambridgegreys Limited. Registered in England. Company Number 10273661
> > > https://www.cambridgegreys.com/
> > >
> >
> > --
> > Anton R. Ivanov
> > Cambridgegreys Limited. Registered in England. Company Number 10273661
> > https://www.cambridgegreys.com/
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>