[ovs-dev] [PATCH ovn v2 3/3] northd.c: Optimize parallel build performance with hash based locks.

Han Zhou hzhou at ovn.org
Tue Oct 5 17:17:00 UTC 2021


On Tue, Oct 5, 2021 at 9:37 AM Numan Siddique <numans at ovn.org> wrote:
>
> On Mon, Oct 4, 2021 at 1:28 PM Han Zhou <hzhou at ovn.org> wrote:
> >
> > On Mon, Oct 4, 2021 at 9:57 AM Anton Ivanov <
anton.ivanov at cambridgegreys.com>
> > wrote:
> > >
> > >
> > > On 04/10/2021 16:58, Han Zhou wrote:
> > >
> > >
> > >
> > > On Mon, Oct 4, 2021 at 2:31 AM Anton Ivanov <
> > anton.ivanov at cambridgegreys.com> wrote:
> > > >
> > > >
> > > > On 03/10/2021 23:45, Han Zhou wrote:
> > > > > The current implementation of parallel build in northd with
dp-groups
> > > > > enabled results in bad performance when the below assumption is
not
> > > > > true:
> > > > >
> > > > >   * 3. Most RW ops are not flow adds, but changes to the
> > > > >   * od groups.
> > > > >
> > > > > In fact most (if not all) of the ovn port based flows don't share
> > > > > dp-groups, so the assumption doesn't hold in the reality, and in a
> > scale
> > > > > test environment with ovn-k8s topology of 800 nodes, the parallel
> > build
> > > > > shows 5 times more time spent for one northd iteration than
without
> > > > > parallel build on a test machine with 24 cores (24 northd worker
> > > > > threads). This is because of lock contension on the global rw lock
> > > > > protecting the lflow hmap.
> > > > >
> > > > > This patch optimizes it by using an array of bash based locks
instead
> > of
> > > > > a single global lock. It is similar to the approach prior to the
> > commit
> > > > > 8c980ce6, but with two major differences:
> > > > >
> > > > > 1) It uses a fixed length mutex array instead of the dynamic
array of
> > > > >     the struct hashrow_locks. It is equally efficient considering
the
> > low
> > > > >     chance of contention in a large array of locks, but without
the
> > burden
> > > > >     of resizing every time the hmap size changes. The uniqueness
of
> > the
> > > > >     lock is guaranteed by combining the masks of both the hmap
and the
> > > > >     mutex array.
> > > > >
> > > > > 2) It fixes the corrupted hmap size after each round of parallel
flow
> > > > >     build. The hash based lock protects the list in each bucket,
but
> > > > >     doesn't protect the hmap size. The patch uses thread-local
> > counters
> > > > >     and aggregate them at the end of each iteration, which is lock
> > free.
> > > > >     This approach has lower cost than alternatively using atomic
> > > > >     incrementing a global counter.
> > > > >
> > > > > This patch ends up with 8 times speedup than the current parallel
> > build
> > > > > with dp-group enabled for the same scale test (which is 30% faster
> > than
> > > > > without parallel).
> > > > >
> > > > > Test command: ovn-nbctl --print-wait-time --wait=sb sync
> > > > >
> > > > > Before:
> > > > >
> > > > > no parallel:
> > > > > ovn-northd completion: 7807ms
> > > > >
> > > > > parallel:
> > > > > ovn-northd completion: 41267ms
> > > > >
> > > > > After:
> > > > >
> > > > > no parallel: (no change)
> > > > >
> > > > > parallel:
> > > > > ovn-northd completion: 5081ms
> > > > > (8x faster than before, 30% faster than no parallel)
> > > > >
> > > > > Note: all the above tests are with dp-groups enabled)
> > > > >
> > > > > Signed-off-by: Han Zhou <hzhou at ovn.org>
> > > > > ---
> > > > > v1 -> v2: Addressed comments from Anton
> > > > >      - Fixes the hmap size afte each round of prallel flow
building.
> > > > >      - Refactored lflow_hash_lock and removed unnecessary
functions
> > to avoid
> > > > >        clang warnings.
> > > > >
> > > > >   northd/northd.c | 140
> > ++++++++++++++++++++++++------------------------
> > > > >   1 file changed, 70 insertions(+), 70 deletions(-)
> > > > >
> > > > > diff --git a/northd/northd.c b/northd/northd.c
> > > > > index afd812700..88ab0c929 100644
> > > > > --- a/northd/northd.c
> > > > > +++ b/northd/northd.c
> > > > > @@ -4319,41 +4319,44 @@ ovn_dp_group_add_with_reference(struct
> > ovn_lflow *lflow_ref,
> > > > >       return true;
> > > > >   }
> > > > >
> > > > > -/* Adds a row with the specified contents to the Logical_Flow
table.
> > > > > - * Version to use with dp_groups + parallel - when locking is
> > required.
> > > > > - *
> > > > > - * Assumptions:
> > > > > - *
> > > > > - * 1. A large proportion of the operations are lookups (reads).
> > > > > - * 2. RW operations are a small proportion of overall adds.
> > > > > - * 3. Most RW ops are not flow adds, but changes to the
> > > > > - * od groups.
> > > > > +/* The lflow_hash_lock is a mutex array that protects updates to
the
> > shared
> > > > > + * lflow table across threads when parallel lflow build and
dp-group
> > are both
> > > > > + * enabled. To avoid high contention between threads, a big
array of
> > mutexes
> > > > > + * are used instead of just one. This is possible because when
> > parallel build
> > > > > + * is used we only use hmap_insert_fast() to update the hmap,
which
> > would not
> > > > > + * touch the bucket array but only the list in a single bucket.
We
> > only need to
> > > > > + * make sure that when adding lflows to the same hash bucket, the
> > same lock is
> > > > > + * used, so that no two threads can add to the bucket at the same
> > time.  It is
> > > > > + * ok that the same lock is used to protect multiple buckets, so
a
> > fixed sized
> > > > > + * mutex array is used instead of 1-1 mapping to the hash
buckets.
> > This
> > > > > + * simplies the implementation while effectively reduces lock
> > contention
> > > > > + * because the chance that different threads contending the same
> > lock amongst
> > > > > + * the big number of locks is very low. */
> > > > > +#define LFLOW_HASH_LOCK_MASK 0xFFFF
> > > > > +static struct ovs_mutex lflow_hash_locks[LFLOW_HASH_LOCK_MASK +
1];
> > > > > +
> > > > > +static void
> > > > > +lflow_hash_lock_init(void)
> > > > > +{
> > > > > +    for (size_t i = 0; i < LFLOW_HASH_LOCK_MASK + 1; i++) {
> > > > > +        ovs_mutex_init(&lflow_hash_locks[i]);
> > > > > +    }
> > > > > +}
> > > > > +
> > > > > +/* This thread-local var is used for parallel lflow building when
> > dp-groups is
> > > > > + * enabled. It maintains the number of lflows inserted by the
> > current thread to
> > > > > + * the shared lflow hmap in the current iteration. It is needed
> > because the
> > > > > + * lflow_hash_lock cannot protect current update of the hmap's
size
> > (hmap->n)
> > > > > + * by different threads.
> > > > >    *
> > > > > - * Principles of operation:
> > > > > - * 1. All accesses to the flow table are protected by a rwlock.
> > > > > - * 2. By default, everyone grabs a rd lock so that multiple
threads
> > > > > - * can do lookups simultaneously.
> > > > > - * 3. If a change to the lflow is needed, the rd lock is
released and
> > > > > - * a wr lock is acquired instead (the fact that POSIX does not
have
> > an
> > > > > - * "upgrade" on locks is a major pain, but there is nothing we
can do
> > > > > - * - it's not available).
> > > > > - * 4. WR lock operations in rd/wr locking have LOWER priority
than
> > RD.
> > > > > - * That is by design and spec. So the code after a request for WR
> > lock
> > > > > - * may wait for a considerable amount of time until it is given a
> > > > > - * change to run. That means that another thread may get there
in the
> > > > > - * meantime and change the data. Hence all wr operations MUST be
> > coded
> > > > > - * to ensure that they are not vulnerable to "someone pulled this
> > from
> > > > > - * under my feet". Re- reads, checks for presense, etc.
> > > > > - * 5. Operations on the actual od_group hash map are protected by
> > > > > - * per-flow locks. There is no need for these to be rd, mutex is
more
> > > > > - * appropriate. They are low contention as each protects only its
> > flow
> > > > > - * and only during modification which happen while holding a rd
lock
> > on
> > > > > - * the flow table.
> > > > > - */
> > > > > -static struct ovs_rwlock flowtable_lock;
> > > > > + * When all threads complete the tasks of an iteration, the
counters
> > of all the
> > > > > + * threads are collected to fix the lflow hmap's size (by the
> > function
> > > > > + * fix_flow_map_size()).
> > > > > + * */
> > > > > +static thread_local size_t thread_lflow_counter = 0;
> > > > >
> > > > >   /* Adds a row with the specified contents to the Logical_Flow
table.
> > > > > - * Version to use when locking is NOT required.
> > > > > + * Version to use when hash bucket locking is NOT required.
> > > > >    */
> > > > >   static struct ovn_lflow *
> > > > >   do_ovn_lflow_add(struct hmap *lflow_map, struct ovn_datapath
*od,
> > > > > @@ -4361,7 +4364,6 @@ do_ovn_lflow_add(struct hmap *lflow_map,
struct
> > ovn_datapath *od,
> > > > >                    const char *match, const char *actions, const
char
> > *io_port,
> > > > >                    const struct ovsdb_idl_row *stage_hint,
> > > > >                    const char *where, const char *ctrl_meter)
> > > > > -                 OVS_NO_THREAD_SAFETY_ANALYSIS
> > > > >   {
> > > > >
> > > > >       struct ovn_lflow *old_lflow;
> > > > > @@ -4390,14 +4392,14 @@ do_ovn_lflow_add(struct hmap *lflow_map,
> > struct ovn_datapath *od,
> > > > >           hmap_insert(lflow_map, &lflow->hmap_node, hash);
> > > > >       } else {
> > > > >           hmap_insert_fast(lflow_map, &lflow->hmap_node, hash);
> > > > > +        thread_lflow_counter++;
> > > > >       }
> > > > >       return lflow;
> > > > >   }
> > > > >
> > > > >   /* Adds a row with the specified contents to the Logical_Flow
table.
> > > > > - * Version to use when locking IS required.
> > > > > + * Version to use when hash bucket locking IS required.
> > > > >    */
> > > > > -
> > > > >   static struct ovn_lflow *
> > > > >   do_ovn_lflow_add_pd(struct hmap *lflow_map, struct ovn_datapath
*od,
> > > > >                       uint32_t hash, enum ovn_stage stage,
uint16_t
> > priority,
> > > > > @@ -4405,44 +4407,16 @@ do_ovn_lflow_add_pd(struct hmap
*lflow_map,
> > struct ovn_datapath *od,
> > > > >                       const char *io_port,
> > > > >                       const struct ovsdb_idl_row *stage_hint,
> > > > >                       const char *where, const char *ctrl_meter)
> > > > > -                    OVS_NO_THREAD_SAFETY_ANALYSIS
> > > > >   {
> > > > >
> > > > > -    struct ovn_lflow *old_lflow;
> > > > >       struct ovn_lflow *lflow;
> > > > > +    struct ovs_mutex *hash_lock =
> > > > > +        &lflow_hash_locks[hash & lflow_map->mask &
> > LFLOW_HASH_LOCK_MASK];
> > > > >
> > > > > -    /* Fast Path - try to amend an existing flow without asking
> > > > > -     * for WR access to the whole flow table. Locking the actual
> > > > > -     * hmapx for the particular flow's odg is low overhead as its
> > > > > -     * contention is much lower.
> > > > > -     */
> > > > > -
> > > > > -    ovs_rwlock_rdlock(&flowtable_lock);
> > > > > -    old_lflow = ovn_lflow_find(lflow_map, NULL, stage, priority,
> > match,
> > > > > -                               actions, ctrl_meter, hash);
> > > > > -    if (old_lflow) {
> > > > > -        ovs_mutex_lock(&old_lflow->odg_lock);
> > > > > -        hmapx_add(&old_lflow->od_group, od);
> > > > > -        ovs_mutex_unlock(&old_lflow->odg_lock);
> > > > > -    }
> > > > > -    ovs_rwlock_unlock(&flowtable_lock);
> > > > > -
> > > > > -    if (old_lflow) {
> > > > > -        return old_lflow;
> > > > > -    }
> > > > > -
> > > > > -    ovs_rwlock_wrlock(&flowtable_lock);
> > > > > -
> > > > > -    /* We need to rerun the "if in flowtable" steps, because
someone
> > > > > -     * could have inserted it while we were waiting to acquire an
> > > > > -     * wr lock. As we are now holding a wr lock on it nobody
else is
> > > > > -     * in the * "fast" portion of the code which is protected by
the
> > > > > -     * rwlock.
> > > > > -     */
> > > > > +    ovs_mutex_lock(hash_lock);
> > > > >       lflow = do_ovn_lflow_add(lflow_map, od, hash, stage,
priority,
> > match,
> > > > > -                                 actions, io_port, stage_hint,
where,
> > > > > -                                 ctrl_meter);
> > > > > -    ovs_rwlock_unlock(&flowtable_lock);
> > > > > +                             actions, io_port, stage_hint, where,
> > ctrl_meter);
> > > > > +    ovs_mutex_unlock(hash_lock);
> > > > >       return lflow;
> > > > >   }
> > > > >
> > > > > @@ -12767,6 +12741,7 @@ struct lswitch_flow_build_info {
> > > > >       char *svc_check_match;
> > > > >       struct ds match;
> > > > >       struct ds actions;
> > > > > +    size_t thread_lflow_counter;
> > > > >   };
> > > > >
> > > > >   /* Helper function to combine all lflow generation which is
> > iterated by
> > > > > @@ -12892,6 +12867,7 @@ build_lflows_thread(void *arg)
> > > > >           if (stop_parallel_processing()) {
> > > > >               return NULL;
> > > > >           }
> > > > > +        thread_lflow_counter = 0;
> > > > >           if (lsi && workload) {
> > > > >               /* Iterate over bucket ThreadID, ThreadID+size, ...
*/
> > > > >               for (bnum = control->id;
> > > > > @@ -12952,6 +12928,7 @@ build_lflows_thread(void *arg)
> > > > >                   }
> > > > >               }
> > > > >           }
> > > > > +        lsi->thread_lflow_counter = thread_lflow_counter;
> > > > >           post_completed_work(control);
> > > > >       }
> > > > >       return NULL;
> > > > > @@ -12994,6 +12971,26 @@ noop_callback(struct worker_pool *pool
> > OVS_UNUSED,
> > > > >       /* Do nothing */
> > > > >   }
> > > > >
> > > > > +/* Fixes the hmap size (hmap->n) after parallel building the
> > lflow_map when
> > > > > + * dp-groups is enabled, because in that case all threads are
> > updating the
> > > > > + * global lflow hmap. Although the lflow_hash_lock prevents
> > currently inserting
> > > > > + * to the same hash bucket, the hmap->n is updated currently by
all
> > threads and
> > > > > + * may not be accurate at the end of each iteration. This
function
> > collects the
> > > > > + * thread-local lflow counters maintained by each thread and
update
> > the hmap
> > > > > + * size with the aggregated value. This function must be called
> > immediately
> > > > > + * after the worker threads complete the tasks in each iteration
> > before any
> > > > > + * future operations on the lflow map. */
> > > > > +static void
> > > > > +fix_flow_map_size(struct hmap *lflow_map,
> > > > > +                  struct lswitch_flow_build_info *lsiv,
> > > > > +                  size_t n_lsiv)
> > > > > +{
> > > > > +    size_t total = 0;
> > > > > +    for (size_t i = 0; i < n_lsiv; i++) {
> > > > > +        total += lsiv[i].thread_lflow_counter;
> > > > > +    }
> > > > > +    lflow_map->n = total;
> > > > > +}
> > > > >
> > > > >   static void
> > > > >   build_lswitch_and_lrouter_flows(struct hmap *datapaths, struct
hmap
> > *ports,
> > > > > @@ -13046,6 +13043,7 @@ build_lswitch_and_lrouter_flows(struct
hmap
> > *datapaths, struct hmap *ports,
> > > > >               lsiv[index].lbs = lbs;
> > > > >               lsiv[index].bfd_connections = bfd_connections;
> > > > >               lsiv[index].svc_check_match = svc_check_match;
> > > > > +            lsiv[index].thread_lflow_counter = 0;
> > > > >               ds_init(&lsiv[index].match);
> > > > >               ds_init(&lsiv[index].actions);
> > > > >
> > > > > @@ -13054,7 +13052,9 @@ build_lswitch_and_lrouter_flows(struct
hmap
> > *datapaths, struct hmap *ports,
> > > > >
> > > > >           /* Run thread pool. */
> > > > >           if (use_logical_dp_groups) {
> > > > > -            run_pool_callback(build_lflows_pool->pool, NULL,
NULL,
> > noop_callback);
> > > > > +            run_pool_callback(build_lflows_pool->pool, NULL,
NULL,
> > > > > +                              noop_callback);
> > > > > +            fix_flow_map_size(lflows, lsiv,
> > build_lflows_pool->pool->size);
> > > > >           } else {
> > > > >               run_pool_hash(build_lflows_pool->pool, lflows,
> > lflow_segs);
> > > > >           }
> > > > > @@ -13221,7 +13221,7 @@ build_lflows(struct northd_context *ctx,
> > struct hmap *datapaths,
> > > > >
> > > > >       if (use_parallel_build && use_logical_dp_groups &&
> > > > >           needs_parallel_init) {
> > > > > -        ovs_rwlock_init(&flowtable_lock);
> > > > > +        lflow_hash_lock_init();
> > > > >           needs_parallel_init = false;
> > > > >           /* Disable parallel build on first run with dp_groups
> > > > >            * to determine the correct sizing of hashes. */
> > > >
> > > >
> > > > +1
> > > >
> > > > As I said before - I would like to get to the bottom of why do we
get
> > improvement on ovn-heater with rwlock despite these numbers.
> > > >
> > > Yes, there are still unknowns regarding the difference in our test
> > results. It is possible that there are some differences in our test
> > topology which results in different proportions of lflow update with DP
> > groups v.s. with a single DP. However, this still doesn't explain why we
> > observed so different results for the same "make check-perf" test. Those
> > test cases have a bigger number of per-LSP flows, which all have single
> > DPs, so the lock contention should be high. My result shows that the
> > original parallel build spent 4x more time than without parallel, while
in
> > yours with parallel was better than without. Just to confirm, I was
always
> > using the TESTSUITEFLAGS="--rebuild". Were you doing the same?
> > >
> > > > However, a fine grained lock is always better than a global lock
(even
> > a rw one). So a definite +1
> > > >
> > > Thanks for the review! May I consider your +1 for the series as
> > "Acked-by"?
> > >
> > > I am not a commiter, hence using +1 (good for me) instead of an
official
> > Acked-By which is counted by patchwork and AFAIK changes the patch
state.
> > >
> > Ok. I believe there is no rule preventing a non-committer to Ack. In
this
> > case since you are the original author of the parallel build, I think
it is
> > appropriate for you to give Acked-by. But of course it is also fine if
you
> > don't want to :) In any case, I will wait for more reviews.
>
> Acked-by: Numan Siddique <numans at ovn.org>
>

Thanks Anton and Numan. I applied it to main branch.

> Numan
>
> >
> > > Brgds,
> > >
> > > Han
> > >
> > > > Brgds,
> > > >
> > > > --
> > > > Anton R. Ivanov
> > > > Cambridgegreys Limited. Registered in England. Company Number
10273661
> > > > https://www.cambridgegreys.com/
> > > >
> > >
> > > --
> > > Anton R. Ivanov
> > > Cambridgegreys Limited. Registered in England. Company Number 10273661
> > > https://www.cambridgegreys.com/
> > _______________________________________________
> > dev mailing list
> > dev at openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >


More information about the dev mailing list