[ovs-dev] [PATCH ovn] northd.c: Optimize parallel build performance with hash based locks.

Han Zhou hzhou at ovn.org
Sun Oct 3 23:06:37 UTC 2021


On Sat, Oct 2, 2021 at 12:26 AM Anton Ivanov <
anton.ivanov at cambridgegreys.com> wrote:
>
> On 01/10/2021 23:29, Han Zhou wrote:
> > The current implementation of parallel build in northd with dp-groups
> > enabled results in bad performance when the below assumption is not
> > true:
> >
> >   * 3. Most RW ops are not flow adds, but changes to the
> >   * od groups.
> Hi Han,
>
> Have a look at the ovn-parallel-hmap.h
>
> It has most of the primitives from my previous attempt to use this
> approach, you do not need to redefine them. It also has the relevant
> pragma to make CLANG shut up because it really dislikes taking a lock in
> a function and not releasing it.
>
> The sole difference is that the lock array is sized to current hash
> size, not to max-mask.
>

Hi Anton, thanks for the review and the pointers!

Yes, I could have used the hashrow_locks defined in the
ovn-parallel-hmap.h. However, as you also noticed that the major difference
of my approach is that it doesn't do 1-1 mapping between the locks and the
hash rows. This way we don't need to dynamically expand and reinitialize
the locks. Even though it is not 1-1 mapping, this will still be effective
even when the hmap size is far bigger than the lock array, because there
are a very small number of threads, and the chance two threads contending
the same lock among the big number of locks would be extremely small. Since
the code required for this is really simple (<20 lines) I kept it as is in
v2 rather than changing to the hashrow_locks.

> Also - when using this approach, using the stock hmap_insert_fast() will
> result in corrupting hmap_size (at the very least - I had a few cases
> where the whole cash-line was corrupt).
>
> As a result the hmap_resize() of lflows on the exit of the build
> procedure will either produce a bogus result (on a good day) or coredump
> (on a really bad day).
>
This is indeed a problem. Thanks for pointing out and I fixed it in v2
using thread-local counters. An alternative is to use a separate counter
that is updated using atomic increment operation, but it would still
introduce cache line lock contention (although better than a global mutex).

Another problem of this approach is that the encapsulation is not good.
Ideally a hmap version supporting parallel insertion should be implemented
in OVS code. Before that is implemented in OVS, I think we could compromise
a little by playing with the hmap size carefully. (We are already playing
with the hmap->mask anyway)

> It needs to be removed and something along the lines of the "rebuild the
> lflows while building dp groups and rehashing the single flows
> correctly" incorporated:
>
https://patchwork.ozlabs.org/project/ovn/patch/20210915162144.28369-1-anton.ivanov@cambridgegreys.com/
>
I am reviewing your patches, and it seems the problems are independent.

In addition, I added another fix in v2 for a risk of race condition when
updating the od_group. Please take a look at v2 here:
https://patchwork.ozlabs.org/project/ovn/list/?series=265223

Thanks,
Han

> Brgds,
>
>
> >
> > In fact most (if not all) of the ovn port based flows don't share
> > dp-groups, so the assumption doesn't hold in the reality, and in a scale
> > test environment with ovn-k8s topology of 800 nodes, the parallel build
> > shows 5 times more time spent for one northd iteration than without
> > parallel build on a test machine with 24 cores (24 northd worker
> > threads). This is because of lock contension on the global rw lock
> > protecting the lflow hmap.
> >
> > This patch optimizes it by using an array of bash based locks instead of
> > a single global lock. It ends up with 8 times speedup than the current
> > parallel build for the same scale test (which is 30% faster than
> > without parallel).
> >
> > Test command: ovn-nbctl --print-wait-time --wait=sb sync
> >
> > Before:
> >
> > no parallel:
> > ovn-northd completion: 7807ms
> >
> > parallel:
> > ovn-northd completion: 41267ms
> >
> > After:
> >
> > no parallel: (no change)
> >
> > parallel:
> > ovn-northd completion: 5081ms
> > (8x faster than before, 30% faster than no parallel)
> >
> > Note: all the above tests are with dp-groups enabled)
> >
> > Signed-off-by: Han Zhou <hzhou at ovn.org>
> > ---
> >   northd/northd.c | 110 ++++++++++++++++++------------------------------
> >   1 file changed, 42 insertions(+), 68 deletions(-)
> >
> > diff --git a/northd/northd.c b/northd/northd.c
> > index cf2467fe1..f6706e0b3 100644
> > --- a/northd/northd.c
> > +++ b/northd/northd.c
> > @@ -4299,39 +4299,6 @@ ovn_lflow_init(struct ovn_lflow *lflow, struct
ovn_datapath *od,
> >       }
> >   }
> >
> > -/* Adds a row with the specified contents to the Logical_Flow table.
> > - * Version to use with dp_groups + parallel - when locking is required.
> > - *
> > - * Assumptions:
> > - *
> > - * 1. A large proportion of the operations are lookups (reads).
> > - * 2. RW operations are a small proportion of overall adds.
> > - * 3. Most RW ops are not flow adds, but changes to the
> > - * od groups.
> > - *
> > - * Principles of operation:
> > - * 1. All accesses to the flow table are protected by a rwlock.
> > - * 2. By default, everyone grabs a rd lock so that multiple threads
> > - * can do lookups simultaneously.
> > - * 3. If a change to the lflow is needed, the rd lock is released and
> > - * a wr lock is acquired instead (the fact that POSIX does not have an
> > - * "upgrade" on locks is a major pain, but there is nothing we can do
> > - * - it's not available).
> > - * 4. WR lock operations in rd/wr locking have LOWER priority than RD.
> > - * That is by design and spec. So the code after a request for WR lock
> > - * may wait for a considerable amount of time until it is given a
> > - * change to run. That means that another thread may get there in the
> > - * meantime and change the data. Hence all wr operations MUST be coded
> > - * to ensure that they are not vulnerable to "someone pulled this from
> > - * under my feet". Re- reads, checks for presense, etc.
> > - * 5. Operations on the actual od_group hash map are protected by
> > - * per-flow locks. There is no need for these to be rd, mutex is more
> > - * appropriate. They are low contention as each protects only its flow
> > - * and only during modification which happen while holding a rd lock on
> > - * the flow table.
> > - */
> > -static struct ovs_rwlock flowtable_lock;
> > -
> >   /* Adds a row with the specified contents to the Logical_Flow table.
> >    * Version to use when locking is NOT required.
> >    */
> > @@ -4374,10 +4341,47 @@ do_ovn_lflow_add(struct hmap *lflow_map, struct
ovn_datapath *od,
> >       return lflow;
> >   }
> >
> > +/* The lflow_hash_lock is a mutex array that protects updates to the
shared
> > + * lflow table across threads when parallel lflow build and dp-group
are both
> > + * enabled. To avoid high contention between threads, a big array of
mutexes
> > + * are used instead of just one. This is possible because when
parallel build
> > + * is used we only use hmap_insert_fast() to update the hmap, which
would not
> > + * touch the bucket array but only the list in a single bucket. We
only need to
> > + * make sure that when adding lflows to the same hash bucket, the same
lock is
> > + * used, so that no two threads can add to the bucket at the same
time.  It is
> > + * ok that the same lock is used to protect multiple buckets, so a
fixed sized
> > + * mutex array is used instead of 1-1 mapping to the hash buckets. This
> > + * simplies the implementation while effectively reduces lock
contention
> > + * because the chance that different threads contending the same lock
amongst
> > + * the big number of locks is very low. */
> > +#define LFLOW_HASH_LOCK_MASK 0xFFFF
> > +static struct ovs_mutex lflow_hash_locks[LFLOW_HASH_LOCK_MASK + 1];
> > +
> > +static void
> > +lflow_hash_lock_init(void)
> > +{
> > +    for (size_t i = 0; i < LFLOW_HASH_LOCK_MASK + 1; i++) {
> > +        ovs_mutex_init(&lflow_hash_locks[i]);
> > +    }
> > +}
> > +
> > +static void
> > +lflow_hash_lock(uint32_t hash, struct hmap *lflow_map)
> > +{
> > +    size_t index = hash & lflow_map->mask & LFLOW_HASH_LOCK_MASK;
> > +    ovs_mutex_lock(&lflow_hash_locks[index]);
> > +}
> > +
> > +static void
> > +lflow_hash_unlock(uint32_t hash, struct hmap *lflow_map)
> > +{
> > +    size_t index = hash & lflow_map->mask & LFLOW_HASH_LOCK_MASK;
> > +    ovs_mutex_unlock(&lflow_hash_locks[index]);
> > +}
> > +
> >   /* Adds a row with the specified contents to the Logical_Flow table.
> >    * Version to use when locking IS required.
> >    */
> > -
> >   static struct ovn_lflow *
> >   do_ovn_lflow_add_pd(struct hmap *lflow_map, struct ovn_datapath *od,
> >                       uint32_t hash, enum ovn_stage stage, uint16_t
priority,
> > @@ -4388,41 +4392,11 @@ do_ovn_lflow_add_pd(struct hmap *lflow_map,
struct ovn_datapath *od,
> >                       OVS_NO_THREAD_SAFETY_ANALYSIS
> >   {
> >
> > -    struct ovn_lflow *old_lflow;
> >       struct ovn_lflow *lflow;
> > -
> > -    /* Fast Path - try to amend an existing flow without asking
> > -     * for WR access to the whole flow table. Locking the actual
> > -     * hmapx for the particular flow's odg is low overhead as its
> > -     * contention is much lower.
> > -     */
> > -
> > -    ovs_rwlock_rdlock(&flowtable_lock);
> > -    old_lflow = ovn_lflow_find(lflow_map, NULL, stage, priority, match,
> > -                               actions, ctrl_meter, hash);
> > -    if (old_lflow) {
> > -        ovs_mutex_lock(&old_lflow->odg_lock);
> > -        hmapx_add(&old_lflow->od_group, od);
> > -        ovs_mutex_unlock(&old_lflow->odg_lock);
> > -    }
> > -    ovs_rwlock_unlock(&flowtable_lock);
> > -
> > -    if (old_lflow) {
> > -        return old_lflow;
> > -    }
> > -
> > -    ovs_rwlock_wrlock(&flowtable_lock);
> > -
> > -    /* We need to rerun the "if in flowtable" steps, because someone
> > -     * could have inserted it while we were waiting to acquire an
> > -     * wr lock. As we are now holding a wr lock on it nobody else is
> > -     * in the * "fast" portion of the code which is protected by the
> > -     * rwlock.
> > -     */
> > +    lflow_hash_lock(hash, lflow_map);
> >       lflow = do_ovn_lflow_add(lflow_map, od, hash, stage, priority,
match,
> > -                                 actions, io_port, stage_hint, where,
> > -                                 ctrl_meter);
> > -    ovs_rwlock_unlock(&flowtable_lock);
> > +                             actions, io_port, stage_hint, where,
ctrl_meter);
> > +    lflow_hash_unlock(hash, lflow_map);
> >       return lflow;
> >   }
> >
> > @@ -13221,7 +13195,7 @@ build_lflows(struct northd_context *ctx, struct
hmap *datapaths,
> >
> >       if (use_parallel_build && use_logical_dp_groups &&
> >           needs_parallel_init) {
> > -        ovs_rwlock_init(&flowtable_lock);
> > +        lflow_hash_lock_init();
> >           needs_parallel_init = false;
> >           /* Disable parallel build on first run with dp_groups
> >            * to determine the correct sizing of hashes. */
>
>
> --
> Anton R. Ivanov
> Cambridgegreys Limited. Registered in England. Company Number 10273661
> https://www.cambridgegreys.com/
>


More information about the dev mailing list