[ovs-dev] [PATCH v3 12/28] mpsc-queue: Module for lock-free message passing

Gaëtan Rivet grive at u256.net
Sat May 1 15:56:21 UTC 2021


On Sat, May 1, 2021, at 15:20, Gaëtan Rivet wrote:
> On Fri, Apr 30, 2021, at 17:24, Ilya Maximets wrote:
> > On 4/25/21 1:55 PM, Gaetan Rivet wrote:
> > > Add a lockless multi-producer/single-consumer (MPSC), linked-list based,
> > > intrusive, unbounded queue that does not require deferred memory
> > > management.
> > > 
> > > The queue is an implementation of the structure described by Dmitri
> > > Vyukov[1]. It adds a slightly more explicit API explaining the proper use
> > > of the queue.
> > > Alternatives were considered such as a Treiber Stack [2] or a
> > > Michael-Scott queue [3], but this one is faster, simpler and scalable.
> > > 
> > > [1]: http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue
> > > 
> > > [2]: R. K. Treiber. Systems programming: Coping with parallelism.
> > >      Technical Report RJ 5118, IBM Almaden Research Center, April 1986.
> > > 
> > > [3]: M. M. Michael, Simple, Fast, and Practical Non-Blocking and Blocking
> > >      Concurrent Queue Algorithms.
> > >      https://www.cs.rochester.edu/research/synchronization/pseudocode/queues.html
> > > 
> > > The queue is designed to improve the specific MPSC setup.  A benchmark
> > > accompanies the unit tests to measure the difference in this configuration.
> > > A single reader thread polls the queue while N writers enqueue elements
> > > as fast as possible.  The mpsc-queue is compared against the regular ovs-list
> > > as well as the guarded list.  The latter usually offers a slight improvement
> > > by batching the element removal, however the mpsc-queue is faster.
> > > 
> > > The average is of each producer threads time:
> > > 
> > >    $ ./tests/ovstest test-mpsc-queue benchmark 3000000 1
> > >    Benchmarking n=3000000 on 1 + 1 threads.
> > >     type\thread:  Reader      1    Avg
> > >      mpsc-queue:     161    161    161 ms
> > >            list:     803    803    803 ms
> > >    guarded list:     665    665    665 ms
> > > 
> > >    $ ./tests/ovstest test-mpsc-queue benchmark 3000000 2
> > >    Benchmarking n=3000000 on 1 + 2 threads.
> > >     type\thread:  Reader      1      2    Avg
> > >      mpsc-queue:     102    101     97     99 ms
> > >            list:     246    212    246    229 ms
> > >    guarded list:     264    263    214    238 ms
> > > 
> > >    $ ./tests/ovstest test-mpsc-queue benchmark 3000000 3
> > >    Benchmarking n=3000000 on 1 + 3 threads.
> > >     type\thread:  Reader      1      2      3    Avg
> > >      mpsc-queue:      92     91     92     91     91 ms
> > >            list:     520    517    515    520    517 ms
> > >    guarded list:     405    395    401    404    400 ms
> > > 
> > >    $ ./tests/ovstest test-mpsc-queue benchmark 3000000 4
> > >    Benchmarking n=3000000 on 1 + 4 threads.
> > >     type\thread:  Reader      1      2      3      4    Avg
> > >      mpsc-queue:      77     73     73     77     75     74 ms
> > >            list:     371    359    361    287    370    344 ms
> > >    guarded list:     389    388    359    363    357    366 ms
> > 
> > Hi, Gaetan.
> > 
> > This is an interesting implementation.  However.  I spent a few hours
> > benchmarking it on my machine and I can say that benchmark itself has
> > a few issues:
> > 
> > 1. Benchmark need to care about cores on which threads are working,
> >    otherwise results are highly inconsistent, e.g. if several threads
> >    are scheduled on the same core.  Also, results are different if
> >    threads are scheduled on sibling cores vs on separate physical cores.
> > 
> > 2. ovs-list test is not fair in compare with test for mpsc.  In test
> >    for ovs-list, reader blocks the writer for the whole loop that also
> >    includes some calculations inside.  I see you point that, likely,
> >    taking a mutex for every pop() operation will destroy the performance,
> >    but you're missing the fact that spinlock could be used instead
> >    of a mutex.
> > 
> > What I did is that I replaced a mutex with spinlock, set affinity for
> > threads to separate physical cores and performed a series of tests.
> > Every test variant was executed 10 times, calculated average, min and
> > max of the 'Avg' column from these runs.
> > Since spinlocks are not fair, they sometimes perform badly in case
> > of too high contention.  For that reason I also included 90% average
> > that cuts off one worst result.
> > 
> > Results:
> > 
> > ./tests/ovstest test-mpsc-queue benchmark 3000000 1
> > mpsc-queue: avg 183.5  | max 200  | min 169 | 90% avg 181
> >  spin+list: avg 89.2   | max 142  | min 64  | 90% avg 89
> > 
> > ./tests/ovstest test-mpsc-queue benchmark 3000000 2
> > mpsc-queue: avg 83.3   | max 90   | min 79  | 90% avg 83
> >  spin+list: avg 141.8  | max 675  | min 43  | 90% avg 83
> > 
> > ./tests/ovstest test-mpsc-queue benchmark 3000000 3
> > mpsc-queue: avg 75.5   | max 80   | min 70  | 90% avg 75
> >  spin+list: avg 96.2   | max 166  | min 63  | 90% avg 88
> > 
> > We can see that in this setup list+spinlock absolutely crashes the
> > mspc in case of single reader + single writer, which is the most
> > likely case for a non-synthetic workload.  It's on par with mpsc
> > in case of 1+2 and also very similar in 1+3.  I didn't have 5
> > physical cores to test 1+4.
> > 
> > I also run a test without contention (All reads after join).  And
> > there is no noticeable difference between implementations in this
> > case:
> > 
> > ./tests/ovstest test-mpsc-queue benchmark 3000000 1 - without contention
> > mpsc-queue: avg 60.9   | max 75   | min 42  | 90% avg 59
> >  spin+list: avg 60.5   | max 73   | min 43  | 90% avg 59
> >    guarded: avg 79.7   | max 95   | min 61  | 90% avg 78
> > 
> > Results are much more flaky in case of scheduling threads to siblings,
> > but spinlock+list still not too far from mpsc even in this case.
> > 
> > So, the question is: do we really need this new data structure?
> > spinlock + ovs-list shows very good results and should work just fine,
> > especially in non-synthetic cases, where contention of more than 2
> > threads is very unlikely (assuming that critical sections are short).
> > 
> > In OVS these data structures will be used from PMD threads that are
> > guaranteed to run on different cores.  Non-PMD threads are running
> > on different set of cores too, so there should be no problems with
> > contention between threads running on a same core.
> > 
> > Of course, spinlocks should be used wisely, i.e. we should not hold
> > them longer than necessary.  But if the code is already written for
> > mspc, it should not be a problem to convert it to spinlock + list
> > without need to hold a lock for longer than one push() or pop().
> > 
> > Any thoughts?
> 
> Hello Ilya,
> 
> You raise valid points, and performance work is usually nuanced.
> Thanks for taking a thorough look into it. Using a list + spinlock
> would reduce code so it is worth examining.
> 
> Getting the benchmark closer to reality:
> 
> The first application of the queue is with the offload management case.
> 
> The benchmark departs from the actual workload by having all producers 
> enqueuing equally.
> I'm not sure how the number of offloads inserted by the PMD compares to 
> the number deleted
> by revalidators. Maybe in the benchmark producers should be 
> asymmetrical?
> 
> I don't think that the 1+1 test is the least synthetic. Multiple PMDs 
> can be used,
> but even if only 1 PMD is used, a number of revalidators are configured 
> by default
> (7 I think?). Maybe they are not constantly working in a normal setup?
> But I don't think that multiple PMDs || active revalidators is such an 
> edge-case
> that it should be ignored.
> 
> Thread affinity:
> 
> Only PMDs have dedicated cores. The offload thread and revalidators do not.
> I think this is an important point to keep in mind when considering the spinlock
> as it is unfair.
> 
> If I run the 1+1 test with all producers + reader affined to their 
> core, I get
> results closer to yours, though with slightly improved numbers for 
> mpsc-queue (I'm guessing
> the benchmark is not isolated enough on our machines and it will still 
> depend
> on general load):
> 
> $ ./mpsc-stats.sh  1000000 1
> 20 times './tests/ovstest test-mpsc-queue benchmark  1000000 1':
>    mpsc-queue reader: avg   70.7 | stdev    3.8 | max    75 | min    61
>   mpsc-queue writers: avg   56.2 | stdev    3.4 | max    61 | min    48
>     spin+list reader: avg   49.0 | stdev    7.5 | max    65 | min    38
>    spin+list writers: avg   30.9 | stdev    8.6 | max    51 | min    21
>  guarded list reader: avg  251.6 | stdev   62.7 | max   369 | min   173
> guarded list writers: avg  237.8 | stdev   62.3 | max   351 | min   161
> 

Mmmh of course I forgot to go back to 3M elements for a proper comparison.
Here is the same run with fully affined threads to isolated cores:

20 times 'taskset -c 4,5,6,7 ./tests/ovstest test-mpsc-queue benchmark  3000000 1':
   mpsc-queue reader: avg  185.4 | stdev    6.3 | max   202 | min   177
  mpsc-queue writers: avg  168.9 | stdev    6.7 | max   182 | min   157
    spin+list reader: avg  109.8 | stdev   16.8 | max   161 | min    94
   spin+list writers: avg   81.5 | stdev   18.4 | max   138 | min    64
 guarded list reader: avg  711.0 | stdev  167.4 | max  1004 | min   504
guarded list writers: avg  693.5 | stdev  167.5 | max   987 | min   484
$ ./mpsc-stats.sh 3000000 2
20 times 'taskset -c 4,5,6,7 ./tests/ovstest test-mpsc-queue benchmark  3000000 2':
   mpsc-queue reader: avg  129.6 | stdev    9.5 | max   153 | min   114
  mpsc-queue writers: avg  112.1 | stdev    8.5 | max   132 | min    99
    spin+list reader: avg  198.6 | stdev   27.6 | max   237 | min   127
   spin+list writers: avg  160.6 | stdev   30.5 | max   208 | min    84
 guarded list reader: avg  247.5 | stdev   13.2 | max   268 | min   223
guarded list writers: avg  222.4 | stdev   12.7 | max   250 | min   203
$ ./mpsc-stats.sh 3000000 3
20 times 'taskset -c 4,5,6,7 ./tests/ovstest test-mpsc-queue benchmark  3000000 3':
   mpsc-queue reader: avg  120.9 | stdev   16.4 | max   155 | min   103
  mpsc-queue writers: avg  102.4 | stdev   17.7 | max   139 | min    87
    spin+list reader: avg  331.5 | stdev   97.2 | max   718 | min   257
   spin+list writers: avg  281.7 | stdev   65.3 | max   485 | min   200
 guarded list reader: avg  544.9 | stdev   54.3 | max   617 | min   459
guarded list writers: avg  523.5 | stdev   53.6 | max   603 | min   433

However it does not matter beyond the scale. The issue highlighted further with
the affinity is the same.

> If I set a single producer affinity however (not the reader nor other 
> producers),
> I get surprising results:
> 
> $ ./mpsc-stats.sh  1000000 1
> 20 times './tests/ovstest test-mpsc-queue benchmark  1000000 1':
>    mpsc-queue reader: avg   85.6 | stdev   29.1 | max   111 | min    24
>   mpsc-queue writers: avg   84.2 | stdev   30.8 | max   108 | min    24
>     spin+list reader: avg 1730.1 | stdev 4403.4 | max 19418 | min    40
>    spin+list writers: avg 1726.2 | stdev 4404.8 | max 19418 | min    35
>  guarded list reader: avg  470.0 | stdev  147.7 | max   710 | min   204
> guarded list writers: avg  469.9 | stdev  147.7 | max   710 | min   204
> 
> If I profile some of the worst runs in this configuration, I can see
> 84% of the time spent waiting on the spinlock. The affinity for the producer
> on core 0 has an adverse effect. If I affine it to core 1 instead,
> I get still unstable but better results:
> 
> $ ./mpsc-stats.sh  1000000 1
> 20 times './tests/ovstest test-mpsc-queue benchmark  1000000 1':
>    mpsc-queue reader: avg   48.8 | stdev   15.1 | max    60 | min    13
>   mpsc-queue writers: avg   48.8 | stdev   15.1 | max    60 | min    13
>     spin+list reader: avg   62.3 | stdev   81.0 | max   328 | min    23
>    spin+list writers: avg   59.0 | stdev   82.4 | max   328 | min    20
>  guarded list reader: avg  271.5 | stdev   46.9 | max   337 | min   190
> guarded list writers: avg  271.3 | stdev   46.9 | max   337 | min   190
> $ ./mpsc-stats.sh  1000000 1
> 20 times './tests/ovstest test-mpsc-queue benchmark  1000000 1':
>    mpsc-queue reader: avg   47.1 | stdev   15.8 | max    59 | min    13
>   mpsc-queue writers: avg   45.8 | stdev   16.9 | max    59 | min    13
>     spin+list reader: avg  172.4 | stdev  467.8 | max  2100 | min    23
>    spin+list writers: avg  168.9 | stdev  468.8 | max  2100 | min    20
>  guarded list reader: avg  260.0 | stdev   68.5 | max   362 | min   129
> guarded list writers: avg  259.9 | stdev   68.5 | max   362 | min   129
> 
> Once we use more than 1+1 threads it improves:
> 
> $ ./mpsc-stats.sh  1000000 2
> 20 times './tests/ovstest test-mpsc-queue benchmark  1000000 2':
>    mpsc-queue reader: avg   36.7 | stdev    5.3 | max    51 | min    27
>   mpsc-queue writers: avg   31.5 | stdev    4.9 | max    39 | min    23
>     spin+list reader: avg   49.0 | stdev   15.4 | max    73 | min    26
>    spin+list writers: avg   40.2 | stdev   15.3 | max    65 | min    16
>  guarded list reader: avg   77.9 | stdev    6.3 | max    95 | min    69
> guarded list writers: avg   73.1 | stdev    5.6 | max    87 | min    63
> 
> $ ./mpsc-stats.sh  1000000 3
> 20 times './tests/ovstest test-mpsc-queue benchmark  1000000 3':
>    mpsc-queue reader: avg   31.1 | stdev    6.0 | max    42 | min    24
>   mpsc-queue writers: avg   28.4 | stdev    5.9 | max    41 | min    19
>     spin+list reader: avg   82.2 | stdev   18.9 | max   127 | min    46
>    spin+list writers: avg   70.8 | stdev   19.5 | max   119 | min    36
>  guarded list reader: avg  162.2 | stdev   15.7 | max   189 | min   132
> guarded list writers: avg  158.6 | stdev   16.2 | max   185 | min   126
> 
> Avoiding additional code would be nice, but spinlocks can have
> surprising effects in some configurations (or so it seems).
> 
> The spinlock reduces stability and that could be hard to pinpoint in the
> larger system. In future work to improve latency in the datapath, it 
> will add noise.
> 
> The exponential backoff will work just as well in the offload thread, so
> if we decide to go with list + spinlock it's alright. My vote still goes to
> the mpsc-queue as it is more predictable with honorable results in 1+1.
> 
> To decide whether we need a new structure, I think we should look at 
> the overall architecture.
> OVS-DPDK runs a mix of datapath threads on affined cores and 
> non-affined helper threads.
> Using spinlocks in this context might be an issue. I think it makes 
> sense to have ways to
> avoid them if needed.
> 
> Gaetan
> 
> PS: here is the diff I used to generate the results, including the script
> for the stats:
> 
> ---
> diff --git a/mpsc-stats.sh b/mpsc-stats.sh
> new file mode 100755
> index 000000000..1ee177ede
> --- /dev/null
> +++ b/mpsc-stats.sh
> @@ -0,0 +1,80 @@
> +#!/usr/bin/env sh
> +
> +N_ELEM=${1:-1000000}
> +N_CORE=${2:-1}
> +N_RUN=${3:-20}
> +
> +#BIN="taskset -c 0,1,2,3 ./tests/ovstest test-mpsc-queue benchmark "
> +BIN="./tests/ovstest test-mpsc-queue benchmark "
> +CMD="$BIN $N_ELEM $N_CORE"
> +CMD_fast="$BIN 1 1"
> +
> +xc() { python -c "import math; print(float($*))"; }
> +join_by() {( IFS="$1"; shift; echo "$*"; )}
> +
> +stdev() {(
> +    m=$1; shift;
> +    sum=0
> +    for v in $*; do
> +        sum=$(xc "$sum + pow($v - $m, 2)")
> +    done
> +    echo $(xc "math.sqrt($sum / $#)")
> +)}
> +mean() { echo $(xc "($(join_by + $*)) / $#"); }
> +
> +# $1: name field max width
> +# $2: stat name
> +# $3-: values
> +print_stats() {(
> +    len=$1; shift;
> +    name=$1; shift;
> +    values="$*"
> +    m=$(mean $values)
> +    sd=$(stdev $m $values)
> +    printf "%*s: avg %6.1f | stdev %6.1f | max %5.0f | min %5.0f\n" \
> +        "$len" "$name" "$m" "$sd" \
> +        "$(xc "max($(join_by , $values))")" \
> +        "$(xc "min($(join_by , $values))")"
> +)}
> +
> +tmp=$(mktemp)
> +$CMD_fast | grep -v 'Benchmarking\|Reader' | \
> +while read line; do
> +    name="$(echo $line | cut -d: -f1)"
> +    printf "%s reader\t%s writers\t" "$name" "$name"
> +    #printf "%s writers\t" "$name"
> +done >> $tmp
> +printf "\n" >> $tmp
> +
> +echo "$N_RUN times '$CMD':"
> +
> +for i in $(seq 1 $N_RUN); do
> +$CMD | grep -v 'Benchmarking\|Reader' | \
> +while read line; do
> +    name="$(echo $line    | cut -d: -f1)"
> +    reader="$(echo $line  | cut -d: -f2- | cut -d' ' -f2)"
> +    writers="$(echo $line | cut -d: -f2- | rev | cut -d' ' -f2 | rev)"
> +
> +    printf "%d\t%d\t" $reader $writers
> +    #printf "%d\t" $writers
> +done >> $tmp
> +printf "\n" >> $tmp
> +done
> +
> +#cat $tmp
> +
> +nb_col=$(awk -F$'\t' '{print NF-1}' $tmp | head -1)
> +maxlen=0
> +for i in $(seq 1 $nb_col); do
> +    name=$(head -1 $tmp | cut -d$'\t' -f$i)
> +    len=$(printf "%s" "$name" |wc -m)
> +    [ "$maxlen" -lt "$len" ] && maxlen=$len
> +done
> +
> +for i in $(seq 1 $nb_col); do
> +    name=$(head -1 $tmp | cut -d$'\t' -f$i)
> +    values=$(tail -n +2 $tmp | cut -d$'\t' -f$i)
> +    print_stats $maxlen "$name" $values
> +done
> +
> +rm $tmp
> diff --git a/tests/test-mpsc-queue.c b/tests/test-mpsc-queue.c
> index ebd1226fe..cad7a0ce3 100644
> --- a/tests/test-mpsc-queue.c
> +++ b/tests/test-mpsc-queue.c
> @@ -25,12 +25,22 @@
>  #include "guarded-list.h"
>  #include "mpsc-queue.h"
>  #include "openvswitch/list.h"
> +#include "openvswitch/vlog.h"
>  #include "openvswitch/util.h"
> +#include "ovs-numa.h"
> +#include "ovs-rcu.h"
>  #include "ovs-thread.h"
>  #include "ovstest.h"
>  #include "timeval.h"
>  #include "util.h"
>  
> +static void
> +set_affinity(unsigned int tid OVS_UNUSED)
> +{
> +    //ovs_numa_thread_setaffinity_core(tid * 2);
> +    //ovs_numa_thread_setaffinity_core(tid);
> +}
> +
>  struct element {
>      union {
>          struct mpsc_queue_node mpscq;
> @@ -318,7 +328,13 @@ mpsc_queue_insert_thread(void *aux_)
>      unsigned int id;
>      size_t i;
>  
> +    ovsrcu_quiesce_start();
> +
>      atomic_add(&aux->thread_id, 1u, &id);
> +    if (id == 0) {
> +        ovs_numa_thread_setaffinity_core(1);
> +    }
> +    set_affinity(id + 1);
>      n_elems_per_thread = n_elems / n_threads;
>      th_elements = &elements[id * n_elems_per_thread];
>  
> @@ -358,6 +374,7 @@ benchmark_mpsc_queue(void)
>  
>      aux.queue = &queue;
>      atomic_store(&aux.thread_id, 0);
> +    set_affinity(0);
>  
>      for (i = n_elems - (n_elems % n_threads); i < n_elems; i++) {
>          mpsc_queue_insert(&queue, &elements[i].node.mpscq);
> @@ -424,7 +441,7 @@ benchmark_mpsc_queue(void)
>  
>  struct list_aux {
>      struct ovs_list *list;
> -    struct ovs_mutex *lock;
> +    struct ovs_spin *lock;
>      atomic_uint thread_id;
>  };
>  
> @@ -438,7 +455,13 @@ locked_list_insert_thread(void *aux_)
>      unsigned int id;
>      size_t i;
>  
> +    ovsrcu_quiesce_start();
> +
>      atomic_add(&aux->thread_id, 1u, &id);
> +    if (id == 0) {
> +        ovs_numa_thread_setaffinity_core(1);
> +    }
> +    set_affinity(id + 1);
>      n_elems_per_thread = n_elems / n_threads;
>      th_elements = &elements[id * n_elems_per_thread];
>  
> @@ -446,9 +469,9 @@ locked_list_insert_thread(void *aux_)
>      xgettimeofday(&start);
>  
>      for (i = 0; i < n_elems_per_thread; i++) {
> -        ovs_mutex_lock(aux->lock);
> +        ovs_spin_lock(aux->lock);
>          ovs_list_push_front(aux->list, &th_elements[i].node.list);
> -        ovs_mutex_unlock(aux->lock);
> +        ovs_spin_unlock(aux->lock);
>      }
>  
>      thread_working_ms[id] = elapsed(&start);
> @@ -462,7 +485,7 @@ locked_list_insert_thread(void *aux_)
>  static void
>  benchmark_list(void)
>  {
> -    struct ovs_mutex lock;
> +    struct ovs_spin lock;
>      struct ovs_list list;
>      struct element *elem;
>      struct timeval start;
> @@ -477,18 +500,19 @@ benchmark_list(void)
>      memset(elements, 0, n_elems * sizeof *elements);
>      memset(thread_working_ms, 0, n_threads * sizeof *thread_working_ms);
>  
> -    ovs_mutex_init(&lock);
> +    ovs_spin_init(&lock);
>      ovs_list_init(&list);
>  
>      aux.list = &list;
>      aux.lock = &lock;
>      atomic_store(&aux.thread_id, 0);
> +    set_affinity(0);
>  
> -    ovs_mutex_lock(&lock);
> +    ovs_spin_lock(&lock);
>      for (i = n_elems - (n_elems % n_threads); i < n_elems; i++) {
>          ovs_list_push_front(&list, &elements[i].node.list);
>      }
> -    ovs_mutex_unlock(&lock);
> +    ovs_spin_unlock(&lock);
>  
>      working = true;
>  
> @@ -505,12 +529,21 @@ benchmark_list(void)
>      counter = 0;
>      epoch = 1;
>      do {
> -        ovs_mutex_lock(&lock);
> -        LIST_FOR_EACH_POP (elem, node.list, &list) {
> -            elem->mark = epoch;
> -            counter++;
> +        struct ovs_list *node = NULL;
> +
> +        ovs_spin_lock(&lock);
> +        if (!ovs_list_is_empty(&list)) {
> +            node = ovs_list_pop_front(&list);
> +        }
> +        ovs_spin_unlock(&lock);
> +
> +        if (!node) {
> +            continue;
>          }
> -        ovs_mutex_unlock(&lock);
> +
> +        elem = CONTAINER_OF(node, struct element, node.list);
> +        elem->mark = epoch;
> +        counter++;
>          if (epoch == UINT64_MAX) {
>              epoch = 0;
>          }
> @@ -525,14 +558,14 @@ benchmark_list(void)
>      avg /= n_threads;
>  
>      /* Elements might have been inserted before threads were joined. */
> -    ovs_mutex_lock(&lock);
> +    ovs_spin_lock(&lock);
>      LIST_FOR_EACH_POP (elem, node.list, &list) {
>          elem->mark = epoch;
>          counter++;
>      }
> -    ovs_mutex_unlock(&lock);
> +    ovs_spin_unlock(&lock);
>  
> -    printf("        list:  %6d", elapsed(&start));
> +    printf("   spin+list:  %6d", elapsed(&start));
>      for (i = 0; i < n_threads; i++) {
>          printf(" %6" PRIu64, thread_working_ms[i]);
>      }
> @@ -566,7 +599,13 @@ guarded_list_insert_thread(void *aux_)
>      unsigned int id;
>      size_t i;
>  
> +    ovsrcu_quiesce_start();
> +
>      atomic_add(&aux->thread_id, 1u, &id);
> +    if (id == 0) {
> +        ovs_numa_thread_setaffinity_core(1);
> +    }
> +    set_affinity(id + 1);
>      n_elems_per_thread = n_elems / n_threads;
>      th_elements = &elements[id * n_elems_per_thread];
>  
> @@ -585,6 +624,7 @@ guarded_list_insert_thread(void *aux_)
>      return NULL;
>  }
>  
> +OVS_UNUSED
>  static void
>  benchmark_guarded_list(void)
>  {
> @@ -608,6 +648,7 @@ benchmark_guarded_list(void)
>  
>      aux.glist = &glist;
>      atomic_store(&aux.thread_id, 0);
> +    set_affinity(0);
>  
>      for (i = n_elems - (n_elems % n_threads); i < n_elems; i++) {
>          guarded_list_push_back(&glist, &elements[i].node.list, n_elems);
> @@ -680,6 +721,9 @@ run_benchmarks(struct ovs_cmdl_context *ctx)
>      long int l_elems;
>      size_t i;
>  
> +    vlog_set_levels(NULL, VLF_ANY_DESTINATION, VLL_OFF);
> +    ovs_numa_init();
> +
>      l_elems = strtol(ctx->argv[1], NULL, 10);
>      l_threads = strtol(ctx->argv[2], NULL, 10);
>      ovs_assert(l_elems > 0 && l_threads > 0);
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> 


-- 
Gaetan Rivet


More information about the dev mailing list