[ovs-dev] [PATCH v3 14/28] seq-pool: Module for faster ID generation
Ilya Maximets
i.maximets at ovn.org
Fri Apr 30 17:31:37 UTC 2021
On 4/25/21 1:55 PM, Gaetan Rivet wrote:
> The current id-pool module is slow to allocate the
> next valid ID, and can be optimized when restricting
> some properties of the pool.
>
> Those restrictions are:
>
> * No ability to add a random ID to the pool.
>
> * A new ID is no more the smallest possible ID. It is
> however guaranteed to be in the range of
> [base, next_id]. Multiple users of the pool are
> registered, each with a thread-local cache for
> better scalability and the next_id is one after the
> latest ID added to any user cache.
> The allocation range can be written as:
>
> [base, last_alloc + nb-user * cache-size + 1].
>
> * A user should never free an ID that is not allocated.
> No checks are done and doing so will duplicate the spurious
> ID. Refcounting or other memory management scheme should
> be used to ensure an object and its ID are only freed once.
>
> This pool is designed to scale reasonably well in multi-thread
> setup. As it is aimed at being a faster replacement to the
> current id-pool, a benchmark has been implemented alongside
> unit tests.
>
> The benchmark is composed of 4 rounds: 'new', 'del', 'mix', and 'rnd'.
> Respectively
>
> + 'new': only allocate IDs
> + 'del': only free IDs
> + 'mix': allocate, sequential free, then allocate ID.
> + 'rnd': allocate, random free, allocate ID.
>
> Randomized freeing is done by swapping the latest allocated ID with any
> from the range of currently allocated ID, which is reminiscent of the
> Fisher-Yates shuffle. This evaluates freeing non-sequential IDs,
> which is the more natural use-case.
>
> For this specific round, the id-pool performance is such that a timeout
> of 10 seconds is added to the benchmark:
>
> $ ./tests/ovstest test-seq-pool benchmark 10000 1
> Benchmarking n=10000 on 1 thread.
> type\thread: 1 Avg
> seq-pool new: 1 1 ms
> seq-pool del: 0 0 ms
> seq-pool mix: 1 1 ms
> seq-pool rnd: 1 1 ms
> id-pool new: 0 0 ms
> id-pool del: 1 1 ms
> id-pool mix: 1 1 ms
> id-pool rnd: 1201 1201 ms
>
> $ ./tests/ovstest test-seq-pool benchmark 100000 1
> Benchmarking n=100000 on 1 thread.
> type\thread: 1 Avg
> seq-pool new: 2 2 ms
> seq-pool del: 5 5 ms
> seq-pool mix: 5 5 ms
> seq-pool rnd: 5 5 ms
> id-pool new: 8 8 ms
> id-pool del: 5 5 ms
> id-pool mix: 11 11 ms
> id-pool rnd: 10000+ ****** ms
>
> $ ./tests/ovstest test-seq-pool benchmark 1000000 1
> Benchmarking n=1000000 on 1 thread.
> type\thread: 1 Avg
> seq-pool new: 23 23 ms
> seq-pool del: 49 49 ms
> seq-pool mix: 53 53 ms
> seq-pool rnd: 53 53 ms
> id-pool new: 190 190 ms
> id-pool del: 173 173 ms
> id-pool mix: 273 273 ms
> id-pool rnd: 10042+ ****** ms
>
> $ ./tests/ovstest test-seq-pool benchmark 1000000 2
> Benchmarking n=1000000 on 2 threads.
> type\thread: 1 2 Avg
> seq-pool new: 40 39 39 ms
> seq-pool del: 33 33 33 ms
> seq-pool mix: 89 91 90 ms
> seq-pool rnd: 146 151 148 ms
> id-pool new: 485 485 485 ms
> id-pool del: 541 542 541 ms
> id-pool mix: 550 600 575 ms
> id-pool rnd: 10048+ 10003+ ****** ms
>
> $ ./tests/ovstest test-seq-pool benchmark 1000000 4
> Benchmarking n=1000000 on 4 threads.
> type\thread: 1 2 3 4 Avg
> seq-pool new: 40 39 40 40 39 ms
> seq-pool del: 24 28 28 30 27 ms
> seq-pool mix: 60 63 69 69 65 ms
> seq-pool rnd: 195 197 202 202 199 ms
> id-pool new: 478 471 482 485 479 ms
> id-pool del: 474 469 467 474 471 ms
> id-pool mix: 558 558 611 545 568 ms
> id-pool rnd: 10121+ 10076+ 10030+ 10167+ ****** ms
Exactly same question here. The new data structure doesn't provide
any guarantees and checks. It doesn't allocate smallest id.
When why not just an array + spinlock? We're already using this
schema in lib/netdev-afxdp-pool.c and it works fine.
I crafted a very simple implementation and added it to this benchmark.
Results are following:
$ ./tests/ovstest test-seq-pool benchmark 10000 1
Benchmarking n=10000 on 1 thread.
type\thread: 1 Avg
seq-pool new: 1 1 ms
seq-pool del: 2 2 ms
seq-pool mix: 4 4 ms
seq-pool rnd: 4 4 ms
my-pool new: 1 1 ms
my-pool del: 1 1 ms
my-pool mix: 1 1 ms
my-pool rnd: 2 2 ms
$ ./tests/ovstest test-seq-pool benchmark 100000 1
Benchmarking n=100000 on 1 thread.
type\thread: 1 Avg
seq-pool new: 13 13 ms
seq-pool del: 24 24 ms
seq-pool mix: 11 11 ms
seq-pool rnd: 8 8 ms
my-pool new: 2 2 ms
my-pool del: 1 1 ms
my-pool mix: 4 4 ms
my-pool rnd: 3 3 ms
$ ./tests/ovstest test-seq-pool benchmark 1000000 1
Benchmarking n=1000000 on 1 thread.
type\thread: 1 Avg
seq-pool new: 43 43 ms
seq-pool del: 50 50 ms
seq-pool mix: 54 54 ms
seq-pool rnd: 54 54 ms
my-pool new: 10 10 ms
my-pool del: 10 10 ms
my-pool mix: 28 28 ms
my-pool rnd: 41 41 ms
$ ./tests/ovstest test-seq-pool benchmark 1000000 2
Benchmarking n=1000000 on 2 threads.
type\thread: 1 2 Avg
seq-pool new: 72 71 71 ms
seq-pool del: 31 33 32 ms
seq-pool mix: 83 83 83 ms
seq-pool rnd: 144 147 145 ms
my-pool new: 16 12 14 ms
my-pool del: 16 12 14 ms
my-pool mix: 66 70 68 ms
my-pool rnd: 42 58 50 ms
$ ./tests/ovstest test-seq-pool benchmark 1000000 4
Benchmarking n=1000000 on 4 threads.
type\thread: 1 2 3 4 Avg
seq-pool new: 71 69 69 67 69 ms
seq-pool del: 27 22 27 23 24 ms
seq-pool mix: 57 64 71 67 64 ms
seq-pool rnd: 182 186 189 191 187 ms
my-pool new: 50 42 44 50 46 ms
my-pool del: 46 39 37 45 41 ms
my-pool mix: 79 45 55 81 65 ms
my-pool rnd: 149 130 123 148 137 ms
So, this implementation is faster and doesn't require any new complex
data structures.
Sometimes high contention on a spin lock results in unfair distribution
and gives somewhat worse results, but they are still not too shabby, e.g.:
$ ./tests/ovstest test-seq-pool benchmark 1000000 4
Benchmarking n=1000000 on 4 threads.
type\thread: 1 2 3 4 Avg
seq-pool new: 75 75 75 74 74 ms
seq-pool del: 25 22 23 26 24 ms
seq-pool mix: 56 68 63 69 64 ms
seq-pool rnd: 183 188 191 192 188 ms
my-pool new: 52 52 53 39 49 ms
my-pool del: 52 54 55 42 50 ms
my-pool mix: 150 156 159 126 147 ms
my-pool rnd: 160 175 177 139 162 ms
I didn't add any core affinity management to this benchmark.
And even in a bad for spinlock scenario with 16 threads on 4 physical
cores with multi-threading, it doesn't look terrible:
$ ./tests/ovstest test-seq-pool benchmark 1000000 16
Benchmarking n=1000000 on 16 threads.
type\thread: 1 ... 16 Avg
seq-pool new: 38 ... 40 40 ms
seq-pool del: 23 ... 20 20 ms
seq-pool mix: 66 ... 68 65 ms
seq-pool rnd: 200 ... 185 195 ms
my-pool new: 68 ... 91 97 ms
my-pool del: 87 ... 163 145 ms
my-pool mix: 363 ... 359 343 ms
my-pool rnd: 464 ... 432 428 ms
For the scenario with a few offload thread should be perfectly fine.
What do you think?
Best regards, Ilya Maximets.
P.S. Here is an implementation that I used for tests:
diff --git a/tests/test-seq-pool.c b/tests/test-seq-pool.c
index a1e0d5500..53af7be20 100644
--- a/tests/test-seq-pool.c
+++ b/tests/test-seq-pool.c
@@ -24,6 +24,7 @@
#include "command-line.h"
#include "id-pool.h"
#include "openvswitch/util.h"
+#include "openvswitch/vlog.h"
#include "ovs-thread.h"
#include "ovs-rcu.h"
#include "ovstest.h"
@@ -468,6 +469,180 @@ benchmark_id_pool(void)
free(threads);
}
+struct id_unordered_spin_pool {
+ struct ovs_spin lock;
+ uint32_t *array;
+ size_t count;
+};
+
+static struct id_unordered_spin_pool *
+id_unordered_spin_pool_create(int start, int size)
+{
+ struct id_unordered_spin_pool *pool = xzalloc(sizeof *pool);
+
+ ovs_spin_init(&pool->lock);
+ ovs_spin_lock(&pool->lock);
+ pool->array = xzalloc(size * sizeof pool->array[0]);
+ for (int i = 0; i < size; i++) {
+ pool->array[pool->count++] = start + i;
+ }
+ ovs_spin_unlock(&pool->lock);
+ return pool;
+}
+
+static bool
+id_unordered_spin_pool_alloc_id(struct id_unordered_spin_pool *pool,
+ uint32_t *id)
+{
+ bool res = false;
+
+ ovs_spin_lock(&pool->lock);
+ if (pool->count) {
+ *id = pool->array[--pool->count];
+ res = true;
+ }
+ ovs_spin_unlock(&pool->lock);
+ return res;
+}
+
+static void
+id_unordered_spin_pool_free_id(struct id_unordered_spin_pool *pool,
+ uint32_t id)
+{
+ ovs_spin_lock(&pool->lock);
+ pool->array[pool->count++] = id;
+ ovs_spin_unlock(&pool->lock);
+}
+
+static void
+id_unordered_spin_pool_destroy(struct id_unordered_spin_pool *pool)
+{
+ ovs_spin_lock(&pool->lock);
+ free(pool->array);
+ ovs_spin_unlock(&pool->lock);
+ ovs_spin_destroy(&pool->lock);
+ free(pool);
+}
+
+struct id_unordered_spin_pool_aux {
+ struct id_unordered_spin_pool *pool;
+ atomic_uint thread_id;
+};
+
+static void *
+id_unordered_spin_pool_thread(void *aux_)
+{
+ unsigned int n_ids_per_thread;
+ struct id_unordered_spin_pool_aux *aux = aux_;
+ uint32_t *th_ids;
+ unsigned int tid;
+ int start;
+ size_t i;
+
+ atomic_add(&aux->thread_id, 1u, &tid);
+ n_ids_per_thread = n_ids / n_threads;
+ th_ids = &ids[tid * n_ids_per_thread];
+
+ /* NEW */
+
+ start = running_time_ms;
+ for (i = 0; i < n_ids_per_thread; i++) {
+ ovs_assert(id_unordered_spin_pool_alloc_id(aux->pool, &th_ids[i]));
+ }
+ thread_working_ms[tid] = elapsed(&start);
+
+ ovs_barrier_block(&barrier);
+
+ /* DEL */
+
+ shuffle(th_ids, n_ids_per_thread);
+
+ start = running_time_ms;
+ for (i = 0; i < n_ids_per_thread; i++) {
+ id_unordered_spin_pool_free_id(aux->pool, th_ids[i]);
+ }
+ thread_working_ms[tid] = elapsed(&start);
+
+ ovs_barrier_block(&barrier);
+
+ /* MIX */
+
+ start = running_time_ms;
+ for (i = 0; i < n_ids_per_thread; i++) {
+ ignore(id_unordered_spin_pool_alloc_id(aux->pool, &th_ids[i]));
+ id_unordered_spin_pool_free_id(aux->pool, th_ids[i]);
+ ignore(id_unordered_spin_pool_alloc_id(aux->pool, &th_ids[i]));
+ }
+ thread_working_ms[tid] = elapsed(&start);
+
+ ovs_barrier_block(&barrier);
+
+ /* MIX SHUFFLED */
+
+ start = running_time_ms;
+ for (i = 0; i < n_ids_per_thread; i++) {
+ if (elapsed(&start) >= TIMEOUT_MS) {
+ break;
+ }
+ ignore(id_unordered_spin_pool_alloc_id(aux->pool, &th_ids[i]));
+ swap_u32(&th_ids[i], &th_ids[random_range(i + 1)]);
+ id_unordered_spin_pool_free_id(aux->pool, th_ids[i]);
+ ignore(id_unordered_spin_pool_alloc_id(aux->pool, &th_ids[i]));
+ }
+ thread_working_ms[tid] = elapsed(&start);
+
+ return NULL;
+}
+
+static void
+benchmark_id_unordered_spin_pool(void)
+{
+ pthread_t *threads;
+ struct id_unordered_spin_pool_aux aux;
+ size_t i;
+
+ memset(ids, 0, n_ids & sizeof *ids);
+ memset(thread_working_ms, 0, n_threads & sizeof *thread_working_ms);
+
+ aux.pool = id_unordered_spin_pool_create(0, n_ids);
+ atomic_store(&aux.thread_id, 0);
+
+ for (i = n_ids - (n_ids % n_threads); i < n_ids; i++) {
+ id_unordered_spin_pool_alloc_id(aux.pool, &ids[i]);
+ }
+
+ threads = xmalloc(n_threads * sizeof *threads);
+ ovs_barrier_init(&barrier, n_threads + 1);
+
+ for (i = 0; i < n_threads; i++) {
+ threads[i] = ovs_thread_create("id_unordered_spin_pool_alloc",
+ id_unordered_spin_pool_thread, &aux);
+ }
+
+ ovs_barrier_block(&barrier);
+
+ print_result(" my-pool new");
+
+ ovs_barrier_block(&barrier);
+
+ print_result(" my-pool del");
+
+ ovs_barrier_block(&barrier);
+
+ print_result(" my-pool mix");
+
+ for (i = 0; i < n_threads; i++) {
+ xpthread_join(threads[i], NULL);
+ }
+
+ print_result(" my-pool rnd");
+
+ id_unordered_spin_pool_destroy(aux.pool);
+ ovs_barrier_destroy(&barrier);
+ free(threads);
+}
+
+
static void *
clock_main(void *arg OVS_UNUSED)
{
@@ -492,6 +666,8 @@ run_benchmarks(struct ovs_cmdl_context *ctx)
long int l_ids;
size_t i;
+ vlog_set_levels(NULL, VLF_ANY_DESTINATION, VLL_OFF);
+
l_ids = strtol(ctx->argv[1], NULL, 10);
l_threads = strtol(ctx->argv[2], NULL, 10);
ovs_assert(l_ids > 0 && l_threads > 0);
@@ -514,7 +690,8 @@ run_benchmarks(struct ovs_cmdl_context *ctx)
printf(" Avg\n");
benchmark_seq_pool();
- benchmark_id_pool();
+ /* benchmark_id_pool(); */
+ benchmark_id_unordered_spin_pool();
stop = true;
---
More information about the dev
mailing list