[ovs-dev] [PATCH v3 14/28] seq-pool: Module for faster ID generation
Gaëtan Rivet
grive at u256.net
Wed May 5 14:41:41 UTC 2021
On Mon, May 3, 2021, at 14:48, Ilya Maximets wrote:
> On 5/1/21 4:49 PM, Gaëtan Rivet wrote:
> > On Fri, Apr 30, 2021, at 19:31, Ilya Maximets wrote:
> >> On 4/25/21 1:55 PM, Gaetan Rivet wrote:
> >>> The current id-pool module is slow to allocate the
> >>> next valid ID, and can be optimized when restricting
> >>> some properties of the pool.
> >>>
> >>> Those restrictions are:
> >>>
> >>> * No ability to add a random ID to the pool.
> >>>
> >>> * A new ID is no more the smallest possible ID. It is
> >>> however guaranteed to be in the range of
> >>> [base, next_id]. Multiple users of the pool are
> >>> registered, each with a thread-local cache for
> >>> better scalability and the next_id is one after the
> >>> latest ID added to any user cache.
> >>> The allocation range can be written as:
> >>>
> >>> [base, last_alloc + nb-user * cache-size + 1].
> >>>
> >>> * A user should never free an ID that is not allocated.
> >>> No checks are done and doing so will duplicate the spurious
> >>> ID. Refcounting or other memory management scheme should
> >>> be used to ensure an object and its ID are only freed once.
> >>>
> >>> This pool is designed to scale reasonably well in multi-thread
> >>> setup. As it is aimed at being a faster replacement to the
> >>> current id-pool, a benchmark has been implemented alongside
> >>> unit tests.
> >>>
> >>> The benchmark is composed of 4 rounds: 'new', 'del', 'mix', and 'rnd'.
> >>> Respectively
> >>>
> >>> + 'new': only allocate IDs
> >>> + 'del': only free IDs
> >>> + 'mix': allocate, sequential free, then allocate ID.
> >>> + 'rnd': allocate, random free, allocate ID.
> >>>
> >>> Randomized freeing is done by swapping the latest allocated ID with any
> >>> from the range of currently allocated ID, which is reminiscent of the
> >>> Fisher-Yates shuffle. This evaluates freeing non-sequential IDs,
> >>> which is the more natural use-case.
> >>>
> >>> For this specific round, the id-pool performance is such that a timeout
> >>> of 10 seconds is added to the benchmark:
> >>>
> >>> $ ./tests/ovstest test-seq-pool benchmark 10000 1
> >>> Benchmarking n=10000 on 1 thread.
> >>> type\thread: 1 Avg
> >>> seq-pool new: 1 1 ms
> >>> seq-pool del: 0 0 ms
> >>> seq-pool mix: 1 1 ms
> >>> seq-pool rnd: 1 1 ms
> >>> id-pool new: 0 0 ms
> >>> id-pool del: 1 1 ms
> >>> id-pool mix: 1 1 ms
> >>> id-pool rnd: 1201 1201 ms
> >>>
> >>> $ ./tests/ovstest test-seq-pool benchmark 100000 1
> >>> Benchmarking n=100000 on 1 thread.
> >>> type\thread: 1 Avg
> >>> seq-pool new: 2 2 ms
> >>> seq-pool del: 5 5 ms
> >>> seq-pool mix: 5 5 ms
> >>> seq-pool rnd: 5 5 ms
> >>> id-pool new: 8 8 ms
> >>> id-pool del: 5 5 ms
> >>> id-pool mix: 11 11 ms
> >>> id-pool rnd: 10000+ ****** ms
> >>>
> >>> $ ./tests/ovstest test-seq-pool benchmark 1000000 1
> >>> Benchmarking n=1000000 on 1 thread.
> >>> type\thread: 1 Avg
> >>> seq-pool new: 23 23 ms
> >>> seq-pool del: 49 49 ms
> >>> seq-pool mix: 53 53 ms
> >>> seq-pool rnd: 53 53 ms
> >>> id-pool new: 190 190 ms
> >>> id-pool del: 173 173 ms
> >>> id-pool mix: 273 273 ms
> >>> id-pool rnd: 10042+ ****** ms
> >>>
> >>> $ ./tests/ovstest test-seq-pool benchmark 1000000 2
> >>> Benchmarking n=1000000 on 2 threads.
> >>> type\thread: 1 2 Avg
> >>> seq-pool new: 40 39 39 ms
> >>> seq-pool del: 33 33 33 ms
> >>> seq-pool mix: 89 91 90 ms
> >>> seq-pool rnd: 146 151 148 ms
> >>> id-pool new: 485 485 485 ms
> >>> id-pool del: 541 542 541 ms
> >>> id-pool mix: 550 600 575 ms
> >>> id-pool rnd: 10048+ 10003+ ****** ms
> >>>
> >>> $ ./tests/ovstest test-seq-pool benchmark 1000000 4
> >>> Benchmarking n=1000000 on 4 threads.
> >>> type\thread: 1 2 3 4 Avg
> >>> seq-pool new: 40 39 40 40 39 ms
> >>> seq-pool del: 24 28 28 30 27 ms
> >>> seq-pool mix: 60 63 69 69 65 ms
> >>> seq-pool rnd: 195 197 202 202 199 ms
> >>> id-pool new: 478 471 482 485 479 ms
> >>> id-pool del: 474 469 467 474 471 ms
> >>> id-pool mix: 558 558 611 545 568 ms
> >>> id-pool rnd: 10121+ 10076+ 10030+ 10167+ ****** ms
> >>
> >> Exactly same question here. The new data structure doesn't provide
> >> any guarantees and checks. It doesn't allocate smallest id.
> >> When why not just an array + spinlock? We're already using this
> >> schema in lib/netdev-afxdp-pool.c and it works fine.
> >>
> >> I crafted a very simple implementation and added it to this benchmark.
> >> Results are following:
> >>
> >> $ ./tests/ovstest test-seq-pool benchmark 10000 1
> >> Benchmarking n=10000 on 1 thread.
> >> type\thread: 1 Avg
> >> seq-pool new: 1 1 ms
> >> seq-pool del: 2 2 ms
> >> seq-pool mix: 4 4 ms
> >> seq-pool rnd: 4 4 ms
> >> my-pool new: 1 1 ms
> >> my-pool del: 1 1 ms
> >> my-pool mix: 1 1 ms
> >> my-pool rnd: 2 2 ms
> >>
> >> $ ./tests/ovstest test-seq-pool benchmark 100000 1
> >> Benchmarking n=100000 on 1 thread.
> >> type\thread: 1 Avg
> >> seq-pool new: 13 13 ms
> >> seq-pool del: 24 24 ms
> >> seq-pool mix: 11 11 ms
> >> seq-pool rnd: 8 8 ms
> >> my-pool new: 2 2 ms
> >> my-pool del: 1 1 ms
> >> my-pool mix: 4 4 ms
> >> my-pool rnd: 3 3 ms
> >>
> >> $ ./tests/ovstest test-seq-pool benchmark 1000000 1
> >> Benchmarking n=1000000 on 1 thread.
> >> type\thread: 1 Avg
> >> seq-pool new: 43 43 ms
> >> seq-pool del: 50 50 ms
> >> seq-pool mix: 54 54 ms
> >> seq-pool rnd: 54 54 ms
> >> my-pool new: 10 10 ms
> >> my-pool del: 10 10 ms
> >> my-pool mix: 28 28 ms
> >> my-pool rnd: 41 41 ms
> >>
> >> $ ./tests/ovstest test-seq-pool benchmark 1000000 2
> >> Benchmarking n=1000000 on 2 threads.
> >> type\thread: 1 2 Avg
> >> seq-pool new: 72 71 71 ms
> >> seq-pool del: 31 33 32 ms
> >> seq-pool mix: 83 83 83 ms
> >> seq-pool rnd: 144 147 145 ms
> >> my-pool new: 16 12 14 ms
> >> my-pool del: 16 12 14 ms
> >> my-pool mix: 66 70 68 ms
> >> my-pool rnd: 42 58 50 ms
> >>
> >> $ ./tests/ovstest test-seq-pool benchmark 1000000 4
> >> Benchmarking n=1000000 on 4 threads.
> >> type\thread: 1 2 3 4 Avg
> >> seq-pool new: 71 69 69 67 69 ms
> >> seq-pool del: 27 22 27 23 24 ms
> >> seq-pool mix: 57 64 71 67 64 ms
> >> seq-pool rnd: 182 186 189 191 187 ms
> >> my-pool new: 50 42 44 50 46 ms
> >> my-pool del: 46 39 37 45 41 ms
> >> my-pool mix: 79 45 55 81 65 ms
> >> my-pool rnd: 149 130 123 148 137 ms
> >>
> >> So, this implementation is faster and doesn't require any new complex
> >> data structures.
> >>
> >> Sometimes high contention on a spin lock results in unfair distribution
> >> and gives somewhat worse results, but they are still not too shabby, e.g.:
> >>
> >> $ ./tests/ovstest test-seq-pool benchmark 1000000 4
> >> Benchmarking n=1000000 on 4 threads.
> >> type\thread: 1 2 3 4 Avg
> >> seq-pool new: 75 75 75 74 74 ms
> >> seq-pool del: 25 22 23 26 24 ms
> >> seq-pool mix: 56 68 63 69 64 ms
> >> seq-pool rnd: 183 188 191 192 188 ms
> >> my-pool new: 52 52 53 39 49 ms
> >> my-pool del: 52 54 55 42 50 ms
> >> my-pool mix: 150 156 159 126 147 ms
> >> my-pool rnd: 160 175 177 139 162 ms
> >>
> >> I didn't add any core affinity management to this benchmark.
> >> And even in a bad for spinlock scenario with 16 threads on 4 physical
> >> cores with multi-threading, it doesn't look terrible:
> >>
> >> $ ./tests/ovstest test-seq-pool benchmark 1000000 16
> >> Benchmarking n=1000000 on 16 threads.
> >> type\thread: 1 ... 16 Avg
> >> seq-pool new: 38 ... 40 40 ms
> >> seq-pool del: 23 ... 20 20 ms
> >> seq-pool mix: 66 ... 68 65 ms
> >> seq-pool rnd: 200 ... 185 195 ms
> >> my-pool new: 68 ... 91 97 ms
> >> my-pool del: 87 ... 163 145 ms
> >> my-pool mix: 363 ... 359 343 ms
> >> my-pool rnd: 464 ... 432 428 ms
> >>
> >> For the scenario with a few offload thread should be perfectly fine.
> >> What do you think?
> >
> > Thanks, I hadn't seen the afxdp pool, it's interesting.
> >
> > I see two issues, one trivial and one maybe more problematic.
> > First is that there is an implicit dependency in the marks allocated
> > using the id-pool, on remaining under a limit. I can't speak for other
> > hardware vendors, but we have a limitation in Nvidia NICs, where
> > marks in the full range of the ones used by offloads are not all supported.
> >
> > It was a surprise for me, otherwise I was going for a fully random ID generator
> > (using malloc-generated addresses), there are simpler ways to get there.
> > For your pool the change is trivial anyway, only the ID values have to be
> > set accordingly:
> >
> > 487 for (int i = size; i > 0; i--) {
> > 488 pool->array[pool->count++] = start + i - 1;
> > 489 }
> >
> > There is no way currently to query the hardware for rte_flow limits,
> > so I don't have a good way to set this limit dynamically,
> > and I don't see how it could work with port hotplug, downsizing a shared range.
> > Maybe this is another gap in the rte_flow API that should be addressed.
> >
> > The other issue is that the full range of ID is pre-allocated.
> > With the current mark range it means 16 GB of mostly unused data.
> > We can limit this to the part that we actually use, considering
> > that we should only get flow_limit marks at most.
> >
> > With a flow limit of 10k, 20k maybe to deal with async add + delete, it would
> > be reduced to 80k bytes pre-allocated.
> >
> > I very much prefer your implementation. I am not happy with the seq-pool.
> > I tried to make it a drop-in replacement for id-pool, so allowing very large
> > ranges was part of it. If it is okay to restrict the mark range I think it's
> > worth using yours instead.
>
> I think, we can avoid pre-allocation of the whole array and use 2n-realloc
> schema if we will run out of pre-allocated values. It's not ideal to
> realloc under the spinlock, but this should not be a very big issue if
> it happens once or twice in a lifetime of a process. We can, probably,
> pre-allocate 200K values which will be ~1.5MB of data. 200K is the default
> limit for the number of datapath flows, so we should almost never re-allocate.
> Alternatively, we can pre-allocate 20K and have at most 4 re-allocations
> in a lifetime of the process.
>
> What do you think?
On principle I agree, deferring the allocation should work.
I'm not sure about the 2n-realloc. At face value, exponential alloc seems dangerous with
such a range. A growth constant would make the hiccups more predictable, considering
the spinlock will block other threads in a busy loop, it seems important to keep in mind.
Also, given that memory overcommit won't help, we could waste a lot of memory with
the last realloc.
I will respin the patch with a new version, probably with a new name instead of
seq-pool (I fear it might confuse readers regarding the struct 'seq' otherwise).
--
Gaetan Rivet
More information about the dev
mailing list