[ovs-dev] [PATCH v3 14/28] seq-pool: Module for faster ID generation

Wed May 5 14:41:41 UTC 2021

On Mon, May 3, 2021, at 14:48, Ilya Maximets wrote:
> On 5/1/21 4:49 PM, Gaëtan Rivet wrote:
> > On Fri, Apr 30, 2021, at 19:31, Ilya Maximets wrote:
> >> On 4/25/21 1:55 PM, Gaetan Rivet wrote:
> >>> The current id-pool module is slow to allocate the
> >>> next valid ID, and can be optimized when restricting
> >>> some properties of the pool.
> >>>
> >>> Those restrictions are:
> >>>
> >>>   * No ability to add a random ID to the pool.
> >>>
> >>>   * A new ID is no more the smallest possible ID.  It is
> >>>     however guaranteed to be in the range of
> >>>     [base, next_id].  Multiple users of the pool are
> >>>     registered, each with a thread-local cache for
> >>>     better scalability and the next_id is one after the
> >>>     latest ID added to any user cache.
> >>>     The allocation range can be written as:
> >>>
> >>>        [base, last_alloc + nb-user * cache-size + 1].
> >>>
> >>>   * A user should never free an ID that is not allocated.
> >>>     No checks are done and doing so will duplicate the spurious
> >>>     ID.  Refcounting or other memory management scheme should
> >>>     be used to ensure an object and its ID are only freed once.
> >>>
> >>> This pool is designed to scale reasonably well in multi-thread
> >>> setup.  As it is aimed at being a faster replacement to the
> >>> current id-pool, a benchmark has been implemented alongside
> >>> unit tests.
> >>>
> >>> The benchmark is composed of 4 rounds: 'new', 'del', 'mix', and 'rnd'.
> >>> Respectively
> >>>
> >>>   + 'new': only allocate IDs
> >>>   + 'del': only free IDs
> >>>   + 'mix': allocate, sequential free, then allocate ID.
> >>>   + 'rnd': allocate, random free, allocate ID.
> >>>
> >>> Randomized freeing is done by swapping the latest allocated ID with any
> >>> from the range of currently allocated ID, which is reminiscent of the
> >>> Fisher-Yates shuffle.  This evaluates freeing non-sequential IDs,
> >>> which is the more natural use-case.
> >>>
> >>> For this specific round, the id-pool performance is such that a timeout
> >>> of 10 seconds is added to the benchmark:
> >>>
> >>>    $ ./tests/ovstest test-seq-pool benchmark 10000 1
> >>>    Benchmarking n=10000 on 1 thread.
> >>>     type\thread:       1    Avg
> >>>    seq-pool new:       1      1 ms
> >>>    seq-pool del:       0      0 ms
> >>>    seq-pool mix:       1      1 ms
> >>>    seq-pool rnd:       1      1 ms
> >>>     id-pool new:       0      0 ms
> >>>     id-pool del:       1      1 ms
> >>>     id-pool mix:       1      1 ms
> >>>     id-pool rnd:    1201   1201 ms
> >>>
> >>>    $ ./tests/ovstest test-seq-pool benchmark 100000 1
> >>>    Benchmarking n=100000 on 1 thread.
> >>>     type\thread:       1    Avg
> >>>    seq-pool new:       2      2 ms
> >>>    seq-pool del:       5      5 ms
> >>>    seq-pool mix:       5      5 ms
> >>>    seq-pool rnd:       5      5 ms
> >>>     id-pool new:       8      8 ms
> >>>     id-pool del:       5      5 ms
> >>>     id-pool mix:      11     11 ms
> >>>     id-pool rnd:  10000+ ****** ms
> >>>
> >>>    $ ./tests/ovstest test-seq-pool benchmark 1000000 1
> >>>    Benchmarking n=1000000 on 1 thread.
> >>>     type\thread:       1    Avg
> >>>    seq-pool new:      23     23 ms
> >>>    seq-pool del:      49     49 ms
> >>>    seq-pool mix:      53     53 ms
> >>>    seq-pool rnd:      53     53 ms
> >>>     id-pool new:     190    190 ms
> >>>     id-pool del:     173    173 ms
> >>>     id-pool mix:     273    273 ms
> >>>     id-pool rnd:  10042+ ****** ms
> >>>
> >>>    $ ./tests/ovstest test-seq-pool benchmark 1000000 2
> >>>    Benchmarking n=1000000 on 2 threads.
> >>>     type\thread:       1      2    Avg
> >>>    seq-pool new:      40     39     39 ms
> >>>    seq-pool del:      33     33     33 ms
> >>>    seq-pool mix:      89     91     90 ms
> >>>    seq-pool rnd:     146    151    148 ms
> >>>     id-pool new:     485    485    485 ms
> >>>     id-pool del:     541    542    541 ms
> >>>     id-pool mix:     550    600    575 ms
> >>>     id-pool rnd:  10048+ 10003+ ****** ms
> >>>
> >>>    $ ./tests/ovstest test-seq-pool benchmark 1000000 4
> >>>    Benchmarking n=1000000 on 4 threads.
> >>>     type\thread:       1      2      3      4    Avg
> >>>    seq-pool new:      40     39     40     40     39 ms
> >>>    seq-pool del:      24     28     28     30     27 ms
> >>>    seq-pool mix:      60     63     69     69     65 ms
> >>>    seq-pool rnd:     195    197    202    202    199 ms
> >>>     id-pool new:     478    471    482    485    479 ms
> >>>     id-pool del:     474    469    467    474    471 ms
> >>>     id-pool mix:     558    558    611    545    568 ms
> >>>     id-pool rnd:  10121+ 10076+ 10030+ 10167+ ****** ms
> >>
> >> Exactly same question here.  The new data structure doesn't provide
> >> any guarantees and checks.  It doesn't allocate smallest id.
> >> When why not just an array + spinlock?   We're already using this
> >> schema in lib/netdev-afxdp-pool.c and it works fine.
> >>
> >> I crafted a very simple implementation and added it to this benchmark.
> >> Results are following:
> >>
> >> $ ./tests/ovstest test-seq-pool benchmark 10000 1
> >> Benchmarking n=10000 on 1 thread.
> >>  type\thread:       1    Avg
> >> seq-pool new:       1      1 ms
> >> seq-pool del:       2      2 ms
> >> seq-pool mix:       4      4 ms
> >> seq-pool rnd:       4      4 ms
> >>  my-pool new:       1      1 ms
> >>  my-pool del:       1      1 ms
> >>  my-pool mix:       1      1 ms
> >>  my-pool rnd:       2      2 ms
> >>
> >> $ ./tests/ovstest test-seq-pool benchmark 100000 1
> >> Benchmarking n=100000 on 1 thread.
> >>  type\thread:       1    Avg
> >> seq-pool new:      13     13 ms
> >> seq-pool del:      24     24 ms
> >> seq-pool mix:      11     11 ms
> >> seq-pool rnd:       8      8 ms
> >>  my-pool new:       2      2 ms
> >>  my-pool del:       1      1 ms
> >>  my-pool mix:       4      4 ms
> >>  my-pool rnd:       3      3 ms
> >>
> >> $ ./tests/ovstest test-seq-pool benchmark 1000000 1
> >> Benchmarking n=1000000 on 1 thread.
> >>  type\thread:       1    Avg
> >> seq-pool new:      43     43 ms
> >> seq-pool del:      50     50 ms
> >> seq-pool mix:      54     54 ms
> >> seq-pool rnd:      54     54 ms
> >>  my-pool new:      10     10 ms
> >>  my-pool del:      10     10 ms
> >>  my-pool mix:      28     28 ms
> >>  my-pool rnd:      41     41 ms
> >>
> >> $ ./tests/ovstest test-seq-pool benchmark 1000000 2
> >> Benchmarking n=1000000 on 2 threads.
> >>  type\thread:       1      2    Avg
> >> seq-pool new:      72     71     71 ms
> >> seq-pool del:      31     33     32 ms
> >> seq-pool mix:      83     83     83 ms
> >> seq-pool rnd:     144    147    145 ms
> >>  my-pool new:      16     12     14 ms
> >>  my-pool del:      16     12     14 ms
> >>  my-pool mix:      66     70     68 ms
> >>  my-pool rnd:      42     58     50 ms
> >>
> >> $ ./tests/ovstest test-seq-pool benchmark 1000000 4
> >> Benchmarking n=1000000 on 4 threads.
> >>  type\thread:       1      2      3      4    Avg
> >> seq-pool new:      71     69     69     67     69 ms
> >> seq-pool del:      27     22     27     23     24 ms
> >> seq-pool mix:      57     64     71     67     64 ms
> >> seq-pool rnd:     182    186    189    191    187 ms
> >>  my-pool new:      50     42     44     50     46 ms
> >>  my-pool del:      46     39     37     45     41 ms
> >>  my-pool mix:      79     45     55     81     65 ms
> >>  my-pool rnd:     149    130    123    148    137 ms
> >>
> >> So, this implementation is faster and doesn't require any new complex
> >> data structures.
> >>
> >> Sometimes high contention on a spin lock results in unfair distribution
> >> and gives somewhat worse results, but they are still not too shabby, e.g.:
> >>
> >> $ ./tests/ovstest test-seq-pool benchmark 1000000 4
> >> Benchmarking n=1000000 on 4 threads.
> >>  type\thread:       1      2      3      4    Avg
> >> seq-pool new:      75     75     75     74     74 ms
> >> seq-pool del:      25     22     23     26     24 ms
> >> seq-pool mix:      56     68     63     69     64 ms
> >> seq-pool rnd:     183    188    191    192    188 ms
> >>  my-pool new:      52     52     53     39     49 ms
> >>  my-pool del:      52     54     55     42     50 ms
> >>  my-pool mix:     150    156    159    126    147 ms
> >>  my-pool rnd:     160    175    177    139    162 ms
> >>
> >> I didn't add any core affinity management to this benchmark.
> >> And even in a bad for spinlock scenario with 16 threads on 4 physical
> >> cores with multi-threading, it doesn't look terrible:
> >>
> >> $ ./tests/ovstest test-seq-pool benchmark 1000000 16
> >> Benchmarking n=1000000 on 16 threads.
> >>  type\thread:       1      ...     16    Avg
> >> seq-pool new:      38      ...     40     40 ms
> >> seq-pool del:      23      ...     20     20 ms
> >> seq-pool mix:      66      ...     68     65 ms
> >> seq-pool rnd:     200      ...    185    195 ms
> >>  my-pool new:      68      ...     91     97 ms
> >>  my-pool del:      87      ...    163    145 ms
> >>  my-pool mix:     363      ...    359    343 ms
> >>  my-pool rnd:     464      ...    432    428 ms
> >>
> >> For the scenario with a few offload thread should be perfectly fine.
> >> What do you think?
> > 
> > Thanks, I hadn't seen the afxdp pool, it's interesting.
> > 
> > I see two issues, one trivial and one maybe more problematic.
> > First is that there is an implicit dependency in the marks allocated
> > using the id-pool, on remaining under a limit. I can't speak for other
> > hardware vendors, but we have a limitation in Nvidia NICs, where
> > marks in the full range of the ones used by offloads are not all supported.
> > 
> > It was a surprise for me, otherwise I was going for a fully random ID generator
> > (using malloc-generated addresses), there are simpler ways to get there.
> > For your pool the change is trivial anyway, only the ID values have to be
> > set accordingly:
> > 
> > 487     for (int i = size; i > 0; i--) {
> > 488         pool->array[pool->count++] = start + i - 1;
> > 489     }
> > 
> > There is no way currently to query the hardware for rte_flow limits,
> > so I don't have a good way to set this limit dynamically,
> > and I don't see how it could work with port hotplug, downsizing a shared range.
> > Maybe this is another gap in the rte_flow API that should be addressed.
> > 
> > The other issue is that the full range of ID is pre-allocated.
> > With the current mark range it means 16 GB of mostly unused data.
> > We can limit this to the part that we actually use, considering
> > that we should only get flow_limit marks at most.
> > 
> > With a flow limit of 10k, 20k maybe to deal with async add + delete, it would
> > be reduced to 80k bytes pre-allocated.
> > 
> > I very much prefer your implementation. I am not happy with the seq-pool.
> > I tried to make it a drop-in replacement for id-pool, so allowing very large
> > ranges was part of it. If it is okay to restrict the mark range I think it's
> > worth using yours instead.
> 
> I think, we can avoid pre-allocation of the whole array and use 2n-realloc
> schema if we will run out of pre-allocated values.  It's not ideal to
> realloc under the spinlock, but this should not be a very big issue if
> it happens once or twice in a lifetime of a process.  We can, probably,
> pre-allocate 200K values which will be ~1.5MB of data.  200K is the default
> limit for the number of datapath flows, so we should almost never re-allocate.
> Alternatively, we can pre-allocate 20K and have at most 4 re-allocations
> in a lifetime of the process.
> 
> What do you think?

On principle I agree, deferring the allocation should work.

I'm not sure about the 2n-realloc. At face value, exponential alloc seems dangerous with
such a range. A growth constant would make the hiccups more predictable, considering
the spinlock will block other threads in a busy loop, it seems important to keep in mind.
Also, given that memory overcommit won't help, we could waste a lot of memory with
the last realloc.

I will respin the patch with a new version, probably with a new name instead of
seq-pool (I fear it might confuse readers regarding the struct 'seq' otherwise).

-- 
Gaetan Rivet