[ovs-dev] [OVN Patch v8 3/3] northd: Restore parallel build with dp_groups

Fri Oct 1 06:04:03 UTC 2021

On 01/10/2021 01:32, Han Zhou wrote:
>
>
> On Thu, Sep 30, 2021 at 2:03 PM Anton Ivanov 
> <anton.ivanov at cambridgegreys.com 
> <mailto:anton.ivanov at cambridgegreys.com>> wrote:
>
>     On 30/09/2021 20:48, Han Zhou wrote:
>>
>>
>>     On Thu, Sep 30, 2021 at 7:34 AM Anton Ivanov
>>     <anton.ivanov at cambridgegreys.com
>>     <mailto:anton.ivanov at cambridgegreys.com>> wrote:
>>
>>         Summary of findings.
>>
>>         1. The numbers on the perf test do not align with heater
>>         which is much closer to a realistic load. On some tests where
>>         heater gives 5-10% end-to-end improvement with
>>         parallelization we get worse results with the perf-test. You
>>         spotted this one correctly.
>>
>>         Example of the northd average pulled out of the test report
>>         via grep and sed.
>>
>>            127.489353
>>            131.509458
>>            116.088205
>>            94.721911
>>            119.629756
>>            114.896258
>>            124.811069
>>            129.679160
>>            106.699905
>>            134.490338
>>            112.106713
>>            135.957658
>>            132.471111
>>            94.106849
>>            117.431450
>>            115.861592
>>            106.830657
>>            132.396905
>>            107.092542
>>            128.945760
>>            94.298464
>>            120.455510
>>            136.910426
>>            134.311765
>>            115.881292
>>            116.918458
>>
>>         These values are all over the place - this is not a
>>         reproducible test.
>>
>>         2. In the present state you need to re-run it > 30+ times and
>>         take an average. The standard deviation for the values for
>>         the northd loop is > 10%. Compared to that the
>>         reproducibility of ovn-heater is significantly better. I
>>         usually get less than 0.5% difference between runs if there
>>         was no iteration failures. I would suggest using that instead
>>         if you want to do performance comparisons until we have
>>         figured out what affects the perf-test.
>>
>>         3. It is using the short term running average value in
>>         reports which is probably wrong because you have very
>>         significant skew from the last several values.
>>
>>         I will look into all of these.
>>
>>     Thanks for the summary! However, I think there is a bigger
>>     problem (probably related to my environment) than the stability
>>     of the test (make check-perf TESTSUITEFLAGS="--rebuild") itself.
>>     As I mentioned in an earlier email I observed even worse results
>>     with a large scale topology closer to a real world deployment of
>>     ovn-k8s just testing with the command:
>>         ovn-nbctl --print-wait-time --wait=sb sync
>>
>>     This command simply triggers a change in NB_Global table and wait
>>     for northd to complete all the recompute and update SB. It
>>     doesn't have to use "sync" command but any change to the NB DB
>>     produces similar result (e.g.: ovn-nbctl --print-wait-time
>>     --wait=sb ls-add ls1)
>>
>>     Without parallel:
>>     ovn-northd completion: 7807ms
>>
>>     With parallel:
>>     ovn-northd completion: 41267ms
>
>     Is this with current master or prior to these patches?
>
>     1. There was an issue prior to these where the hash on first
>     iteration with an existing database when loading a large database
>     for the first time was not sized correctly. These numbers sound
>     about right when this bug was around.
>
> The patches are included. The commit id is 9242f27f63 as mentioned in 
> my first email.
>
>     2. There should be NO DIFFERENCE in a single compute cycle with an
>     existing database between a run with parallel and without with dp
>     groups at present. This is because the first cycle does not use
>     parallel compute. It is disabled in order to achieve the correct
>     hash sizings for future cycle by auto-scaling the hash.
>
> Yes, I understand this and I did enable dp-group for the above 
> "ovn-nbctl sync" test, so the number I showed above for "with 
> parallel" was for the 2nd run and onwards. For the first round the 
> result is exactly the same as without parallel.
>
> I just tried disabling DP group for the large scale "ovn-nbctl sync" 
> test (after taking some effort squeezing out memory spaces on my 
> desktop), and the result shows that parallel build performs slightly 
> better (although it is 3x slower than with dp-group & without 
> parallel, which is expected). Summarize the result together below:
>
> Without parallel, with dp-group:
> ovn-northd completion: 7807ms
>
> With parallel, with dp-group:
> ovn-northd completion: 41267ms
>
> without parallel, without dp-group:
> ovn-northd completion: 27996ms
>
> with parallel, without dp-group:
> ovn-northd completion: 26584ms
>
> Now the interesting part:
> I implemented a POC of a hash based mutex array that replaces the rw 
> lock in the function do_ovn_lflow_add_pd(), and the performance is 
> greatly improved for the dp-group test:
>
> with parallel, with dp-group (hash based mutex):
> ovn-northd completion: 5081ms
>
> This is 8x faster than the current parallel one and 30% faster than 
> without parallel. This result looks much more reasonable to me. My 
> theory is that when using parallel with dp-group, the rwlock 
> contention is causing the low CPU utilization of the threads and the 
> overall slowness on my machine. I will refine the POC to a formal 
> patch and send it for review, hopefully by tomorrow.

Cool. The older implementation prior to going to rwlock was based on that.

I found a couple of issues with it which is why I switched to RWlock

Namely - the access to the lflow hash size is not controlled and the 
hash size ends up corrupt because different threads modify it without a 
lock. In a worst case scenario you end up with a dog's breakfast in this 
entire cache line.

So you need a couple of extra macros to insert fast without touching the 
cache size.

This in turn leaves you with a hash you cannot resize to correct size 
for searching for post-processing lflows and reconciliation. You will 
probably need the post-processing optimization patch which I submitted a 
couple of weeks back. Instead of using a HMAPX to hold the single flows 
and modifying in-place the lflow hash it rebuilds it completely and 
replaces the original one. At that point you have the right size and the 
hash is resized to optimum size.

By the way, there is one more option - you may want to try switching to 
fatrwlock - that is supposed to decrease contention and make things 
faster. Though that probably will not be enough.

I still do not get it why your results are so different from ovn-heater 
tests, but that is something I will look into separately.

Brgds,

>
> Thanks,
> Han

-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/