[ovs-dev] [PATCH v3 0/9] OVSDB Relay Service Model. (Was: OVSDB 2-Tier deployment)

Thu Jul 15 22:33:50 UTC 2021

On 7/15/21 3:32 PM, Dumitru Ceara wrote:
> Hi Ilya,
> 
> On 7/14/21 6:52 PM, Ilya Maximets wrote:
>> On 7/14/21 3:50 PM, Ilya Maximets wrote:
>>> Replication can be used to scale out read-only access to the database.
>>> But there are clients that are not read-only, but read-mostly.
>>> One of the main examples is ovn-controller that mostly monitors
>>> updates from the Southbound DB, but needs to claim ports by sending
>>> transactions that changes some database tables.
>>>
>>> Southbound database serves lots of connections: all connections
>>> from ovn-controllers and some service connections from cloud
>>> infrastructure, e.g. some OpenStack agents are monitoring updates.
>>> At a high scale and with a big size of the database ovsdb-server
>>> spends too much time processing monitor updates and it's required
>>> to move this load somewhere else.  This patch-set aims to introduce
>>> required functionality to scale out read-mostly connections by
>>> introducing a new OVSDB 'relay' service model .
>>>
>>> In this new service model ovsdb-server connects to existing OVSDB
>>> server and maintains in-memory copy of the database.  It serves
>>> read-only transactions and monitor requests by its own, but forwards
>>> write transactions to the relay source.
>>>
>>> Key differences from the active-backup replication:
>>> - support for "write" transactions.
>>> - no on-disk storage. (probably, faster operation)
>>> - support for multiple remotes (connect to the clustered db).
>>> - doesn't try to keep connection as long as possible, but
>>>   faster reconnects to other remotes to avoid missing updates.
>>> - No need to know the complete database schema beforehand,
>>>   only the schema name.
>>> - can be used along with other standalone and clustered databases
>>>   by the same ovsdb-server process. (doesn't turn the whole
>>>   jsonrpc server to read-only mode)
>>> - supports modern version of monitors (monitor_cond_since),
>>>   because based on ovsdb-cs.
>>> - could be chained, i.e. multiple relays could be connected
>>>   one to another in a row or in a tree-like form.
>>>
>>> Bringing all above functionality to the existing active-backup
>>> replication doesn't look right as it will make it less reliable
>>> for the actual backup use case, and this also would be much
>>> harder from the implementation point of view, because current
>>> replication code is not based on ovsdb-cs or idl and all the required
>>> features would be likely duplicated or replication would be fully
>>> re-written on top of ovsdb-cs with severe modifications of the former.
>>>
>>> Relay is somewhere in the middle between active-backup replication and
>>> the clustered model taking a lot from both, therefore is hard to
>>> implement on top of any of them.
>>>
>>> To run ovsdb-server in relay mode, user need to simply run:
>>>
>>>   ovsdb-server --remote=punix:db.sock relay:<schema-name>:<remotes>
>>>
>>> e.g.
>>>
>>>   ovsdb-server --remote=punix:db.sock relay:OVN_Southbound:tcp:127.0.0.1:6642
>>>
>>> More details and examples in the documentation in the last patch
>>> of the series.
>>>
>>> I actually tried to implement transaction forwarding on top of
>>> active-backup replication in v1 of this seies, but it required
>>> a lot of tricky changes, including schema format changes in order
>>> to bring required information to the end clients, so I decided
>>> to fully rewrite the functionality in v2 with a different approach.
>>>
>>>
>>>  Testing
>>>  =======
>>>
>>> Some scale tests were performed with OVSDB Relays that mimics OVN
>>> workloads with ovn-kubernetes.
>>> Tests performed with ovn-heater (https://github.com/dceara/ovn-heater)
>>> on scenario ocp-120-density-heavy:
>>>  https://github.com/dceara/ovn-heater/blob/master/test-scenarios/ocp-120-density-heavy.yml
>>> In short, the test gradually creates a lot of OVN resources and
>>> checks that network is configured correctly (by pinging diferent
>>> namespaces).  The test includes 120 chassis (created by
>>> ovn-fake-multinode), 31250 LSPs spread evenly across 120 LSes, 3 LBs
>>> with 15625 VIPs each, attached to all node LSes, etc.  Test performed
>>> with monitor-all=true.
>>>
>>> Note 1:
>>>  - Memory consumption is checked at the end of a test in a following
>>>    way: 1) check RSS 2) compact database 3) check RSS again.
>>>    It's observed that ovn-controllers in this test are fairly slow
>>>    and backlog builds up on monitors, because ovn-controllers are
>>>    not able to receive updates fast enough.  This contributes to
>>>    RSS of the process, especially in combination of glibc bug (glibc
>>>    doesn't free fastbins back to the system).  Memory trimming on
>>>    compaction is enabled in the test, so after compaction we can
>>>    see more or less real value of the RSS at the end of the test
>>>    wihtout backlog noise. (Compaction on relay in this case is
>>>    just plain malloc_trim()).
>>>
>>> Note 2:
>>>  - I didn't collect memory consumption (RSS) after compaction for a
>>>    test with 10 relays, because I got the idea only after the test
>>>    was finished and another one already started.  And run takes
>>>    significant amount of time.  So, values marked with a star (*)
>>>    are an approximation based on results form other tests, hence
>>>    might be not fully correct.
>>>
>>> Note 3:
>>>  - 'Max. poll' is a maximum of the 'long poll intervals' logged by
>>>    ovsdb-server during the test.  Poll intervals that involved database
>>>    compaction (huge disk writes) are same in all tests and excluded
>>>    from the results.  (Sb DB size in the test is 256MB, fully
>>>    compacted).  'Number of intervals' is just a number of logged
>>>    unreasonably long poll intervals.
>>>    Also note that ovsdb-server logs only compactions that took > 1s,
>>>    so poll intervals that involved compaction, but under 1s can not
>>>    be reliably excluded from the test results.
>>>    'central' - main Sb DB servers.
>>>    'relay'   - relay servers connected to central ones.
>>>    'before'/'after' - RSS before and after compaction + malloc_trim().
>>>    'time' - is a total time the process spent in Running state.
>>>
>>>
>>> Baseline (3 main servers, 0 relays):
>>> ++++++++++++++++++++++++++++++++++++++++
>>>
>>>                RSS
>>> central  before    after    clients  time     Max. poll   Number of intervals
>>>          7552924   3828848   ~41     109:50   5882        1249
>>>          7342468   4109576   ~43     108:37   5717        1169
>>>          5886260   4109496   ~39      96:31   4990        1233
>>>          ---------------------------------------------------------------------
>>>              20G       12G   126     314:58   5882        3651
>>>
>>> 3x3 (3 main servers, 3 relays):
>>> +++++++++++++++++++++++++++++++
>>>
>>>                 RSS
>>> central  before    after    clients  time     Max. poll   Number of intervals
>>>          6228176   3542164   ~1-5    36:53    2174        358
>>>          5723920   3570616   ~1-5    24:03    2205        382
>>>          5825420   3490840   ~1-5    35:42    2214        309
>>>          ---------------------------------------------------------------------
>>>            17.7G     10.6G      9    96:38    2214        1049
>>>
>>> relay    before    after    clients  time     Max. poll   Number of intervals
>>>          2174328    726576    37     69:44    5216        627
>>>          2122144    729640    32     63:52    4767        625
>>>          2824160    751384    51     89:09    5980        627
>>>          ---------------------------------------------------------------------
>>>               7G      2.2G    120   222:45    5980        1879
>>>
>>> Total:   =====================================================================
>>>            24.7G     12.8G    129    319:23   5980        2928
>>>
>>> 3x10 (3 main servers, 10 relays):
>>> +++++++++++++++++++++++++++++++++
>>>
>>>                RSS
>>> central  before    after    clients  time    Max. poll   Number of intervals
>>>          6190892    ---      ~1-6    42:43   2041         634
>>>          5687576    ---      ~1-5    27:09   2503         405
>>>          5958432    ---      ~1-7    40:44   2193         450
>>>          ---------------------------------------------------------------------
>>>            17.8G   ~10G*       16   110:36   2503         1489
>>>
>>> relay    before    after    clients  time    Max. poll   Number of intervals
>>>          1331256    ---       9      22:58   1327         140
>>>          1218288    ---      13      28:28   1840         621
>>>          1507644    ---      19      41:44   2869         623
>>>          1257692    ---      12      27:40   1532         517
>>>          1125368    ---       9      22:23   1148         105
>>>          1380664    ---      16      35:04   2422         619
>>>          1087248    ---       6      18:18   1038           6
>>>          1277484    ---      14      34:02   2392         616
>>>          1209936    ---      10      25:31   1603         451
>>>          1293092    ---      12      29:03   2071         621
>>>          ---------------------------------------------------------------------
>>>            12.6G    5-7G*    120    285:11   2869         4319
>>>
>>> Total:   =====================================================================
>>>            30.4G    15-17G*  136    395:47   2869         5808
> 
> This is very cool, thanks for taking the time to share all this data!
> 
>>>
>>>
>>>  Conclusions from the test:
>>>  ==========================
>>>
>>> 1. Relays relieve a lot of pressure from main Sb DB servers.
>>>    In my testing total CPU time on main servers goes down from 314
>>>    to 96-110 minutes, which is 3 times lower.
>>>    During the test, number of registered 'unreasonably poll interval's
>>>    on main servers goes down by 3-4 times.  At the same time the
>>>    maximum duration of these intervals goes down by a factor of 2.5.
>>>    Also, factor should be higher with increased number of clinents.
>>>
>>> 2. Since number of clients is significantly lower, memory consumption
>>>    of main Db DB servers also goes down by ~12%.
>>>
>>> 3. For the 3x3 test total memory consumed by all processes increased
>>>    only by 6%.  And total CPU usage increased by 1.2%.  Poll intervals
>>>    on relay servers are comparable to poll intervals on main servers
>>>    with no relays, but poll intervals on main servers are significantly
>>>    better (see conclusion # 1).  In general, it seems that for this
>>>    test running of 3 relays next to 3 main Sb DB servers significanlty
>>>    increases cluster stability and responsiveness without noticeable
>>>    increase in memory or CPU usage.
>>>
>>> 4. For the 3x10 test total memory consumed by all processes increased
>>>    by ~50-70%*.  And total CPU usage increased by 26% in compare with
>>
>> ~50-70%* should be ~25-40%*.  I miscalculated because used 10G from 3x3
>> test instead of 12G from the baseline.
>>
>>>    baseline setup.  At the same time poll intervals on both main
>>>    and relay servers are lower by a factor of 2-4 (depends on a
>>>    particular server).  In general, cluster with 10 relays is much more
>>>    stable and responsive with a reasonably low memory consumption and
>>>    CPU time overhead.
>>>
>>>
> 
> Nice!
> 
>>>
>>> Future work:
>>> - Add support for transaction history (it could be just inherited
>>>   from the transaction ids received from the relay source).  This
>>>   will allow clients to utilize monitor_cond_since while working
>>>   with relay.
>>> - Possibly try to inherit min_index from the relay source to give
>>>   clients ability to detect relays with stale data.
>>> - Probably, add support for both above things to standalone databases,
>>>   so relays will be able to inherit not only from clustered ones.
> 
> Nit: I don't think this should block the series but I think the above
> should be added to ovsdb/TODO.rst in a follow up patch.

Will do.  TODO.rst also needs some clean-up as it seems that some bits
from there are already implemented.

> 
> I just acked the single patch I hadn't acked in v2 (7/9) and left a
> minor comment on 5/9 (which can be fixed at apply time).
> 
> The series looks good to me.

Thanks, Mark and Dumitru!

I fixed the small comment on patch 5/9 and applied the series to master
with a minor rebase due to a memory leak fix that got accepted in the
meantime.

Best regards, Ilya Maximets.