[ovs-dev] [PATCH v3 0/9] OVSDB Relay Service Model. (Was: OVSDB 2-Tier deployment)

Thu Jul 15 13:32:59 UTC 2021

Hi Ilya,

On 7/14/21 6:52 PM, Ilya Maximets wrote:
> On 7/14/21 3:50 PM, Ilya Maximets wrote:
>> Replication can be used to scale out read-only access to the database.
>> But there are clients that are not read-only, but read-mostly.
>> One of the main examples is ovn-controller that mostly monitors
>> updates from the Southbound DB, but needs to claim ports by sending
>> transactions that changes some database tables.
>>
>> Southbound database serves lots of connections: all connections
>> from ovn-controllers and some service connections from cloud
>> infrastructure, e.g. some OpenStack agents are monitoring updates.
>> At a high scale and with a big size of the database ovsdb-server
>> spends too much time processing monitor updates and it's required
>> to move this load somewhere else.  This patch-set aims to introduce
>> required functionality to scale out read-mostly connections by
>> introducing a new OVSDB 'relay' service model .
>>
>> In this new service model ovsdb-server connects to existing OVSDB
>> server and maintains in-memory copy of the database.  It serves
>> read-only transactions and monitor requests by its own, but forwards
>> write transactions to the relay source.
>>
>> Key differences from the active-backup replication:
>> - support for "write" transactions.
>> - no on-disk storage. (probably, faster operation)
>> - support for multiple remotes (connect to the clustered db).
>> - doesn't try to keep connection as long as possible, but
>>   faster reconnects to other remotes to avoid missing updates.
>> - No need to know the complete database schema beforehand,
>>   only the schema name.
>> - can be used along with other standalone and clustered databases
>>   by the same ovsdb-server process. (doesn't turn the whole
>>   jsonrpc server to read-only mode)
>> - supports modern version of monitors (monitor_cond_since),
>>   because based on ovsdb-cs.
>> - could be chained, i.e. multiple relays could be connected
>>   one to another in a row or in a tree-like form.
>>
>> Bringing all above functionality to the existing active-backup
>> replication doesn't look right as it will make it less reliable
>> for the actual backup use case, and this also would be much
>> harder from the implementation point of view, because current
>> replication code is not based on ovsdb-cs or idl and all the required
>> features would be likely duplicated or replication would be fully
>> re-written on top of ovsdb-cs with severe modifications of the former.
>>
>> Relay is somewhere in the middle between active-backup replication and
>> the clustered model taking a lot from both, therefore is hard to
>> implement on top of any of them.
>>
>> To run ovsdb-server in relay mode, user need to simply run:
>>
>>   ovsdb-server --remote=punix:db.sock relay:<schema-name>:<remotes>
>>
>> e.g.
>>
>>   ovsdb-server --remote=punix:db.sock relay:OVN_Southbound:tcp:127.0.0.1:6642
>>
>> More details and examples in the documentation in the last patch
>> of the series.
>>
>> I actually tried to implement transaction forwarding on top of
>> active-backup replication in v1 of this seies, but it required
>> a lot of tricky changes, including schema format changes in order
>> to bring required information to the end clients, so I decided
>> to fully rewrite the functionality in v2 with a different approach.
>>
>>
>>  Testing
>>  =======
>>
>> Some scale tests were performed with OVSDB Relays that mimics OVN
>> workloads with ovn-kubernetes.
>> Tests performed with ovn-heater (https://github.com/dceara/ovn-heater)
>> on scenario ocp-120-density-heavy:
>>  https://github.com/dceara/ovn-heater/blob/master/test-scenarios/ocp-120-density-heavy.yml
>> In short, the test gradually creates a lot of OVN resources and
>> checks that network is configured correctly (by pinging diferent
>> namespaces).  The test includes 120 chassis (created by
>> ovn-fake-multinode), 31250 LSPs spread evenly across 120 LSes, 3 LBs
>> with 15625 VIPs each, attached to all node LSes, etc.  Test performed
>> with monitor-all=true.
>>
>> Note 1:
>>  - Memory consumption is checked at the end of a test in a following
>>    way: 1) check RSS 2) compact database 3) check RSS again.
>>    It's observed that ovn-controllers in this test are fairly slow
>>    and backlog builds up on monitors, because ovn-controllers are
>>    not able to receive updates fast enough.  This contributes to
>>    RSS of the process, especially in combination of glibc bug (glibc
>>    doesn't free fastbins back to the system).  Memory trimming on
>>    compaction is enabled in the test, so after compaction we can
>>    see more or less real value of the RSS at the end of the test
>>    wihtout backlog noise. (Compaction on relay in this case is
>>    just plain malloc_trim()).
>>
>> Note 2:
>>  - I didn't collect memory consumption (RSS) after compaction for a
>>    test with 10 relays, because I got the idea only after the test
>>    was finished and another one already started.  And run takes
>>    significant amount of time.  So, values marked with a star (*)
>>    are an approximation based on results form other tests, hence
>>    might be not fully correct.
>>
>> Note 3:
>>  - 'Max. poll' is a maximum of the 'long poll intervals' logged by
>>    ovsdb-server during the test.  Poll intervals that involved database
>>    compaction (huge disk writes) are same in all tests and excluded
>>    from the results.  (Sb DB size in the test is 256MB, fully
>>    compacted).  'Number of intervals' is just a number of logged
>>    unreasonably long poll intervals.
>>    Also note that ovsdb-server logs only compactions that took > 1s,
>>    so poll intervals that involved compaction, but under 1s can not
>>    be reliably excluded from the test results.
>>    'central' - main Sb DB servers.
>>    'relay'   - relay servers connected to central ones.
>>    'before'/'after' - RSS before and after compaction + malloc_trim().
>>    'time' - is a total time the process spent in Running state.
>>
>>
>> Baseline (3 main servers, 0 relays):
>> ++++++++++++++++++++++++++++++++++++++++
>>
>>                RSS
>> central  before    after    clients  time     Max. poll   Number of intervals
>>          7552924   3828848   ~41     109:50   5882        1249
>>          7342468   4109576   ~43     108:37   5717        1169
>>          5886260   4109496   ~39      96:31   4990        1233
>>          ---------------------------------------------------------------------
>>              20G       12G   126     314:58   5882        3651
>>
>> 3x3 (3 main servers, 3 relays):
>> +++++++++++++++++++++++++++++++
>>
>>                 RSS
>> central  before    after    clients  time     Max. poll   Number of intervals
>>          6228176   3542164   ~1-5    36:53    2174        358
>>          5723920   3570616   ~1-5    24:03    2205        382
>>          5825420   3490840   ~1-5    35:42    2214        309
>>          ---------------------------------------------------------------------
>>            17.7G     10.6G      9    96:38    2214        1049
>>
>> relay    before    after    clients  time     Max. poll   Number of intervals
>>          2174328    726576    37     69:44    5216        627
>>          2122144    729640    32     63:52    4767        625
>>          2824160    751384    51     89:09    5980        627
>>          ---------------------------------------------------------------------
>>               7G      2.2G    120   222:45    5980        1879
>>
>> Total:   =====================================================================
>>            24.7G     12.8G    129    319:23   5980        2928
>>
>> 3x10 (3 main servers, 10 relays):
>> +++++++++++++++++++++++++++++++++
>>
>>                RSS
>> central  before    after    clients  time    Max. poll   Number of intervals
>>          6190892    ---      ~1-6    42:43   2041         634
>>          5687576    ---      ~1-5    27:09   2503         405
>>          5958432    ---      ~1-7    40:44   2193         450
>>          ---------------------------------------------------------------------
>>            17.8G   ~10G*       16   110:36   2503         1489
>>
>> relay    before    after    clients  time    Max. poll   Number of intervals
>>          1331256    ---       9      22:58   1327         140
>>          1218288    ---      13      28:28   1840         621
>>          1507644    ---      19      41:44   2869         623
>>          1257692    ---      12      27:40   1532         517
>>          1125368    ---       9      22:23   1148         105
>>          1380664    ---      16      35:04   2422         619
>>          1087248    ---       6      18:18   1038           6
>>          1277484    ---      14      34:02   2392         616
>>          1209936    ---      10      25:31   1603         451
>>          1293092    ---      12      29:03   2071         621
>>          ---------------------------------------------------------------------
>>            12.6G    5-7G*    120    285:11   2869         4319
>>
>> Total:   =====================================================================
>>            30.4G    15-17G*  136    395:47   2869         5808

This is very cool, thanks for taking the time to share all this data!

>>
>>
>>  Conclusions from the test:
>>  ==========================
>>
>> 1. Relays relieve a lot of pressure from main Sb DB servers.
>>    In my testing total CPU time on main servers goes down from 314
>>    to 96-110 minutes, which is 3 times lower.
>>    During the test, number of registered 'unreasonably poll interval's
>>    on main servers goes down by 3-4 times.  At the same time the
>>    maximum duration of these intervals goes down by a factor of 2.5.
>>    Also, factor should be higher with increased number of clinents.
>>
>> 2. Since number of clients is significantly lower, memory consumption
>>    of main Db DB servers also goes down by ~12%.
>>
>> 3. For the 3x3 test total memory consumed by all processes increased
>>    only by 6%.  And total CPU usage increased by 1.2%.  Poll intervals
>>    on relay servers are comparable to poll intervals on main servers
>>    with no relays, but poll intervals on main servers are significantly
>>    better (see conclusion # 1).  In general, it seems that for this
>>    test running of 3 relays next to 3 main Sb DB servers significanlty
>>    increases cluster stability and responsiveness without noticeable
>>    increase in memory or CPU usage.
>>
>> 4. For the 3x10 test total memory consumed by all processes increased
>>    by ~50-70%*.  And total CPU usage increased by 26% in compare with
> 
> ~50-70%* should be ~25-40%*.  I miscalculated because used 10G from 3x3
> test instead of 12G from the baseline.
> 
>>    baseline setup.  At the same time poll intervals on both main
>>    and relay servers are lower by a factor of 2-4 (depends on a
>>    particular server).  In general, cluster with 10 relays is much more
>>    stable and responsive with a reasonably low memory consumption and
>>    CPU time overhead.
>>
>>

Nice!

>>
>> Future work:
>> - Add support for transaction history (it could be just inherited
>>   from the transaction ids received from the relay source).  This
>>   will allow clients to utilize monitor_cond_since while working
>>   with relay.
>> - Possibly try to inherit min_index from the relay source to give
>>   clients ability to detect relays with stale data.
>> - Probably, add support for both above things to standalone databases,
>>   so relays will be able to inherit not only from clustered ones.

Nit: I don't think this should block the series but I think the above
should be added to ovsdb/TODO.rst in a follow up patch.

I just acked the single patch I hadn't acked in v2 (7/9) and left a
minor comment on 5/9 (which can be fixed at apply time).

The series looks good to me.

Regards,
Dumitru