[ovs-dev] [PATCH v3 0/9] OVSDB Relay Service Model. (Was: OVSDB 2-Tier deployment)
Dumitru Ceara
dceara at redhat.com
Thu Jul 15 13:32:59 UTC 2021
Hi Ilya,
On 7/14/21 6:52 PM, Ilya Maximets wrote:
> On 7/14/21 3:50 PM, Ilya Maximets wrote:
>> Replication can be used to scale out read-only access to the database.
>> But there are clients that are not read-only, but read-mostly.
>> One of the main examples is ovn-controller that mostly monitors
>> updates from the Southbound DB, but needs to claim ports by sending
>> transactions that changes some database tables.
>>
>> Southbound database serves lots of connections: all connections
>> from ovn-controllers and some service connections from cloud
>> infrastructure, e.g. some OpenStack agents are monitoring updates.
>> At a high scale and with a big size of the database ovsdb-server
>> spends too much time processing monitor updates and it's required
>> to move this load somewhere else. This patch-set aims to introduce
>> required functionality to scale out read-mostly connections by
>> introducing a new OVSDB 'relay' service model .
>>
>> In this new service model ovsdb-server connects to existing OVSDB
>> server and maintains in-memory copy of the database. It serves
>> read-only transactions and monitor requests by its own, but forwards
>> write transactions to the relay source.
>>
>> Key differences from the active-backup replication:
>> - support for "write" transactions.
>> - no on-disk storage. (probably, faster operation)
>> - support for multiple remotes (connect to the clustered db).
>> - doesn't try to keep connection as long as possible, but
>> faster reconnects to other remotes to avoid missing updates.
>> - No need to know the complete database schema beforehand,
>> only the schema name.
>> - can be used along with other standalone and clustered databases
>> by the same ovsdb-server process. (doesn't turn the whole
>> jsonrpc server to read-only mode)
>> - supports modern version of monitors (monitor_cond_since),
>> because based on ovsdb-cs.
>> - could be chained, i.e. multiple relays could be connected
>> one to another in a row or in a tree-like form.
>>
>> Bringing all above functionality to the existing active-backup
>> replication doesn't look right as it will make it less reliable
>> for the actual backup use case, and this also would be much
>> harder from the implementation point of view, because current
>> replication code is not based on ovsdb-cs or idl and all the required
>> features would be likely duplicated or replication would be fully
>> re-written on top of ovsdb-cs with severe modifications of the former.
>>
>> Relay is somewhere in the middle between active-backup replication and
>> the clustered model taking a lot from both, therefore is hard to
>> implement on top of any of them.
>>
>> To run ovsdb-server in relay mode, user need to simply run:
>>
>> ovsdb-server --remote=punix:db.sock relay:<schema-name>:<remotes>
>>
>> e.g.
>>
>> ovsdb-server --remote=punix:db.sock relay:OVN_Southbound:tcp:127.0.0.1:6642
>>
>> More details and examples in the documentation in the last patch
>> of the series.
>>
>> I actually tried to implement transaction forwarding on top of
>> active-backup replication in v1 of this seies, but it required
>> a lot of tricky changes, including schema format changes in order
>> to bring required information to the end clients, so I decided
>> to fully rewrite the functionality in v2 with a different approach.
>>
>>
>> Testing
>> =======
>>
>> Some scale tests were performed with OVSDB Relays that mimics OVN
>> workloads with ovn-kubernetes.
>> Tests performed with ovn-heater (https://github.com/dceara/ovn-heater)
>> on scenario ocp-120-density-heavy:
>> https://github.com/dceara/ovn-heater/blob/master/test-scenarios/ocp-120-density-heavy.yml
>> In short, the test gradually creates a lot of OVN resources and
>> checks that network is configured correctly (by pinging diferent
>> namespaces). The test includes 120 chassis (created by
>> ovn-fake-multinode), 31250 LSPs spread evenly across 120 LSes, 3 LBs
>> with 15625 VIPs each, attached to all node LSes, etc. Test performed
>> with monitor-all=true.
>>
>> Note 1:
>> - Memory consumption is checked at the end of a test in a following
>> way: 1) check RSS 2) compact database 3) check RSS again.
>> It's observed that ovn-controllers in this test are fairly slow
>> and backlog builds up on monitors, because ovn-controllers are
>> not able to receive updates fast enough. This contributes to
>> RSS of the process, especially in combination of glibc bug (glibc
>> doesn't free fastbins back to the system). Memory trimming on
>> compaction is enabled in the test, so after compaction we can
>> see more or less real value of the RSS at the end of the test
>> wihtout backlog noise. (Compaction on relay in this case is
>> just plain malloc_trim()).
>>
>> Note 2:
>> - I didn't collect memory consumption (RSS) after compaction for a
>> test with 10 relays, because I got the idea only after the test
>> was finished and another one already started. And run takes
>> significant amount of time. So, values marked with a star (*)
>> are an approximation based on results form other tests, hence
>> might be not fully correct.
>>
>> Note 3:
>> - 'Max. poll' is a maximum of the 'long poll intervals' logged by
>> ovsdb-server during the test. Poll intervals that involved database
>> compaction (huge disk writes) are same in all tests and excluded
>> from the results. (Sb DB size in the test is 256MB, fully
>> compacted). 'Number of intervals' is just a number of logged
>> unreasonably long poll intervals.
>> Also note that ovsdb-server logs only compactions that took > 1s,
>> so poll intervals that involved compaction, but under 1s can not
>> be reliably excluded from the test results.
>> 'central' - main Sb DB servers.
>> 'relay' - relay servers connected to central ones.
>> 'before'/'after' - RSS before and after compaction + malloc_trim().
>> 'time' - is a total time the process spent in Running state.
>>
>>
>> Baseline (3 main servers, 0 relays):
>> ++++++++++++++++++++++++++++++++++++++++
>>
>> RSS
>> central before after clients time Max. poll Number of intervals
>> 7552924 3828848 ~41 109:50 5882 1249
>> 7342468 4109576 ~43 108:37 5717 1169
>> 5886260 4109496 ~39 96:31 4990 1233
>> ---------------------------------------------------------------------
>> 20G 12G 126 314:58 5882 3651
>>
>> 3x3 (3 main servers, 3 relays):
>> +++++++++++++++++++++++++++++++
>>
>> RSS
>> central before after clients time Max. poll Number of intervals
>> 6228176 3542164 ~1-5 36:53 2174 358
>> 5723920 3570616 ~1-5 24:03 2205 382
>> 5825420 3490840 ~1-5 35:42 2214 309
>> ---------------------------------------------------------------------
>> 17.7G 10.6G 9 96:38 2214 1049
>>
>> relay before after clients time Max. poll Number of intervals
>> 2174328 726576 37 69:44 5216 627
>> 2122144 729640 32 63:52 4767 625
>> 2824160 751384 51 89:09 5980 627
>> ---------------------------------------------------------------------
>> 7G 2.2G 120 222:45 5980 1879
>>
>> Total: =====================================================================
>> 24.7G 12.8G 129 319:23 5980 2928
>>
>> 3x10 (3 main servers, 10 relays):
>> +++++++++++++++++++++++++++++++++
>>
>> RSS
>> central before after clients time Max. poll Number of intervals
>> 6190892 --- ~1-6 42:43 2041 634
>> 5687576 --- ~1-5 27:09 2503 405
>> 5958432 --- ~1-7 40:44 2193 450
>> ---------------------------------------------------------------------
>> 17.8G ~10G* 16 110:36 2503 1489
>>
>> relay before after clients time Max. poll Number of intervals
>> 1331256 --- 9 22:58 1327 140
>> 1218288 --- 13 28:28 1840 621
>> 1507644 --- 19 41:44 2869 623
>> 1257692 --- 12 27:40 1532 517
>> 1125368 --- 9 22:23 1148 105
>> 1380664 --- 16 35:04 2422 619
>> 1087248 --- 6 18:18 1038 6
>> 1277484 --- 14 34:02 2392 616
>> 1209936 --- 10 25:31 1603 451
>> 1293092 --- 12 29:03 2071 621
>> ---------------------------------------------------------------------
>> 12.6G 5-7G* 120 285:11 2869 4319
>>
>> Total: =====================================================================
>> 30.4G 15-17G* 136 395:47 2869 5808
This is very cool, thanks for taking the time to share all this data!
>>
>>
>> Conclusions from the test:
>> ==========================
>>
>> 1. Relays relieve a lot of pressure from main Sb DB servers.
>> In my testing total CPU time on main servers goes down from 314
>> to 96-110 minutes, which is 3 times lower.
>> During the test, number of registered 'unreasonably poll interval's
>> on main servers goes down by 3-4 times. At the same time the
>> maximum duration of these intervals goes down by a factor of 2.5.
>> Also, factor should be higher with increased number of clinents.
>>
>> 2. Since number of clients is significantly lower, memory consumption
>> of main Db DB servers also goes down by ~12%.
>>
>> 3. For the 3x3 test total memory consumed by all processes increased
>> only by 6%. And total CPU usage increased by 1.2%. Poll intervals
>> on relay servers are comparable to poll intervals on main servers
>> with no relays, but poll intervals on main servers are significantly
>> better (see conclusion # 1). In general, it seems that for this
>> test running of 3 relays next to 3 main Sb DB servers significanlty
>> increases cluster stability and responsiveness without noticeable
>> increase in memory or CPU usage.
>>
>> 4. For the 3x10 test total memory consumed by all processes increased
>> by ~50-70%*. And total CPU usage increased by 26% in compare with
>
> ~50-70%* should be ~25-40%*. I miscalculated because used 10G from 3x3
> test instead of 12G from the baseline.
>
>> baseline setup. At the same time poll intervals on both main
>> and relay servers are lower by a factor of 2-4 (depends on a
>> particular server). In general, cluster with 10 relays is much more
>> stable and responsive with a reasonably low memory consumption and
>> CPU time overhead.
>>
>>
Nice!
>>
>> Future work:
>> - Add support for transaction history (it could be just inherited
>> from the transaction ids received from the relay source). This
>> will allow clients to utilize monitor_cond_since while working
>> with relay.
>> - Possibly try to inherit min_index from the relay source to give
>> clients ability to detect relays with stale data.
>> - Probably, add support for both above things to standalone databases,
>> so relays will be able to inherit not only from clustered ones.
Nit: I don't think this should block the series but I think the above
should be added to ovsdb/TODO.rst in a follow up patch.
I just acked the single patch I hadn't acked in v2 (7/9) and left a
minor comment on 5/9 (which can be fixed at apply time).
The series looks good to me.
Regards,
Dumitru
More information about the dev
mailing list