[ovs-dev] [PATCH v3 0/9] OVSDB Relay Service Model. (Was: OVSDB 2-Tier deployment)
Ilya Maximets
i.maximets at ovn.org
Thu Jul 15 22:33:50 UTC 2021
On 7/15/21 3:32 PM, Dumitru Ceara wrote:
> Hi Ilya,
>
> On 7/14/21 6:52 PM, Ilya Maximets wrote:
>> On 7/14/21 3:50 PM, Ilya Maximets wrote:
>>> Replication can be used to scale out read-only access to the database.
>>> But there are clients that are not read-only, but read-mostly.
>>> One of the main examples is ovn-controller that mostly monitors
>>> updates from the Southbound DB, but needs to claim ports by sending
>>> transactions that changes some database tables.
>>>
>>> Southbound database serves lots of connections: all connections
>>> from ovn-controllers and some service connections from cloud
>>> infrastructure, e.g. some OpenStack agents are monitoring updates.
>>> At a high scale and with a big size of the database ovsdb-server
>>> spends too much time processing monitor updates and it's required
>>> to move this load somewhere else. This patch-set aims to introduce
>>> required functionality to scale out read-mostly connections by
>>> introducing a new OVSDB 'relay' service model .
>>>
>>> In this new service model ovsdb-server connects to existing OVSDB
>>> server and maintains in-memory copy of the database. It serves
>>> read-only transactions and monitor requests by its own, but forwards
>>> write transactions to the relay source.
>>>
>>> Key differences from the active-backup replication:
>>> - support for "write" transactions.
>>> - no on-disk storage. (probably, faster operation)
>>> - support for multiple remotes (connect to the clustered db).
>>> - doesn't try to keep connection as long as possible, but
>>> faster reconnects to other remotes to avoid missing updates.
>>> - No need to know the complete database schema beforehand,
>>> only the schema name.
>>> - can be used along with other standalone and clustered databases
>>> by the same ovsdb-server process. (doesn't turn the whole
>>> jsonrpc server to read-only mode)
>>> - supports modern version of monitors (monitor_cond_since),
>>> because based on ovsdb-cs.
>>> - could be chained, i.e. multiple relays could be connected
>>> one to another in a row or in a tree-like form.
>>>
>>> Bringing all above functionality to the existing active-backup
>>> replication doesn't look right as it will make it less reliable
>>> for the actual backup use case, and this also would be much
>>> harder from the implementation point of view, because current
>>> replication code is not based on ovsdb-cs or idl and all the required
>>> features would be likely duplicated or replication would be fully
>>> re-written on top of ovsdb-cs with severe modifications of the former.
>>>
>>> Relay is somewhere in the middle between active-backup replication and
>>> the clustered model taking a lot from both, therefore is hard to
>>> implement on top of any of them.
>>>
>>> To run ovsdb-server in relay mode, user need to simply run:
>>>
>>> ovsdb-server --remote=punix:db.sock relay:<schema-name>:<remotes>
>>>
>>> e.g.
>>>
>>> ovsdb-server --remote=punix:db.sock relay:OVN_Southbound:tcp:127.0.0.1:6642
>>>
>>> More details and examples in the documentation in the last patch
>>> of the series.
>>>
>>> I actually tried to implement transaction forwarding on top of
>>> active-backup replication in v1 of this seies, but it required
>>> a lot of tricky changes, including schema format changes in order
>>> to bring required information to the end clients, so I decided
>>> to fully rewrite the functionality in v2 with a different approach.
>>>
>>>
>>> Testing
>>> =======
>>>
>>> Some scale tests were performed with OVSDB Relays that mimics OVN
>>> workloads with ovn-kubernetes.
>>> Tests performed with ovn-heater (https://github.com/dceara/ovn-heater)
>>> on scenario ocp-120-density-heavy:
>>> https://github.com/dceara/ovn-heater/blob/master/test-scenarios/ocp-120-density-heavy.yml
>>> In short, the test gradually creates a lot of OVN resources and
>>> checks that network is configured correctly (by pinging diferent
>>> namespaces). The test includes 120 chassis (created by
>>> ovn-fake-multinode), 31250 LSPs spread evenly across 120 LSes, 3 LBs
>>> with 15625 VIPs each, attached to all node LSes, etc. Test performed
>>> with monitor-all=true.
>>>
>>> Note 1:
>>> - Memory consumption is checked at the end of a test in a following
>>> way: 1) check RSS 2) compact database 3) check RSS again.
>>> It's observed that ovn-controllers in this test are fairly slow
>>> and backlog builds up on monitors, because ovn-controllers are
>>> not able to receive updates fast enough. This contributes to
>>> RSS of the process, especially in combination of glibc bug (glibc
>>> doesn't free fastbins back to the system). Memory trimming on
>>> compaction is enabled in the test, so after compaction we can
>>> see more or less real value of the RSS at the end of the test
>>> wihtout backlog noise. (Compaction on relay in this case is
>>> just plain malloc_trim()).
>>>
>>> Note 2:
>>> - I didn't collect memory consumption (RSS) after compaction for a
>>> test with 10 relays, because I got the idea only after the test
>>> was finished and another one already started. And run takes
>>> significant amount of time. So, values marked with a star (*)
>>> are an approximation based on results form other tests, hence
>>> might be not fully correct.
>>>
>>> Note 3:
>>> - 'Max. poll' is a maximum of the 'long poll intervals' logged by
>>> ovsdb-server during the test. Poll intervals that involved database
>>> compaction (huge disk writes) are same in all tests and excluded
>>> from the results. (Sb DB size in the test is 256MB, fully
>>> compacted). 'Number of intervals' is just a number of logged
>>> unreasonably long poll intervals.
>>> Also note that ovsdb-server logs only compactions that took > 1s,
>>> so poll intervals that involved compaction, but under 1s can not
>>> be reliably excluded from the test results.
>>> 'central' - main Sb DB servers.
>>> 'relay' - relay servers connected to central ones.
>>> 'before'/'after' - RSS before and after compaction + malloc_trim().
>>> 'time' - is a total time the process spent in Running state.
>>>
>>>
>>> Baseline (3 main servers, 0 relays):
>>> ++++++++++++++++++++++++++++++++++++++++
>>>
>>> RSS
>>> central before after clients time Max. poll Number of intervals
>>> 7552924 3828848 ~41 109:50 5882 1249
>>> 7342468 4109576 ~43 108:37 5717 1169
>>> 5886260 4109496 ~39 96:31 4990 1233
>>> ---------------------------------------------------------------------
>>> 20G 12G 126 314:58 5882 3651
>>>
>>> 3x3 (3 main servers, 3 relays):
>>> +++++++++++++++++++++++++++++++
>>>
>>> RSS
>>> central before after clients time Max. poll Number of intervals
>>> 6228176 3542164 ~1-5 36:53 2174 358
>>> 5723920 3570616 ~1-5 24:03 2205 382
>>> 5825420 3490840 ~1-5 35:42 2214 309
>>> ---------------------------------------------------------------------
>>> 17.7G 10.6G 9 96:38 2214 1049
>>>
>>> relay before after clients time Max. poll Number of intervals
>>> 2174328 726576 37 69:44 5216 627
>>> 2122144 729640 32 63:52 4767 625
>>> 2824160 751384 51 89:09 5980 627
>>> ---------------------------------------------------------------------
>>> 7G 2.2G 120 222:45 5980 1879
>>>
>>> Total: =====================================================================
>>> 24.7G 12.8G 129 319:23 5980 2928
>>>
>>> 3x10 (3 main servers, 10 relays):
>>> +++++++++++++++++++++++++++++++++
>>>
>>> RSS
>>> central before after clients time Max. poll Number of intervals
>>> 6190892 --- ~1-6 42:43 2041 634
>>> 5687576 --- ~1-5 27:09 2503 405
>>> 5958432 --- ~1-7 40:44 2193 450
>>> ---------------------------------------------------------------------
>>> 17.8G ~10G* 16 110:36 2503 1489
>>>
>>> relay before after clients time Max. poll Number of intervals
>>> 1331256 --- 9 22:58 1327 140
>>> 1218288 --- 13 28:28 1840 621
>>> 1507644 --- 19 41:44 2869 623
>>> 1257692 --- 12 27:40 1532 517
>>> 1125368 --- 9 22:23 1148 105
>>> 1380664 --- 16 35:04 2422 619
>>> 1087248 --- 6 18:18 1038 6
>>> 1277484 --- 14 34:02 2392 616
>>> 1209936 --- 10 25:31 1603 451
>>> 1293092 --- 12 29:03 2071 621
>>> ---------------------------------------------------------------------
>>> 12.6G 5-7G* 120 285:11 2869 4319
>>>
>>> Total: =====================================================================
>>> 30.4G 15-17G* 136 395:47 2869 5808
>
> This is very cool, thanks for taking the time to share all this data!
>
>>>
>>>
>>> Conclusions from the test:
>>> ==========================
>>>
>>> 1. Relays relieve a lot of pressure from main Sb DB servers.
>>> In my testing total CPU time on main servers goes down from 314
>>> to 96-110 minutes, which is 3 times lower.
>>> During the test, number of registered 'unreasonably poll interval's
>>> on main servers goes down by 3-4 times. At the same time the
>>> maximum duration of these intervals goes down by a factor of 2.5.
>>> Also, factor should be higher with increased number of clinents.
>>>
>>> 2. Since number of clients is significantly lower, memory consumption
>>> of main Db DB servers also goes down by ~12%.
>>>
>>> 3. For the 3x3 test total memory consumed by all processes increased
>>> only by 6%. And total CPU usage increased by 1.2%. Poll intervals
>>> on relay servers are comparable to poll intervals on main servers
>>> with no relays, but poll intervals on main servers are significantly
>>> better (see conclusion # 1). In general, it seems that for this
>>> test running of 3 relays next to 3 main Sb DB servers significanlty
>>> increases cluster stability and responsiveness without noticeable
>>> increase in memory or CPU usage.
>>>
>>> 4. For the 3x10 test total memory consumed by all processes increased
>>> by ~50-70%*. And total CPU usage increased by 26% in compare with
>>
>> ~50-70%* should be ~25-40%*. I miscalculated because used 10G from 3x3
>> test instead of 12G from the baseline.
>>
>>> baseline setup. At the same time poll intervals on both main
>>> and relay servers are lower by a factor of 2-4 (depends on a
>>> particular server). In general, cluster with 10 relays is much more
>>> stable and responsive with a reasonably low memory consumption and
>>> CPU time overhead.
>>>
>>>
>
> Nice!
>
>>>
>>> Future work:
>>> - Add support for transaction history (it could be just inherited
>>> from the transaction ids received from the relay source). This
>>> will allow clients to utilize monitor_cond_since while working
>>> with relay.
>>> - Possibly try to inherit min_index from the relay source to give
>>> clients ability to detect relays with stale data.
>>> - Probably, add support for both above things to standalone databases,
>>> so relays will be able to inherit not only from clustered ones.
>
> Nit: I don't think this should block the series but I think the above
> should be added to ovsdb/TODO.rst in a follow up patch.
Will do. TODO.rst also needs some clean-up as it seems that some bits
from there are already implemented.
>
> I just acked the single patch I hadn't acked in v2 (7/9) and left a
> minor comment on 5/9 (which can be fixed at apply time).
>
> The series looks good to me.
Thanks, Mark and Dumitru!
I fixed the small comment on patch 5/9 and applied the series to master
with a minor rebase due to a memory leak fix that got accepted in the
meantime.
Best regards, Ilya Maximets.
More information about the dev
mailing list