[ovs-dev] [PATCH v3 0/9] OVSDB Relay Service Model. (Was: OVSDB 2-Tier deployment)
Mark Gray
mark.d.gray at redhat.com
Thu Jul 15 16:45:15 UTC 2021
On 14/07/2021 14:50, Ilya Maximets wrote:
> Replication can be used to scale out read-only access to the database.
> But there are clients that are not read-only, but read-mostly.
> One of the main examples is ovn-controller that mostly monitors
> updates from the Southbound DB, but needs to claim ports by sending
> transactions that changes some database tables.
>
> Southbound database serves lots of connections: all connections
> from ovn-controllers and some service connections from cloud
> infrastructure, e.g. some OpenStack agents are monitoring updates.
> At a high scale and with a big size of the database ovsdb-server
> spends too much time processing monitor updates and it's required
> to move this load somewhere else. This patch-set aims to introduce
> required functionality to scale out read-mostly connections by
> introducing a new OVSDB 'relay' service model .
>
> In this new service model ovsdb-server connects to existing OVSDB
> server and maintains in-memory copy of the database. It serves
> read-only transactions and monitor requests by its own, but forwards
> write transactions to the relay source.
>
> Key differences from the active-backup replication:
> - support for "write" transactions.
> - no on-disk storage. (probably, faster operation)
> - support for multiple remotes (connect to the clustered db).
> - doesn't try to keep connection as long as possible, but
> faster reconnects to other remotes to avoid missing updates.
> - No need to know the complete database schema beforehand,
> only the schema name.
> - can be used along with other standalone and clustered databases
> by the same ovsdb-server process. (doesn't turn the whole
> jsonrpc server to read-only mode)
> - supports modern version of monitors (monitor_cond_since),
> because based on ovsdb-cs.
> - could be chained, i.e. multiple relays could be connected
> one to another in a row or in a tree-like form.
>
> Bringing all above functionality to the existing active-backup
> replication doesn't look right as it will make it less reliable
> for the actual backup use case, and this also would be much
> harder from the implementation point of view, because current
> replication code is not based on ovsdb-cs or idl and all the required
> features would be likely duplicated or replication would be fully
> re-written on top of ovsdb-cs with severe modifications of the former.
>
> Relay is somewhere in the middle between active-backup replication and
> the clustered model taking a lot from both, therefore is hard to
> implement on top of any of them.
>
> To run ovsdb-server in relay mode, user need to simply run:
>
> ovsdb-server --remote=punix:db.sock relay:<schema-name>:<remotes>
>
> e.g.
>
> ovsdb-server --remote=punix:db.sock relay:OVN_Southbound:tcp:127.0.0.1:6642
>
> More details and examples in the documentation in the last patch
> of the series.
>
> I actually tried to implement transaction forwarding on top of
> active-backup replication in v1 of this seies, but it required
> a lot of tricky changes, including schema format changes in order
> to bring required information to the end clients, so I decided
> to fully rewrite the functionality in v2 with a different approach.
>
>
> Testing
> =======
>
> Some scale tests were performed with OVSDB Relays that mimics OVN
> workloads with ovn-kubernetes.
> Tests performed with ovn-heater (https://github.com/dceara/ovn-heater)
> on scenario ocp-120-density-heavy:
> https://github.com/dceara/ovn-heater/blob/master/test-scenarios/ocp-120-density-heavy.yml
> In short, the test gradually creates a lot of OVN resources and
> checks that network is configured correctly (by pinging diferent
> namespaces). The test includes 120 chassis (created by
> ovn-fake-multinode), 31250 LSPs spread evenly across 120 LSes, 3 LBs
> with 15625 VIPs each, attached to all node LSes, etc. Test performed
> with monitor-all=true.
>
> Note 1:
> - Memory consumption is checked at the end of a test in a following
> way: 1) check RSS 2) compact database 3) check RSS again.
> It's observed that ovn-controllers in this test are fairly slow
> and backlog builds up on monitors, because ovn-controllers are
> not able to receive updates fast enough. This contributes to
> RSS of the process, especially in combination of glibc bug (glibc
> doesn't free fastbins back to the system). Memory trimming on
> compaction is enabled in the test, so after compaction we can
> see more or less real value of the RSS at the end of the test
> wihtout backlog noise. (Compaction on relay in this case is
> just plain malloc_trim()).
>
> Note 2:
> - I didn't collect memory consumption (RSS) after compaction for a
> test with 10 relays, because I got the idea only after the test
> was finished and another one already started. And run takes
> significant amount of time. So, values marked with a star (*)
> are an approximation based on results form other tests, hence
> might be not fully correct.
>
> Note 3:
> - 'Max. poll' is a maximum of the 'long poll intervals' logged by
> ovsdb-server during the test. Poll intervals that involved database
> compaction (huge disk writes) are same in all tests and excluded
> from the results. (Sb DB size in the test is 256MB, fully
> compacted). 'Number of intervals' is just a number of logged
> unreasonably long poll intervals.
> Also note that ovsdb-server logs only compactions that took > 1s,
> so poll intervals that involved compaction, but under 1s can not
> be reliably excluded from the test results.
> 'central' - main Sb DB servers.
> 'relay' - relay servers connected to central ones.
> 'before'/'after' - RSS before and after compaction + malloc_trim().
> 'time' - is a total time the process spent in Running state.
>
>
> Baseline (3 main servers, 0 relays):
> ++++++++++++++++++++++++++++++++++++++++
>
> RSS
> central before after clients time Max. poll Number of intervals
> 7552924 3828848 ~41 109:50 5882 1249
> 7342468 4109576 ~43 108:37 5717 1169
> 5886260 4109496 ~39 96:31 4990 1233
> ---------------------------------------------------------------------
> 20G 12G 126 314:58 5882 3651
>
> 3x3 (3 main servers, 3 relays):
> +++++++++++++++++++++++++++++++
>
> RSS
> central before after clients time Max. poll Number of intervals
> 6228176 3542164 ~1-5 36:53 2174 358
> 5723920 3570616 ~1-5 24:03 2205 382
> 5825420 3490840 ~1-5 35:42 2214 309
> ---------------------------------------------------------------------
> 17.7G 10.6G 9 96:38 2214 1049
>
> relay before after clients time Max. poll Number of intervals
> 2174328 726576 37 69:44 5216 627
> 2122144 729640 32 63:52 4767 625
> 2824160 751384 51 89:09 5980 627
> ---------------------------------------------------------------------
> 7G 2.2G 120 222:45 5980 1879
>
> Total: =====================================================================
> 24.7G 12.8G 129 319:23 5980 2928
>
> 3x10 (3 main servers, 10 relays):
> +++++++++++++++++++++++++++++++++
>
> RSS
> central before after clients time Max. poll Number of intervals
> 6190892 --- ~1-6 42:43 2041 634
> 5687576 --- ~1-5 27:09 2503 405
> 5958432 --- ~1-7 40:44 2193 450
> ---------------------------------------------------------------------
> 17.8G ~10G* 16 110:36 2503 1489
>
> relay before after clients time Max. poll Number of intervals
> 1331256 --- 9 22:58 1327 140
> 1218288 --- 13 28:28 1840 621
> 1507644 --- 19 41:44 2869 623
> 1257692 --- 12 27:40 1532 517
> 1125368 --- 9 22:23 1148 105
> 1380664 --- 16 35:04 2422 619
> 1087248 --- 6 18:18 1038 6
> 1277484 --- 14 34:02 2392 616
> 1209936 --- 10 25:31 1603 451
> 1293092 --- 12 29:03 2071 621
> ---------------------------------------------------------------------
> 12.6G 5-7G* 120 285:11 2869 4319
>
> Total: =====================================================================
> 30.4G 15-17G* 136 395:47 2869 5808
>
>
Thanks for running these and sharing the data. It looks promising and is
not a hugely intrusive change in the code base so it looks good to me
Acked-by: Mark D. Gray <mark.d.gray at redhat.com>
> Conclusions from the test:
> ==========================
>
> 1. Relays relieve a lot of pressure from main Sb DB servers.
> In my testing total CPU time on main servers goes down from 314
> to 96-110 minutes, which is 3 times lower.
> During the test, number of registered 'unreasonably poll interval's
> on main servers goes down by 3-4 times. At the same time the
> maximum duration of these intervals goes down by a factor of 2.5.
> Also, factor should be higher with increased number of clinents.
>
> 2. Since number of clients is significantly lower, memory consumption
> of main Db DB servers also goes down by ~12%.
>
> 3. For the 3x3 test total memory consumed by all processes increased
> only by 6%. And total CPU usage increased by 1.2%. Poll intervals
> on relay servers are comparable to poll intervals on main servers
> with no relays, but poll intervals on main servers are significantly
> better (see conclusion # 1). In general, it seems that for this
> test running of 3 relays next to 3 main Sb DB servers significanlty
> increases cluster stability and responsiveness without noticeable
> increase in memory or CPU usage.
>
> 4. For the 3x10 test total memory consumed by all processes increased
> by ~50-70%*. And total CPU usage increased by 26% in compare with
> baseline setup. At the same time poll intervals on both main
> and relay servers are lower by a factor of 2-4 (depends on a
> particular server). In general, cluster with 10 relays is much more
> stable and responsive with a reasonably low memory consumption and
> CPU time overhead.
>
>
>
> Future work:
> - Add support for transaction history (it could be just inherited
> from the transaction ids received from the relay source). This
> will allow clients to utilize monitor_cond_since while working
> with relay.
> - Possibly try to inherit min_index from the relay source to give
> clients ability to detect relays with stale data.
> - Probably, add support for both above things to standalone databases,
> so relays will be able to inherit not only from clustered ones.
>
>
> Version 3:
> - Fixed issue with incorrect schema equality check.
> - Fixed transaction leak if inconsistent data received from the
> source.
> - Minor fixes for style, wording and typos.
>
> Version 2:
> - Dropped implementation on top of active-backup replication.
> - Implemented new 'relay' service model.
> - Updated documentation and wrote a separate topic with examples
> and ascii-graphics. That's why v2 seem larger.
>
> Ilya Maximets (9):
> jsonrpc-server: Wake up jsonrpc session if there are completed
> triggers.
> ovsdb: storage: Allow setting the name for the unbacked storage.
> ovsdb: table: Expose functions to execute operations on ovsdb tables.
> ovsdb: row: Add support for xor-based row updates.
> ovsdb: New ovsdb 'relay' service model.
> ovsdb: relay: Add support for transaction forwarding.
> ovsdb: relay: Reflect connection status in _Server database.
> ovsdb: Make clients aware of relay service model.
> docs: Add documentation for ovsdb relay mode.
>
> Documentation/automake.mk | 1 +
> Documentation/ref/ovsdb.7.rst | 62 ++++-
> Documentation/topics/index.rst | 1 +
> Documentation/topics/ovsdb-relay.rst | 124 +++++++++
> NEWS | 3 +
> lib/ovsdb-cs.c | 15 +-
> ovsdb/_server.ovsschema | 7 +-
> ovsdb/_server.xml | 35 +--
> ovsdb/automake.mk | 4 +
> ovsdb/execution.c | 18 +-
> ovsdb/file.c | 2 +-
> ovsdb/jsonrpc-server.c | 3 +-
> ovsdb/ovsdb-client.c | 2 +-
> ovsdb/ovsdb-server.1.in | 27 +-
> ovsdb/ovsdb-server.c | 105 +++++---
> ovsdb/ovsdb.c | 11 +
> ovsdb/ovsdb.h | 9 +-
> ovsdb/relay.c | 385 +++++++++++++++++++++++++++
> ovsdb/relay.h | 38 +++
> ovsdb/replication.c | 83 +-----
> ovsdb/row.c | 30 ++-
> ovsdb/row.h | 6 +-
> ovsdb/storage.c | 13 +-
> ovsdb/storage.h | 2 +-
> ovsdb/table.c | 70 +++++
> ovsdb/table.h | 14 +
> ovsdb/transaction-forward.c | 182 +++++++++++++
> ovsdb/transaction-forward.h | 44 +++
> ovsdb/trigger.c | 49 +++-
> ovsdb/trigger.h | 41 +--
> python/ovs/db/idl.py | 16 ++
> tests/ovsdb-server.at | 85 +++++-
> tests/test-ovsdb.c | 6 +-
> 33 files changed, 1297 insertions(+), 196 deletions(-)
> create mode 100644 Documentation/topics/ovsdb-relay.rst
> create mode 100644 ovsdb/relay.c
> create mode 100644 ovsdb/relay.h
> create mode 100644 ovsdb/transaction-forward.c
> create mode 100644 ovsdb/transaction-forward.h
>
More information about the dev
mailing list