[ovs-discuss] OVN SB DB server overload when restarted at large scale environment

Han Zhou zhouhan at gmail.com
Thu Jan 24 23:36:58 UTC 2019


On Wed, Oct 31, 2018 at 4:34 PM Ben Pfaff <blp at ovn.org> wrote:
>
> On Tue, Oct 30, 2018 at 11:51:05PM -0700, Han Zhou wrote:
> > On Tue, Oct 30, 2018 at 11:15 AM Ben Pfaff <blp at ovn.org> wrote:
> > >
> > > On Wed, Oct 24, 2018 at 05:42:15PM -0700, Han Zhou wrote:
> > > > On Tue, Sep 25, 2018 at 10:18 AM Han Zhou <zhouhan at gmail.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Sep 20, 2018 at 4:43 PM Ben Pfaff <blp at ovn.org> wrote:
> > > > > >
> > > > > > On Thu, Sep 13, 2018 at 12:28:27PM -0700, Han Zhou wrote:
> > > > > > > In scalability test with ovn-scale-test, ovsdb-server SB load is
> > not a
> > > > > > > problem at least with 1k HVs. However, if we restart the
> > ovsdb-server,
> > > > > > > depending on the number of HVs and scale of logical objects, e.g.
> > the
> > > > > > > number of logical ports, ovsdb-server of SB become an obvious
> > > > bottleneck.
> > > > > > >
> > > > > > > In our test with 1k HVs and 20k logical ports (200 lport * 100
> > > > lswitches
> > > > > > > connected by one single logical router). Restarting ovsdb-server
> > of SB
> > > > > > > resulted in 100% CPU of ovsdb-server for more than 1 hour. All HVs
> > > > (and
> > > > > > > northd) are reconnecting and resyncing the big amount of data at
> > the
> > > > same
> > > > > > > time. Considering the amount of data and json rpc cost, this is
> > not
> > > > > > > surprising.
> > > > > > >
> > > > > > > At this scale, SB ovsdb-server process has RES 303848KB before
> > > > restart. It
> > > > > > > is likely a big proportion of this size is SB DB data that is
> > going
> > > > to be
> > > > > > > transferred to all 1,001 clients, which is about 300GB. With a
> > 10Gbps
> > > > NIC,
> > > > > > > even the pure network transmission would take ~5 minutes.
> > Considering
> > > > the
> > > > > > > actual size of JSON RPC would be much bigger than the raw data,
> > and
> > > > the
> > > > > > > processing cost of the single thread ovsdb-server, 1 hour is
> > > > reasonable.
> > > > > > >
> > > > > > > In addition to the CPU cost of ovsdb-server, the memory
> > consumption
> > > > could
> > > > > > > also be a problem. Since all clients are syncing data from it,
> > > > probably due
> > > > > > > to the buffering, RES increases quickly, spiked to 10G at some
> > point.
> > > > After
> > > > > > > all the syncing finished, the RES is back to the similar size as
> > > > before
> > > > > > > restart. The client side (ovn-controller, northd) were also seeing
> > > > memory
> > > > > > > spike - it is a huge JSON RPC for the new snapshot of the whole
> > DB to
> > > > be
> > > > > > > downloaded, so it is just buffered until the whole message is
> > > > received -
> > > > > > > RES peaked at the doubled size of its original size, and then went
> > > > back to
> > > > > > > the original size after the first round of processing of the new
> > > > snapshot.
> > > > > > > This means for deploying OVN, this memory spike should be
> > considered
> > > > for
> > > > > > > the SB DB restart scenario, especially the central node.
> > > > > > >
> > > > > > > Here is some of my brainstorming of how could we improve on this
> > (very
> > > > > > > rough ones at this stage).
> > > > > > > There are two directions: 1) reducing the size of data to be
> > > > transferred.
> > > > > > > 2) scaling out ovsdb-server.
> > > > > > >
> > > > > > > 1) Reducing the size of data to be transferred.
> > > > > > >
> > > > > > > 1.1) Using BSON instead of JSON. It could reduce the size of data,
> > > > but not
> > > > > > > sure yet how much it could help since most of the data are
> > strings. It
> > > > > > > might be even worse since the bottleneck is not yet the network
> > > > bandwidth
> > > > > > > but processing power of ovsdb-server.
> > > > > > >
> > > > > > > 1.2) Move northd processing to HVs - only relevant NB data needs
> > to be
> > > > > > > transfered, which is much smaller than the SB DB because there is
> > no
> > > > > > > logical flows. However, this would lead to more processing load on
> > > > > > > ovn-controller on HVs. Also, it is a big/huge architecture change.
> > > > > > >
> > > > > > > 1.3) Incremental data transfer. The way IDL works is like a cache.
> > > > Now when
> > > > > > > connection reset the cache has to be rebuilt. But if we know the
> > > > version
> > > > > > > the current snapshot, even when connection is reset, the client
> > can
> > > > still
> > > > > > > communicate with the newly started server to tell the difference
> > of
> > > > the
> > > > > > > current data and the new data, so that only the delta is
> > transferred,
> > > > as if
> > > > > > > the server is not restarted at all.
> > > > > > >
> > > > > > > 2) Scaling out the ovsdb-server.
> > > > > > >
> > > > > > > 2.1) Currently ovsdb-server is single threaded, so that single
> > thread
> > > > has
> > > > > > > to take care of transmission to all clients with 100% CPU. If it
> > is
> > > > > > > mutli-threaded, more cores can be utilized to make this much
> > faster.
> > > > > > >
> > > > > > > 2.2) Using ovsdb cluster. This feature is supported already but I
> > > > haven't
> > > > > > > tested it in this scenario yet. If everything works as expected,
> > > > there can
> > > > > > > be 3 - 5 servers sharing the load, so the transfer should be
> > > > completed 3 -
> > > > > > > 5 times faster than it is right now. However, this is a limit of
> > how
> > > > many
> > > > > > > nodes there can be in a cluster, so the problem can be alleviated
> > but
> > > > may
> > > > > > > still be a problem if the data size goes bigger.
> > > > > > >
> > > > > > > 2.3) Using readonly copies of ovsdb replications. If
> > ovn-controller
> > > > > > > connects to readonly copies, we can deploy a big number of
> > > > ovsdb-servers of
> > > > > > > SB, which replicates from a common source - the read/write one
> > > > populated by
> > > > > > > ovn-northd. It can be a multi-layered (2 - 3 layer is big enough)
> > tree
> > > > > > > structure, so that each server only serves a small number of
> > clients.
> > > > > > > However, today there are some scenarios requires ovn-controller to
> > > > write
> > > > > > > data to SB, such as dynamic mac-binding (neighbor table
> > populating),
> > > > nb_cfg
> > > > > > > sync feature, etc.
> > > > > > >
> > > > > > > These ideas are not mutual exclusive, and the order is random just
> > > > > > > according to my thought process. I think most of them are worth to
> > > > try, but
> > > > > > > not sure about priority (except that 1.2 is almost out of question
> > > > since I
> > > > > > > don't think it is a good idea to do any architecture level change
> > at
> > > > this
> > > > > > > phase). Among the ideas, I think 1.3) 2.1) and 2.3) are the ones
> > that
> > > > > > > should have the best result (if they can be implemented with
> > > > reasonable
> > > > > > > effort).
> > > > > >
> > > > > > It sounds like reducing the size is essential, because you say that
> > the
> > > > > > sheer quantity of data is 5 minutes worth of raw bandwidth.  Let's
> > go
> > > > > > through the options there.
> > > > > >
> > > > > > 1.1, using BSON instead of JSON, won't help sizewise.  See
> > > > > > http://bsonspec.org/faq.html.
> > > > > >
> > > > > > 1.2 would change the OVN architecture, so I don't think it's a good
> > > > > > idea.
> > > > > >
> > > > > > 1.3, incremental data transfer, is an idea that Andy Zhou explored a
> > > > > > little bit before he left.  There is some description of the
> > approach I
> > > > > > suggested in ovn/TODO.rst:
> > > > > >
> > > > > >   * Reducing startup time.
> > > > > >
> > > > > >     As-is, if ovsdb-server restarts, every client will fetch a fresh
> > > > copy of
> > > > > >     the part of the database that it cares about.  With hundreds of
> > > > clients,
> > > > > >     this could cause heavy CPU load on ovsdb-server and use
> > excessive
> > > > network
> > > > > >     bandwidth.  It would be better to allow incremental updates even
> > > > across
> > > > > >     connection loss.  One way might be to use "Difference Digests"
> > as
> > > > described
> > > > > >     in Epstein et al., "What's the Difference? Efficient Set
> > > > Reconciliation
> > > > > >     Without Prior Context".  (I'm not yet aware of previous
> > > > non-academic use of
> > > > > >     this technique.)
> > > > > >
> > > > > > When Andy left VMware, the project got dropped, but it could be
> > picked
> > > > > > up again.
> > > > > >
> > > > > > There are other ways to implement incremental data transfer, too.
> > > > > >
> > > > > > Scaling out ovsdb-server is a good idea too, but I think it's
> > probably
> > > > > > less important for this particular case than reducing bandwidth
> > > > > > requirements
> > > > > >
> > > > > > 2.1, multithreading, is also something that Andy explored; again,
> > the
> > > > > > project would have to be resumed or restarted.
> > > > >
> > > > > Thanks Ben! It seems 1.3 (incremental data transfer) is the most
> > > > effective approach to solve this problem.
> > > > > I had a brief study on the "Difference Digests" paper. For my
> > > > understanding it is particularly useful when there is no prior context.
> > > > However, in OVSDB use case, especially in this OVN DB restart scenario,
> > we
> > > > do have the context about the last data received from the server. I
> > think
> > > > it would be more efficient (no full data scanning and encoding) and
> > maybe
> > > > simpler to implement based on the append-only (in most cases, for the
> > part
> > > > of data that hasn't been compressed yet) nature of OVSDB. Here is what I
> > > > have in mind:
> > > > >
> > > > With some more study, here are some more details added inline:
> > > >
> > > > > - We need a versioning mechanism. For each transaction record in OVSDB
> > > > file, it needs a unique version. The hash string may be used for this
> > > > purpose, so that the file format doesn't need to be changed. If we allow
> > > > the file format to be changed, it may be better to have a version number
> > > > that is sequentially increased.
> > > > >
> > > > For standalone format, the hash value can be used as unique version id.
> > > > (what's the chance of the 10 bytes hash value having a conflict?)
> > > > For clustered DB format, the eid is perfect for this purpose.
> > > > (Sequentially increasing seems not really necessary, since we only need
> > to
> > > > keep a small amount of history transactions for the DB restart scenario)
> > > >
> > > > > - We can include the latest version number in every OVSDB notification
> > > > from server to client, and the client IDL records the version.
> > > > >
> > > > > - When a client reconnects to server, it can request to get only the
> > > > changes after “last version”.
> > > > >
> > > > > - When a server starts up, it reads the DB file and keeps track of the
> > > > “version” for last N (e.g. 100) transactions, and maintains the changes
> > in
> > > > memory of that N transactions.
> > > >
> > > > The current implementation keeps track in monitors for the transactions
> > > > that haven't been flushed to all clients. We can extend this by keeping
> > > > track of extra N previous transactions. Now for the DB restart scenario,
> > > > the previous N transactions are read from DB file.
> > > >
> > > > Currently, during reading the DB file, there is no monitoring connected
> > > > yet, so replaying transactions will not trigger monitor data
> > population. We
> > > > can create a fake monitor for all tables beforehand so that we can reuse
> > > > the code to populate monitor data while replaying DB file transactions.
> > > > When a real monitor is created, it copies and converts the data in the
> > fake
> > > > monitor to its own table according to the monitor criteria. Flushing the
> > > > data for the fake monitor can be done so that maximumly M transactions
> > will
> > > > be kept in the monitor (because there is no real client to consume the
> > fake
> > > > monitor data).
> > > >
> > > > >
> > > > > - When a new client asks data after a given “version”, if the version
> > is
> > > > among the N transactions, the server sends the data afterwards to the
> > > > client. If the given version is not found (e.g. when client reconnect,
> > > > there is already lot of changes happened in server and the old changes
> > were
> > > > already flushed out, or a DB compression has been performed thus the
> > > > transaction data is gone), server can:
> > > > >     - Option1: return an error telling the client the version is not
> > > > available, and client can re-request the whole snapshot
> > > > >     - Option2: server directly send the whole snapshot, with a flag
> > > > indicating this is the whole snapshot instead of the delta
> > > > >
> > > > Option2 seems better.
> > > >
> > > > As to the OVSDB protocol change, since we need to add a version id to
> > every
> > > > update notification, it would be better to have a new method
> > > > "monitor_cond_since":
> > > >
> > > > "method": "monitor_cond_since"
> > > > "params": [<db-name>, <json-value>, <monitor-cond-requests>,
> > > > <latest_version_id>]
> > > > "id": <nonnull-json-value>
> > > >
> > > > <latest_version_id> is the version id that identifies the latest data
> > the
> > > > client already has. Everything else is same as "monitor_cond" method.
> > > >
> > > > The response object has the following members:
> > > > "result": [<found>, <latest_version_id>, <table-updates2>]
> > > > "error": null
> > > > "id": same "id" as request
> > > >
> > > > <found> is a boolean value that tells if the <latest_version_id>
> > requested
> > > > by client is found in history or not. If true, the data after that
> > version
> > > > up to current is sent. Otherwise, all data is sent.
> > > > <latest_version_id> is the version id that identifies the latest change
> > > > involved in this response, so that client can keep track.
> > > > Following changes will be notified to client using "update3" method:
> > > >
> > > > "method": "update3"
> > > > "params": [<json-value>, <latest_version_id>, <table-updates2>]
> > > > "id": null
> > > >
> > > > Similar as the response to "monitor_cond_since", <latest_version_id> is
> > > > added in the update3 method.
> > > >
> > > > > - Client will not destroy the old copy of data, unless the requested
> > > > version is not available and it has to reinitialize with the whole DB.
> > > > >
> > > > > This is less general than the Difference Digests approach, but I
> > think it
> > > > is sufficient and more relevant for OVSDB use case. I am sure there
> > will be
> > > > details need more consideration, e.g. OVSDB protocol update, caching in
> > > > server, etc., but do you think this is a right direction?
> > > >
> > > > Ben, please suggest if this is reasonable so that I could go ahead with
> > a
> > > > POC, or please let me know if you see obvious problems/pitfalls.
> > >
> > > I think that a lot of this analysis is on point.
> > >
> > > I'm pretty worried about the actual implementation.  The monitor code is
> > > already hard to understand.  I am not confident that there is a
> > > straightforward way to keep track of the last 100 transactions in a
> > > sensible way.  (Bear in mind that different clients may monitor
> > > different portions of the database.)  It might make sense to introduce
> > > a multiversioned data structure to keep track of the database
> > > content--maybe it would actually simplify some things.
> >
> > I understand the concern about the complexity. Thinking more about it, the
> > problem is only about DB restart, so we don't really need to keep track of
> > the last N transactions all the time. We only need to have them available
> > within a short period after the initial DB read. When all the old clients
> > are reconnected and requested the difference, these data are useless, and
> > going forward, new clients won't need any previous transactions either. So
> > instead of always keeping track of last N and maintaining a sliding-window,
> > we just need the initial part, which can be generated by using the "fake
> > monitor" approach I mentioned before, and then each new real monitor
> > (created after clients reconnecting) just selectively copy the data with a
> > *timestamp* added to the data entry, so that we can free them after a
> > certain amount of time (e.g. 5 minutes). For the entries that are NOT
> > copied from the fake monitor, which means they are new transactions, they
> > can be freed any time their predecessors are freed. The "fake monitor"
> > itself will have a timestamp, too, so that it can be deleted after the
> > initial period. Would this simplify the problem? (or maybe my description
> > makes it sounds more complex :))
> >
> > Could you explain a little more about the multiversioned data structure
> > idea? I am not sure I understand it correctly.
>
> Well, how do you plan to maintain the multiple copies of the database
> that will be necessary?  Presumably each monitor needs a copy of a
> slightly different database.  Or maybe I just don't understand your plan
> yet.
>

I sent out the RFC patches for the solution. It works well
(compatible) on all three modes of ovsdb-server but only clustered
mode can benefit from it for now, because clustered mode is the only
one currently supports transaction id, which is used to identify the
data version in this solution. Please take a look:
https://patchwork.ozlabs.org/project/openvswitch/list/?series=88095

Thanks,
Han

> > > If we do this, we need solid and thorough tests to ensure that it's
> > > reliable.  It might make sense to start by thinking through the tests,
> > > rather than the implementation.
> >
> > Good point, I may start with tests first.


More information about the discuss mailing list