[ovs-discuss] OVN SB DB server overload when restarted at large scale environment

Han Zhou zhouhan at gmail.com
Tue Sep 25 17:18:52 UTC 2018


On Thu, Sep 20, 2018 at 4:43 PM Ben Pfaff <blp at ovn.org> wrote:
>
> On Thu, Sep 13, 2018 at 12:28:27PM -0700, Han Zhou wrote:
> > In scalability test with ovn-scale-test, ovsdb-server SB load is not a
> > problem at least with 1k HVs. However, if we restart the ovsdb-server,
> > depending on the number of HVs and scale of logical objects, e.g. the
> > number of logical ports, ovsdb-server of SB become an obvious
bottleneck.
> >
> > In our test with 1k HVs and 20k logical ports (200 lport * 100 lswitches
> > connected by one single logical router). Restarting ovsdb-server of SB
> > resulted in 100% CPU of ovsdb-server for more than 1 hour. All HVs (and
> > northd) are reconnecting and resyncing the big amount of data at the
same
> > time. Considering the amount of data and json rpc cost, this is not
> > surprising.
> >
> > At this scale, SB ovsdb-server process has RES 303848KB before restart.
It
> > is likely a big proportion of this size is SB DB data that is going to
be
> > transferred to all 1,001 clients, which is about 300GB. With a 10Gbps
NIC,
> > even the pure network transmission would take ~5 minutes. Considering
the
> > actual size of JSON RPC would be much bigger than the raw data, and the
> > processing cost of the single thread ovsdb-server, 1 hour is reasonable.
> >
> > In addition to the CPU cost of ovsdb-server, the memory consumption
could
> > also be a problem. Since all clients are syncing data from it, probably
due
> > to the buffering, RES increases quickly, spiked to 10G at some point.
After
> > all the syncing finished, the RES is back to the similar size as before
> > restart. The client side (ovn-controller, northd) were also seeing
memory
> > spike - it is a huge JSON RPC for the new snapshot of the whole DB to be
> > downloaded, so it is just buffered until the whole message is received -
> > RES peaked at the doubled size of its original size, and then went back
to
> > the original size after the first round of processing of the new
snapshot.
> > This means for deploying OVN, this memory spike should be considered for
> > the SB DB restart scenario, especially the central node.
> >
> > Here is some of my brainstorming of how could we improve on this (very
> > rough ones at this stage).
> > There are two directions: 1) reducing the size of data to be
transferred.
> > 2) scaling out ovsdb-server.
> >
> > 1) Reducing the size of data to be transferred.
> >
> > 1.1) Using BSON instead of JSON. It could reduce the size of data, but
not
> > sure yet how much it could help since most of the data are strings. It
> > might be even worse since the bottleneck is not yet the network
bandwidth
> > but processing power of ovsdb-server.
> >
> > 1.2) Move northd processing to HVs - only relevant NB data needs to be
> > transfered, which is much smaller than the SB DB because there is no
> > logical flows. However, this would lead to more processing load on
> > ovn-controller on HVs. Also, it is a big/huge architecture change.
> >
> > 1.3) Incremental data transfer. The way IDL works is like a cache. Now
when
> > connection reset the cache has to be rebuilt. But if we know the version
> > the current snapshot, even when connection is reset, the client can
still
> > communicate with the newly started server to tell the difference of the
> > current data and the new data, so that only the delta is transferred,
as if
> > the server is not restarted at all.
> >
> > 2) Scaling out the ovsdb-server.
> >
> > 2.1) Currently ovsdb-server is single threaded, so that single thread
has
> > to take care of transmission to all clients with 100% CPU. If it is
> > mutli-threaded, more cores can be utilized to make this much faster.
> >
> > 2.2) Using ovsdb cluster. This feature is supported already but I
haven't
> > tested it in this scenario yet. If everything works as expected, there
can
> > be 3 - 5 servers sharing the load, so the transfer should be completed
3 -
> > 5 times faster than it is right now. However, this is a limit of how
many
> > nodes there can be in a cluster, so the problem can be alleviated but
may
> > still be a problem if the data size goes bigger.
> >
> > 2.3) Using readonly copies of ovsdb replications. If ovn-controller
> > connects to readonly copies, we can deploy a big number of
ovsdb-servers of
> > SB, which replicates from a common source - the read/write one
populated by
> > ovn-northd. It can be a multi-layered (2 - 3 layer is big enough) tree
> > structure, so that each server only serves a small number of clients.
> > However, today there are some scenarios requires ovn-controller to write
> > data to SB, such as dynamic mac-binding (neighbor table populating),
nb_cfg
> > sync feature, etc.
> >
> > These ideas are not mutual exclusive, and the order is random just
> > according to my thought process. I think most of them are worth to try,
but
> > not sure about priority (except that 1.2 is almost out of question
since I
> > don't think it is a good idea to do any architecture level change at
this
> > phase). Among the ideas, I think 1.3) 2.1) and 2.3) are the ones that
> > should have the best result (if they can be implemented with reasonable
> > effort).
>
> It sounds like reducing the size is essential, because you say that the
> sheer quantity of data is 5 minutes worth of raw bandwidth.  Let's go
> through the options there.
>
> 1.1, using BSON instead of JSON, won't help sizewise.  See
> http://bsonspec.org/faq.html.
>
> 1.2 would change the OVN architecture, so I don't think it's a good
> idea.
>
> 1.3, incremental data transfer, is an idea that Andy Zhou explored a
> little bit before he left.  There is some description of the approach I
> suggested in ovn/TODO.rst:
>
>   * Reducing startup time.
>
>     As-is, if ovsdb-server restarts, every client will fetch a fresh copy
of
>     the part of the database that it cares about.  With hundreds of
clients,
>     this could cause heavy CPU load on ovsdb-server and use excessive
network
>     bandwidth.  It would be better to allow incremental updates even
across
>     connection loss.  One way might be to use "Difference Digests" as
described
>     in Epstein et al., "What's the Difference? Efficient Set
Reconciliation
>     Without Prior Context".  (I'm not yet aware of previous non-academic
use of
>     this technique.)
>
> When Andy left VMware, the project got dropped, but it could be picked
> up again.
>
> There are other ways to implement incremental data transfer, too.
>
> Scaling out ovsdb-server is a good idea too, but I think it's probably
> less important for this particular case than reducing bandwidth
> requirements
>
> 2.1, multithreading, is also something that Andy explored; again, the
> project would have to be resumed or restarted.

Thanks Ben! It seems 1.3 (incremental data transfer) is the most effective
approach to solve this problem.
I had a brief study on the "Difference Digests" paper. For my understanding
it is particularly useful when there is no prior context. However, in OVSDB
use case, especially in this OVN DB restart scenario, we do have the
context about the last data received from the server. I think it would be
more efficient (no full data scanning and encoding) and maybe simpler to
implement based on the append-only (in most cases, for the part of data
that hasn't been compressed yet) nature of OVSDB. Here is what I have in
mind:

- We need a versioning mechanism. For each transaction record in OVSDB
file, it needs a unique version. The hash string may be used for this
purpose, so that the file format doesn't need to be changed. If we allow
the file format to be changed, it may be better to have a version number
that is sequentially increased.

- We can include the latest version number in every OVSDB notification from
server to client, and the client IDL records the version.

- When a client reconnects to server, it can request to get only the
changes after “last version”.

- When a server starts up, it reads the DB file and keeps track of the
“version” for last N (e.g. 100) transactions, and maintains the changes in
memory of that N transactions.

- When a new client asks data after a given “version”, if the version is
among the N transactions, the server sends the data afterwards to the
client. If the given version is not found (e.g. when client reconnect,
there is already lot of changes happened in server and the old changes were
already flushed out, or a DB compression has been performed thus the
transaction data is gone), server can:
    - Option1: return an error telling the client the version is not
available, and client can re-request the whole snapshot
    - Option2: server directly send the whole snapshot, with a flag
indicating this is the whole snapshot instead of the delta

- Client will not destroy the old copy of data, unless the requested
version is not available and it has to reinitialize with the whole DB.

This is less general than the Difference Digests approach, but I think it
is sufficient and more relevant for OVSDB use case. I am sure there will be
details need more consideration, e.g. OVSDB protocol update, caching in
server, etc., but do you think this is a right direction?

Thanks,
Han
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20180925/cbabbfc8/attachment-0001.html>


More information about the discuss mailing list