[ovs-discuss] OVN SB DB server overload when restarted at large scale environment

Han Zhou zhouhan at gmail.com
Thu Sep 13 19:28:27 UTC 2018


In scalability test with ovn-scale-test, ovsdb-server SB load is not a
problem at least with 1k HVs. However, if we restart the ovsdb-server,
depending on the number of HVs and scale of logical objects, e.g. the
number of logical ports, ovsdb-server of SB become an obvious bottleneck.

In our test with 1k HVs and 20k logical ports (200 lport * 100 lswitches
connected by one single logical router). Restarting ovsdb-server of SB
resulted in 100% CPU of ovsdb-server for more than 1 hour. All HVs (and
northd) are reconnecting and resyncing the big amount of data at the same
time. Considering the amount of data and json rpc cost, this is not
surprising.

At this scale, SB ovsdb-server process has RES 303848KB before restart. It
is likely a big proportion of this size is SB DB data that is going to be
transferred to all 1,001 clients, which is about 300GB. With a 10Gbps NIC,
even the pure network transmission would take ~5 minutes. Considering the
actual size of JSON RPC would be much bigger than the raw data, and the
processing cost of the single thread ovsdb-server, 1 hour is reasonable.

In addition to the CPU cost of ovsdb-server, the memory consumption could
also be a problem. Since all clients are syncing data from it, probably due
to the buffering, RES increases quickly, spiked to 10G at some point. After
all the syncing finished, the RES is back to the similar size as before
restart. The client side (ovn-controller, northd) were also seeing memory
spike - it is a huge JSON RPC for the new snapshot of the whole DB to be
downloaded, so it is just buffered until the whole message is received -
RES peaked at the doubled size of its original size, and then went back to
the original size after the first round of processing of the new snapshot.
This means for deploying OVN, this memory spike should be considered for
the SB DB restart scenario, especially the central node.

Here is some of my brainstorming of how could we improve on this (very
rough ones at this stage).
There are two directions: 1) reducing the size of data to be transferred.
2) scaling out ovsdb-server.

1) Reducing the size of data to be transferred.

1.1) Using BSON instead of JSON. It could reduce the size of data, but not
sure yet how much it could help since most of the data are strings. It
might be even worse since the bottleneck is not yet the network bandwidth
but processing power of ovsdb-server.

1.2) Move northd processing to HVs - only relevant NB data needs to be
transfered, which is much smaller than the SB DB because there is no
logical flows. However, this would lead to more processing load on
ovn-controller on HVs. Also, it is a big/huge architecture change.

1.3) Incremental data transfer. The way IDL works is like a cache. Now when
connection reset the cache has to be rebuilt. But if we know the version
the current snapshot, even when connection is reset, the client can still
communicate with the newly started server to tell the difference of the
current data and the new data, so that only the delta is transferred, as if
the server is not restarted at all.

2) Scaling out the ovsdb-server.

2.1) Currently ovsdb-server is single threaded, so that single thread has
to take care of transmission to all clients with 100% CPU. If it is
mutli-threaded, more cores can be utilized to make this much faster.

2.2) Using ovsdb cluster. This feature is supported already but I haven't
tested it in this scenario yet. If everything works as expected, there can
be 3 - 5 servers sharing the load, so the transfer should be completed 3 -
5 times faster than it is right now. However, this is a limit of how many
nodes there can be in a cluster, so the problem can be alleviated but may
still be a problem if the data size goes bigger.

2.3) Using readonly copies of ovsdb replications. If ovn-controller
connects to readonly copies, we can deploy a big number of ovsdb-servers of
SB, which replicates from a common source - the read/write one populated by
ovn-northd. It can be a multi-layered (2 - 3 layer is big enough) tree
structure, so that each server only serves a small number of clients.
However, today there are some scenarios requires ovn-controller to write
data to SB, such as dynamic mac-binding (neighbor table populating), nb_cfg
sync feature, etc.

These ideas are not mutual exclusive, and the order is random just
according to my thought process. I think most of them are worth to try, but
not sure about priority (except that 1.2 is almost out of question since I
don't think it is a good idea to do any architecture level change at this
phase). Among the ideas, I think 1.3) 2.1) and 2.3) are the ones that
should have the best result (if they can be implemented with reasonable
effort).

Any comments/suggestions are welcome!!

Thanks,
Han
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20180913/f46b7ec3/attachment.html>


More information about the discuss mailing list