[ovs-dev] General issue with database HA in ovsdb.

Mon Jul 13 09:03:36 UTC 2020

Hi All,

I have been observing an issue related to monitors and HA for quite a while as a part of trying to develop an async IO framework for the json RPC component in ovs/ovn.

I believe this issue is inherent to the current design and implementation. While my async IO patches make it more pronounced, it can be triggered on a fast enough machine using the stock upstream master. F.E. I can trigger it occasionally with master when running the test suite on a Ryzen 3200 (and never on something slow like f.e. A6 at 2700).

The issue is:

When raft or an administrative request take a database offline all monitors are cancelled with immediate effect resulting in notifications being sent about it. The clients are supposed to reestablish monitors and reconnect as needed while doing so. This was introduced at some point as a more gentle replacement to just dropping all connections.

There are multiple issues with that.

1. The node emitting the cancels is not explicitly blacklisted for the next reconnect cycle.

The only thing which the client does there is restart the fsm. This is in ovsdb_idl_handle_monitor_canceled()

While by default the reconnect logic round-robins, in some circumstances it may end up reconnecting to the faulty peer, attempt to set the monitor on the missing database and get a Syntax Error. That results in a test failure. Probably production failure too.

2. There may be a transaction in flight. JSON RPC by default in ovsdb master will not process any incoming traffic if there is outgoing traffic in the queue. As a result if a transaction is sent (quite often as a single send() operation) by the client it will sit in the receive socket queue on the server while the notifications are emitted. It after that is processed resulting in a Syntax Error.

3. The "receive only if there is no backlog" logic is actually broken for SSL as master double-buffers in stream-ssl.c and it can have one outgoing full message in flight. So if this is really needed, ssl is broken as the backlog does not account for the double-buffered message.

The end result is that the following tests can flake on a fast enough machine (they flake out on a Ryzen at 3GHz+):

ovs: all schema conversion tests. Extremely low probability on master - observed only once or twice in 6 months, more common with the async-io patchset (~ 5-10 % in my tests).

ovsdb-cluster: 3 and 5 node cluster torture tests which admin remove (not kill) peers. Regular on master (~5%), regular with async IO as well, similar probability (~5%).

IMHO this looks like there is some code missing in lib/:

1. Code in reconnect to temporarily explicitly prohibit a node as a reconnect choice until at least one successful completion of the reconnect logic.

2. Code to rewind all non-confirmed transactions in the client IDL and resubmit them to the new endpoint after a reconnect. Unless I have missed it, there is no such code in ovsdb-idl.c . It seems to only try to reload the schema and re-establish monitors.

-- 
Anton R. Ivanov
https://www.kot-begemot.co.uk/