[ovs-discuss] Problems of prerequisite check failures in OVSDB Raft

Han Zhou zhouhan at gmail.com
Fri Mar 1 19:13:08 UTC 2019

On Tue, Feb 26, 2019 at 1:57 PM Han Zhou <zhouhan at gmail.com> wrote:
> I have been working on OVSDB Raft scale testing problems debugging
> recently and here is a major finding. I thought there were different
> problems but it turned out that the only major problem is the
> prerequisite checking mechanism. My conclusion is that we'd better
> abandon the prerequisite checking and change to the alternative
> approach mentioned by Ben at an earlier OVN meeting: the follower
> sends the operations requested by clients directly to leader, without
> any prerequisite assumption. Here is the detailed analysis.
> The test environment is for ovn-scale-test, with 3-node clusters for
> NB and SB. There are 1k HVs connecting to the SB DB, connections
> randomly distributed to the 3 cluster nodes. The test is creating 10k
> lports and bind them on the 1k HVs. The bottleneck is in SB DB only,
> because of the big number of clients and the continuous write requests
> from the clients (port-binding updates).
> The first problem I encountered was that the leader re-election
> happened a lot during the testing because the leader wasn't able to
> respond to heartbeat messages in time, due to the CPU load. The ideal
> solution should be using dedicated thread for Raft messages handling,
> but I tried to circumvent the problem and see what happens by
> increasing the election base time from 1024ms to 10000ms. The
> heartbeat messages are sent at 1/3 of the expire time, so now it is
> sent out around every 3+ sec.
> With the change the leader re-election problem solved. However, it
> reveals the real problem why CPU load was high. The CPU load stayed at
> 100% on the leader, and the INFO logs on prerequisite check failure
> was extremely high. For example:
> 2019-02-20T19:27:02.001Z|00382|raft|INFO|Dropped 299433 log messages
> in last 12 seconds (most recently, 0 seconds ago) due to excessive
> rate
> 2019-02-20T19:27:02.001Z|00383|raft|INFO|current entry eid
> b4b77aaf-4f69-449b-a689-c6c2a02f452f does not match prerequisite
> 9477ce8d-1b8f-46af-8ab4-0ea044253742 in execute_command_request
> 2019-02-20T19:27:14.001Z|00388|raft|INFO|Dropped 300646 log messages
> in last 12 seconds (most recently, 0 seconds ago) due to excessive
> rate
> 2019-02-20T19:27:14.001Z|00389|raft|INFO|current entry eid
> c062d334-c6b8-4e8d-9ae8-38ed650c120b does not match prerequisite
> a87df6e0-661d-404c-8d7f-ad6de210cc41 in execute_command_request
> 2019-02-20T19:27:26.001Z|00394|raft|INFO|Dropped 301084 log messages
> in last 12 seconds (most recently, 0 seconds ago) due to excessive
> rate
> 2019-02-20T19:27:26.001Z|00395|raft|INFO|current entry eid
> 0f74e572-777e-4d6e-859a-739fac0293e5 does not match prerequisite
> def7f61f-f865-423b-a735-5c3c78d982e8 in execute_command_request
> This is ~30K failures per second. This is caused by the prereq check
> failures and continuously retries from the followers.
> My first reaction was to avoid the continuous retry from followers by
> checking if the db's prereq has changed, since it makes no sense to
> retry with same prereq as the previous failed attempt. I made the
> change and tried again. It did solve the CPU problem, but revealed two
> new problems.
> Problem 1):
> There were still hundreds of prerequisite check failures per second
> seen from the leader's logs and whole scale test execution was
> extremely slow, just as the previous high CPU load situation (6 hours
> to complete the whole testing). By checking Raft messages in debug
> logs, I found out that this is because there are so many clients
> requesting updates to the DB, and each follower can pile up a batch of
> transactions from different clients and then send them one by one to
> the leader, all with *the same prereq*. In this case, even if there is
> no racing with other nodes, the requests from the same follower will
> end up only the first one being succeeded, and the later ones will
> fail the prereq check for sure because the first transaction proposed
> by the same follower would cause the leader update its expected
> prereq. I tried another patch to solve the problem, by updating the
> follower's "prereq" to the "result" - the eid of the last proposed
> commit, so that the *batch* of transaction proposed by the same
> follower would either all succeed if there is no racing from other
> nodes, or all fail if there is racing. This patch worked, and the
> whole testing completed in 2 hours, similar to the result when using
> standalone DB. However, I realized that this trick violates the basic
> transactional correctness. The proposed commit should be based on a
> previously committed version. If the previous proposed transaction in
> the *batch* is not in committed state, setting its eid as prereq is
> wrong. Going further, the attempt to propose a batch of transaction
> seems to be wrong - each transaction should be examined on top of
> previously committed transactions (unless we are doing concurrency
> control such as MVCC, but obviously it is not the case in current
> OVSDB design). So, the prereq has to be the last applied eid on the
> follower side (as is implemented today), to ensure the correctness.
> Because of this, the only optimization we can do with the prereq
> checking mechanism is probably proposing commits to leader only when:
> a) the raft applied index is the same as the latest raft eid in the
> follower's raft log - because otherwise it indicates there are new
> proposal already happened from other nodes and it will be just waste
> of CPU to propose a commit which would fail the prereq check; and b)
> there is no pending proposal from current follower node that has been
> sent to the leader with the same prereq - because otherwise it
> indicates the following proposal would fail the prereq check, too.

I submitted a patch for a), which is a low hanging fruit. b) is not
difficult to do but it seems not providing big value since a) already
solves the CPU problem (but not solving the latency - "problem 2"
described below).

> However, this optimization can solve the CPU load problem, but it
> would not solve the performance problem. The scale test would still be
> slow, due to the problem 2) described below.
> Problem 2):
> There is a latency introduced by the prereq check mechanism. Take the
> simple scenario as example, assume there is no racing with other
> nodes, and the follower A is currently the only node proposing
> commits. Assume there are 3 transactions T1 - T3 accumulated in
> current main loop iteration to be sent to the leader. The Raft message
> sequence would be like following:
> 1. Follower A: send CommandExec(T1) using prereq P1
> 2. Leader: handle T1, send AppendReq(T1) to all followers
> 3. Follower A: receive AppendReq(T1), update local log_end (but not
> updating commit_index or last_applied), reply to leader
> 4. Leader: received major replies, reply success to Follower A for the
> CommandExec(T1), with the new commit_index
> 5. Follower A: handle reply for CommandExec(T1). However, it does
> nothing but some cleanup. The commit_index in the reply seems not
> used. (This is correct since Raft doesn't suggest any action at this
> point)
> 6. ... 1/3 heartbeat timer reached ...
> 7. Leader: send AppendReq(heartbeat) with the new commit_index to follower A
> 8. Follower A: handle AppendReq(heartbeat) and update local
> commit_index, and now follower A can apply T1 to its local committed
> data, and T2 can be proposed to leader with the new prereq P2.
> 9. Follower A: send CommandExec(T2) using prereq P2.
> ...
> As we can see, just to complete T1 there is a latency caused by the
> hearbeat timer at step 6, and T2 cannot start until step 8 (it can
> start earlier like how it is currently implemented but only could
> result in prereq check failure and useless retries wasting more CPU
> cycles). So the batch of transactions in the follower A have to be
> executed one by one, each wait at least 1/3 heartheat timer. This is
> causing big latencies and very low performance overall. An
> optimization may be done at step 5 when follower receives CommandExec
> reply, to move forward the local commit_index without waiting for the
> next heartbeat. However, I am not sure if it introduces any problem
> since it is not something mentioned in Raft paper.
> Because of the above problems, I believe the alternative approach
> would be better than the prereq check mechanism and solve the problems
> easily. Instead of sending the final result to leader, followers can
> simply forward the "transaction" PRC from client to leader, and let
> leader perform the ovsdb_execute_compose() and ovsdb_txn_precommit()
> on top of the latest data on leader. This way, both problems are
> solved:
> For problem 1), since leader will do the real transaction parsing and
> execution, the prerequisite is not needed - in fact there is no
> defined order between the transactions forwarded by followers, since
> they originally came from different clients, so leader has the freedom
> to apply them in any order it considers as consistent. For problem 2),
> a batch of request can be sent to leader without delaying. Only the
> final commit_index update for last transaction in the batch will
> experience the delay caused by the heartbeat timer (similar as in step
> 6 above) (which seems to be inherent problem of the raft design?)
> There are side-effects of this alternative approach. The leader will
> do a little more work than before, but compared with the wasted CPU
> for handling the useless failure retries this is nothing. Even if the
> previously mentioned optimizations are implemented for the prereq
> mechanism to avoid the CPU waste, the alternative approach would still
> beat the prereq mechanism because of the overall performance advantage
> (because there is no delaying for sending request to leader).
> Ben, please correct me if I misunderstood anything and suggest the next steps.
> Thanks,
> Han

More information about the discuss mailing list