[ovs-discuss] Problems of prerequisite check failures in OVSDB Raft

Han Zhou zhouhan at gmail.com
Fri Mar 1 19:14:10 UTC 2019


On Fri, Mar 1, 2019 at 11:13 AM Han Zhou <zhouhan at gmail.com> wrote:
>
> On Tue, Feb 26, 2019 at 1:57 PM Han Zhou <zhouhan at gmail.com> wrote:
> >
> > I have been working on OVSDB Raft scale testing problems debugging
> > recently and here is a major finding. I thought there were different
> > problems but it turned out that the only major problem is the
> > prerequisite checking mechanism. My conclusion is that we'd better
> > abandon the prerequisite checking and change to the alternative
> > approach mentioned by Ben at an earlier OVN meeting: the follower
> > sends the operations requested by clients directly to leader, without
> > any prerequisite assumption. Here is the detailed analysis.
> >
> > The test environment is for ovn-scale-test, with 3-node clusters for
> > NB and SB. There are 1k HVs connecting to the SB DB, connections
> > randomly distributed to the 3 cluster nodes. The test is creating 10k
> > lports and bind them on the 1k HVs. The bottleneck is in SB DB only,
> > because of the big number of clients and the continuous write requests
> > from the clients (port-binding updates).
> >
> > The first problem I encountered was that the leader re-election
> > happened a lot during the testing because the leader wasn't able to
> > respond to heartbeat messages in time, due to the CPU load. The ideal
> > solution should be using dedicated thread for Raft messages handling,
> > but I tried to circumvent the problem and see what happens by
> > increasing the election base time from 1024ms to 10000ms. The
> > heartbeat messages are sent at 1/3 of the expire time, so now it is
> > sent out around every 3+ sec.
> >
> > With the change the leader re-election problem solved. However, it
> > reveals the real problem why CPU load was high. The CPU load stayed at
> > 100% on the leader, and the INFO logs on prerequisite check failure
> > was extremely high. For example:
> > 2019-02-20T19:27:02.001Z|00382|raft|INFO|Dropped 299433 log messages
> > in last 12 seconds (most recently, 0 seconds ago) due to excessive
> > rate
> > 2019-02-20T19:27:02.001Z|00383|raft|INFO|current entry eid
> > b4b77aaf-4f69-449b-a689-c6c2a02f452f does not match prerequisite
> > 9477ce8d-1b8f-46af-8ab4-0ea044253742 in execute_command_request
> > 2019-02-20T19:27:14.001Z|00388|raft|INFO|Dropped 300646 log messages
> > in last 12 seconds (most recently, 0 seconds ago) due to excessive
> > rate
> > 2019-02-20T19:27:14.001Z|00389|raft|INFO|current entry eid
> > c062d334-c6b8-4e8d-9ae8-38ed650c120b does not match prerequisite
> > a87df6e0-661d-404c-8d7f-ad6de210cc41 in execute_command_request
> > 2019-02-20T19:27:26.001Z|00394|raft|INFO|Dropped 301084 log messages
> > in last 12 seconds (most recently, 0 seconds ago) due to excessive
> > rate
> > 2019-02-20T19:27:26.001Z|00395|raft|INFO|current entry eid
> > 0f74e572-777e-4d6e-859a-739fac0293e5 does not match prerequisite
> > def7f61f-f865-423b-a735-5c3c78d982e8 in execute_command_request
> >
> > This is ~30K failures per second. This is caused by the prereq check
> > failures and continuously retries from the followers.
> >
> > My first reaction was to avoid the continuous retry from followers by
> > checking if the db's prereq has changed, since it makes no sense to
> > retry with same prereq as the previous failed attempt. I made the
> > change and tried again. It did solve the CPU problem, but revealed two
> > new problems.
> >
> > Problem 1):
> > There were still hundreds of prerequisite check failures per second
> > seen from the leader's logs and whole scale test execution was
> > extremely slow, just as the previous high CPU load situation (6 hours
> > to complete the whole testing). By checking Raft messages in debug
> > logs, I found out that this is because there are so many clients
> > requesting updates to the DB, and each follower can pile up a batch of
> > transactions from different clients and then send them one by one to
> > the leader, all with *the same prereq*. In this case, even if there is
> > no racing with other nodes, the requests from the same follower will
> > end up only the first one being succeeded, and the later ones will
> > fail the prereq check for sure because the first transaction proposed
> > by the same follower would cause the leader update its expected
> > prereq. I tried another patch to solve the problem, by updating the
> > follower's "prereq" to the "result" - the eid of the last proposed
> > commit, so that the *batch* of transaction proposed by the same
> > follower would either all succeed if there is no racing from other
> > nodes, or all fail if there is racing. This patch worked, and the
> > whole testing completed in 2 hours, similar to the result when using
> > standalone DB. However, I realized that this trick violates the basic
> > transactional correctness. The proposed commit should be based on a
> > previously committed version. If the previous proposed transaction in
> > the *batch* is not in committed state, setting its eid as prereq is
> > wrong. Going further, the attempt to propose a batch of transaction
> > seems to be wrong - each transaction should be examined on top of
> > previously committed transactions (unless we are doing concurrency
> > control such as MVCC, but obviously it is not the case in current
> > OVSDB design). So, the prereq has to be the last applied eid on the
> > follower side (as is implemented today), to ensure the correctness.
> >
> > Because of this, the only optimization we can do with the prereq
> > checking mechanism is probably proposing commits to leader only when:
> > a) the raft applied index is the same as the latest raft eid in the
> > follower's raft log - because otherwise it indicates there are new
> > proposal already happened from other nodes and it will be just waste
> > of CPU to propose a commit which would fail the prereq check; and b)
> > there is no pending proposal from current follower node that has been
> > sent to the leader with the same prereq - because otherwise it
> > indicates the following proposal would fail the prereq check, too.
>
> I submitted a patch for a), which is a low hanging fruit. b) is not
> difficult to do but it seems not providing big value since a) already
> solves the CPU problem (but not solving the latency - "problem 2"
> described below).

Sorry, I forgot to paste link to the patch:
https://mail.openvswitch.org/pipermail/ovs-dev/2019-March/356891.html
>
> >
> > However, this optimization can solve the CPU load problem, but it
> > would not solve the performance problem. The scale test would still be
> > slow, due to the problem 2) described below.
> >
> > Problem 2):
> > There is a latency introduced by the prereq check mechanism. Take the
> > simple scenario as example, assume there is no racing with other
> > nodes, and the follower A is currently the only node proposing
> > commits. Assume there are 3 transactions T1 - T3 accumulated in
> > current main loop iteration to be sent to the leader. The Raft message
> > sequence would be like following:
> >
> > 1. Follower A: send CommandExec(T1) using prereq P1
> > 2. Leader: handle T1, send AppendReq(T1) to all followers
> > 3. Follower A: receive AppendReq(T1), update local log_end (but not
> > updating commit_index or last_applied), reply to leader
> > 4. Leader: received major replies, reply success to Follower A for the
> > CommandExec(T1), with the new commit_index
> > 5. Follower A: handle reply for CommandExec(T1). However, it does
> > nothing but some cleanup. The commit_index in the reply seems not
> > used. (This is correct since Raft doesn't suggest any action at this
> > point)
> > 6. ... 1/3 heartbeat timer reached ...
> > 7. Leader: send AppendReq(heartbeat) with the new commit_index to follower A
> > 8. Follower A: handle AppendReq(heartbeat) and update local
> > commit_index, and now follower A can apply T1 to its local committed
> > data, and T2 can be proposed to leader with the new prereq P2.
> > 9. Follower A: send CommandExec(T2) using prereq P2.
> > ...
> >
> > As we can see, just to complete T1 there is a latency caused by the
> > hearbeat timer at step 6, and T2 cannot start until step 8 (it can
> > start earlier like how it is currently implemented but only could
> > result in prereq check failure and useless retries wasting more CPU
> > cycles). So the batch of transactions in the follower A have to be
> > executed one by one, each wait at least 1/3 heartheat timer. This is
> > causing big latencies and very low performance overall. An
> > optimization may be done at step 5 when follower receives CommandExec
> > reply, to move forward the local commit_index without waiting for the
> > next heartbeat. However, I am not sure if it introduces any problem
> > since it is not something mentioned in Raft paper.
> >
> > Because of the above problems, I believe the alternative approach
> > would be better than the prereq check mechanism and solve the problems
> > easily. Instead of sending the final result to leader, followers can
> > simply forward the "transaction" PRC from client to leader, and let
> > leader perform the ovsdb_execute_compose() and ovsdb_txn_precommit()
> > on top of the latest data on leader. This way, both problems are
> > solved:
> >
> > For problem 1), since leader will do the real transaction parsing and
> > execution, the prerequisite is not needed - in fact there is no
> > defined order between the transactions forwarded by followers, since
> > they originally came from different clients, so leader has the freedom
> > to apply them in any order it considers as consistent. For problem 2),
> > a batch of request can be sent to leader without delaying. Only the
> > final commit_index update for last transaction in the batch will
> > experience the delay caused by the heartbeat timer (similar as in step
> > 6 above) (which seems to be inherent problem of the raft design?)
> >
> > There are side-effects of this alternative approach. The leader will
> > do a little more work than before, but compared with the wasted CPU
> > for handling the useless failure retries this is nothing. Even if the
> > previously mentioned optimizations are implemented for the prereq
> > mechanism to avoid the CPU waste, the alternative approach would still
> > beat the prereq mechanism because of the overall performance advantage
> > (because there is no delaying for sending request to leader).
> >
> > Ben, please correct me if I misunderstood anything and suggest the next steps.
> >
> > Thanks,
> > Han


More information about the discuss mailing list