[ovs-dev] [PATCH] raft: Transfer leadership before creating snapshots.

Tue May 18 11:48:32 UTC 2021

On 5/17/21 11:57 PM, Han Zhou wrote:
> 
> 
> On Mon, May 17, 2021 at 2:02 PM Ilya Maximets <i.maximets at ovn.org <mailto:i.maximets at ovn.org>> wrote:
>>
>> On 5/14/21 10:32 PM, Han Zhou wrote:
>> >
>> >
>> > On Fri, May 14, 2021 at 7:05 AM Ilya Maximets <i.maximets at ovn.org <mailto:i.maximets at ovn.org> <mailto:i.maximets at ovn.org <mailto:i.maximets at ovn.org>>> wrote:
>> >>
>> >> On 5/10/21 3:16 PM, Dumitru Ceara wrote:
>> >> > On 5/6/21 2:47 PM, Ilya Maximets wrote:
>> >> >
>> >> > [...]
>> >> >
>> >> > Oops, I meant to send this along with the acked-by:
>> >> >
>> >> >>
>> >> >> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
>> >> >> index d2ff643b2..658a016b2 100644
>> >> >> --- a/ovsdb/raft.c
>> >> >> +++ b/ovsdb/raft.c
>> >> >> @@ -4160,9 +4160,24 @@ raft_may_snapshot(const struct raft *raft)
>> >> >>              && !raft->leaving
>> >> >>              && !raft->left
>> >> >>              && !raft->failed
>> >> >> +            && raft->role != RAFT_LEADER
>> >> >>              && raft->last_applied >= raft->log_start);
>> >> >>  }
>> >> >>
>> >> >> +/* Prepares for soon snapshotting.  */
>> >> >
>> >> > Nit: one space too many.  And I'd probably rephrase this to "Prepare for
>> >> > snapshotting soon.", but I'm not a native speaker, I'm not sure if
>> >> > that's better.
>> >> >
>> >>
>> >> Thanks!  I'm not sure if this variant is better and I wanted to keep
>> >> comments uniform with other functions.  So, I removed the extra space,
>> >> but kept the text as is.
>> >>
>> >> Applied to master and backported down to 2.13 as this change is important
>> >> for large scale setups.
>> >>
>> >> Best regards, Ilya Maximets.
>> >
>> > Hi Ilya, thanks for fixing the problem and sorry for my late response. This approach looks reasonable and the code looks good, but I still have some concerns not sure if you had already thought about or tested.
>> >
>> > This patch increases the chance that a leader is responsive and so the cluster can process new transactions. (I suppose the leader transfer process completes immediately, but I didn't test.)
>>
>> Yes, it's pretty fast.  The leadership receiver starts election
>> immediately after receiving a transfer and rest of the cluster
>> votes for it, because there are no other candidates.  In my
>> testing the original leader was typically able to vote before
>> starting compaction process, i.e. right on the next iteration of
>> the main loop.
>>
>> >
>> > However, it also increases the frequency of leader elections - every snapshot on leader node would trigger a leader election. We set the leader election timer big so that in large scale deployment leader election is not triggered unnecessarily, because it introduces churns in the cluster. For example, ovn-northd (and probably a lot of CMS writable client workers that use lead-only option) would need to reconnect to the new leader. Also, ongoing transactions (including those initiated by followers) would all get disturbed and repropagated through the new leader (not sure if there can be failures in untested corner cases).
>>
>> Yes, this triggers a leader election and, yes, northd will re-connect.
>> To be honest, I'd strongly argue that CMS clients should never connect
>> to Southbound DB at all (I know that neutron agents does that, but they,
>> IMHO, shouldn't) because Sb DB layout is not standardized and it just
>> happened to be available for user's connections.  Especially, these
>> should not be 'write'/'leader-only' connections.  For the Northbound
>> the problem is not that severe and I don't think that under normal
>> circumstances Nb DB will be compacted frequently.
> 
> Agree.
> 
>>
>> > If the snapshot time is not beyond the requirement of the response time in a production deployment, I'd rather let the leader keep its role and finish the snapshot instead of introducing these churns (and of course setting the election timer long enough for the snapshot to complete). If the snapshot time is beyond the requirement, transferring the leader role
>> > sounds reasonable, but not really sure if the frequent leader election in a large scale deployment would increase the chance of hitting problems of corner cases given that we don't have enough test cases for raft leader transfer and its impact to the ongoing transactions.
>>
>> Regarding the snapshot time: it would be great if we can reliably
>> predict how much time it will take to create a snapshot, but
>> unfortunately this highly depends on a big variety of random factors as
>> a storage speed and if other applications on the system using it at
>> the same time, and also depends on the CPU speed and if this process
>> shares a CPU with other processes in the system.  If you have a good
>> idea on how to predict the snapshot writing time, I'd be glad to hear.
>>
> I was thinking from operational perspective rather than proposing an implementation that predicts the time required for snapshotting and react based on that. For example, an operator may be totally ok with occasional ~10 seconds unresponsive time several times a day. But I agree that it is hard to predict accurately.

Yep.  But, from this perspective we can only enable or disable transferring
as we still can't reliably predict the time needed for compaction.
I think, we can introduce an appctl for that if this will become an issue
in the future.

> 
>> Regarding failed transactions. Yes, that is true that they will need
>> to be re-tried.  For the corner cases, I think it's a good thing to
>> hit all such cases as soon as possible and fix them.
> 
> Of course. My major concern is probably whether this feature is safe enough to be used directly in production. If not, how would we test it and ensure the test coverage.

We have some torture tests that simulates leader crashes, i.e. non-graceful
election scenarios.  I think that these tests might covers some corner cases.
I think, it's possible to write more tests like this with more incoming
transactions during leader changes.

> 
>>
>> Long election timers are bad in general, because detection of a real
>> problem, if any, is significantly delayed.  Leadership transfer, actually,
>> allows to significantly reduce election timers in test scenarios.
>>
> Long election timer is definitely not ideal. However, I doubt if we are now safe to use smaller election timers just after this patch, because snapshot is not the only reason the server is non-responsive. For example, in a decent scale environment with ~1k chassises, if all the ovn-controllers restart at the same time requesting the initial data sync, SB servers will likely be non-responsive and at least 10 - 20 sec of election timer is required so that the already busy cluster can proceed instead of doing nothing but leader flapping. In addition, there are also very small chances that two followers are doing snapshot at the same time, so they can't respond to leader's heartbeat for a long time, which could result in the leader regarding itself as disconnected from the cluster, but of course this chance is really small so maybe nothing to worry in reality.

For sure, this change alone doesn't save us from other sources of
big piles of work.  There is one more change under development right
now to limit time spent on processing updates for monitors within a
single processing loop iteration.  This should make ovsdb-server much
more responsive in big deployments.  At least it will be able to do
raft housekeeping more frequently to avoid election timeouts.

> 
>> >
>> > There may be even more complex scenarios that need to be considered if we do this. What if other nodes are doing snapshot at the same time or at very close time? Since all the nodes' logs are synced it is likely that they would do snapshot close to each other. Suppose the leader is trying to do it first and then one round of snapshot may trigger the leader transfer 3 times (in a 3-node cluster). If there is already a follower node doing snapshot in a 3-node cluster, then transfering the leader would not help the response time because in the next couple of seconds only the new leader is not doing snapshot but any transaction needs at least one of the followers to replicate the change and respond to the leader.
>>
>>
>> Snapshotting of many servers at the same or very close time is unlikely,
>> because the default compaction timeout is 10 + random(10) minutes.
>> And it gets more and more random after every compaction scheduling.
>> Of course, it's possible that all 3 servers will someday hit compaction
>> at the same or very close time, but in a worst case we will have
>> a cluster non-operational for a 1-2 snapshot writing time period,
>> which is not worse than in a case without transferring.
>>
> Ok, sounds reasonable.
> 
>> > In addition, is there any chance the install_snapshot procedure kicks in and what would be the impact?
>>
>> On the leader's side, generation of install_snapshot request is cheaper
>> than the database compaction, because there is no need to write on disk
>> and the json is already prepared.  It still takes time, but I'm not sure
>> how to avoid that.   On the follower's side, it's a heavy operation,
>> but it's a follower, so it's not that important if it's not very responsive.
>>
> 
> Thanks for your explain!
> 
> Han
> 
>> >
>> > These are just my random thoughts and maybe some of them aren't real problems at all, but I just want to put them here and see if they bring any cautions or are already ruled out.
>> >
>> > Thanks,
>> > Han