[ovs-dev] [PATCH] raft: Transfer leadership before creating snapshots.

Mon May 17 21:57:29 UTC 2021

On Mon, May 17, 2021 at 2:02 PM Ilya Maximets <i.maximets at ovn.org> wrote:
>
> On 5/14/21 10:32 PM, Han Zhou wrote:
> >
> >
> > On Fri, May 14, 2021 at 7:05 AM Ilya Maximets <i.maximets at ovn.org
<mailto:i.maximets at ovn.org>> wrote:
> >>
> >> On 5/10/21 3:16 PM, Dumitru Ceara wrote:
> >> > On 5/6/21 2:47 PM, Ilya Maximets wrote:
> >> >
> >> > [...]
> >> >
> >> > Oops, I meant to send this along with the acked-by:
> >> >
> >> >>
> >> >> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> >> >> index d2ff643b2..658a016b2 100644
> >> >> --- a/ovsdb/raft.c
> >> >> +++ b/ovsdb/raft.c
> >> >> @@ -4160,9 +4160,24 @@ raft_may_snapshot(const struct raft *raft)
> >> >>              && !raft->leaving
> >> >>              && !raft->left
> >> >>              && !raft->failed
> >> >> +            && raft->role != RAFT_LEADER
> >> >>              && raft->last_applied >= raft->log_start);
> >> >>  }
> >> >>
> >> >> +/* Prepares for soon snapshotting.  */
> >> >
> >> > Nit: one space too many.  And I'd probably rephrase this to "Prepare
for
> >> > snapshotting soon.", but I'm not a native speaker, I'm not sure if
> >> > that's better.
> >> >
> >>
> >> Thanks!  I'm not sure if this variant is better and I wanted to keep
> >> comments uniform with other functions.  So, I removed the extra space,
> >> but kept the text as is.
> >>
> >> Applied to master and backported down to 2.13 as this change is
important
> >> for large scale setups.
> >>
> >> Best regards, Ilya Maximets.
> >
> > Hi Ilya, thanks for fixing the problem and sorry for my late response.
This approach looks reasonable and the code looks good, but I still have
some concerns not sure if you had already thought about or tested.
> >
> > This patch increases the chance that a leader is responsive and so the
cluster can process new transactions. (I suppose the leader transfer
process completes immediately, but I didn't test.)
>
> Yes, it's pretty fast.  The leadership receiver starts election
> immediately after receiving a transfer and rest of the cluster
> votes for it, because there are no other candidates.  In my
> testing the original leader was typically able to vote before
> starting compaction process, i.e. right on the next iteration of
> the main loop.
>
> >
> > However, it also increases the frequency of leader elections - every
snapshot on leader node would trigger a leader election. We set the leader
election timer big so that in large scale deployment leader election is not
triggered unnecessarily, because it introduces churns in the cluster. For
example, ovn-northd (and probably a lot of CMS writable client workers that
use lead-only option) would need to reconnect to the new leader. Also,
ongoing transactions (including those initiated by followers) would all get
disturbed and repropagated through the new leader (not sure if there can be
failures in untested corner cases).
>
> Yes, this triggers a leader election and, yes, northd will re-connect.
> To be honest, I'd strongly argue that CMS clients should never connect
> to Southbound DB at all (I know that neutron agents does that, but they,
> IMHO, shouldn't) because Sb DB layout is not standardized and it just
> happened to be available for user's connections.  Especially, these
> should not be 'write'/'leader-only' connections.  For the Northbound
> the problem is not that severe and I don't think that under normal
> circumstances Nb DB will be compacted frequently.

Agree.

>
> > If the snapshot time is not beyond the requirement of the response time
in a production deployment, I'd rather let the leader keep its role and
finish the snapshot instead of introducing these churns (and of course
setting the election timer long enough for the snapshot to complete). If
the snapshot time is beyond the requirement, transferring the leader role
> > sounds reasonable, but not really sure if the frequent leader election
in a large scale deployment would increase the chance of hitting problems
of corner cases given that we don't have enough test cases for raft leader
transfer and its impact to the ongoing transactions.
>
> Regarding the snapshot time: it would be great if we can reliably
> predict how much time it will take to create a snapshot, but
> unfortunately this highly depends on a big variety of random factors as
> a storage speed and if other applications on the system using it at
> the same time, and also depends on the CPU speed and if this process
> shares a CPU with other processes in the system.  If you have a good
> idea on how to predict the snapshot writing time, I'd be glad to hear.
>
I was thinking from operational perspective rather than proposing an
implementation that predicts the time required for snapshotting and react
based on that. For example, an operator may be totally ok with occasional
~10 seconds unresponsive time several times a day. But I agree that it is
hard to predict accurately.

> Regarding failed transactions. Yes, that is true that they will need
> to be re-tried.  For the corner cases, I think it's a good thing to
> hit all such cases as soon as possible and fix them.

Of course. My major concern is probably whether this feature is safe enough
to be used directly in production. If not, how would we test it and ensure
the test coverage.

>
> Long election timers are bad in general, because detection of a real
> problem, if any, is significantly delayed.  Leadership transfer, actually,
> allows to significantly reduce election timers in test scenarios.
>
Long election timer is definitely not ideal. However, I doubt if we are now
safe to use smaller election timers just after this patch, because snapshot
is not the only reason the server is non-responsive. For example, in a
decent scale environment with ~1k chassises, if all the ovn-controllers
restart at the same time requesting the initial data sync, SB servers will
likely be non-responsive and at least 10 - 20 sec of election timer is
required so that the already busy cluster can proceed instead of doing
nothing but leader flapping. In addition, there are also very small chances
that two followers are doing snapshot at the same time, so they can't
respond to leader's heartbeat for a long time, which could result in the
leader regarding itself as disconnected from the cluster, but of course
this chance is really small so maybe nothing to worry in reality.

> >
> > There may be even more complex scenarios that need to be considered if
we do this. What if other nodes are doing snapshot at the same time or at
very close time? Since all the nodes' logs are synced it is likely that
they would do snapshot close to each other. Suppose the leader is trying to
do it first and then one round of snapshot may trigger the leader transfer
3 times (in a 3-node cluster). If there is already a follower node doing
snapshot in a 3-node cluster, then transfering the leader would not help
the response time because in the next couple of seconds only the new leader
is not doing snapshot but any transaction needs at least one of the
followers to replicate the change and respond to the leader.
>
>
> Snapshotting of many servers at the same or very close time is unlikely,
> because the default compaction timeout is 10 + random(10) minutes.
> And it gets more and more random after every compaction scheduling.
> Of course, it's possible that all 3 servers will someday hit compaction
> at the same or very close time, but in a worst case we will have
> a cluster non-operational for a 1-2 snapshot writing time period,
> which is not worse than in a case without transferring.
>
Ok, sounds reasonable.

> > In addition, is there any chance the install_snapshot procedure kicks
in and what would be the impact?
>
> On the leader's side, generation of install_snapshot request is cheaper
> than the database compaction, because there is no need to write on disk
> and the json is already prepared.  It still takes time, but I'm not sure
> how to avoid that.   On the follower's side, it's a heavy operation,
> but it's a follower, so it's not that important if it's not very
responsive.
>

Thanks for your explain!

Han

> >
> > These are just my random thoughts and maybe some of them aren't real
problems at all, but I just want to put them here and see if they bring any
cautions or are already ruled out.
> >
> > Thanks,
> > Han