[ovs-dev] [OVN][RAFT] Follower refusing new entries from leader

Thu Nov 28 03:47:55 UTC 2019

On Wed, Nov 27, 2019 at 7:22 PM taoyunupt <taoyunupt at 126.com> wrote:
>
> Hi,
>     My OVN cluster has 3 OVN-northd nodes, They are proxied by Haproxy
with a VIP. Recently, I restart OVN cluster frequently.  One of the members
report the logs below.
>     After read the code and paper of RAFT, it seems normal process ,If
the follower does not find an entry in its log with the same index and
term, then it refuses the new entries.
>     I think it's reasonable to refuse. But, as we could not control
Haproxy or some proxy maybe, so it will happen error when an session
assignate to the failed follower.
>
>     Does have some means or ways to solve this problem. Maybe we can kick
off the failed follower or disconnect it from the haproxy then synchronize
the date ?  Hope to hear your suggestion.
>
>
> 2019-11-27T14:22:17.060Z|00240|raft|INFO|rejecting append_request because
previous entry 1103,50975 not in local log (mismatch past end of log)
> 2019-11-27T14:22:17.064Z|00241|raft|ERR|Dropped 34 log messages in last
12 seconds (most recently, 0 seconds ago) due to excessive rate
> 2019-11-27T14:22:17.064Z|00242|raft|ERR|internal error: deferred
append_reply message completed but not ready to send because message index
14890 is past last synced index 0: a2b2 append_reply "mismatch past end of
log": term=1103 log_end=14891 result="inconsistency"
> 2019-11-27T14:22:17.402Z|00243|raft|INFO|rejecting append_request because
previous entry 1103,50975 not in local log (mismatch past end of log)
>
>
> [root at ovn1 ~]#  ovs-appctl -t /var/run/openvswitch/ovnsb_db.ctl
cluster/status OVN_Southbound
> a2b2
> Name: OVN_Southbound
> Cluster ID: 4c54 (4c546513-77e3-4602-b211-2e200014ad79)
> Server ID: a2b2 (a2b2a9c5-cf58-4724-8421-88fd5ca5d94d)
> Address: tcp:10.254.8.209:6644
> Status: cluster member
> Role: leader
> Term: 1103
> Leader: self
> Vote: self
>
> Log: [42052, 51009]
> Entries not yet committed: 0
> Entries not yet applied: 0
> Connections: ->beaf ->9a33 <-9a33 <-beaf
> Servers:
>     a2b2 (a2b2 at tcp:10.254.8.209:6644) (self) next_index=15199
match_index=51008
>     beaf (beaf at tcp:10.254.8.208:6644) next_index=51009 match_index=0
>     9a33 (9a33 at tcp:10.254.8.210:6644) next_index=51009
match_index=51008
>

I think it is a bug. I noticed that this problem happens when the cluster
is restarted after DB compaction. I mentioned it in one of the test cases:
https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L252
I also mentioned another problem related to compaction:
https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L239
I was planning to debug these but didn't get the time yet. I will try to
find some time next week (it would be great if you could figure it out and
submit patches).

Thanks,
Han