[ovs-discuss] [ovs-dev] [OVN][RAFT] Follower refusing new entries from leader

Han Zhou hzhou at ovn.org
Wed Dec 4 02:01:16 UTC 2019


Hi,

Could you see if this patch fixes your problem?
https://patchwork.ozlabs.org/patch/1203951/

Thanks,
Han


On Mon, Dec 2, 2019 at 12:28 AM Han Zhou <hzhou at ovn.org> wrote:

> Sorry for the late reply. It was holiday here.
> I didn't see such problem when there is no compaction. Did you see this
> problem when DB compaction didn't happen? The difference is that after
> compaction the RAFT log doesn't have any entries and all the data is in the
> snapshot.
>
> On Fri, Nov 29, 2019 at 12:11 AM taoyunupt <taoyunupt at 126.com> wrote:
>
>> Hi,Han
>>           Hope to receive your reply.
>>
>>
>> Thanks,
>> Yun
>>
>>
>>
>> 在 2019-11-28 16:17:07,"taoyunupt" <taoyunupt at 126.com> 写道:
>>
>> Hi,Han
>>          Another question. NO COMPACT. If restart a follower , leader
>> sender some entries during the  break time, when it has started, if it also
>> happend to this problem?  What is the difference between simply restart and
>> COMPACT with restart ?
>>
>>
>> Thanks,
>> Yun
>>
>>
>>
>>
>>
>>
>>
>>
>> 在 2019-11-28 13:58:36,"taoyunupt" <taoyunupt at 126.com> 写道:
>>
>> Hi,Han
>>          Thanks for your reply.  I think maybe we can disconnect the
>> failed follower from the Haproxy then synchronize the date, after all
>> completed, reconnect it to Haproxy again. But I do not know how to
>> synchronize actually.
>>          It is just my naive idea. Do you have some suggestion about how
>> to fix this problem.  If not very completed, I wii have a try.
>>
>>
>> Thanks
>> Yun
>>
>>
>>
>>
>>
>>
>> 在 2019-11-28 11:47:55,"Han Zhou" <hzhou at ovn.org> 写道:
>>
>>
>>
>> On Wed, Nov 27, 2019 at 7:22 PM taoyunupt <taoyunupt at 126.com> wrote:
>> >
>> > Hi,
>> >     My OVN cluster has 3 OVN-northd nodes, They are proxied by Haproxy
>> with a VIP. Recently, I restart OVN cluster frequently.  One of the members
>> report the logs below.
>> >     After read the code and paper of RAFT, it seems normal process ,If
>> the follower does not find an entry in its log with the same index and
>> term, then it refuses the new entries.
>> >     I think it's reasonable to refuse. But, as we could not control
>> Haproxy or some proxy maybe, so it will happen error when an session
>> assignate to the failed follower.
>> >
>> >     Does have some means or ways to solve this problem. Maybe we can
>> kick off the failed follower or disconnect it from the haproxy then
>> synchronize the date ?  Hope to hear your suggestion.
>> >
>> >
>> > 2019-11-27T14:22:17.060Z|00240|raft|INFO|rejecting append_request
>> because previous entry 1103,50975 not in local log (mismatch past end of
>> log)
>> > 2019-11-27T14:22:17.064Z|00241|raft|ERR|Dropped 34 log messages in last
>> 12 seconds (most recently, 0 seconds ago) due to excessive rate
>> > 2019-11-27T14:22:17.064Z|00242|raft|ERR|internal error: deferred
>> append_reply message completed but not ready to send because message index
>> 14890 is past last synced index 0: a2b2 append_reply "mismatch past end of
>> log": term=1103 log_end=14891 result="inconsistency"
>> > 2019-11-27T14:22:17.402Z|00243|raft|INFO|rejecting append_request
>> because previous entry 1103,50975 not in local log (mismatch past end of
>> log)
>> >
>> >
>> > [root at ovn1 ~]#  ovs-appctl -t /var/run/openvswitch/ovnsb_db.ctl
>> cluster/status OVN_Southbound
>> > a2b2
>> > Name: OVN_Southbound
>> > Cluster ID: 4c54 (4c546513-77e3-4602-b211-2e200014ad79)
>> > Server ID: a2b2 (a2b2a9c5-cf58-4724-8421-88fd5ca5d94d)
>> > Address: tcp:10.254.8.209:6644
>> > Status: cluster member
>> > Role: leader
>> > Term: 1103
>> > Leader: self
>> > Vote: self
>> >
>> > Log: [42052, 51009]
>> > Entries not yet committed: 0
>> > Entries not yet applied: 0
>> > Connections: ->beaf ->9a33 <-9a33 <-beaf
>> > Servers:
>> >     a2b2 (a2b2 at tcp:10.254.8.209:6644) (self) next_index=15199
>> match_index=51008
>> >     beaf (beaf at tcp:10.254.8.208:6644) next_index=51009 match_index=0
>> >     9a33 (9a33 at tcp:10.254.8.210:6644) next_index=51009
>> match_index=51008
>>
>> >
>>
>>
>> I think it is a bug. I noticed that this problem happens when the cluster
>> is restarted after DB compaction. I mentioned it in one of the test cases:
>> https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L252
>> I also mentioned another problem related to compaction:
>> https://github.com/openvswitch/ovs/blob/master/tests/ovsdb-cluster.at#L239
>> I was planning to debug these but didn't get the time yet. I will try to
>> find some time next week (it would be great if you could figure it out and
>> submit patches).
>>
>>
>>
>> Thanks,
>> Han
>> _______________________________________________
>> dev mailing list
>> dev at openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20191203/60007902/attachment-0001.html>


More information about the discuss mailing list