[ovs-discuss] [question] need help to understand raft leadership transfer reason

Vladislav Odintsov odivlad at gmail.com
Sun Aug 15 12:37:55 UTC 2021


Han, thanks for the answer.
My comments inline.

Regards,
Vladislav Odintsov

> On 12 Aug 2021, at 19:15, Han Zhou <hzhou at ovn.org> wrote:
> 
> 
> 
> On Thu, Aug 12, 2021 at 4:54 AM Vladislav Odintsov <odivlad at gmail.com <mailto:odivlad at gmail.com>> wrote:
> >
> > Hi,
> >
> > I’ve got a 3-node RAFT cluster, serving ovn northbound DB and trying to understand what triggered ovsdb-server leadership change.
> > Can somebody help explain that?
> >
> > This cluster runs ovs 2.13.4 and has ~25 active clients: 3 x ovn-northd, 3 x ovn-ic, 11 python-ovsdbapp clients (CMS).
> >
> > Server logs (per-node) listed above.
> >
> > Node 1:
> >
> > 2021-08-12T07:45:55.189Z|03131|raft|INFO|server be61 is leader for term 357
> > 2021-08-12T07:46:04.187Z|03132|raft|INFO|term 358: 1454 ms timeout expired, starting election
> > 2021-08-12T07:46:04.191Z|03133|raft|INFO|term 358: elected leader by 2+ of 3 servers
> > 2021-08-12T07:46:05.558Z|03134|raft|INFO|rejecting term 357 < current term 358 received in vote_reply message from server be61
> > 2021-08-12T07:46:29.181Z|03135|timeval|WARN|Unreasonably long 2122ms poll interval (2120ms user, 0ms system)
> > 2021-08-12T07:46:29.181Z|03136|timeval|WARN|context switches: 0 voluntary, 7 involuntary
> > 2021-08-12T07:46:29.181Z|03137|coverage|INFO|Event coverage, avg rate over last: 5 seconds, last minute, last hour,  hash=3b99bba2:
> > 2021-08-12T07:46:29.181Z|03138|coverage|INFO|hmap_pathological         94.0/sec     9.917/sec        0.1819/sec   total: 85714
> > 2021-08-12T07:46:29.181Z|03139|coverage|INFO|hmap_expand              39281.4/sec  4318.283/sec      112.9786/sec   total: 193444648
> > 2021-08-12T07:46:29.181Z|03140|coverage|INFO|lockfile_lock              0.0/sec     0.000/sec        0.0000/sec   total: 1
> > 2021-08-12T07:46:29.181Z|03141|coverage|INFO|poll_create_node         201.6/sec   109.617/sec       49.4042/sec   total: 425254534
> > 2021-08-12T07:46:29.181Z|03142|coverage|INFO|poll_zero_timeout          2.0/sec     1.567/sec        0.9350/sec   total: 2938208
> > 2021-08-12T07:46:29.181Z|03143|coverage|INFO|seq_change                13.0/sec    10.433/sec        5.5928/sec   total: 22320268
> > 2021-08-12T07:46:29.181Z|03144|coverage|INFO|pstream_open               0.0/sec     0.000/sec        0.0000/sec   total: 4
> > 2021-08-12T07:46:29.181Z|03145|coverage|INFO|stream_open                0.0/sec     0.000/sec        0.0000/sec   total: 20
> > 2021-08-12T07:46:29.181Z|03146|coverage|INFO|unixctl_received           0.0/sec     0.017/sec        0.0164/sec   total: 54512
> > 2021-08-12T07:46:29.181Z|03147|coverage|INFO|unixctl_replied            0.0/sec     0.017/sec        0.0164/sec   total: 54512
> > 2021-08-12T07:46:29.181Z|03148|coverage|INFO|util_xalloc              1693863.2/sec 181037.517/sec     3963.9467/sec   total: 4830747405
> > 2021-08-12T07:46:29.181Z|03149|coverage|INFO|97 events never hit
> > 2021-08-12T07:46:29.186Z|03150|raft|WARN|ignoring vote request received as leader
> > 2021-08-12T07:46:29.186Z|03151|raft|INFO|server be61 is leader for term 359
> > 2021-08-12T07:46:29.187Z|03152|raft|INFO|1076 truncating 1 entries from end of log
> > 2021-08-12T07:46:29.187Z|03153|raft|INFO|rejected append_reply (not leader)
> > 2021-08-12T07:46:29.187Z|03154|raft|INFO|rejected append_reply (not leader)
> > 2021-08-12T07:46:29.187Z|03155|raft|INFO|rejected append_reply (not leader)
> > 2021-08-12T07:46:29.191Z|03156|raft|INFO|rejected append_reply (not leader)
> > 2021-08-12T07:46:29.221Z|03157|jsonrpc|WARN|Dropped 4 log messages in last 14866 seconds (most recently, 14865 seconds ago) due to excessive rate
> > 2021-08-12T07:46:29.221Z|03158|jsonrpc|WARN|tcp:client-1:42402: receive error: Connection reset by peer
> > 2021-08-12T07:46:29.222Z|03159|reconnect|WARN|tcp:client-1:42402: connection dropped (Connection reset by peer)
> > 2021-08-12T07:46:29.222Z|03160|jsonrpc|WARN|tcp:client-2:45746: receive error: Connection reset by peer
> > 2021-08-12T07:46:29.222Z|03161|reconnect|WARN|tcp:client-2:45746: connection dropped (Connection reset by peer)
> > 2021-08-12T07:46:29.225Z|03162|jsonrpc|WARN|tcp:client-3:54218: receive error: Connection reset by peer
> > 2021-08-12T07:46:29.225Z|03163|reconnect|WARN|tcp:client-3:54218: connection dropped (Connection reset by peer)
> > 2021-08-12T07:46:29.232Z|03164|jsonrpc|WARN|tcp:client-4:48064: receive error: Connection reset by peer
> > 2021-08-12T07:46:29.232Z|03165|reconnect|WARN|tcp:client-4:48064: connection dropped (Connection reset by peer)
> > 2021-08-12T07:46:29.232Z|03166|jsonrpc|WARN|tcp:client-5:49022: receive error: Connection reset by peer
> > 2021-08-12T07:46:29.232Z|03167|reconnect|WARN|tcp:client-5:49022: connection dropped (Connection reset by peer)
> > 2021-08-12T07:46:29.566Z|03168|reconnect|WARN|tcp:node-3:40902: connection dropped (Connection reset by peer)
> > 2021-08-12T07:46:30.047Z|03169|poll_loop|INFO|Dropped 64 log messages in last 14879 seconds (most recently, 14876 seconds ago) due to excessive rate
> > 2021-08-12T07:46:30.047Z|03170|poll_loop|INFO|wakeup due to [POLLOUT] on fd 49 (node-1:6641<->ip7:48658) at lib/stream-fd.c:153 (72% CPU usage)
> > 2021-08-12T07:46:30.195Z|03171|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (node-1:6643<->node-3:47488) at lib/stream-fd.c:157 (72% CPU usage)
> > 2021-08-12T07:46:30.351Z|03172|poll_loop|INFO|wakeup due to [POLLOUT] on fd 43 (node-1:6641<->ip8:48656) at lib/stream-fd.c:153 (72% CPU usage)
> > 2021-08-12T07:46:30.529Z|03173|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (node-1:6643<->node-3:47488) at lib/stream-fd.c:157 (72% CPU usage)
> > 2021-08-12T07:46:30.753Z|03174|poll_loop|INFO|wakeup due to [POLLOUT] on fd 39 (node-1:6641<->ip9:48062) at lib/stream-fd.c:153 (72% CPU usage)
> > 2021-08-12T07:46:30.864Z|03175|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (node-1:6643<->node-3:47488) at lib/stream-fd.c:157 (72% CPU usage)
> > 2021-08-12T07:46:30.980Z|03176|poll_loop|INFO|wakeup due to [POLLOUT] on fd 49 (node-1:6641<->ip10:48658) at lib/stream-fd.c:153 (72% CPU usage)
> > 2021-08-12T07:46:31.156Z|03177|poll_loop|INFO|wakeup due to [POLLOUT] on fd 45 (node-1:6641<->ip11:54216) at lib/stream-fd.c:153 (72% CPU usage)
> > 2021-08-12T07:46:31.197Z|03178|poll_loop|INFO|wakeup due to [POLLIN] on fd 20 (node-1:6643<->node-3:47488) at lib/stream-fd.c:157 (72% CPU usage)
> > 2021-08-12T07:46:31.198Z|03179|poll_loop|INFO|wakeup due to 0-ms timeout at lib/reconnect.c:664 (72% CPU usage)
> > 2021-08-12T07:46:46.654Z|03180|raft|INFO|server 9f6e is leader for term 360
> > 2021-08-12T07:46:46.954Z|03181|raft|INFO|rejecting term 359 < current term 360 received in append_request message from server be61
> >
> > Node 2:
> >
> > 2021-08-12T07:45:54.748Z|02385|raft|INFO|Transferring leadership to write a snapshot.
> > 2021-08-12T07:45:55.058Z|02386|raft|INFO|server be61 is leader for term 357
> > 2021-08-12T07:45:55.185Z|02387|raft|INFO|rejecting term 356 < current term 357 received in append_reply message from server 1076
> > 2021-08-12T07:46:04.192Z|02388|raft|INFO|server 1076 is leader for term 358
> > 2021-08-12T07:46:28.339Z|02389|raft|INFO|server be61 is leader for term 359
> > 2021-08-12T07:46:29.177Z|02390|raft|INFO|rejecting term 358 < current term 359 received in append_request message from server 1076
> > 2021-08-12T07:46:29.183Z|02391|raft|INFO|rejecting term 358 < current term 359 received in append_request message from server 1076
> > 2021-08-12T07:46:46.650Z|02392|raft|INFO|term 360: 1425 ms timeout expired, starting election
> > 2021-08-12T07:46:46.654Z|02393|raft|INFO|term 360: elected leader by 2+ of 3 servers
> > 2021-08-12T07:46:46.954Z|02394|raft|INFO|rejecting term 359 < current term 360 received in append_request message from server be61
> >
> > Node 3:
> >
> > 2021-08-12T07:45:54.750Z|03526|raft|INFO|received leadership transfer from 9f6e in term 356
> > 2021-08-12T07:45:54.750Z|03527|raft|INFO|term 357: starting election
> > 2021-08-12T07:45:55.057Z|03528|raft|INFO|term 357: elected leader by 2+ of 3 servers
> > 2021-08-12T07:46:05.558Z|03529|raft|WARN|ignoring vote request received as leader
> > 2021-08-12T07:46:05.558Z|03530|raft|INFO|server 1076 is leader for term 358
> > 2021-08-12T07:46:05.558Z|03531|timeval|WARN|Unreasonably long 2735ms poll interval (2733ms user, 0ms system)
> > 2021-08-12T07:46:05.558Z|03532|timeval|WARN|faults: 5583 minor, 0 major
> > 2021-08-12T07:46:05.558Z|03533|timeval|WARN|disk: 0 reads, 8 writes
> > 2021-08-12T07:46:05.558Z|03534|timeval|WARN|context switches: 0 voluntary, 19 involuntary
> > 2021-08-12T07:46:05.558Z|03535|coverage|INFO|Skipping details of duplicate event coverage for hash=3b99bba2
> > 2021-08-12T07:46:05.559Z|03536|poll_loop|INFO|Dropped 59 log messages in last 8160 seconds (most recently, 8157 seconds ago) due to excessive rate
> > 2021-08-12T07:46:05.559Z|03537|poll_loop|INFO|wakeup due to [POLLOUT] on fd 37 (node-3:6641<->client-1:49952) at lib/stream-fd.c:153 (57% CPU usage)
> > 2021-08-12T07:46:05.559Z|03538|poll_loop|INFO|wakeup due to [POLLOUT] on fd 46 (node-3:6641<->client-2:39512) at lib/stream-fd.c:153 (57% CPU usage)
> > 2021-08-12T07:46:05.559Z|03539|poll_loop|INFO|wakeup due to [POLLOUT] on fd 35 (node-3:6641<->client-3:35686) at lib/stream-fd.c:153 (57% CPU usage)
> > 2021-08-12T07:46:05.559Z|03540|poll_loop|INFO|wakeup due to [POLLOUT] on fd 47 (node-3:6641<->client-4:49954) at lib/stream-fd.c:153 (57% CPU usage)
> > 2021-08-12T07:46:05.559Z|03541|poll_loop|INFO|wakeup due to [POLLOUT] on fd 39 (node-3:6641<->client-5:33490) at lib/stream-fd.c:153 (57% CPU usage)
> > 2021-08-12T07:46:05.559Z|03542|poll_loop|INFO|wakeup due to [POLLOUT] on fd 45 (node-3:6641<->client-6:41156) at lib/stream-fd.c:153 (57% CPU usage)
> > 2021-08-12T07:46:05.559Z|03543|poll_loop|INFO|wakeup due to [POLLOUT] on fd 40 (node-3:6641<->client-7:40098) at lib/stream-fd.c:153 (57% CPU usage)
> > 2021-08-12T07:46:05.559Z|03544|poll_loop|INFO|wakeup due to [POLLOUT] on fd 43 (node-3:6641<->client-8:38632) at lib/stream-fd.c:153 (57% CPU usage)
> > 2021-08-12T07:46:05.559Z|03545|poll_loop|INFO|wakeup due to [POLLOUT] on fd 36 (node-3:6641<->client-9:34912) at lib/stream-fd.c:153 (57% CPU usage)
> > 2021-08-12T07:46:05.559Z|03546|poll_loop|INFO|wakeup due to [POLLOUT] on fd 41 (node-3:6641<->client-10:55638) at lib/stream-fd.c:153 (57% CPU usage)
> > 2021-08-12T07:46:05.896Z|03547|jsonrpc|WARN|tcp:client-11:43772: send error: Broken pipe
> > 2021-08-12T07:46:05.896Z|03548|reconnect|WARN|tcp:client-11:43772: connection dropped (Broken pipe)
> > 2021-08-12T07:46:05.928Z|03549|jsonrpc|WARN|tcp:client-12:55640: receive error: Connection reset by peer
> > 2021-08-12T07:46:05.929Z|03550|reconnect|WARN|tcp:client-12:55640: connection dropped (Connection reset by peer)
> > 2021-08-12T07:46:19.068Z|03551|jsonrpc|WARN|tcp:client-13:33490: receive error: Connection reset by peer
> > 2021-08-12T07:46:19.069Z|03552|reconnect|WARN|tcp:client-13:33490: connection dropped (Connection reset by peer)
> > 2021-08-12T07:46:19.176Z|03553|jsonrpc|WARN|tcp:client-14:38632: receive error: Connection reset by peer
> > 2021-08-12T07:46:19.177Z|03554|reconnect|WARN|tcp:client-14:38632: connection dropped (Connection reset by peer)
> > 2021-08-12T07:46:20.712Z|03555|jsonrpc|WARN|tcp:client-15:34912: receive error: Connection reset by peer
> > 2021-08-12T07:46:20.713Z|03556|reconnect|WARN|tcp:client-15:34912: connection dropped (Connection reset by peer)
> > 2021-08-12T07:46:23.453Z|03557|jsonrpc|WARN|tcp:client-16:41158: receive error: Connection reset by peer
> > 2021-08-12T07:46:23.453Z|03558|reconnect|WARN|tcp:client-16:41158: connection dropped (Connection reset by peer)
> > 2021-08-12T07:46:28.336Z|03559|raft|INFO|term 359: 1318 ms timeout expired, starting election
> > 2021-08-12T07:46:28.339Z|03560|raft|INFO|term 359: elected leader by 2+ of 3 servers
> > 2021-08-12T07:46:29.178Z|03561|raft|INFO|rejecting term 358 < current term 359 received in append_request message from server 1076
> > 2021-08-12T07:46:29.182Z|03562|raft|INFO|rejecting term 358 < current term 359 received in append_request message from server 1076
> > 2021-08-12T07:46:29.187Z|03563|raft|INFO|rejecting term 358 < current term 359 received in vote_reply message from server 1076
> > 2021-08-12T07:46:46.957Z|03564|timeval|WARN|Unreasonably long 1645ms poll interval (1642ms user, 0ms system)
> > 2021-08-12T07:46:46.957Z|03565|timeval|WARN|faults: 163 minor, 0 major
> > 2021-08-12T07:46:46.957Z|03566|timeval|WARN|context switches: 0 voluntary, 22 involuntary
> > 2021-08-12T07:46:46.957Z|03567|coverage|INFO|Skipping details of duplicate event coverage for hash=3b99bba2
> > 2021-08-12T07:46:47.233Z|03568|raft|INFO|rejected append_reply (not leader)
> > 2021-08-12T07:46:47.233Z|03569|raft|WARN|ignoring vote request received after only 3 ms (minimum election time is 1000 ms)
> > 2021-08-12T07:46:47.233Z|03570|raft|INFO|server 9f6e is leader for term 360
> > 2021-08-12T07:46:47.234Z|03571|raft|INFO|rejected append_reply (not leader)
> > 2021-08-12T07:46:48.290Z|03572|poll_loop|INFO|Dropped 49 log messages in last 43 seconds (most recently, 40 seconds ago) due to excessive rate
> > 2021-08-12T07:46:48.290Z|03573|poll_loop|INFO|wakeup due to [POLLOUT] on fd 43 (node-3:6641<->client-17:34970) at lib/stream-fd.c:153 (64% CPU usage)
> > 2021-08-12T07:46:48.446Z|03574|poll_loop|INFO|wakeup due to [POLLIN] on fd 30 (node-3:6643<->node-2:54684) at lib/stream-fd.c:157 (64% CPU usage)
> > 2021-08-12T07:46:48.647Z|03575|poll_loop|INFO|wakeup due to [POLLOUT] on fd 41 (node-3:6641<->client-18:38648) at lib/stream-fd.c:153 (64% CPU usage)
> > 2021-08-12T07:46:48.780Z|03576|poll_loop|INFO|wakeup due to [POLLIN] on fd 30 (node-3:6643<->node-2:54684) at lib/stream-fd.c:157 (64% CPU usage)
> > 2021-08-12T07:46:48.919Z|03577|poll_loop|INFO|wakeup due to [POLLOUT] on fd 41 (node-3:6641<->client-19:38648) at lib/stream-fd.c:153 (64% CPU usage)
> > 2021-08-12T07:46:49.113Z|03578|poll_loop|INFO|wakeup due to [POLLIN] on fd 30 (1node-32.20.1.125:6643<->node-2:54684) at lib/stream-fd.c:157 (64% CPU usage)
> > 2021-08-12T07:46:49.296Z|03579|poll_loop|INFO|wakeup due to [POLLOUT] on fd 41 (node-3:6641<->client-20:38648) at lib/stream-fd.c:153 (64% CPU usage)
> > 2021-08-12T07:46:50.837Z|03580|jsonrpc|WARN|tcp:client-21:35700: receive error: Connection reset by peer
> > 2021-08-12T07:46:50.837Z|03581|reconnect|WARN|tcp:client-21:35700: connection dropped (Connection reset by peer)
> > 2021-08-12T07:46:55.042Z|03582|jsonrpc|WARN|tcp:client-22:39840: receive error: Connection reset by peer
> > 2021-08-12T07:46:55.043Z|03583|reconnect|WARN|tcp:client-22:39840: connection dropped (Connection reset by peer)
> > 2021-08-12T07:47:01.015Z|03584|jsonrpc|WARN|tcp:client-23:55826: receive error: Connection reset by peer
> > 2021-08-12T07:47:01.015Z|03585|reconnect|WARN|tcp:client-23:55826: connection dropped (Connection reset by peer)
> >
> >
> > First leader ownership transfer (at 07:45:54) is understandable.
> > node-3 received leadership transfer from node-1 because node-1 was prepearing to take a DB snapshot/compaction.
> >
> > But then at 07:46:04.187Z node-1 wrote:
> > 2021-08-12T07:46:04.187Z|03132|raft|INFO|term 358: 1454 ms timeout expired, starting election
> > 2021-08-12T07:46:04.191Z|03133|raft|INFO|term 358: elected leader by 2+ of 3 servers
> >
> > I’m not sure, but I suppose node-1 didn’t get heartbet from leader (node-3), started leader election and won it. Right?
> >
> > My question is: how to understand the reason why node-3 couldn’t send raft heartbeat in time?
> >
> > Can this long poll be the indicator that server was not answering heartbeats?
> 
> Yes. But not sure what caused the long poll on node-3.
> Are these logs from NB-DB servers? (I guessed from the port 6641). Are the NB-DB clients (CMS) using the "leader_only" option - meaning connecting to RAFT leaders only? If so, when node-1 transfers leadship (because of doing snapshot), all the clients to node-1 would reconnect to the new leader (node-3), which could cause some extra load if node-3 just did snapshot right before the leadership transfer and misses the last transaction ID requested by the clients. Not sure if this is what happened, but it is just one possibility.
> 
> I'd recommend:
> 1) Try disable "leader_only" from the clients, so that when leadership changes, they don't reconnect. The cost is that for each write operation the cost is higher on the RAFT cluster. If the write is very heavy/frequent in your cluster it is probably not good to do so.
Thanks for the advice - we’ll try this options. Traffic shouldn’t be very heavy.
> 2) Try increasing the election-timer (heart-beat). It looks your are using the default timer. It can be increase by ovs-appctl -t ... cluster/change-election-timer OVN_Northbound <new timer in ms>. The long poll is 2 - 3 seconds, so at least setting the timer to 3 - 5 seconds. It can be specified while creating the cluster as well.
Am I right that if we switch clients to no_leader_only mode, the reason for these long polls we be eliminated and we can leave default timer settings (which you correctly pointed - 1 second).
> 3) If occasional long latency is acceptable when the leader is doing snapshot, it is better not transferring the leadership before snapshot, if clients are connected using leader-only mode. However, it seems nothing can be done from your side. I think we could add this option to OVSDB so that leadership is not transferred due to snapshot. @Ilya Maximets <mailto:i.maximets at ovn.org> what do you think?
Yeah, that could be a good point to have a choice, whether do snaphot with transferring leadership or not.

But from my side an additional question here: I’ve never seen follower ovsdb-servers taking snapshot/compaction.
Moreover, on a Southbound DB, which is much more write-intensive, than NBDB, I’ve seen many times that leader preparing to take a snapshot, transfers leadership, does a snap. And after some seconds, or maybe in that particular second, when follower node became a leader, this new leader also wants to take a snap! And it starts transferring a leader.

Please correct me, if follower also takes snap and it was just a coincidence. On not?
PS: I’m talking about OVS 2.13.
> 
> Thanks,
> Han
> 
> >
> > 2021-08-12T07:46:05.558Z|03531|timeval|WARN|Unreasonably long 2735ms poll interval (2733ms user, 0ms system)
> > 2021-08-12T07:46:05.558Z|03532|timeval|WARN|faults: 5583 minor, 0 major
> > 2021-08-12T07:46:05.558Z|03533|timeval|WARN|disk: 0 reads, 8 writes
> > 2021-08-12T07:46:05.558Z|03534|timeval|WARN|context switches: 0 voluntary, 19 involuntary
> >
> > Thanks in advance for your help.
> >
> > Regards,
> > Vladislav Odintsov
> >
> > _______________________________________________
> > discuss mailing list
> > discuss at openvswitch.org <mailto:discuss at openvswitch.org>
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss <https://mail.openvswitch.org/mailman/listinfo/ovs-discuss>_______________________________________________
> discuss mailing list
> discuss at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20210815/873576f1/attachment-0001.html>


More information about the discuss mailing list