[ovs-discuss] [question] need help to understand raft leadership transfer reason

Han Zhou hzhou at ovn.org
Thu Aug 12 16:15:55 UTC 2021


On Thu, Aug 12, 2021 at 4:54 AM Vladislav Odintsov <odivlad at gmail.com>
wrote:
>
> Hi,
>
> I’ve got a 3-node RAFT cluster, serving ovn northbound DB and trying to
understand what triggered ovsdb-server leadership change.
> Can somebody help explain that?
>
> This cluster runs ovs 2.13.4 and has ~25 active clients: 3 x ovn-northd,
3 x ovn-ic, 11 python-ovsdbapp clients (CMS).
>
> Server logs (per-node) listed above.
>
> Node 1:
>
> 2021-08-12T07:45:55.189Z|03131|raft|INFO|server be61 is leader for term
357
> 2021-08-12T07:46:04.187Z|03132|raft|INFO|term 358: 1454 ms timeout
expired, starting election
> 2021-08-12T07:46:04.191Z|03133|raft|INFO|term 358: elected leader by 2+
of 3 servers
> 2021-08-12T07:46:05.558Z|03134|raft|INFO|rejecting term 357 < current
term 358 received in vote_reply message from server be61
> 2021-08-12T07:46:29.181Z|03135|timeval|WARN|Unreasonably long 2122ms poll
interval (2120ms user, 0ms system)
> 2021-08-12T07:46:29.181Z|03136|timeval|WARN|context switches: 0
voluntary, 7 involuntary
> 2021-08-12T07:46:29.181Z|03137|coverage|INFO|Event coverage, avg rate
over last: 5 seconds, last minute, last hour,  hash=3b99bba2:
> 2021-08-12T07:46:29.181Z|03138|coverage|INFO|hmap_pathological
94.0/sec     9.917/sec        0.1819/sec   total: 85714
> 2021-08-12T07:46:29.181Z|03139|coverage|INFO|hmap_expand
 39281.4/sec  4318.283/sec      112.9786/sec   total: 193444648
> 2021-08-12T07:46:29.181Z|03140|coverage|INFO|lockfile_lock
 0.0/sec     0.000/sec        0.0000/sec   total: 1
> 2021-08-12T07:46:29.181Z|03141|coverage|INFO|poll_create_node
201.6/sec   109.617/sec       49.4042/sec   total: 425254534
> 2021-08-12T07:46:29.181Z|03142|coverage|INFO|poll_zero_timeout
 2.0/sec     1.567/sec        0.9350/sec   total: 2938208
> 2021-08-12T07:46:29.181Z|03143|coverage|INFO|seq_change
 13.0/sec    10.433/sec        5.5928/sec   total: 22320268
> 2021-08-12T07:46:29.181Z|03144|coverage|INFO|pstream_open
0.0/sec     0.000/sec        0.0000/sec   total: 4
> 2021-08-12T07:46:29.181Z|03145|coverage|INFO|stream_open
 0.0/sec     0.000/sec        0.0000/sec   total: 20
> 2021-08-12T07:46:29.181Z|03146|coverage|INFO|unixctl_received
0.0/sec     0.017/sec        0.0164/sec   total: 54512
> 2021-08-12T07:46:29.181Z|03147|coverage|INFO|unixctl_replied
 0.0/sec     0.017/sec        0.0164/sec   total: 54512
> 2021-08-12T07:46:29.181Z|03148|coverage|INFO|util_xalloc
 1693863.2/sec 181037.517/sec     3963.9467/sec   total: 4830747405
> 2021-08-12T07:46:29.181Z|03149|coverage|INFO|97 events never hit
> 2021-08-12T07:46:29.186Z|03150|raft|WARN|ignoring vote request received
as leader
> 2021-08-12T07:46:29.186Z|03151|raft|INFO|server be61 is leader for term
359
> 2021-08-12T07:46:29.187Z|03152|raft|INFO|1076 truncating 1 entries from
end of log
> 2021-08-12T07:46:29.187Z|03153|raft|INFO|rejected append_reply (not
leader)
> 2021-08-12T07:46:29.187Z|03154|raft|INFO|rejected append_reply (not
leader)
> 2021-08-12T07:46:29.187Z|03155|raft|INFO|rejected append_reply (not
leader)
> 2021-08-12T07:46:29.191Z|03156|raft|INFO|rejected append_reply (not
leader)
> 2021-08-12T07:46:29.221Z|03157|jsonrpc|WARN|Dropped 4 log messages in
last 14866 seconds (most recently, 14865 seconds ago) due to excessive rate
> 2021-08-12T07:46:29.221Z|03158|jsonrpc|WARN|tcp:client-1:42402: receive
error: Connection reset by peer
> 2021-08-12T07:46:29.222Z|03159|reconnect|WARN|tcp:client-1:42402:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:29.222Z|03160|jsonrpc|WARN|tcp:client-2:45746: receive
error: Connection reset by peer
> 2021-08-12T07:46:29.222Z|03161|reconnect|WARN|tcp:client-2:45746:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:29.225Z|03162|jsonrpc|WARN|tcp:client-3:54218: receive
error: Connection reset by peer
> 2021-08-12T07:46:29.225Z|03163|reconnect|WARN|tcp:client-3:54218:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:29.232Z|03164|jsonrpc|WARN|tcp:client-4:48064: receive
error: Connection reset by peer
> 2021-08-12T07:46:29.232Z|03165|reconnect|WARN|tcp:client-4:48064:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:29.232Z|03166|jsonrpc|WARN|tcp:client-5:49022: receive
error: Connection reset by peer
> 2021-08-12T07:46:29.232Z|03167|reconnect|WARN|tcp:client-5:49022:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:29.566Z|03168|reconnect|WARN|tcp:node-3:40902:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:30.047Z|03169|poll_loop|INFO|Dropped 64 log messages in
last 14879 seconds (most recently, 14876 seconds ago) due to excessive rate
> 2021-08-12T07:46:30.047Z|03170|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 49 (node-1:6641<->ip7:48658) at lib/stream-fd.c:153 (72% CPU usage)
> 2021-08-12T07:46:30.195Z|03171|poll_loop|INFO|wakeup due to [POLLIN] on
fd 20 (node-1:6643<->node-3:47488) at lib/stream-fd.c:157 (72% CPU usage)
> 2021-08-12T07:46:30.351Z|03172|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 43 (node-1:6641<->ip8:48656) at lib/stream-fd.c:153 (72% CPU usage)
> 2021-08-12T07:46:30.529Z|03173|poll_loop|INFO|wakeup due to [POLLIN] on
fd 20 (node-1:6643<->node-3:47488) at lib/stream-fd.c:157 (72% CPU usage)
> 2021-08-12T07:46:30.753Z|03174|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 39 (node-1:6641<->ip9:48062) at lib/stream-fd.c:153 (72% CPU usage)
> 2021-08-12T07:46:30.864Z|03175|poll_loop|INFO|wakeup due to [POLLIN] on
fd 20 (node-1:6643<->node-3:47488) at lib/stream-fd.c:157 (72% CPU usage)
> 2021-08-12T07:46:30.980Z|03176|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 49 (node-1:6641<->ip10:48658) at lib/stream-fd.c:153 (72% CPU usage)
> 2021-08-12T07:46:31.156Z|03177|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 45 (node-1:6641<->ip11:54216) at lib/stream-fd.c:153 (72% CPU usage)
> 2021-08-12T07:46:31.197Z|03178|poll_loop|INFO|wakeup due to [POLLIN] on
fd 20 (node-1:6643<->node-3:47488) at lib/stream-fd.c:157 (72% CPU usage)
> 2021-08-12T07:46:31.198Z|03179|poll_loop|INFO|wakeup due to 0-ms timeout
at lib/reconnect.c:664 (72% CPU usage)
> 2021-08-12T07:46:46.654Z|03180|raft|INFO|server 9f6e is leader for term
360
> 2021-08-12T07:46:46.954Z|03181|raft|INFO|rejecting term 359 < current
term 360 received in append_request message from server be61
>
> Node 2:
>
> 2021-08-12T07:45:54.748Z|02385|raft|INFO|Transferring leadership to write
a snapshot.
> 2021-08-12T07:45:55.058Z|02386|raft|INFO|server be61 is leader for term
357
> 2021-08-12T07:45:55.185Z|02387|raft|INFO|rejecting term 356 < current
term 357 received in append_reply message from server 1076
> 2021-08-12T07:46:04.192Z|02388|raft|INFO|server 1076 is leader for term
358
> 2021-08-12T07:46:28.339Z|02389|raft|INFO|server be61 is leader for term
359
> 2021-08-12T07:46:29.177Z|02390|raft|INFO|rejecting term 358 < current
term 359 received in append_request message from server 1076
> 2021-08-12T07:46:29.183Z|02391|raft|INFO|rejecting term 358 < current
term 359 received in append_request message from server 1076
> 2021-08-12T07:46:46.650Z|02392|raft|INFO|term 360: 1425 ms timeout
expired, starting election
> 2021-08-12T07:46:46.654Z|02393|raft|INFO|term 360: elected leader by 2+
of 3 servers
> 2021-08-12T07:46:46.954Z|02394|raft|INFO|rejecting term 359 < current
term 360 received in append_request message from server be61
>
> Node 3:
>
> 2021-08-12T07:45:54.750Z|03526|raft|INFO|received leadership transfer
from 9f6e in term 356
> 2021-08-12T07:45:54.750Z|03527|raft|INFO|term 357: starting election
> 2021-08-12T07:45:55.057Z|03528|raft|INFO|term 357: elected leader by 2+
of 3 servers
> 2021-08-12T07:46:05.558Z|03529|raft|WARN|ignoring vote request received
as leader
> 2021-08-12T07:46:05.558Z|03530|raft|INFO|server 1076 is leader for term
358
> 2021-08-12T07:46:05.558Z|03531|timeval|WARN|Unreasonably long 2735ms poll
interval (2733ms user, 0ms system)
> 2021-08-12T07:46:05.558Z|03532|timeval|WARN|faults: 5583 minor, 0 major
> 2021-08-12T07:46:05.558Z|03533|timeval|WARN|disk: 0 reads, 8 writes
> 2021-08-12T07:46:05.558Z|03534|timeval|WARN|context switches: 0
voluntary, 19 involuntary
> 2021-08-12T07:46:05.558Z|03535|coverage|INFO|Skipping details of
duplicate event coverage for hash=3b99bba2
> 2021-08-12T07:46:05.559Z|03536|poll_loop|INFO|Dropped 59 log messages in
last 8160 seconds (most recently, 8157 seconds ago) due to excessive rate
> 2021-08-12T07:46:05.559Z|03537|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 37 (node-3:6641<->client-1:49952) at lib/stream-fd.c:153 (57% CPU usage)
> 2021-08-12T07:46:05.559Z|03538|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 46 (node-3:6641<->client-2:39512) at lib/stream-fd.c:153 (57% CPU usage)
> 2021-08-12T07:46:05.559Z|03539|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 35 (node-3:6641<->client-3:35686) at lib/stream-fd.c:153 (57% CPU usage)
> 2021-08-12T07:46:05.559Z|03540|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 47 (node-3:6641<->client-4:49954) at lib/stream-fd.c:153 (57% CPU usage)
> 2021-08-12T07:46:05.559Z|03541|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 39 (node-3:6641<->client-5:33490) at lib/stream-fd.c:153 (57% CPU usage)
> 2021-08-12T07:46:05.559Z|03542|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 45 (node-3:6641<->client-6:41156) at lib/stream-fd.c:153 (57% CPU usage)
> 2021-08-12T07:46:05.559Z|03543|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 40 (node-3:6641<->client-7:40098) at lib/stream-fd.c:153 (57% CPU usage)
> 2021-08-12T07:46:05.559Z|03544|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 43 (node-3:6641<->client-8:38632) at lib/stream-fd.c:153 (57% CPU usage)
> 2021-08-12T07:46:05.559Z|03545|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 36 (node-3:6641<->client-9:34912) at lib/stream-fd.c:153 (57% CPU usage)
> 2021-08-12T07:46:05.559Z|03546|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 41 (node-3:6641<->client-10:55638) at lib/stream-fd.c:153 (57% CPU usage)
> 2021-08-12T07:46:05.896Z|03547|jsonrpc|WARN|tcp:client-11:43772: send
error: Broken pipe
> 2021-08-12T07:46:05.896Z|03548|reconnect|WARN|tcp:client-11:43772:
connection dropped (Broken pipe)
> 2021-08-12T07:46:05.928Z|03549|jsonrpc|WARN|tcp:client-12:55640: receive
error: Connection reset by peer
> 2021-08-12T07:46:05.929Z|03550|reconnect|WARN|tcp:client-12:55640:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:19.068Z|03551|jsonrpc|WARN|tcp:client-13:33490: receive
error: Connection reset by peer
> 2021-08-12T07:46:19.069Z|03552|reconnect|WARN|tcp:client-13:33490:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:19.176Z|03553|jsonrpc|WARN|tcp:client-14:38632: receive
error: Connection reset by peer
> 2021-08-12T07:46:19.177Z|03554|reconnect|WARN|tcp:client-14:38632:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:20.712Z|03555|jsonrpc|WARN|tcp:client-15:34912: receive
error: Connection reset by peer
> 2021-08-12T07:46:20.713Z|03556|reconnect|WARN|tcp:client-15:34912:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:23.453Z|03557|jsonrpc|WARN|tcp:client-16:41158: receive
error: Connection reset by peer
> 2021-08-12T07:46:23.453Z|03558|reconnect|WARN|tcp:client-16:41158:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:28.336Z|03559|raft|INFO|term 359: 1318 ms timeout
expired, starting election
> 2021-08-12T07:46:28.339Z|03560|raft|INFO|term 359: elected leader by 2+
of 3 servers
> 2021-08-12T07:46:29.178Z|03561|raft|INFO|rejecting term 358 < current
term 359 received in append_request message from server 1076
> 2021-08-12T07:46:29.182Z|03562|raft|INFO|rejecting term 358 < current
term 359 received in append_request message from server 1076
> 2021-08-12T07:46:29.187Z|03563|raft|INFO|rejecting term 358 < current
term 359 received in vote_reply message from server 1076
> 2021-08-12T07:46:46.957Z|03564|timeval|WARN|Unreasonably long 1645ms poll
interval (1642ms user, 0ms system)
> 2021-08-12T07:46:46.957Z|03565|timeval|WARN|faults: 163 minor, 0 major
> 2021-08-12T07:46:46.957Z|03566|timeval|WARN|context switches: 0
voluntary, 22 involuntary
> 2021-08-12T07:46:46.957Z|03567|coverage|INFO|Skipping details of
duplicate event coverage for hash=3b99bba2
> 2021-08-12T07:46:47.233Z|03568|raft|INFO|rejected append_reply (not
leader)
> 2021-08-12T07:46:47.233Z|03569|raft|WARN|ignoring vote request received
after only 3 ms (minimum election time is 1000 ms)
> 2021-08-12T07:46:47.233Z|03570|raft|INFO|server 9f6e is leader for term
360
> 2021-08-12T07:46:47.234Z|03571|raft|INFO|rejected append_reply (not
leader)
> 2021-08-12T07:46:48.290Z|03572|poll_loop|INFO|Dropped 49 log messages in
last 43 seconds (most recently, 40 seconds ago) due to excessive rate
> 2021-08-12T07:46:48.290Z|03573|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 43 (node-3:6641<->client-17:34970) at lib/stream-fd.c:153 (64% CPU usage)
> 2021-08-12T07:46:48.446Z|03574|poll_loop|INFO|wakeup due to [POLLIN] on
fd 30 (node-3:6643<->node-2:54684) at lib/stream-fd.c:157 (64% CPU usage)
> 2021-08-12T07:46:48.647Z|03575|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 41 (node-3:6641<->client-18:38648) at lib/stream-fd.c:153 (64% CPU usage)
> 2021-08-12T07:46:48.780Z|03576|poll_loop|INFO|wakeup due to [POLLIN] on
fd 30 (node-3:6643<->node-2:54684) at lib/stream-fd.c:157 (64% CPU usage)
> 2021-08-12T07:46:48.919Z|03577|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 41 (node-3:6641<->client-19:38648) at lib/stream-fd.c:153 (64% CPU usage)
> 2021-08-12T07:46:49.113Z|03578|poll_loop|INFO|wakeup due to [POLLIN] on
fd 30 (1node-32.20.1.125:6643<->node-2:54684) at lib/stream-fd.c:157 (64%
CPU usage)
> 2021-08-12T07:46:49.296Z|03579|poll_loop|INFO|wakeup due to [POLLOUT] on
fd 41 (node-3:6641<->client-20:38648) at lib/stream-fd.c:153 (64% CPU usage)
> 2021-08-12T07:46:50.837Z|03580|jsonrpc|WARN|tcp:client-21:35700: receive
error: Connection reset by peer
> 2021-08-12T07:46:50.837Z|03581|reconnect|WARN|tcp:client-21:35700:
connection dropped (Connection reset by peer)
> 2021-08-12T07:46:55.042Z|03582|jsonrpc|WARN|tcp:client-22:39840: receive
error: Connection reset by peer
> 2021-08-12T07:46:55.043Z|03583|reconnect|WARN|tcp:client-22:39840:
connection dropped (Connection reset by peer)
> 2021-08-12T07:47:01.015Z|03584|jsonrpc|WARN|tcp:client-23:55826: receive
error: Connection reset by peer
> 2021-08-12T07:47:01.015Z|03585|reconnect|WARN|tcp:client-23:55826:
connection dropped (Connection reset by peer)
>
>
> First leader ownership transfer (at 07:45:54) is understandable.
> node-3 received leadership transfer from node-1 because node-1 was
prepearing to take a DB snapshot/compaction.
>
> But then at 07:46:04.187Z node-1 wrote:
> 2021-08-12T07:46:04.187Z|03132|raft|INFO|term 358: 1454 ms timeout
expired, starting election
> 2021-08-12T07:46:04.191Z|03133|raft|INFO|term 358: elected leader by 2+
of 3 servers
>
> I’m not sure, but I suppose node-1 didn’t get heartbet from leader
(node-3), started leader election and won it. Right?
>
> My question is: how to understand the reason why node-3 couldn’t send
raft heartbeat in time?
>
> Can this long poll be the indicator that server was not answering
heartbeats?

Yes. But not sure what caused the long poll on node-3.
Are these logs from NB-DB servers? (I guessed from the port 6641). Are the
NB-DB clients (CMS) using the "leader_only" option - meaning connecting to
RAFT leaders only? If so, when node-1 transfers leadship (because of doing
snapshot), all the clients to node-1 would reconnect to the new leader
(node-3), which could cause some extra load if node-3 just did snapshot
right before the leadership transfer and misses the last transaction ID
requested by the clients. Not sure if this is what happened, but it is just
one possibility.

I'd recommend:
1) Try disable "leader_only" from the clients, so that when leadership
changes, they don't reconnect. The cost is that for each write operation
the cost is higher on the RAFT cluster. If the write is very heavy/frequent
in your cluster it is probably not good to do so.
2) Try increasing the election-timer (heart-beat). It looks your are using
the default timer. It can be increase by ovs-appctl -t ...
cluster/change-election-timer OVN_Northbound <new timer in ms>. The long
poll is 2 - 3 seconds, so at least setting the timer to 3 - 5 seconds. It
can be specified while creating the cluster as well.
3) If occasional long latency is acceptable when the leader is doing
snapshot, it is better not transferring the leadership before snapshot, if
clients are connected using leader-only mode. However, it seems nothing can
be done from your side. I think we could add this option to OVSDB so that
leadership is not transferred due to snapshot. @Ilya Maximets
<i.maximets at ovn.org> what do you think?

Thanks,
Han

>
> 2021-08-12T07:46:05.558Z|03531|timeval|WARN|Unreasonably long 2735ms poll
interval (2733ms user, 0ms system)
> 2021-08-12T07:46:05.558Z|03532|timeval|WARN|faults: 5583 minor, 0 major
> 2021-08-12T07:46:05.558Z|03533|timeval|WARN|disk: 0 reads, 8 writes
> 2021-08-12T07:46:05.558Z|03534|timeval|WARN|context switches: 0
voluntary, 19 involuntary
>
> Thanks in advance for your help.
>
> Regards,
> Vladislav Odintsov
>
> _______________________________________________
> discuss mailing list
> discuss at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20210812/b7761e14/attachment-0001.html>


More information about the discuss mailing list