[ovs-dev] raft ovsdb clustering with scale test

aginwala aginwala at asu.edu
Mon Apr 23 08:46:24 UTC 2018


Hi :

I was able to go little further from log perspective with debug level and
see proper error:
However there were couple of changes that I did for current setup.
For controllers setup, since I am using 3 vms, to keep the specs in line
with current ovn scale env:

   1. Increased the virtual memory of my vms to 40Gb as I saw out of memory
   erros with 4g memory.
   2. Also set the mtu to 9k instead of 1500 as scale env BM used 1500
   3. Also changed net.core.somaxconn=2048 as ubuntu image just using 128
   as the connection rate is high.



After 1.5k lports are created, another request for 250 lports took ~ 20+
minutes to go through as per logs which is huge.

With above settings in place, active northd even though was processing
little slow ,did not show any errors which stretched to around 2k lports
instead of 1k lports in the previous runs:

northd logs when wait-until command is getting run with debug mode for
reference is @
https://github.com/noah8713/ovn-scale-test/blob/scale_results/results/active_northd_nod3.txt


Finally, with wait-until; nb db process exited with segmentation fault on
all 3 nodes.

e.g. ovn-nbctl --db="tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp:
192.168.220.103:6641" show
ovn-nbctl: tcp:192.168.220.101:6641,tcp:192.168.220.102:6641,tcp:
192.168.220.103:6641: database connection failed (Connection refused)

#node3 ovsdb-server-nb logs

2018-04-23T05:11:34.237Z|00015|raft|INFO|tcp:192.168.10.10:43412: learned
server ID 9e1c
2018-04-23T05:11:34.237Z|00016|raft|INFO|tcp:192.168.10.10:43412: learned
remote address tcp:192.168.220.101:6643
2018-04-23T05:11:34.240Z|00017|raft|INFO|tcp:192.168.10.10:43006: learned
server ID f038
2018-04-23T05:11:34.240Z|00018|raft|INFO|tcp:192.168.10.10:43006: learned
remote address tcp:192.168.220.102:6643
2018-04-23T05:11:44.216Z|00019|memory|INFO|2536 kB peak resident set size
after 10.0 seconds
2018-04-23T05:11:44.216Z|00020|memory|INFO|cells:28 monitors:0
2018-04-23T05:21:04.276Z|00021|memory|INFO|peak resident set size grew 51%
in last 560.1 seconds, from 2536 kB to 3828 kB
2018-04-23T05:21:04.276Z|00022|memory|INFO|cells:5588 monitors:0
2018-04-23T05:24:14.327Z|00023|memory|INFO|peak resident set size grew 76%
in last 190.1 seconds, from 3828 kB to 6744 kB
2018-04-23T05:24:14.328Z|00024|memory|INFO|cells:13588 monitors:0
2018-04-23T05:26:54.365Z|00025|memory|INFO|peak resident set size grew 52%
in last 160.0 seconds, from 6744 kB to 10220 kB
2018-04-23T05:26:54.365Z|00026|memory|INFO|cells:13588 monitors:0
2018-04-23T05:27:04.370Z|00027|memory|INFO|peak resident set size grew 51%
in last 10.0 seconds, from 10220 kB to 15472 kB
2018-04-23T05:27:04.370Z|00028|memory|INFO|cells:13588 monitors:0
2018-04-23T05:28:54.388Z|00029|memory|INFO|peak resident set size grew 60%
in last 110.0 seconds, from 15472 kB to 24780 kB
2018-04-23T05:28:54.388Z|00030|memory|INFO|cells:17588 monitors:0
2018-04-23T05:29:44.399Z|00031|memory|INFO|peak resident set size grew 59%
in last 50.0 seconds, from 24780 kB to 39480 kB
2018-04-23T05:29:44.400Z|00032|memory|INFO|cells:17588 monitors:0
2018-04-23T05:31:44.470Z|00033|memory|INFO|peak resident set size grew 52%
in last 120.1 seconds, from 39480 kB to 60000 kB
2018-04-23T05:31:44.470Z|00034|memory|INFO|cells:17588 monitors:0
2018-04-23T05:32:24.474Z|00035|memory|INFO|peak resident set size grew 56%
in last 40.0 seconds, from 60000 kB to 93528 kB
2018-04-23T05:32:24.475Z|00036|memory|INFO|cells:17588 monitors:0
2018-04-23T05:33:44.482Z|00037|memory|INFO|peak resident set size grew 52%
in last 80.0 seconds, from 93528 kB to 142304 kB
2018-04-23T05:33:44.482Z|00038|memory|INFO|cells:17588 monitors:0
2018-04-23T05:37:34.511Z|00039|memory|INFO|peak resident set size grew 52%
in last 230.0 seconds, from 142304 kB to 215640 kB
2018-04-23T05:37:34.511Z|00040|memory|INFO|cells:21588 monitors:0
2018-04-23T05:56:30.576Z|00002|daemon_unix(monitor)|ERR|1 crashes: pid
16603 died, killed (Segmentation fault), core dumped, restarting



#node 2 ovsdb-server-nb logs
2018-04-23T05:56:30.298Z|00040|reconnect|INFO|tcp:192.168.220.103:6643:
connection closed by peer
2018-04-23T05:56:31.298Z|00041|reconnect|INFO|tcp:192.168.220.103:6643:
connecting...
2018-04-23T05:56:31.299Z|00042|reconnect|INFO|tcp:192.168.220.103:6643:
connection attempt failed (Connection refused)
2018-04-23T05:56:31.299Z|00043|reconnect|INFO|tcp:192.168.220.103:6643:
waiting 2 seconds before reconnect
2018-04-23T05:56:33.298Z|00044|reconnect|INFO|tcp:192.168.220.103:6643:
connecting...
2018-04-23T05:56:33.298Z|00045|reconnect|INFO|tcp:192.168.220.103:6643:
connection attempt failed (Connection refused)
2018-04-23T05:56:33.298Z|00046|reconnect|INFO|tcp:192.168.220.103:6643:
waiting 4 seconds before reconnect
2018-04-23T05:56:37.298Z|00047|reconnect|INFO|tcp:192.168.220.103:6643:
connecting...
2018-04-23T05:56:37.298Z|00048|reconnect|INFO|tcp:192.168.220.103:6643:
connection attempt failed (Connection refused)
2018-04-23T05:56:37.298Z|00049|reconnect|INFO|tcp:192.168.220.103:6643:
continuing to reconnect in the background but suppressing further logging
2018-04-23T05:58:30.824Z|00050|jsonrpc|WARN|tcp:192.168.220.101:6643:
receive error: Connection reset by peer
2018-04-23T05:58:30.825Z|00051|reconnect|WARN|tcp:192.168.220.101:6643:
connection dropped (Connection reset by peer)
2018-04-23T05:58:31.826Z|00052|reconnect|INFO|tcp:192.168.220.101:6643:
connecting...
2018-04-23T05:58:31.827Z|00053|reconnect|INFO|tcp:192.168.220.101:6643:
connection attempt failed (Connection refused)
2018-04-23T05:58:31.827Z|00054|reconnect|INFO|tcp:192.168.220.101:6643:
waiting 2 seconds before reconnect
2018-04-23T05:58:32.108Z|00055|raft|INFO|term 2: 1737 ms timeout expired,
starting election
2018-04-23T05:58:33.372Z|00056|raft|INFO|term 3: 1263 ms timeout expired,
starting election
2018-04-23T05:58:33.827Z|00057|reconnect|INFO|tcp:192.168.220.101:6643:
connecting...
2018-04-23T05:58:33.828Z|00058|reconnect|INFO|tcp:192.168.220.101:6643:
connection attempt failed (Connection refused)
2018-04-23T05:58:33.828Z|00059|reconnect|INFO|tcp:192.168.220.101:6643:
waiting 4 seconds before reconnect
2018-04-23T05:58:34.987Z|00060|raft|INFO|term 4: 1614 ms timeout expired,
starting election
2018-04-23T05:58:36.122Z|00061|raft|INFO|term 5: 1135 ms timeout expired,
starting election
2018-04-23T05:58:37.646Z|00062|raft|INFO|term 6: 1525 ms timeout expired,
starting election
2018-04-23T05:58:37.829Z|00063|reconnect|INFO|tcp:192.168.220.101:6643:
connecting...
2018-04-23T05:58:37.829Z|00064|reconnect|INFO|tcp:192.168.220.101:6643:
connection attempt failed (Connection refused)
2018-04-23T05:58:37.829Z|00065|reconnect|INFO|tcp:192.168.220.101:6643:
continuing to reconnect in the background but suppressing further logging
2018-04-23T05:59:26.219Z|00002|daemon_unix(monitor)|ERR|1 crashes: pid
16471 died, killed (Segmentation fault), core dumped, restarting



#node1 ovsdb-server-nb logs
2018-04-23T05:56:22.299Z|00384|poll_loop|INFO|wakeup due to [POLLIN] on fd
32 (192.168.220.101:6641<->192.168.10.10:60270) at lib/stream-fd.c:157 (55%
CPU usage)
2018-04-23T05:56:22.302Z|00385|poll_loop|INFO|wakeup due to [POLLIN] on fd
23 (FIFO pipe:[3682840]) at ovsdb/log.c:994 (55% CPU usage)
2018-04-23T05:56:22.302Z|00386|poll_loop|INFO|wakeup due to [POLLIN] on fd
27 (192.168.220.101:6643<->192.168.10.10:53852) at lib/stream-fd.c:157 (55%
CPU usage)
2018-04-23T05:56:22.302Z|00387|poll_loop|INFO|wakeup due to [POLLIN] on fd
30 (192.168.220.101:6643<->192.168.10.10:58108) at lib/stream-fd.c:157 (55%
CPU usage)
2018-04-23T05:56:22.302Z|00388|poll_loop|INFO|wakeup due to 0-ms timeout at
ovsdb/jsonrpc-server.c:598 (55% CPU usage)
2018-04-23T05:56:22.303Z|00389|poll_loop|INFO|wakeup due to [POLLIN] on fd
33 (192.168.220.101:6641<->192.168.220.101:60212) at lib/stream-fd.c:157
(55% CPU usage)
2018-04-23T05:56:24.536Z|00390|poll_loop|INFO|Dropped 1359 log messages in
last 2 seconds (most recently, 0 seconds ago) due to excessive rate
2018-04-23T05:56:24.536Z|00391|poll_loop|INFO|wakeup due to 20-ms timeout
at ovsdb/raft.c:1825 (55% CPU usage)
2018-04-23T05:56:30.235Z|00392|jsonrpc|WARN|tcp:192.168.10.10:58108:
receive error: Connection reset by peer
2018-04-23T05:56:30.235Z|00393|reconnect|WARN|tcp:192.168.10.10:58108:
connection dropped (Connection reset by peer)
2018-04-23T05:56:30.235Z|00394|reconnect|INFO|tcp:192.168.220.103:6643:
connection closed by peer
2018-04-23T05:56:31.236Z|00395|reconnect|INFO|tcp:192.168.220.103:6643:
connecting...
2018-04-23T05:56:31.237Z|00396|reconnect|INFO|tcp:192.168.220.103:6643:
connection attempt failed (Connection refused)
2018-04-23T05:56:31.237Z|00397|reconnect|INFO|tcp:192.168.220.103:6643:
waiting 2 seconds before reconnect
2018-04-23T05:56:32.799Z|00398|raft|INFO|Dropped 12983 log messages in last
14 seconds (most recently, 10 seconds ago) due to excessive rate
2018-04-23T05:56:32.799Z|00399|raft|INFO|current entry eid
7ead0c6c-e30a-485c-bce5-098e6d714c72 does not match prerequisite
843202ec-313e-4b39-bcf2-a0cc4c440887
2018-04-23T05:56:33.238Z|00400|reconnect|INFO|tcp:192.168.220.103:6643:
connecting...
2018-04-23T05:56:33.238Z|00401|reconnect|INFO|tcp:192.168.220.103:6643:
connection attempt failed (Connection refused)
2018-04-23T05:56:33.238Z|00402|reconnect|INFO|tcp:192.168.220.103:6643:
waiting 4 seconds before reconnect
2018-04-23T05:56:37.239Z|00403|reconnect|INFO|tcp:192.168.220.103:6643:
connecting...
2018-04-23T05:56:37.242Z|00404|reconnect|INFO|tcp:192.168.220.103:6643:
connection attempt failed (Connection refused)
2018-04-23T05:56:37.242Z|00405|reconnect|INFO|tcp:192.168.220.103:6643:
continuing to reconnect in the background but suppressing further logging
2018-04-23T05:56:42.413Z|00406|raft|INFO|Dropped 15487 log messages in last
10 seconds (most recently, 0 seconds ago) due to excessive rate
2018-04-23T05:56:42.413Z|00407|raft|INFO|current entry eid
05ee84d3-1569-4707-9727-c6ea2635173c does not match prerequisite
acf1dc8e-64ae-43d9-ae0e-1d7f75d24e87
2018-04-23T05:57:38.509Z|00408|raft|INFO|Dropped 10955 log messages in last
56 seconds (most recently, 45 seconds ago) due to excessive rate
2018-04-23T05:57:38.509Z|00409|raft|INFO|current entry eid
24d45f3f-ec8c-4297-bdb6-fe5a2b53fedb does not match prerequisite
d63e09ac-104e-4ab8-a872-38db93df9b8a
2018-04-23T05:57:38.509Z|00410|raft|INFO|current entry eid
24d45f3f-ec8c-4297-bdb6-fe5a2b53fedb does not match prerequisite
d63e09ac-104e-4ab8-a872-38db93df9b8a
2018-04-23T05:57:38.509Z|00411|raft|INFO|current entry eid
24d45f3f-ec8c-4297-bdb6-fe5a2b53fedb does not match prerequisite
d63e09ac-104e-4ab8-a872-38db93df9b8a
2018-04-23T05:57:38.510Z|00412|raft|INFO|current entry eid
24d45f3f-ec8c-4297-bdb6-fe5a2b53fedb does not match prerequisite
d63e09ac-104e-4ab8-a872-38db93df9b8a
2018-04-23T05:57:42.421Z|00413|raft|INFO|Dropped 11522 log messages in last
4 seconds (most recently, 0 seconds ago) due to excessive rate
2018-04-23T05:57:42.421Z|00414|raft|INFO|current entry eid
fc4b668c-c6b4-43ea-b371-1ee106715c57 does not match prerequisite
e50b2fd1-7718-48c9-99b5-23b47709ee1c
2018-04-23T05:58:19.675Z|00415|raft|INFO|Dropped 3656 log messages in last
37 seconds (most recently, 33 seconds ago) due to excessive rate
2018-04-23T05:58:19.676Z|00416|raft|INFO|current entry eid
c398d0de-c4e0-474e-aefb-d9de1d2f7353 does not match prerequisite
a06e0870-41c1-48d7-8ec1-0ef6c90a1d7c
2018-04-23T05:58:19.676Z|00417|raft|INFO|current entry eid
c398d0de-c4e0-474e-aefb-d9de1d2f7353 does not match prerequisite
a06e0870-41c1-48d7-8ec1-0ef6c90a1d7c
2018-04-23T05:58:19.677Z|00418|raft|INFO|current entry eid
c398d0de-c4e0-474e-aefb-d9de1d2f7353 does not match prerequisite
a06e0870-41c1-48d7-8ec1-0ef6c90a1d7c
2018-04-23T05:58:30.761Z|00002|daemon_unix(monitor)|ERR|1 crashes: pid
31577 died, killed (Segmentation fault), core dumped, restarting


#dmesg logs on node1
[22471.454414] ovsdb-server[31577]: segfault at 7f96ebe83310 ip
00007f96e9d1445b sp 00007ffe6ac36120 error 4 in libc-2.23.so
[7f96e9c96000+1c0000]

#node2
[21707.901519] ovsdb-server[16471]: segfault at 7fce382bc520 ip
00007fce2f30645b sp 00007fff14533760 error 4 in libc-2.23.so
[7fce2f288000+1c0000]

#node3
[21300.376651] ovsdb-server[16603]: segfault at 7f17ba556c40 ip
00007f17ad5a745b sp 00007ffc195657f0 error 4 in libc-2.23.so
[7f17ad529000+1c0000]

sb db worked fine and with no issues.

restarting nb db now fails with below error on all 3 nodes:
ovsdb-server: syntax "{"cluster_id":"391406ed-dc0a-44cd-8483-acae65dcc405
....... syntax error: Parsing raft header failed: Type mismatch for member
'prev_servers'.

Let me know if that clicks you little further as to what could be causing
this. Have not gone into raft code into details but now seems worth
visiting with hints above.


Regards,


On Tue, Apr 17, 2018 at 6:07 PM, aginwala <aginwala at asu.edu> wrote:

> Thanks Numan:
>
>
> Will also try to go little deep and see if we get little more hints by
> increasing the log levels to debug and will wait for inputs from you and
> Ben further . Hope we can find the culprit soon :)
>
>
> Regards,
>
>
> On Tue, Apr 17, 2018 at 12:13 AM, Numan Siddique <nusiddiq at redhat.com>
> wrote:
>
>>
>>
>> On Tue, Apr 17, 2018 at 5:52 AM, aginwala <aginwala at asu.edu> wrote:
>>
>>> Hi Ben/Numan:
>>>
>>>
>>> Just following up again. Let me know for any further
>>> improvements/suggestions on this.
>>>
>>>
>> Hi Aliasgar,
>>
>> I  need to educate myself first by studing raft and Ben's patches :). I
>> hope I will do it soon.
>>
>> Thanks
>> Numan
>>
>>
>>
>>
>>> Regards,
>>>
>>>
>>> On Mon, Apr 9, 2018 at 10:53 AM, aginwala <aginwala at asu.edu> wrote:
>>>
>>> > Further Analysis indicates that cluster is not able to select a new
>>> leader
>>> > and hence every node keeps retrying:
>>> >
>>> > ovn-northd logs around the same time on all three nodes complains about
>>> > disconnection from cluster and failing to choose leader
>>> >
>>> >
>>> > #Node1
>>> > 2018-04-06T01:19:06.481Z|00466|ovn_northd|INFO|ovn-northd lock lost.
>>> This
>>> > ovn-northd instance is now on standby.
>>> > 2018-04-06T01:19:07.668Z|00467|ovn_northd|INFO|ovn-northd lock
>>> acquired.
>>> > This ovn-northd instance is now active.
>>> > 2018-04-06T01:19:07.671Z|00468|ovsdb_idl|INFO|tcp:192.168.220.102:6642
>>> :
>>> > clustered database server is not cluster leader; trying another server
>>> > 2018-04-06T01:19:15.673Z|00469|reconnect|INFO|tcp:192.168.220.103:6642
>>> :
>>> > connected
>>> > 2018-04-06T01:19:15.673Z|00470|ovn_northd|INFO|ovn-northd lock lost.
>>> This
>>> > ovn-northd instance is now on standby.
>>> >
>>> > #Node2
>>> > 2018-04-06T01:19:58.249Z|00487|ovn_northd|INFO|ovn-northd lock
>>> acquired.
>>> > This ovn-northd instance is now active.
>>> > 2018-04-06T01:19:58.255Z|00488|ovsdb_idl|INFO|tcp:192.168.220.103:6642
>>> :
>>> > clustered database server is disconnected from cluster; trying another
>>> > server
>>> > 2018-04-06T01:20:09.261Z|00489|reconnect|INFO|tcp:192.168.220.102:6642
>>> :
>>> > connected
>>> > 2018-04-06T01:20:09.261Z|00490|ovn_northd|INFO|ovn-northd lock lost.
>>> This
>>> > ovn-northd instance is now on standby.
>>> >
>>> >
>>> > #Node3
>>> > 2018-04-06T01:19:01.654Z|00964|ovn_northd|INFO|ovn-northd lock
>>> acquired.
>>> > This ovn-northd instance is now active.
>>> > 2018-04-06T01:19:01.711Z|00965|ovsdb_idl|INFO|tcp:192.168.220.102:6642
>>> :
>>> > clustered database server is disconnected frr
>>> > om cluster; trying another server
>>> > 2018-04-06T01:19:09.715Z|00966|reconnect|INFO|tcp:192.168.220.103:6642
>>> :
>>> > connected
>>> > 2018-04-06T01:19:09.716Z|00967|ovn_northd|INFO|ovn-northd lock lost.
>>> This
>>> > ovn-northd instance is now on standby.
>>> >
>>> >
>>> >
>>> > Regards,
>>> > Aliasgar
>>> >
>>> > On Fri, Apr 6, 2018 at 2:16 PM, aginwala <aginwala at asu.edu> wrote:
>>> >
>>> >> Hi Ben/Numan:
>>> >>
>>> >> So I went ahead to try out clustered db in scale env and results are
>>> not
>>> >> as expected.
>>> >> OVS branch used is master; commit-id(2062840612904ff0873d
>>> >> 46b2f4f226bc23f2296d)
>>> >>
>>> >> Setup is uing 3 nodes.
>>> >>
>>> >> Also disabled inactivity_probe,
>>> >> ovn-nbctl --db="tcp:192.168.220.101:6641,tcp:192.168.220.102:6641
>>> ,tcp:
>>> >> 192.168.220.103:6641" set NB . external_ids:inactivity_probe=0
>>> >> ovn-sbctl --db="tcp:192.168.220.101:6642,tcp:192.168.220.102:6642
>>> ,tcp:
>>> >> 192.168.220.103:6642" set SB . external_ids:inactivity_probe=0
>>> >>
>>> >>
>>> >> 1. With wait_up = true for 10k lports and 1k HVs which internally uses
>>> >> wait-until, it was able to create around 1k lports and rally exited
>>> due db
>>> >> connection error.
>>> >> Cause of failure: database connection failed is @
>>> >> https://raw.githubusercontent.com/noah8713/ovn-scale-test/e3
>>> >> 98ccdd77b5c0bfed3d1cfe37c7e7ac5d8d8f81/results/output_raft_fail.html
>>> >>
>>> >> Some data from sb db logs are as below
>>> >> 2018-04-05T06:29:32.628Z|00065|poll_loop|INFO|wakeup due to 9-ms
>>> timeout
>>> >> at lib/reconnect.c:643 (90% CPU usage)
>>> >> 2018-04-05T06:30:19.432Z|00066|raft|INFO|current entry eid
>>> >> 23a85453-70a1-49ae-bf8f-22a1eeda6a60 does not match prerequisite
>>> >> 966d3234-961a-4dc5-b2ad-4f12766fd610 in execute_command_request
>>> >> 2018-04-05T06:31:03.428Z|00067|raft|INFO|current entry eid
>>> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
>>> >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
>>> >> 2018-04-05T06:31:03.469Z|00068|raft|INFO|current entry eid
>>> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
>>> >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
>>> >> 2018-04-05T06:31:03.492Z|00069|raft|INFO|current entry eid
>>> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
>>> >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
>>> >> 2018-04-05T06:31:03.505Z|00070|raft|INFO|current entry eid
>>> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
>>> >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
>>> >> 2018-04-05T06:31:03.510Z|00071|raft|INFO|current entry eid
>>> >> f0e3517a-e59a-4533-9e60-a3ded2ba704b does not match prerequisite
>>> >> 43ed7278-e1de-44b8-83ea-1cc9578c4d41 in execute_command_request
>>> >> 2018-04-05T06:32:04.056Z|00072|memory|INFO|peak resident set size
>>> grew
>>> >> 51% in last 35031.7 seconds, from 24748 kB to 37332 kB
>>> >> 2018-04-05T06:32:04.056Z|00073|memory|INFO|cells:59908 json-caches:1
>>> >> monitors:3 sessions:431
>>> >> 2018-04-05T06:32:27.956Z|00074|raft|WARN|ignoring vote request
>>> received
>>> >> as leader
>>> >> 2018-04-05T06:32:27.958Z|00075|timeval|WARN|Unreasonably long 1278ms
>>> >> poll interval (1044ms user, 232ms system)
>>> >> 2018-04-05T06:32:27.958Z|00076|timeval|WARN|faults: 61669 minor, 0
>>> major
>>> >> 2018-04-05T06:32:27.958Z|00077|timeval|WARN|context switches: 0
>>> >> voluntary, 3 involuntary
>>> >> 2018-04-05T06:32:27.958Z|00078|coverage|INFO|Event coverage, avg rate
>>> >> over last: 5 seconds, last minute, last hour,  hash=1dd2af20:
>>> >> 2018-04-05T06:32:27.959Z|00079|coverage|INFO|hmap_pathological
>>> >>  25.0/sec    24.633/sec        1.5550/sec   total: 19274
>>> >> 2018-04-05T06:32:27.959Z|00080|coverage|INFO|hmap_expand
>>> >> 4202.6/sec  4041.417/sec      578.9233/sec   total: 14158936
>>> >> 2018-04-05T06:32:27.959Z|00081|coverage|INFO|lockfile_lock
>>> >> 0.0/sec     0.000/sec        0.0000/sec   total: 1
>>> >> 2018-04-05T06:32:27.960Z|00082|coverage|INFO|poll_create_node
>>> >>  50281.4/sec 47195.233/sec    29424.6886/sec   total: 1001238470
>>> >> 2018-04-05T06:32:27.960Z|00083|coverage|INFO|poll_zero_timeout
>>> >> 9.8/sec    12.167/sec        8.2392/sec   total: 312540
>>> >> 2018-04-05T06:32:27.960Z|00084|coverage|INFO|seq_change
>>> >>  122.8/sec   115.967/sec       68.2761/sec   total: 2309640
>>> >> 2018-04-05T06:32:27.961Z|00085|coverage|INFO|pstream_open
>>> >>  0.0/sec     0.000/sec        0.0000/sec   total: 4
>>> >> 2018-04-05T06:32:27.961Z|00086|coverage|INFO|stream_open
>>> >> 0.0/sec     0.000/sec        0.0000/sec   total: 9
>>> >> 2018-04-05T06:32:27.961Z|00087|coverage|INFO|util_xalloc
>>> >> 229362.2/sec 218086.767/sec    46494.5058/sec   total: 1344641131
>>> >>
>>> >>
>>> >>
>>> >> some data from nb db logs:
>>> >> 2018-04-05T23:35:33.458Z|00020|raft|INFO|adding b7bd (b7bd at tcp:
>>> >> 192.168.220.103:6645) to cluster a36d succeeded (completed)
>>> >> 2018-04-05T23:53:06.360Z|00021|jsonrpc|WARN|tcp:192.168.10.10:57298:
>>> >> receive error: Connection reset by peer
>>> >> 2018-04-05T23:53:06.360Z|00022|reconnect|WARN|tcp:192.168.10.10:57298
>>> :
>>> >> connection dropped (Connection reset by peer)
>>> >> 2018-04-05T23:53:09.071Z|00023|jsonrpc|WARN|tcp:192.168.10.10:57348:
>>> >> receive error: Connection reset by peer
>>> >> 2018-04-05T23:53:09.071Z|00024|reconnect|WARN|tcp:192.168.10.10:57348
>>> :
>>> >> connection dropped (Connection reset by peer)
>>> >> ^2018-04-06T00:02:03.864Z|00076|raft|INFO|current entry eid
>>> >> 0b9eeff2-8377-4f6d-a907-cb4e98396d14 does not match prerequisite
>>> >> 277c09b4-96b3-43e9-b482-a9df29576988
>>> >> 2018-04-06T00:02:06.882Z|00077|poll_loop|INFO|Dropped 6187 log
>>> messages
>>> >> in last 6 seconds (most recently, 0 seconds ago) due to excessive rate
>>> >> 2018-04-06T00:02:06.882Z|00078|poll_loop|INFO|wakeup due to [POLLIN]
>>> on
>>> >> fd 31 (192.168.220.102:6645<->192.168.10.10:54782) at
>>> >> lib/stream-fd.c:157 (62% CPU usage)
>>> >> 2018-04-06T00:02:09.130Z|00079|jsonrpc|WARN|tcp:192.168.10.10:33592:
>>> >> receive error: Connection reset by peer
>>> >> 2018-04-06T00:02:09.131Z|00080|reconnect|WARN|tcp:192.168.10.10:33592
>>> :
>>> >> connection dropped (Connection reset by peer)
>>> >> 2018-04-06T00:02:33.986Z|00081|poll_loop|INFO|Dropped 1509 log
>>> messages
>>> >> in last 26 seconds (most recently, 24 seconds ago) due to excessive
>>> rate
>>> >> 2018-04-06T00:02:33.986Z|00082|poll_loop|INFO|wakeup due to [POLLIN]
>>> on
>>> >> fd 24 (FIFO pipe:[40290445]) at ovsdb/log.c:994 (51% CPU usage)
>>> >> 2018-04-06T00:02:33.986Z|00083|poll_loop|INFO|wakeup due to [POLLIN]
>>> on
>>> >> fd 29 (192.168.220.102:6645<->192.168.10.10:55738) at
>>> >> lib/stream-fd.c:157 (51% CPU usage)
>>> >>
>>> >>
>>> >>
>>> >> 2018-04-06T00:13:48.141Z|00037|raft|INFO|rejecting term 4 < current
>>> term
>>> >> 5 received in vote_reply message from server b7bd
>>> >> 2018-04-06T00:14:21.138Z|00038|jsonrpc|WARN|tcp:192.168.10.10:33990:
>>> >> receive error: Connection reset by peer
>>> >> 2018-04-06T00:14:21.138Z|00039|reconnect|WARN|tcp:192.168.10.10:33990
>>> :
>>> >> connection dropped (Connection reset by peer)
>>> >> 2018-04-06T00:17:01.532Z|00040|raft|INFO|current entry eid
>>> >> e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite
>>> >> d47dbdf1-0981-4077-baab-372b05aeb0bb
>>> >> 2018-04-06T00:17:01.532Z|00041|raft|INFO|current entry eid
>>> >> e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite
>>> >> d47dbdf1-0981-4077-baab-372b05aeb0bb
>>> >> 2018-04-06T00:17:01.532Z|00042|raft|INFO|current entry eid
>>> >> e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite
>>> >> d47dbdf1-0981-4077-baab-372b05aeb0bb
>>> >> 2018-04-06T00:17:01.533Z|00043|raft|INFO|current entry eid
>>> >> e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite
>>> >> d47dbdf1-0981-4077-baab-372b05aeb0bb
>>> >> 2018-04-06T00:17:01.533Z|00044|raft|INFO|current entry eid
>>> >> e07229fe-99d8-436b-8ff5-ee533ed3c825 does not match prerequisite
>>> >> d47dbdf1-0981-4077-baab-372b05aeb0bb
>>> >> 2018-04-06T00:19:32.039Z|00045|timeval|WARN|Unreasonably long 1373ms
>>> >> poll interval (1340ms user, 24ms system)
>>> >> 2018-04-06T00:19:32.039Z|00046|timeval|WARN|faults: 319 minor, 0
>>> major
>>> >> 2018-04-06T00:19:32.039Z|00047|timeval|WARN|disk: 0 reads, 368 writes
>>> >> 2018-04-06T00:19:32.039Z|00048|timeval|WARN|context switches: 5
>>> >> voluntary, 19 involuntary
>>> >> 2018-04-06T00:19:32.039Z|00049|coverage|INFO|Event coverage, avg rate
>>> >> over last: 5 seconds, last minute, last hour,  hash=1dd2af20:
>>> >> 2018-04-06T00:19:32.039Z|00050|coverage|INFO|hmap_pathological
>>> >> 0.0/sec     0.000/sec        3.4164/sec   total: 12306
>>> >> 2018-04-06T00:19:32.039Z|00051|coverage|INFO|hmap_expand
>>> >>  73.6/sec    71.267/sec      729.9656/sec   total: 2628980
>>> >> 2018-04-06T00:19:32.039Z|00052|coverage|INFO|lockfile_lock
>>> >> 0.0/sec     0.000/sec        0.0003/sec   total: 1
>>> >> 2018-04-06T00:19:32.039Z|00053|coverage|INFO|poll_create_node
>>> >>  129.6/sec   122.333/sec      866.7178/sec   total: 3120304
>>> >> 2018-04-06T00:19:32.039Z|00054|coverage|INFO|poll_zero_timeout
>>> >> 0.0/sec     0.267/sec       27.4550/sec   total: 98838
>>> >> 2018-04-06T00:19:32.039Z|00055|coverage|INFO|seq_change
>>> >> 10.8/sec    10.217/sec      135.4125/sec   total: 487498
>>> >> 2018-04-06T00:19:32.039Z|00056|coverage|INFO|pstream_open
>>> >>  0.0/sec     0.000/sec        0.0011/sec   total: 4
>>> >> 2018-04-06T00:19:32.039Z|00057|coverage|INFO|stream_open
>>> >> 0.0/sec     0.000/sec        0.0019/sec   total: 7
>>> >> 2018-04-06T00:19:32.039Z|00058|coverage|INFO|util_xalloc
>>> >> 1542.2/sec  1488.750/sec    21346.2483/sec   total: 76872987
>>> >> 2018-04-06T00:19:32.039Z|00059|coverage|
>>> >>
>>> >>
>>> >>
>>> >> To find that wait-until was only the bottleneck, I skipped wait-until
>>> and
>>> >> test got completed in 2.4 hours but things did not go as expected.
>>> All the
>>> >> chassis ovn-controllers crashed and ports actually did not get bind
>>> all the
>>> >> way down to the compute as ovs sb db was running super hot
>>> >>
>>> >> Below is the controller log from sample HV without wait-until.
>>> >> 2018-04-06T04:50:40.274Z|01703|poll_loop|INFO|Dropped 5903 log
>>> messages
>>> >> in last 6 seconds (most recently, 0 seconds ago) due to excessive rate
>>> >> 2018-04-06T04:50:40.274Z|01704|poll_loop|INFO|wakeup due to
>>> >> [POLLIN][POLLHUP] on fd 13 (<->/root/sandbox-192.168.72.4
>>> 5/br-int.mgmt)
>>> >> at ../lib/stream-fd.c:157 (100% CPU usage)
>>> >> 2018-04-06T04:50:41.060Z|01705|ovsdb_idl|INFO|tcp:192.168.22
>>> 0.102:6642:
>>> >> clustered database server is disconnected from cluster; trying another
>>> >> server
>>> >> 2018-04-06T04:50:41.066Z|01706|binding|INFO|Claiming lport
>>> >> lport_a3060c_mlAOEq for this chassis.
>>> >> 2018-04-06T04:50:41.066Z|01707|binding|INFO|lport_a3060c_mlAOEq:
>>> >> Claiming a2:1c:47:32:30:52 172.145.23.37
>>> >> 2018-04-06T04:50:41.066Z|01708|vconn_stream|ERR|send: Broken pipe
>>> >> 2018-04-06T04:50:41.066Z|01709|rconn|WARN|unix:/root/sandbox-
>>> >> 192.168.72.45/br-int.mgmt: connection dropped (Broken pipe)
>>> >> 2018-04-06T04:50:41.067Z|01710|rconn|INFO|unix:/root/sandbox-
>>> >> 192.168.72.45/br-int.mgmt: connected
>>> >> 2018-04-06T04:50:41.067Z|01711|rconn|WARN|unix:/root/sandbox-
>>> >> 192.168.72.45/br-int.mgmt: connection dropped (Broken pipe)
>>> >> 2018-04-06T04:50:41.067Z|01712|util|EMER|../lib/ofp-msgs.c:1062:
>>> >> assertion version >= info->min_version && version <= info->max_version
>>> >> failed in raw_instance_get()
>>> >>
>>> >>
>>> >>
>>> >> Also northd sees many transaction errors without wait-until:
>>> >> northd.log
>>> >> 2018-04-06T04:25:26.594Z|01076|ovsdb_idl|WARN|transaction error:
>>> >> {"details":"Transaction causes multiple rows in \"Port_Binding\"
>>> table to
>>> >> have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 104)
>>> for
>>> >> index on columns \"datapath\" and \"tunnel_key\".  First row, with
>>> UUID
>>> >> 489f777f-d7f6-4b92-80b9-c1c64c083ea4, existed in the database before
>>> >> this transaction and was not modified by the transaction.  Second
>>> row, with
>>> >> UUID b59a2500-3da1-40ab-932c-5e2dd4acc3b4, was inserted by this
>>> >> transaction.","error":"constraint violation"}
>>> >> 2018-04-06T04:25:27.202Z|01077|ovsdb_idl|WARN|transaction error:
>>> >> {"details":"Transaction causes multiple rows in \"Port_Binding\"
>>> table to
>>> >> have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 26)
>>> for
>>> >> index on columns \"datapath\" and \"tunnel_key\".  First row, with
>>> UUID
>>> >> 0668b300-55f8-4bdd-b0a8-8038258d4502, was inserted by this
>>> transaction.
>>> >> Second row, with UUID 63972d45-8845-4489-ac3e-916909a593c7, existed
>>> in
>>> >> the database before this transaction and was not modified by the
>>> >> transaction.","error":"constraint violation"}
>>> >> 2018-04-06T04:25:32.036Z|01078|ovsdb_idl|WARN|transaction error:
>>> >> {"details":"Transaction causes multiple rows in \"Port_Binding\"
>>> table to
>>> >> have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 224)
>>> for
>>> >> index on columns \"datapath\" and \"tunnel_key\".  First row, with
>>> UUID
>>> >> 528dea47-f284-40d6-bc7e-a97a084f49eb, existed in the database before
>>> >> this transaction and was not modified by the transaction.  Second
>>> row, with
>>> >> UUID 6fcc2500-fedb-4ee9-b54d-ea548dd36936, was inserted by this
>>> >> transaction.","error":"constraint violation"}
>>> >> 2018-04-06T04:25:32.482Z|01079|ovsdb_idl|WARN|transaction error:
>>> >> {"details":"Transaction causes multiple rows in \"Port_Binding\"
>>> table to
>>> >> have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 23)
>>> for
>>> >> index on columns \"datapath\" and \"tunnel_key\".  First row, with
>>> UUID
>>> >> 925eb001-adeb-4748-a55f-6c5812e70a7e, was inserted by this
>>> transaction.
>>> >> Second row, with UUID 5ed7bbc5-b7f5-4d4e-8e99-d8c4adcb2b34, existed
>>> in
>>> >> the database before this transaction and was not modified by the
>>> >> transaction.","error":"constraint violation"}
>>> >> 2018-04-06T04:25:35.204Z|01080|ovsdb_idl|WARN|transaction error:
>>> >> {"details":"Transaction causes multiple rows in \"Port_Binding\"
>>> table to
>>> >> have identical values (44d6726d-2a3a-428d-b8d4-b0d1e4c9593b and 18)
>>> for
>>> >> index on columns \"datapath\" and \"tunnel_key\".  First row, with
>>> UUID
>>> >> 4b429b00-75ca-4e38-904d-80a11e6c0f38, was inserted by this
>>> transaction.
>>> >> Second row, with UUID c2db03fe-18b6-4a2b-8db0-bf37ef93db92, existed
>>> in
>>> >> the database before this transaction and was not modified by the
>>> >> transaction.","error":"constraint violation"}
>>> >> 2018-04-06T04:25:35.933Z|01081|ovsdb_idl|INFO|tcp:192.168.22
>>> 0.101:6642:
>>> >> clustered database server is not cluster leader; trying another server
>>> >>
>>> >>
>>> >> To summerize:
>>> >>
>>> >>    - ovn nb db has issues with wait-until and works fine without
>>> >>    wait-until
>>> >>    - ovn sb db causes controller to crash due to more HVs and just few
>>> >>    hundred of ports with no solid root cause without wait-until. Even
>>> sbctl
>>> >>    commands hangs as sb db is 100% on cpu.
>>> >>    - ovn northd sees transaction error without wait-until
>>> >>
>>> >>
>>> >>
>>> >> Will continue to troubleshoot and would love to get any
>>> >> improvements/suggestions if any.
>>> >>
>>> >>
>>> >> Regards,
>>> >>
>>> >>
>>> >
>>> _______________________________________________
>>> dev mailing list
>>> dev at openvswitch.org
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>
>>
>>
>


More information about the dev mailing list