[ovs-dev] [ovs-discuss] ovsdb-server core dump and ovsdb corruption using raft cluster
Yifeng Sun
pkusunyifeng at gmail.com
Tue Jul 24 20:40:36 UTC 2018
Hi Yun and Girish,
I submitted a patch, do you mind testing and reviewing it? Thanks.
[PATCH] dynamic-string: Fix a bug that leads to assertion fail
diff --git a/lib/dynamic-string.c b/lib/dynamic-string.c
index 6f7b610a9908..4564e420544d 100644
--- a/lib/dynamic-string.c
+++ b/lib/dynamic-string.c
@@ -158,7 +158,7 @@ ds_put_format_valist(struct ds *ds, const char *format,
va_list args_)
if (needed < available) {
ds->length += needed;
} else {
- ds_reserve(ds, ds->length + needed);
+ ds_reserve(ds, ds->allocated + needed);
va_copy(args, args_);
available = ds->allocated - ds->length + 1;
Thanks,
Yifeng Sun
On Wed, Jul 18, 2018 at 10:48 AM, Girish Moodalbail <gmoodalbail at gmail.com>
wrote:
> Hello all,
>
> We are able to reproduce this issue on OVS 2.9.2 at will. The OVSDB NB
> server or OVSDB SB server dumps core while it is trying to compact the
> database.
>
> You can reproduce the issue by using:
>
> root at u1804-HVM-domU:/var/crash# ovs-appctl -t
> /var/run/openvswitch/ovnsb_db.ctl ovsdb-server/compact OVN_Southbound
>
> 2018-07-18T17:34:29Z|00001|unixctl|WARN|error communicating with
> unix:/var/run/openvswitch/ovnsb_db.ctl: End of file
> ovs-appctl: /var/run/openvswitch/ovnsb_db.ctl: transaction error (End of
> file)
> root at u1804-HVM-domU:/var/crash#
> root at u1804-HVM-domU:/var/crash#
> root at u1804-HVM-domU:/var/crash# ERROR: apport (pid 17393) Wed Jul 18
> 10:34:23 2018: called for pid 14683, signal 6, core limit 0, dump mode 1
> ERROR: apport (pid 17393) Wed Jul 18 10:34:23 2018: executable:
> /usr/sbin/ovsdb-server (command line "ovsdb-server -vconsole:off
> -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-sb.log
> --remote=punix:/var/run/openvswitch/ovnsb_db.sock
> --pidfile=/var/run/openvswitch/ovnsb_db.pid --unixctl=ovnsb_db.ctl
> --detach
> --monitor --remote=db:OVN_Southbound,SB_Global,connections
> --private-key=db:OVN_Southbound,SSL,private_key
> --certificate=db:OVN_Southbound,SSL,certificate
> --ca-cert=db:OVN_Southbound,SSL,ca_cert
> --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols
> --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers
> --remote=ptcp:6642:10.0.7.33 /etc/openvswitch/ovnsb_db.db")
> ERROR: apport (pid 17393) Wed Jul 18 10:34:23 2018: is_closing_session():
> no DBUS_SESSION_BUS_ADDRESS in environment
> ERROR: apport (pid 17393) Wed Jul 18 10:34:29 2018: wrote report
> /var/crash/_usr_sbin_ovsdb-server.0.crash
>
> Looking through the crash we see the following stack:
>
> (gdb) bt
> #0 __GI_raise (sig=sig at entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> #1 0x00007f7c9a43c801 in __GI_abort () at abort.c:79
> #2 0x00007f7c9aaa633c in json_serialize (json=<optimized out>,
> s=<optimized out>) at lib/json.c:1554
> #3 0x00007f7c9aaa63ab in json_serialize_object_member (i=<optimized out>,
> s=<optimized out>, node=<optimized out>, node=<optimized out>)
> at lib/json.c:1583
> #4 0x00007f7c9aaa62f2 in json_serialize_object (s=0x7ffca2173ea0,
> object=0x5568dc5d5b10) at lib/json.c:1612
> #5 json_serialize (json=<optimized out>, s=0x7ffca2173ea0) at
> lib/json.c:1533
> #6 0x00007f7c9aaa863c in json_to_ds (json=json at entry=0x5568dc5d4a20,
> flags=flags at entry=0, ds=ds at entry=0x7ffca2173f30) at lib/json.c:1511
> #7 0x00007f7c9ae6750f in ovsdb_log_compose_record
> (json=json at entry=0x5568dc5d4a20,
> magic=0x5568dc5d5a60 "CLUSTER",
> header=header at entry=0x7ffca2173f10, data=data at entry=0x7ffca2173f30) at
> ovsdb/log.c:570
> #8 0x00007f7c9ae677ef in ovsdb_log_write (file=0x5568dc5d5a80,
> json=0x5568dc5d4a20) at ovsdb/log.c:618
> #9 0x00007f7c9ae6796e in ovsdb_log_write_and_free
> (log=log at entry=0x5568dc5d5a80,
> json=0x5568dc5d4a20) at ovsdb/log.c:651
> #10 0x00007f7c9ae6d684 in raft_write_snapshot (raft=raft at entry=
> 0x5568dc1e3720,
> log=0x5568dc5d5a80, new_log_start=new_log_start at entry=539578,
> new_snapshot=new_snapshot at entry=0x7ffca21740e0) at ovsdb/raft.c:3588
> #11 0x00007f7c9ae6dbf3 in raft_save_snapshot (raft=raft at entry=
> 0x5568dc1e3720,
> new_start=new_start at entry=539578,
> new_snapshot=new_snapshot at entry=0x7ffca21740e0) at ovsdb/raft.c:3647
> #12 0x00007f7c9ae757bd in raft_store_snapshot (raft=0x5568dc1e3720,
> new_snapshot_data=new_snapshot_data at entry=0x5568dc5d49a0)
> at ovsdb/raft.c:3849
> #13 0x00007f7c9ae7c7ae in ovsdb_storage_store_snapshot__
> (storage=0x5568dc6b2fb0, schema=0x5568dd66f5a0, data=0x5568dca67880)
> at ovsdb/storage.c:541
> #14 0x00007f7c9ae7d1de in ovsdb_storage_store_snapshot
> (storage=0x5568dc6b2fb0, schema=schema at entry=0x5568dd66f5a0,
> data=data at entry=0x5568dca67880) at ovsdb/storage.c:568
> #15 0x00007f7c9ae69cab in ovsdb_snapshot (db=0x5568dc6b3020) at
> ovsdb/ovsdb.c:519
> #16 0x00005568daec1f82 in main_loop (is_backup=0x7ffca21742be,
> exiting=0x7ffca21742bf, run_process=0x0, remotes=0x7ffca2174310,
> unixctl=0x5568dc71ade0, all_dbs=0x7ffca2174350, jsonrpc=0x5568dc1e36a0,
> config=0x7ffca2174370) at ovsdb/ovsdb-server.c:239
> #17 main (argc=<optimized out>, argv=<optimized out>) at
> ovsdb/ovsdb-server.c:457
>
> Walking through the JSON objects being serialized we see that
> "prev_servers" is malformed.
>
> (gdb) print *((struct shash *)0x5568dc5d5b10)
> $3 = {
> map = {
> buckets = 0x5568dc5d1d30,
> one = 0x0,
> mask = 7,
> n = 9
> }
> }
>
> (gdb) x/6a 0x5568dc5d1d30
> 0x5568dc5d1d30: 0x5568dc5d6000 0x0
> 0x5568dc5d1d40: 0x0 0x5568dc5d5f30
> 0x5568dc5d1d50: 0x5568dc5d5e30 0x5568dc5d5bc0
>
> Let us look at the next one
>
> (gdb) print *((struct shash_node *)0x5568dc5d5e30)
> $7 = {
> node = {
> hash = 2043875868,
> next = 0x0
> },
> name = 0x5568dc5d5e10 "prev_servers",
> data = 0x5568dc688cd0
> }
>
> (gdb) print *((struct json *)0x5568dc688cd0)
> $10 = {
> type = 3697839232,
> count = 34,
> u = {
> object = 0x5568dc688cb0,
> array = {
> n = 93908862799024,
> n_allocated = 93908862798944,
> elems = 0x5568dc22f050
> },
> integer = 93908862799024,
> real = 4.6397142949016804e-310,
> string = 0x5568dc688cb0 "\a"
> }
> }
>
> So, this is malformed. Somehow "prev_servers" is getting malformed.
>
> That information is coming in from 'struct raft`snap`servers'
>
> As anyone seen this before?
>
>
> On Fri, Jul 13, 2018 at 3:49 PM, Yun Zhou <yunz at nvidia.com> wrote:
>
> > Hi,
> >
> > We are running into some issues while we are trying out the 3 nodes raft
> > ovsdb cluster in our lab, and hopefully we can get some help from the
> > community.
> >
> > We are using ovs 2.9.2.
> > -------------------------
> >
> > We found that on one of the 3 nodes, the SB ovsdb-server was not started,
> > and was not able to be restarted because its database was already
> corrupted:
> >
> > "ovsdb-server: syntax "{"encaps":["uuid","7f0f7605-
> > c1d1-43fb-826a-1718ea70e088"],"hostname":"nd-sdn-dgx-010"}": syntax
> > error: hostname is not a UUID"
> >
> > Seeing from the ovsdb-server-sb log file history, SB ovsdb-server core
> > dumped several days ago:
> >
> > "2018-07-08T06:58:15.267Z|00002|daemon_unix(monitor)|ERR|1
> > crashes: pid 937 died, killed (Aborted), core dumped, restarting"
> >
> > Unfortunately, core dump was not generated.
> >
> > FWIW, we saw core dumps for the NB ovsdb on all 3 cluster nodes, here is
> > one of the stack:
> >
> > (gdb) bt
> > #0 __GI_raise (sig=sig at entry=6) at ../sysdeps/unix/sysv/linux/
> raise.c:51
> > #1 0x00007fc48f8c2801 in __GI_abort () at abort.c:79
> > #2 0x00007fc48ff2c33c in ?? () from /usr/lib/x86_64-linux-gnu/
> > libopenvswitch-2.9.so.0
> > #3 0x00007fc48ff2c2f2 in ?? () from /usr/lib/x86_64-linux-gnu/
> > libopenvswitch-2.9.so.0
> > #4 0x00007fc48ff2e63c in json_to_ds ()
> > from /usr/lib/x86_64-linux-gnu/libopenvswitch-2.9.so.0
> > #5 0x00007fc4902ed50f in ovsdb_log_compose_record ()
> > from /usr/lib/x86_64-linux-gnu/libovsdb-2.9.so.0
> > #6 0x00007fc4902ed7ef in ovsdb_log_write ()
> > from /usr/lib/x86_64-linux-gnu/libovsdb-2.9.so.0
> > #7 0x00007fc4902ed96e in ovsdb_log_write_and_free ()
> > from /usr/lib/x86_64-linux-gnu/libovsdb-2.9.so.0
> > #8 0x00007fc4902f3684 in ?? () from /usr/lib/x86_64-linux-gnu/
> > libovsdb-2.9.so.0
> > #9 0x00007fc4902f3bf3 in ?? () from /usr/lib/x86_64-linux-gnu/
> > libovsdb-2.9.so.0
> > #10 0x00007fc4902fb7bd in raft_store_snapshot ()
> > from /usr/lib/x86_64-linux-gnu/libovsdb-2.9.so.0
> > #11 0x00007fc4903027ae in ?? () from /usr/lib/x86_64-linux-gnu/
> > libovsdb-2.9.so.0
> > #12 0x00007fc4903031de in ovsdb_storage_store_snapshot ()
> > from /usr/lib/x86_64-linux-gnu/libovsdb-2.9.so.0
> > #13 0x00007fc4902efcab in ovsdb_snapshot ()
> > from /usr/lib/x86_64-linux-gnu/libovsdb-2.9.so.0
> > #14 0x0000561e47a8cf82 in ?? ()
> > #15 0x00007fc48f8a3b97 in __libc_start_main (main=0x561e47a8bef0,
> argc=17,
> > argv=0x7ffe000ce2c8, init=<optimized out>, fini=<optimized out>,
> > rtld_fini=<optimized out>, stack_end=0x7ffe000ce2b8) at
> > ../csu/libc-start.c:310
> > #16 0x0000561e47a8db9a in ?? ()
> >
> > Please let us know if any more information is needed. Thanks very much!
> >
> > - Yun
> >
> >
> > ------------------------------------------------------------
> > -----------------------
> > This email message is for the sole use of the intended recipient(s) and
> > may contain
> > confidential information. Any unauthorized review, use, disclosure or
> > distribution
> > is prohibited. If you are not the intended recipient, please contact the
> > sender by
> > reply email and destroy all copies of the original message.
> > ------------------------------------------------------------
> > -----------------------
> > _______________________________________________
> > discuss mailing list
> > discuss at openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
More information about the dev
mailing list