[ovs-discuss] [ovs-dev] ovsdb-server core dump and ovsdb corruption using raft cluster

aginwala aginwala at asu.edu
Tue Jul 24 23:44:40 UTC 2018


Hi:

Glad to see more people picking up on raft testing.

Just to add on, you can also refer to
https://mail.openvswitch.org/pipermail/ovs-dev/2018-May/347765.html and
https://mail.openvswitch.org/pipermail/ovs-dev/2018-April/346375.html  where
there are couple of suggestions given by Ben too. See if you can skip
snapshot code  and still see the error. However,  the ask to skip snapshot
was to see if the performance would improve for testing purpose. I remember
tuning my VM memory, vcpus ,etc. and never ran into core dump issue again.



Regards,


On Tue, Jul 24, 2018 at 4:41 PM Yifeng Sun <pkusunyifeng at gmail.com> wrote:

> My apologize, the patch has some issue. I need to dig further.
>
> Yifeng
>
> On Tue, Jul 24, 2018 at 1:40 PM, Yifeng Sun <pkusunyifeng at gmail.com>
> wrote:
>
> > Hi Yun and Girish,
> >
> > I submitted a patch, do you mind testing and reviewing it? Thanks.
> >
> > [PATCH] dynamic-string: Fix a bug that leads to assertion fail
> >
> > diff --git a/lib/dynamic-string.c b/lib/dynamic-string.c
> > index 6f7b610a9908..4564e420544d 100644
> > --- a/lib/dynamic-string.c
> > +++ b/lib/dynamic-string.c
> > @@ -158,7 +158,7 @@ ds_put_format_valist(struct ds *ds, const char
> > *format, va_list args_)
> >      if (needed < available) {
> >          ds->length += needed;
> >      } else {
> > -        ds_reserve(ds, ds->length + needed);
> > +        ds_reserve(ds, ds->allocated + needed);
> >
> >          va_copy(args, args_);
> >          available = ds->allocated - ds->length + 1;
> >
> >
> > Thanks,
> > Yifeng Sun
> >
> > On Wed, Jul 18, 2018 at 10:48 AM, Girish Moodalbail <
> gmoodalbail at gmail.com
> > > wrote:
> >
> >> Hello all,
> >>
> >> We are able to reproduce this issue on OVS 2.9.2 at will. The OVSDB NB
> >> server or OVSDB SB server dumps core while it is trying to compact the
> >> database.
> >>
> >> You can reproduce the issue by using:
> >>
> >> root at u1804-HVM-domU:/var/crash# ovs-appctl -t
> >> /var/run/openvswitch/ovnsb_db.ctl ovsdb-server/compact OVN_Southbound
> >>
> >> 2018-07-18T17:34:29Z|00001|unixctl|WARN|error communicating with
> >> unix:/var/run/openvswitch/ovnsb_db.ctl: End of file
> >> ovs-appctl: /var/run/openvswitch/ovnsb_db.ctl: transaction error (End of
> >> file)
> >> root at u1804-HVM-domU:/var/crash#
> >> root at u1804-HVM-domU:/var/crash#
> >> root at u1804-HVM-domU:/var/crash# ERROR: apport (pid 17393) Wed Jul 18
> >> 10:34:23 2018: called for pid 14683, signal 6, core limit 0, dump mode 1
> >> ERROR: apport (pid 17393) Wed Jul 18 10:34:23 2018: executable:
> >> /usr/sbin/ovsdb-server (command line "ovsdb-server -vconsole:off
> >> -vfile:info --log-file=/var/log/openvswitch/ovsdb-server-sb.log
> >> --remote=punix:/var/run/openvswitch/ovnsb_db.sock
> >> --pidfile=/var/run/openvswitch/ovnsb_db.pid --unixctl=ovnsb_db.ctl
> >> --detach
> >> --monitor --remote=db:OVN_Southbound,SB_Global,connections
> >> --private-key=db:OVN_Southbound,SSL,private_key
> >> --certificate=db:OVN_Southbound,SSL,certificate
> >> --ca-cert=db:OVN_Southbound,SSL,ca_cert
> >> --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols
> >> --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers
> >> --remote=ptcp:6642:10.0.7.33 /etc/openvswitch/ovnsb_db.db")
> >> ERROR: apport (pid 17393) Wed Jul 18 10:34:23 2018:
> is_closing_session():
> >> no DBUS_SESSION_BUS_ADDRESS in environment
> >> ERROR: apport (pid 17393) Wed Jul 18 10:34:29 2018: wrote report
> >> /var/crash/_usr_sbin_ovsdb-server.0.crash
> >>
> >> Looking through the crash we see the following stack:
> >>
> >> (gdb) bt
> >> #0  __GI_raise (sig=sig at entry=6) at
> ../sysdeps/unix/sysv/linux/raise.c:51
> >> #1  0x00007f7c9a43c801 in __GI_abort () at abort.c:79
> >> #2  0x00007f7c9aaa633c in json_serialize (json=<optimized out>,
> >> s=<optimized out>) at lib/json.c:1554
> >> #3  0x00007f7c9aaa63ab in json_serialize_object_member (i=<optimized
> out>,
> >> s=<optimized out>, node=<optimized out>, node=<optimized out>)
> >>     at lib/json.c:1583
> >> #4  0x00007f7c9aaa62f2 in json_serialize_object (s=0x7ffca2173ea0,
> >> object=0x5568dc5d5b10) at lib/json.c:1612
> >> #5  json_serialize (json=<optimized out>, s=0x7ffca2173ea0) at
> >> lib/json.c:1533
> >> #6  0x00007f7c9aaa863c in json_to_ds (json=json at entry=0x5568dc5d4a20,
> >> flags=flags at entry=0, ds=ds at entry=0x7ffca2173f30) at lib/json.c:1511
> >> #7  0x00007f7c9ae6750f in ovsdb_log_compose_record
> >> (json=json at entry=0x5568dc5d4a20,
> >> magic=0x5568dc5d5a60 "CLUSTER",
> >>     header=header at entry=0x7ffca2173f10, data=data at entry=0x7ffca2173f30)
> >> at
> >> ovsdb/log.c:570
> >> #8  0x00007f7c9ae677ef in ovsdb_log_write (file=0x5568dc5d5a80,
> >> json=0x5568dc5d4a20) at ovsdb/log.c:618
> >> #9  0x00007f7c9ae6796e in ovsdb_log_write_and_free
> >> (log=log at entry=0x5568dc5d5a80,
> >> json=0x5568dc5d4a20) at ovsdb/log.c:651
> >> #10 0x00007f7c9ae6d684 in raft_write_snapshot (raft=raft at entry
> >> =0x5568dc1e3720,
> >> log=0x5568dc5d5a80, new_log_start=new_log_start at entry=539578,
> >>     new_snapshot=new_snapshot at entry=0x7ffca21740e0) at
> ovsdb/raft.c:3588
> >> #11 0x00007f7c9ae6dbf3 in raft_save_snapshot (raft=raft at entry
> >> =0x5568dc1e3720,
> >> new_start=new_start at entry=539578,
> >>     new_snapshot=new_snapshot at entry=0x7ffca21740e0) at
> ovsdb/raft.c:3647
> >> #12 0x00007f7c9ae757bd in raft_store_snapshot (raft=0x5568dc1e3720,
> >> new_snapshot_data=new_snapshot_data at entry=0x5568dc5d49a0)
> >>     at ovsdb/raft.c:3849
> >> #13 0x00007f7c9ae7c7ae in ovsdb_storage_store_snapshot__
> >> (storage=0x5568dc6b2fb0, schema=0x5568dd66f5a0, data=0x5568dca67880)
> >>     at ovsdb/storage.c:541
> >> #14 0x00007f7c9ae7d1de in ovsdb_storage_store_snapshot
> >> (storage=0x5568dc6b2fb0, schema=schema at entry=0x5568dd66f5a0,
> >>     data=data at entry=0x5568dca67880) at ovsdb/storage.c:568
> >> #15 0x00007f7c9ae69cab in ovsdb_snapshot (db=0x5568dc6b3020) at
> >> ovsdb/ovsdb.c:519
> >> #16 0x00005568daec1f82 in main_loop (is_backup=0x7ffca21742be,
> >> exiting=0x7ffca21742bf, run_process=0x0, remotes=0x7ffca2174310,
> >>     unixctl=0x5568dc71ade0, all_dbs=0x7ffca2174350,
> >> jsonrpc=0x5568dc1e36a0,
> >> config=0x7ffca2174370) at ovsdb/ovsdb-server.c:239
> >> #17 main (argc=<optimized out>, argv=<optimized out>) at
> >> ovsdb/ovsdb-server.c:457
> >>
> >> Walking through the JSON objects being serialized we see that
> >> "prev_servers" is malformed.
> >>
> >> (gdb) print *((struct shash *)0x5568dc5d5b10)
> >> $3 = {
> >>   map = {
> >>     buckets = 0x5568dc5d1d30,
> >>     one = 0x0,
> >>     mask = 7,
> >>     n = 9
> >>   }
> >> }
> >>
> >> (gdb) x/6a 0x5568dc5d1d30
> >> 0x5568dc5d1d30:    0x5568dc5d6000    0x0
> >> 0x5568dc5d1d40:    0x0    0x5568dc5d5f30
> >> 0x5568dc5d1d50:    0x5568dc5d5e30    0x5568dc5d5bc0
> >>
> >> Let us look at the next one
> >>
> >> (gdb) print *((struct shash_node *)0x5568dc5d5e30)
> >> $7 = {
> >>   node = {
> >>     hash = 2043875868,
> >>     next = 0x0
> >>   },
> >>   name = 0x5568dc5d5e10 "prev_servers",
> >>   data = 0x5568dc688cd0
> >> }
> >>
> >> (gdb) print *((struct json *)0x5568dc688cd0)
> >> $10 = {
> >>   type = 3697839232,
> >>   count = 34,
> >>   u = {
> >>     object = 0x5568dc688cb0,
> >>     array = {
> >>       n = 93908862799024,
> >>       n_allocated = 93908862798944,
> >>       elems = 0x5568dc22f050
> >>     },
> >>     integer = 93908862799024,
> >>     real = 4.6397142949016804e-310,
> >>     string = 0x5568dc688cb0 "\a"
> >>   }
> >> }
> >>
> >> So, this is malformed. Somehow "prev_servers" is getting malformed.
> >>
> >> That information is coming in from 'struct raft`snap`servers'
> >>
> >> As anyone seen this before?
> >>
> >>
> >> On Fri, Jul 13, 2018 at 3:49 PM, Yun Zhou <yunz at nvidia.com> wrote:
> >>
> >> > Hi,
> >> >
> >> > We are running into some issues while we are trying out the 3 nodes
> raft
> >> > ovsdb cluster in our lab, and hopefully we can get some help from the
> >> > community.
> >> >
> >> > We are using ovs 2.9.2.
> >> > -------------------------
> >> >
> >> > We found that on one of the 3 nodes, the SB ovsdb-server was not
> >> started,
> >> > and was not able to be restarted because its database was already
> >> corrupted:
> >> >
> >> >    "ovsdb-server: syntax "{"encaps":["uuid","7f0f7605-
> >> > c1d1-43fb-826a-1718ea70e088"],"hostname":"nd-sdn-dgx-010"}": syntax
> >> > error: hostname is not a UUID"
> >> >
> >> > Seeing from the ovsdb-server-sb log file history, SB ovsdb-server core
> >> > dumped several days ago:
> >> >
> >> >        "2018-07-08T06:58:15.267Z|00002|daemon_unix(monitor)|ERR|1
> >> > crashes: pid 937 died, killed (Aborted), core dumped, restarting"
> >> >
> >> > Unfortunately, core dump was not generated.
> >> >
> >> > FWIW, we saw core dumps for the NB ovsdb on all 3 cluster nodes, here
> is
> >> > one of the stack:
> >> >
> >> > (gdb) bt
> >> > #0  __GI_raise (sig=sig at entry=6) at ../sysdeps/unix/sysv/linux/rai
> >> se.c:51
> >> > #1  0x00007fc48f8c2801 in __GI_abort () at abort.c:79
> >> > #2  0x00007fc48ff2c33c in ?? () from /usr/lib/x86_64-linux-gnu/
> >> > libopenvswitch-2.9.so.0
> >> > #3  0x00007fc48ff2c2f2 in ?? () from /usr/lib/x86_64-linux-gnu/
> >> > libopenvswitch-2.9.so.0
> >> > #4  0x00007fc48ff2e63c in json_to_ds ()
> >> >    from /usr/lib/x86_64-linux-gnu/libopenvswitch-2.9.so.0
> >> > #5  0x00007fc4902ed50f in ovsdb_log_compose_record ()
> >> >    from /usr/lib/x86_64-linux-gnu/libovsdb-2.9.so.0
> >> > #6  0x00007fc4902ed7ef in ovsdb_log_write ()
> >> >    from /usr/lib/x86_64-linux-gnu/libovsdb-2.9.so.0
> >> > #7  0x00007fc4902ed96e in ovsdb_log_write_and_free ()
> >> >    from /usr/lib/x86_64-linux-gnu/libovsdb-2.9.so.0
> >> > #8  0x00007fc4902f3684 in ?? () from /usr/lib/x86_64-linux-gnu/
> >> > libovsdb-2.9.so.0
> >> > #9  0x00007fc4902f3bf3 in ?? () from /usr/lib/x86_64-linux-gnu/
> >> > libovsdb-2.9.so.0
> >> > #10 0x00007fc4902fb7bd in raft_store_snapshot ()
> >> >    from /usr/lib/x86_64-linux-gnu/libovsdb-2.9.so.0
> >> > #11 0x00007fc4903027ae in ?? () from /usr/lib/x86_64-linux-gnu/
> >> > libovsdb-2.9.so.0
> >> > #12 0x00007fc4903031de in ovsdb_storage_store_snapshot ()
> >> >    from /usr/lib/x86_64-linux-gnu/libovsdb-2.9.so.0
> >> > #13 0x00007fc4902efcab in ovsdb_snapshot ()
> >> >    from /usr/lib/x86_64-linux-gnu/libovsdb-2.9.so.0
> >> > #14 0x0000561e47a8cf82 in ?? ()
> >> > #15 0x00007fc48f8a3b97 in __libc_start_main (main=0x561e47a8bef0,
> >> argc=17,
> >> >     argv=0x7ffe000ce2c8, init=<optimized out>, fini=<optimized out>,
> >> >     rtld_fini=<optimized out>, stack_end=0x7ffe000ce2b8) at
> >> > ../csu/libc-start.c:310
> >> > #16 0x0000561e47a8db9a in ?? ()
> >> >
> >> > Please let us know if any more information is needed. Thanks very
> much!
> >> >
> >> > - Yun
> >> >
> >> >
> >> > ------------------------------------------------------------
> >> > -----------------------
> >> > This email message is for the sole use of the intended recipient(s)
> and
> >> > may contain
> >> > confidential information.  Any unauthorized review, use, disclosure or
> >> > distribution
> >> > is prohibited.  If you are not the intended recipient, please contact
> >> the
> >> > sender by
> >> > reply email and destroy all copies of the original message.
> >> > ------------------------------------------------------------
> >> > -----------------------
> >> > _______________________________________________
> >> > discuss mailing list
> >> > discuss at openvswitch.org
> >> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >> >
> >> _______________________________________________
> >> dev mailing list
> >> dev at openvswitch.org
> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >>
> >
> >
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20180724/92febbda/attachment-0001.html>


More information about the discuss mailing list