[ovs-dev] intermittent crash during ovsdb updates with ovs-2.4

Sabyasachi Sengupta Sabyasachi.Sengupta at alcatel-lucent.com
Tue Sep 22 00:20:45 UTC 2015


Hi,

We are moving to the recently released ovs-2.4 and are seeing some random 
crashes during ovsdb periodic updates into vswitchd. Some preliminary 
analysis is below. The crash typically happens after we've successfully 
brought up ovs and have downloaded some configs and after we start 
initiating some traffic but no particular pattern. It appears ovs main 
thread crashes while trying to periodically update ovsdb with stats and 
controller table updates. Below debugging is the case when its trying
controller table update.

(gdb) bt
#0  0x00007f9532a052b6 in __strcmp_sse42 () from /lib64/libc.so.6
#1  0x00000000004b7b42 in atom_arrays_compare_3way (a=0xa9b898,
     b=0x7fff8a0359e0, type=0x896df0) at lib/ovsdb-data.c:1582
#2  ovsdb_datum_compare_3way (a=0xa9b898, b=0x7fff8a0359e0, type=0x896df0)
     at lib/ovsdb-data.c:1616
#3  0x00000000004b7b69 in ovsdb_datum_equals (a=<value optimized out>,
     b=<value optimized out>, type=<value optimized out>)
     at lib/ovsdb-data.c:1596
#4  0x00000000004bb36e in ovsdb_idl_txn_write__ (row_=0xa9b5b0,
     column=0x896de8, datum=0x7fff8a0359e0, owns_datum=true)
     at lib/ovsdb-idl.c:2087
#5  0x00000000004f7d24 in ovsrec_controller_set_status (row=0xa9b5b0,
     status=0xc8a308) at lib/vswitch-idl.c:5254
#6  0x0000000000411d5b in refresh_controller_status ()
     at vswitchd/bridge.c:2741
#7  run_stats_update () at vswitchd/bridge.c:2801
#8  bridge_run () at vswitchd/bridge.c:3073
#9  0x00000000004121ad in main (argc=10, argv=0x7fff8a035c38)
     at vswitchd/ovs-vswitchd.c:131

Looking a bit deeper we find one of the array elements of what is being read 
from incore idl is corrupt.

(gdb) frame 1
#1  0x00000000004b7b42 in atom_arrays_compare_3way (a=0xa9b898,
     b=0x7fff8a0359e0, type=0x896df0) at lib/ovsdb-data.c:1582
1582            int cmp = ovsdb_atom_compare_3way(&a[i], &b[i], type);
(gdb) p a
$1 = (const union ovsdb_atom *) 0xc38f10
(gdb) p a[0]
$2 = {integer = 8, real = 3.9525251667299724e-323, boolean = 8,
   string = 0x8 <Address 0x8 out of bounds>, uuid = {parts = {8, 0, 13182880,
       0}}}
(gdb) p a[1]
$3 = {integer = 11240608, real = 5.5535982511682826e-317, boolean = 160,
   string = 0xab84a0 "288", uuid = {parts = {11240608, 0, 13183312, 0}}}
(gdb) p a[2]
$4 = {integer = 10932464, real = 5.4013548867961776e-317, boolean = 240,
   string = 0xa6d0f0 "296", uuid = {parts = {10932464, 0, 13183680, 0}}}
(gdb) p a[3]
$5 = {integer = 11271008, real = 5.5686178468018565e-317, boolean = 96,
   string = 0xabfb60 "ACTIVE", uuid = {parts = {11271008, 0, 13184096, 0}}}
(gdb) p a[4]
$6 = {integer = 13126128, real = 6.4851689077148698e-317, boolean = 240,
   string = 0xc849f0 "\300", uuid = {parts = {13126128, 0, 33, 0}}}

The above tends to indicate that we are tripping because of a bad pointer, 
viz. a[0]->string which curiously has only one bad value.

(gdb) p type
$7 = OVSDB_TYPE_STRING

However, all other values appear to be good including the ones that are 
about to be written.

(gdb) p b[0]
$9 = {integer = 12475120, real = 6.1635282197470516e-317, boolean = 240,
   string = 0xbe5af0 "Connection timed out", uuid = {parts = {12475120, 0, 
44,
       1}}}
(gdb) p b[1]
$10 = {integer = 12475056, real = 6.1634965995457177e-317, boolean = 176,
   string = 0xbe5ab0 "293", uuid = {parts = {12475056, 0, 34, 1}}}
(gdb) p b[2]
$11 = {integer = 11010656, real = 5.4399868677757963e-317, boolean = 96,
   string = 0xa80260 "301", uuid = {parts = {11010656, 0, 851889880, 32661}}}
(gdb) p b[3]
$12 = {integer = 11010720, real = 5.4400184879771301e-317, boolean = 160,
   string = 0xa802a0 "ACTIVE", uuid = {parts = {11010720, 0, 27, 1}}}

Mapping the above UUID to the table, it appears the table is sane.

(gdb) frame 5
#5  0x00000000004f7d24 in ovsrec_controller_set_status (row=0xa9b5b0,
     status=0xc8a308) at lib/vswitch-idl.c:5254
5254        ovsdb_idl_txn_write(&row->header_,

(gdb) p/x row->header_
$20 = {hmap_node = {hash = 0xea5d6304, next = 0x0}, uuid = {parts = {
       0xea5d6304, 0x328a492f, 0xbabbae6c, 0xa26600f1}}, src_arcs = {
     prev = 0xa9b5d0, next = 0xa9b5d0}, dst_arcs = {prev = 0xa9fd40,
     next = 0xa9fd40}, table = 0xa584f0, old = 0xa9b730, new = 0xa9b730,
   prereqs = 0x0, written = 0x0, txn_node = {hash = 0xea5d6304, next = 0x1}}

Fishing for UUID ea5d6304 in ovsdb-client dump:

[root at ovs-1 ~]# ovsdb-client dump Controller
Controller table
_uuid                                config_role connection_mode 
controller_burst_limit controller_rate_limit enable_async_messages 
external_ids inactivity_probe is_connected local_gateway local_ip 
local_netmask max_backoff name    other_config role   status 
target
------------------------------------ ----------- --------------- 
---------------------- --------------------- --------------------- 
------------ ---------------- ------------ ------------- -------- 
------------- ----------- ------- ------------ ------ 
------------------------------------------------------------------------------------------------------ 
---------------------
9b64de9b-d55d-4065-b534-4708c18b780a master      []              [] 
[]                    []                    {}           5000 
true         []            []       []            []          "ctrl1" {} 
master {last_error="No route to host", sec_since_connect="297", 
sec_since_disconnect="312", state=ACTIVE}     "tcp:10.10.13.7:6633"
ea5d6304-328a-492f-babb-ae6ca26600f1 slave       []              [] 
[]                    []                    {}           5000 
true         []            []       []            []          "ctrl2" {} 
slave  {last_error="Connection timed out", sec_since_connect="288", 
sec_since_disconnect="296", state=ACTIVE} "tcp:10.10.15.9:6633"


Can anyone please look and let me know if this has been seen or if there is 
a patch or fix that can address it?

Thanks,
Sabya



More information about the dev mailing list