[ovs-discuss] ovsdb relay server segment fault

Thu Aug 5 12:24:01 UTC 2021

On 8/5/21 1:47 PM, 贾文涛 wrote:
> 
> Hi，Ilya
> 
> Thanks for you reply
> *coredump:*
> 
> Program terminated with signal 11, Segmentation fault.
> #0  0x00007f56bff49783 in hmap_remove (node=0x55a436a4aaa0, hmap=0x55a3d28bb2d8) at include/openvswitch/hmap.h:293
> 293         while (*bucket != node) {
> (gdb) bt
> #0  0x00007f56bff49783 in hmap_remove (node=0x55a436a4aaa0, hmap=0x55a3d28bb2d8) at include/openvswitch/hmap.h:293
> #1  ovsdb_txn_forward_unlist (db=0x55a3d28bb230, txn_fwd=txn_fwd at entry=0x55a436a4aa90) at ovsdb/transaction-forward.c:67
> #2  0x00007f56bff4982e in ovsdb_txn_forward_destroy (db=<optimized out>, txn_fwd=0x55a436a4aa90) at ovsdb/transaction-forward.c:79
> #3  0x00007f56bff468ea in ovsdb_trigger_destroy (trigger=0x55a5afbbcc40) at ovsdb/trigger.c:70
> #4  0x00007f56bff291ac in ovsdb_jsonrpc_trigger_complete (t=0x55a5afbbcc40) at ovsdb/jsonrpc-server.c:1192
> #5  0x00007f56bff29325 in ovsdb_jsonrpc_trigger_remove__ (s=s at entry=0x55a3d2b585d0, db=db at entry=0x0) at ovsdb/jsonrpc-server.c:1204
> #6  0x00007f56bff2aa5c in ovsdb_jsonrpc_trigger_complete_all (s=0x55a3d2b585d0) at ovsdb/jsonrpc-server.c:1223
> #7  ovsdb_jsonrpc_session_run (s=0x55a3d2b585d0) at ovsdb/jsonrpc-server.c:546
> #8  ovsdb_jsonrpc_session_run_all (remote=0x55a531045090) at ovsdb/jsonrpc-server.c:591
> #9  ovsdb_jsonrpc_server_run (svr=svr at entry=0x55a3d28bb170) at ovsdb/jsonrpc-server.c:406
> #10 0x000055a3d19f4442 in main_loop (is_backup=0x7ffebc27bb3a, exiting=0x7ffebc27bb3b, run_process=0x0, remotes=0x7ffebc27bb90, unixctl=0x55a3d28d9070, all_dbs=0x7ffebc27bbd0, jsonrpc=<optimized out>, config=0x7ffebc27bc30)
>     at ovsdb/ovsdb-server.c:219
> #11 main (argc=3, argv=0x7ffebc27be28) at ovsdb/ovsdb-server.c:490
> (gdb) frame 1
> #1  ovsdb_txn_forward_unlist (db=0x55a3d28bb230, txn_fwd=txn_fwd at entry=0x55a436a4aa90) at ovsdb/transaction-forward.c:67
> 67              hmap_remove(&db->txn_forward_sent, &txn_fwd->sent_node);
> 
> (gdb) print db->name
> $1 = 0x55a3d28baef0 "OVN_Southbound"
> ...
> (gdb) print db->txn_forward_sent
> $20 = {buckets = 0x55a58b4cf8b0, one = 0x0, mask = 63, n = 0}
> (gdb) print txn_fwd->sent_node
> $24 = {hash = 0, next = 0x0}
> (gdb)

Thanks for the detailed trace!
I think I know what happened.  txn_fwd->sent_node is filled with zeroes,
because allocated by xzalloc().  However, HMAP_NODE_NULL is not zero,
it's, actually, equals 1.  So, !hmap_node_is_null(&txn_fwd->sent_node)
check fails and tries to remove the node that is not in a hash map.
Should be fixed by correct initialization of the 'sent_node'.  Following
change should fix the issue:

diff --git a/ovsdb/transaction-forward.c b/ovsdb/transaction-forward.c
index 8ff12ef4b..d15f2f1d6 100644
--- a/ovsdb/transaction-forward.c
+++ b/ovsdb/transaction-forward.c
@@ -52,6 +52,7 @@ ovsdb_txn_forward_create(struct ovsdb *db, const struct jsonrpc_msg *request)
     COVERAGE_INC(txn_forward_create);
     txn_fwd->request = jsonrpc_msg_clone(request);
     ovs_list_push_back(&db->txn_forward_new, &txn_fwd->new_node);
+    hmap_node_nullify(&txn_fwd->sent_node);
 
     return txn_fwd;
 }
---

I'll prepare and send a formal patch.

The way to trigger the issue is to disconnect the client while relay
has transaction from it that was not sent to the relay source yet.

Best regards, Ilya Maximets.
 
> 
> 
> Best regards, wentao Jia
> 
> 
> 
> 
> 发件人：Ilya Maximets <i.maximets at ovn.org>
> 发送日期：2021-08-04 21:30:40
> 收件人："贾文涛" <wentao.jia at easystack.cn>
> 抄送人：i.maximets at ovn.org,ovs-discuss at openvswitch.org
> 主题：Re: [ovs-discuss] ovsdb relay server segment fault>> ovn scale test, 3 clustered sb, 10 sb relay server, 1000 sandbox。
>>> sb relay sever will be segment fault  accidentally
>>> [root at node-4 ~]# kubectl logs -n openstack ovn-ovsdb-sb-relay-79d5dd7ff4-tqbbd --tail 10 -p 22021-08-01T03:09:44Z|15758|poll_loop|INFO|wakeup due to [POLLOUT] on fd 101 (10.232.2.213:6642<->10.232.7.147:39998) at lib/stream-fd.c:153 (66% CPU usage) 32021-08-01T03:09:52Z|15759|timeval|WARN|Unreasonably long 5223ms poll interval (2209ms user, 126ms system) 42021-08-01T03:09:52Z|15760|timeval|WARN|faults: 19955 minor, 0 major 52021-08-01T03:09:52Z|15761|timeval|WARN|context switches: 0 voluntary, 5818 involuntary 62021-08-01T03:09:55Z|15762|timeval|WARN|Unreasonably long 3550ms poll interval (2277ms user, 71ms system) 72021-08-01T03:09:55Z|15763|timeval|WARN|faults: 3652 minor, 0 major 82021-08-01T03:09:55Z|15764|timeval|WARN|context switches: 0 voluntary, 1438 involuntary 92021-08-01T03:09:55Z|15765|poll_loop|INFO|Dropped 43 log messages in last 11 seconds (most recently, 10 seconds ago) due to excessive rate 102021-08-01T03:09:55Z|15766|poll_loop|INFO|wakeup due to [POLLOUT] on fd 95 (10.232.2.213:6642<->10.232.7.132:53042) at lib/stream-fd.c:153 (67% CPU usage) 11/tmp/start_sb_relay.sh: line 5: 9 Segmentation fault ovsdb-server --remote=db:OVN_Southbound,SB_Global,connections relay:OVN_Southbound:tcp:${SERVICE_NAME}.${NAMESPACE}.svc.cluster.local:6642
>>> 
>>
>>Hi.  Thanks for the checking relays!
>>
>>Do you have a a coredump or at least a stack trace of this crash?
>>Otherwise it's not possible to figure out why this happened.
>>
>>Best regards, Ilya Maximets.
> 
>