[ovs-discuss] [IDL] unhandled/unexpected OVSDB reply to set_db_change_aware method

Thu Jul 9 13:40:11 UTC 2020

On Mon, 2020-07-06 at 23:57 -0700, Girish Moodalbail wrote:
> Hello all,
> 
> In the OVN Kubernetes project, we are seeing a ton of following
> errors in
> ovsdb-server-nb.log and ovsdb-server-sb.log.
> 
> ---------8<------------8<------
> 2020-07-07T06:30:53.320Z|20733|jsonrpc|WARN|unix#3145: send error:
> Broken
> pipe
> 2020-07-07T06:30:53.320Z|20734|reconnect|WARN|unix#3145: connection
> dropped
> (Broken pipe)
> ---------8<------------8<------
> 
> (Note: The N in uniX#N used for unbound unix sockets is monotonically
> increasing number. So, with unix#3145 there are more than 3000 such
> errors
> in the log file)
> 
> I am going to be vague in explaining why it might be happening, so
> sincere
> apologies for that. So, this is what we are seeing and how to
> reproduce
> the issue.
> 
> 1. Unhandled reply case
> ========================
> 
> (a) Client
> ~~~~~~~~~~~
> $  ovn-sbctl -vjsonrpc -vstream --no-leader-only
> --db=unix:/var/run/openvswitch/ovnsb_db.sock get SB_Global .
> options:e2e_timestamp
> <output snipped>
> 2020-07-
> 07T04:54:22Z|00006|jsonrpc|DBG|unix:/var/run/openvswitch/ovnsb_db.soc
> k:
> send request, method="set_db_change_aware", params=[true], id=3
> <output snipped>
> "1594097605"  (<-- returned value of e2e_timestamp)
> 
> (b) Server
> ~~~~~~~~~~~
> <output snipped>
> 2020-07-07T06:30:53.320Z|20731|jsonrpc|DBG|unix#3145: received
> request,
> method="set_db_change_aware", params=[true], id=3
> 2020-07-07T06:30:53.320Z|20732|jsonrpc|DBG|unix#3145: send reply,
> result={}, id=3
> 2020-07-07T06:30:53.320Z|20733|jsonrpc|WARN|unix#3145: send error:
> Broken
> pipe
> 2020-07-07T06:30:53.320Z|20734|reconnect|WARN|unix#3145: connection
> dropped
> (Broken pipe)
> <output snipped>
> 
> So, the OVSDB server did try to send the reply. However, before it
> could
> send it through stream-fd.c`fd_send() the client program exits and
> closed
> the Unix Socket and we see broken pipe on the sender
> 
> 2. Unexpected reply case
> =========================
> (a) Client
> ~~~~~~~~~~~
> $ ovn-sbctl -v --no-leader-only
> --db=unix:/var/run/openvswitch/ovnsb_db.sock get datapath_binding
> 3b735851-db34-489a-a54c-8bb947431a74 external_ids:logical-switch
> 
> <output snipped>
> 2020-07-
> 07T06:43:18Z|00021|ovsdb_idl|DBG|unix:/var/run/openvswitch/ovnsb_db.s
> ock:
> DATA_MONITOR_COND_SINCE_REQUESTED -> MONITORING at lib/ovsdb-
> idl.c:764
> 2020-07-
> 07T06:43:18Z|00022|jsonrpc|DBG|unix:/var/run/openvswitch/ovnsb_db.soc
> k:
> received reply, result={}, id=3
> 2020-07-
> 07T06:43:18Z|00023|ovsdb_idl|DBG|unix:/var/run/openvswitch/ovnsb_db.s
> ock:
> received unexpected reply message: {"error":null,"id":3,"result":{}}
> <output snipped>
> 
> 
> (b) Server
> ~~~~~~~~~~~
> <output snipped>
> 2020-07-07T06:43:18.402Z|21691|jsonrpc|DBG|unix#3281: received
> request,
> method="set_db_change_aware", params=[true], id=3
> 2020-07-07T06:43:18.402Z|21692|jsonrpc|DBG|unix#3281: send reply,
> result={}, id=3
> <output snipped>
> 
> So, in the 2nd case the Server's reply makes it all the way to the
> Client,
> but the client is not expecting it.
> 
> So, looking at the lib/ovsdb-idl.c's ovsdb_idl_process_response()....
> 
> ovsdb_idl_process_response() {
> 
>     case IDL_S_SERVER_MONITOR_COND_REQUESTED:
>         if (ok) {
>             idl->server.monitoring = OVSDB_IDL_MONITORING_COND;
>             ovsdb_idl_db_parse_monitor_reply(&idl->server, msg-
> >result,
>                                              OVSDB_IDL_MM_MONITOR_CON
> D);
>             if (ovsdb_idl_check_server_db(idl)) {
>                 ovsdb_idl_send_db_change_aware(idl);
>             }
>         } else {
> 
> }
> 
> ... I see that we call the OVSDB set_db_change_aware method through
> ovsdb_idl_send_db_change_aware(), but there is no state machine to
> process
> the reply for it. In the 1st case, the client exits early and
> connection is
> broken? Whilst, in the 2nd case the reply comes through but the code
> is not
> expecting it.

I guess we're not seeing this at Red Hat because we're using
monitor_all still... anyone have thoughts about this issue?

Dan