[ovs-discuss] "ovs|01253|reconnect|ERR|tcp:127.0.0.1:50814: no response to inactivity probe after 5.01 seconds, disconnecting" messages and lost packets

Thu Sep 27 19:59:15 UTC 2018

> So something in this openstack driver is broken, because it does not respond to server probes.

I must specify, it’s not broken ALL the time. It seems to start breaking when traffic increases on the openstack setup. This increase though is not accompanied by a CPU overload though: it’s happening right now and the load is barely over 1, on a 16 core xeon.The client trying to connect also appears to be ovs-vswitchd.

Jean-Philippe Méthot
Openstack system administrator
Administrateur système Openstack
PlanetHoster inc.

> Le 27 sept. 2018 à 15:47, Paul Greenberg <greenpau at outlook.com> a écrit :
> 
> This specific error is triggered by the following. When a client connects to ovsdb json rpc server, it has to follow certain protocol. In this case, a the server sends probes, and the client must acknowledge them by sending the exact message it received from the server back to the server. If a client, does not do that in time, the server drops the client.
> 
> So something in this openstack driver is broken, because it does not respond to server probes.
> 
> Best Regards,
> Paul Greenberg
>  
> From: 20230277700n behalf of 
> Sent: Thursday, September 27, 2018 3:40 PM
> To: jp.methot at planethoster.info
> Cc: ovs-discuss at openvswitch.org
> Subject: Re: [ovs-discuss] "ovs|01253|reconnect|ERR|tcp:127.0.0.1:50814: no response to inactivity probe after 5.01 seconds, disconnecting" messages and lost packets
>  
> 
> ovs-vswitchd is multi-threaded. ovsdb-server is single threaded. 
> (You did not answer my question about the file from which the logs were printed in your email)
> 
> Who is at 127.0.0.1:45928 <http://127.0.0.1:45928/> and 127.0.0.1:45930 <http://127.0.0.1:45930/>?
> 
> On Thu, 27 Sep 2018 at 11:14, Jean-Philippe Méthot <jp.methot at planethoster.info <mailto:jp.methot at planethoster.info>> wrote:
> Thank you for your reply.
> 
> This is Openstack with ml2 plugin. There’s no other 3rd party application used with our network, so no OVN or anything of the sort. Essentially, to give a quick idea of the topology, we have our vms on our compute nodes going through GRE tunnels toward network nodes where they are routed in network namespace toward a flat external network.
> 
>> Generally, the above indicates that a daemon fronting a Open vSwitch database hasn't been able to connect to its client. Usually happens when CPU consumption is very high.
> 
> Our network nodes CPU are literally sleeping. Is openvswitch single-thread or multi-thread though? If ovs overloaded a single thread, it’s possible I may have missed it.
> 
> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.
> 
> 
> 
> 
>> Le 27 sept. 2018 à 14:04, Guru Shetty <guru at ovn.org <mailto:guru at ovn.org>> a écrit :
>> 
>> 
>> 
>> On Wed, 26 Sep 2018 at 12:59, Jean-Philippe Méthot via discuss <ovs-discuss at openvswitch.org <mailto:ovs-discuss at openvswitch.org>> wrote:
>> Hi,
>> 
>> I’ve been using openvswitch for my networking backend on openstack for several years now. Lately, as our network has grown, we’ve started noticing some intermittent packet drop accompanied with the following error message in openvswitch:
>> 
>> 2018-09-26T04:15:20.676Z|00005|reconnect|ERR|tcp:127.0.0.1:45928 <http://127.0.0.1:45928/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:15:20.677Z|00006|reconnect|ERR|tcp:127.0.0.1:45930 <http://127.0.0.1:45930/>: no response to inactivity probe after 5 seconds, disconnecting
>> 
>> Open vSwitch is a project with multiple daemons. Since you are using OpenStack, it is not clear from your message, what type of networking plugin you are using. Do you use OVN?
>> Also, you did not mention from which file you have gotten the above errors.
>> 
>> Generally, the above indicates that a daemon fronting a Open vSwitch database hasn't been able to connect to its client. Usually happens when CPU consumption is very high.
>> 
>>  
>> 2018-09-26T04:15:30.409Z|00007|reconnect|ERR|tcp:127.0.0.1:45874 <http://127.0.0.1:45874/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:15:33.661Z|00008|reconnect|ERR|tcp:127.0.0.1:45934 <http://127.0.0.1:45934/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:15:33.847Z|00009|reconnect|ERR|tcp:127.0.0.1:45894 <http://127.0.0.1:45894/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:16:03.247Z|00010|reconnect|ERR|tcp:127.0.0.1:45958 <http://127.0.0.1:45958/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:16:21.534Z|00011|reconnect|ERR|tcp:127.0.0.1:45956 <http://127.0.0.1:45956/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:16:21.786Z|00012|reconnect|ERR|tcp:127.0.0.1:45974 <http://127.0.0.1:45974/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:16:47.085Z|00013|reconnect|ERR|tcp:127.0.0.1:45988 <http://127.0.0.1:45988/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:16:49.618Z|00014|reconnect|ERR|tcp:127.0.0.1:45982 <http://127.0.0.1:45982/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:16:53.321Z|00015|reconnect|ERR|tcp:127.0.0.1:45964 <http://127.0.0.1:45964/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:17:15.543Z|00016|reconnect|ERR|tcp:127.0.0.1:45986 <http://127.0.0.1:45986/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:17:24.767Z|00017|reconnect|ERR|tcp:127.0.0.1:45990 <http://127.0.0.1:45990/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:17:31.735Z|00018|reconnect|ERR|tcp:127.0.0.1:45998 <http://127.0.0.1:45998/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:20:12.593Z|00019|reconnect|ERR|tcp:127.0.0.1:46014 <http://127.0.0.1:46014/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:23:51.996Z|00020|reconnect|ERR|tcp:127.0.0.1:46028 <http://127.0.0.1:46028/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:25:12.187Z|00021|reconnect|ERR|tcp:127.0.0.1:46022 <http://127.0.0.1:46022/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:25:28.871Z|00022|reconnect|ERR|tcp:127.0.0.1:46056 <http://127.0.0.1:46056/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:27:11.663Z|00023|reconnect|ERR|tcp:127.0.0.1:46046 <http://127.0.0.1:46046/>: no response to inactivity probe after 5 seconds, disconnecting
>> 2018-09-26T04:29:56.161Z|00024|jsonrpc|WARN|tcp:127.0.0.1:46018 <http://127.0.0.1:46018/>: receive error: Connection reset by peer
>> 2018-09-26T04:29:56.161Z|00025|reconnect|WARN|tcp:127.0.0.1:46018 <http://127.0.0.1:46018/>: connection dropped (Connection reset by peer)
>> 
>> This definitely kills the connection for a few seconds before it reconnects. So, I’ve been wondering, what is this probe and what is really happening here? What’s the cause and is there a way to fix this? 
>> 
>> Openvswitch version is 2.9.0-3 on CentOS 7 with Openstack Pike running on it (but the issues show up on Queens too).
>> 
>>  
>> Jean-Philippe Méthot
>> Openstack system administrator
>> Administrateur système Openstack
>> PlanetHoster inc.
>> 
>> 
>> 
>> 
>> _______________________________________________
>> discuss mailing list
>> discuss at openvswitch.org <mailto:discuss at openvswitch.org>
>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss <https://mail.openvswitch.org/mailman/listinfo/ovs-discuss>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20180927/125b02d3/attachment-0001.html>