[ovs-discuss] "ovs|01253|reconnect|ERR|tcp:127.0.0.1:50814: no response to inactivity probe after 5.01 seconds, disconnecting" messages and lost packets

Thu Sep 27 20:09:37 UTC 2018

On Thu, 27 Sep 2018 at 12:52, Jean-Philippe Méthot <
jp.methot at planethoster.info> wrote:

> Sorry, the log file is ovsdb-server.log. ovs-vswitchd.log as the other
> counterpart of this error it seems:
>
> 2018-09-27T19:38:01.217Z|00783|rconn|ERR|br-tun<->tcp:127.0.0.1:6633: no
> response to inactivity probe after 5 seconds, disconnecting
> 2018-09-27T19:38:01.218Z|00784|rconn|ERR|br-ex<->tcp:127.0.0.1:6633: no
> response to inactivity probe after 5 seconds, disconnecting
>

1. Who is at 127.0.0.1:6633?  This is likely a openflow controller.
2. What does `ovs-vsctl list controller` say?
3. What does `ovs-vsctl list manager` say?
4. ovs-appctl -t ovsdb-server ovsdb-server/list-remotes
5. What does 'ps -ef | grep ovs' say?

I am asking these simple questions because, I am not familiar with
OpenStack ml2.

> 2018-09-27T19:38:02.218Z|00785|rconn|INFO|br-tun<->tcp:127.0.0.1:6633:
> connecting...
> 2018-09-27T19:38:02.218Z|00786|rconn|INFO|br-ex<->tcp:127.0.0.1:6633:
> connecting...
> 2018-09-27T19:38:03.218Z|00787|rconn|INFO|br-tun<->tcp:127.0.0.1:6633:
> connection timed out
> 2018-09-27T19:38:03.218Z|00788|rconn|INFO|br-tun<->tcp:127.0.0.1:6633:
> waiting 2 seconds before reconnect
> 2018-09-27T19:38:03.218Z|00789|rconn|INFO|br-ex<->tcp:127.0.0.1:6633:
> connection timed out
> 2018-09-27T19:38:03.218Z|00790|rconn|INFO|br-ex<->tcp:127.0.0.1:6633:
> waiting 2 seconds before reconnect
> 2018-09-27T19:38:05.218Z|00791|rconn|INFO|br-tun<->tcp:127.0.0.1:6633:
> connecting...
> 2018-09-27T19:38:05.218Z|00792|rconn|INFO|br-ex<->tcp:127.0.0.1:6633:
> connecting...
> 2018-09-27T19:38:06.221Z|00793|rconn|INFO|br-tun<->tcp:127.0.0.1:6633:
> connected
> 2018-09-27T19:38:06.222Z|00794|rconn|INFO|br-ex<->tcp:127.0.0.1:6633:
> connected
>
> Who is at 127.0.0.1:45928 and 127.0.0.1:45930?
>
>
> That seems to be ovs-vswitchd in that range. of course, these ports seem
> to change all the time, but I think vswitchd tend to stay in that range.
>
> Here’s an example of "ss -anp |grep ovs" so you can have an idea of the
> port mapping.
>
> tcp    LISTEN     0      10     127.0.0.1:6640                  *:*
>             users:(("ovsdb-server",pid=939,fd=19))
> tcp    ESTAB      0      0      127.0.0.1:6640
> 127.0.0.1:28720               users:(("ovsdb-server",pid=939,fd=20))
> tcp    ESTAB      0      0      127.0.0.1:6640
> 127.0.0.1:28734               users:(("ovsdb-server",pid=939,fd=18))
> tcp    ESTAB      0      0      127.0.0.1:6640
> 127.0.0.1:28754               users:(("ovsdb-server",pid=939,fd=25))
> tcp    ESTAB      0      0      127.0.0.1:6640
> 127.0.0.1:28730               users:(("ovsdb-server",pid=939,fd=24))
> tcp    ESTAB      0      0      127.0.0.1:28754
> 127.0.0.1:6640                users:(("ovsdb-client",pid=20965,fd=3))
> tcp    ESTAB      0      0      127.0.0.1:6640
> 127.0.0.1:28752               users:(("ovsdb-server",pid=939,fd=23))
> tcp    ESTAB      0      0      127.0.0.1:46917
> 127.0.0.1:6633                users:(("ovs-vswitchd",pid=1013,fd=214))
> tcp    ESTAB      0      0      127.0.0.1:6640
> 127.0.0.1:28750               users:(("ovsdb-server",pid=939,fd=22))
> tcp    ESTAB      0      0      127.0.0.1:6640
> 127.0.0.1:28722               users:(("ovsdb-server",pid=939,fd=21))
> tcp    ESTAB      0      0      127.0.0.1:28752
> 127.0.0.1:6640
>   users:(("ovsdb-client",pid=20363,fd=3))Jean-Philippe Méthot
>
>
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.
>
>
>
>
> Le 27 sept. 2018 à 15:39, Guru Shetty <guru at ovn.org> a écrit :
>
>
> ovs-vswitchd is multi-threaded. ovsdb-server is single threaded.
> (You did not answer my question about the file from which the logs were
> printed in your email)
>
> Who is at 127.0.0.1:45928 and 127.0.0.1:45930?
>
> On Thu, 27 Sep 2018 at 11:14, Jean-Philippe Méthot <
> jp.methot at planethoster.info> wrote:
>
>> Thank you for your reply.
>>
>> This is Openstack with ml2 plugin. There’s no other 3rd party application
>> used with our network, so no OVN or anything of the sort. Essentially, to
>> give a quick idea of the topology, we have our vms on our compute nodes
>> going through GRE tunnels toward network nodes where they are routed in
>> network namespace toward a flat external network.
>>
>> Generally, the above indicates that a daemon fronting a Open vSwitch
>> database hasn't been able to connect to its client. Usually happens when
>> CPU consumption is very high.
>>
>>
>> Our network nodes CPU are literally sleeping. Is openvswitch
>> single-thread or multi-thread though? If ovs overloaded a single thread,
>> it’s possible I may have missed it.
>>
>> Jean-Philippe Méthot
>> Openstack system administrator
>> Administrateur système Openstack
>> PlanetHoster inc.
>>
>>
>>
>>
>> Le 27 sept. 2018 à 14:04, Guru Shetty <guru at ovn.org> a écrit :
>>
>>
>>
>> On Wed, 26 Sep 2018 at 12:59, Jean-Philippe Méthot via discuss <
>> ovs-discuss at openvswitch.org> wrote:
>>
>>> Hi,
>>>
>>> I’ve been using openvswitch for my networking backend on openstack for
>>> several years now. Lately, as our network has grown, we’ve started noticing
>>> some intermittent packet drop accompanied with the following error message
>>> in openvswitch:
>>>
>>> 2018-09-26T04:15:20.676Z|00005|reconnect|ERR|tcp:127.0.0.1:45928: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:15:20.677Z|00006|reconnect|ERR|tcp:127.0.0.1:45930: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>>
>>
>> Open vSwitch is a project with multiple daemons. Since you are using
>> OpenStack, it is not clear from your message, what type of networking
>> plugin you are using. Do you use OVN?
>> Also, you did not mention from which file you have gotten the above
>> errors.
>>
>> Generally, the above indicates that a daemon fronting a Open vSwitch
>> database hasn't been able to connect to its client. Usually happens when
>> CPU consumption is very high.
>>
>>
>>
>>> 2018-09-26T04:15:30.409Z|00007|reconnect|ERR|tcp:127.0.0.1:45874: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:15:33.661Z|00008|reconnect|ERR|tcp:127.0.0.1:45934: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:15:33.847Z|00009|reconnect|ERR|tcp:127.0.0.1:45894: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:16:03.247Z|00010|reconnect|ERR|tcp:127.0.0.1:45958: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:16:21.534Z|00011|reconnect|ERR|tcp:127.0.0.1:45956: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:16:21.786Z|00012|reconnect|ERR|tcp:127.0.0.1:45974: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:16:47.085Z|00013|reconnect|ERR|tcp:127.0.0.1:45988: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:16:49.618Z|00014|reconnect|ERR|tcp:127.0.0.1:45982: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:16:53.321Z|00015|reconnect|ERR|tcp:127.0.0.1:45964: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:17:15.543Z|00016|reconnect|ERR|tcp:127.0.0.1:45986: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:17:24.767Z|00017|reconnect|ERR|tcp:127.0.0.1:45990: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:17:31.735Z|00018|reconnect|ERR|tcp:127.0.0.1:45998: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:20:12.593Z|00019|reconnect|ERR|tcp:127.0.0.1:46014: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:23:51.996Z|00020|reconnect|ERR|tcp:127.0.0.1:46028: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:25:12.187Z|00021|reconnect|ERR|tcp:127.0.0.1:46022: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:25:28.871Z|00022|reconnect|ERR|tcp:127.0.0.1:46056: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:27:11.663Z|00023|reconnect|ERR|tcp:127.0.0.1:46046: no
>>> response to inactivity probe after 5 seconds, disconnecting
>>> 2018-09-26T04:29:56.161Z|00024|jsonrpc|WARN|tcp:127.0.0.1:46018:
>>> receive error: Connection reset by peer
>>> 2018-09-26T04:29:56.161Z|00025|reconnect|WARN|tcp:127.0.0.1:46018:
>>> connection dropped (Connection reset by peer)
>>>
>>> This definitely kills the connection for a few seconds before it
>>> reconnects. So, I’ve been wondering, what is this probe and what is really
>>> happening here? What’s the cause and is there a way to fix this?
>>>
>>> Openvswitch version is 2.9.0-3 on CentOS 7 with Openstack Pike running
>>> on it (but the issues show up on Queens too).
>>>
>>>
>>> Jean-Philippe Méthot
>>> Openstack system administrator
>>> Administrateur système Openstack
>>> PlanetHoster inc.
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss at openvswitch.org
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20180927/79326f80/attachment-0001.html>