[ovs-discuss] bond members flapping in state up and down

Tue Mar 2 20:58:00 UTC 2021

Hi OVS-team,

we neither know if this is really a bug, nor if this has really something to do with OVS, but we hope so, because we have no other ideas so far to solve our issues.

Environment:
We are running an OpenStack cluster based on CentOS 8 with nearly 50 host systems in total.
The controller-cluster consists of three nodes. Every of these control nodes is connected to a 4-member-stack switch by two cables via two dual-port "Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)" cards.
These ports have LACP activated and building a "port-channel" as DELL calls it. So in other words, we have to two cables, coming from different network cards (but same model) building a LACP based bond.

Issue:
Since two month we have the re-occurring issue, that both interfaces are going down, a few seconds one after another. Sometimes one of the interfaces comes back online again and going down a second time, before the bond finally breaks.
In most cases the links are coming back in state "link state up" within 2-10 seconds and then the link aggregation will be negotiated again.
Last time, this happened 15 times at 2021-02-24 within one hour.

This issue happened multiple times out of a sudden in the middle of the day. In most cases only on one of the mentioned host machines. Sometimes on two. But the last two times it happened at almost the same time on all three machines while we were running a so called "overcloud deployment", which actually means a rollout or execution of ansible- and puppet-based configuration changes on every host system of the whole cluster.

Do not confuse cause and effect:
We don't know what exactly happens at the time, while the links starting to flap up and down, but we saw a growing load average on these servers.
But it is totally unclear if the load is the cause or just an effect, because we know when the nodes don't see each other anymore, they are starting all the (not really) lost resources of the other unseen cluster nodes. This will rise up the load in any case and can be re-produced. Another problem with this is, that when the link state is "up" again, the nodes can see each other again and the cluster is healing itself, the resources will be torn down and then after a couple of minutes or just seconds the issue appears again and everything starts from the beginning and the load continues to rise up.

So we hope, you could say if such a higher load average can cause this problems or not.
Unfortunately our sysstat reporting was set to a 10minutes period at this day, so we only got these values from the database:

                CPU     %user     %nice   %system   %iowait    %steal     %idle
02:10:00 PM     all     18.67      0.00      8.61      0.01      0.00     72.72
02:20:00 PM     all     41.89      0.00     35.00      0.00      0.00     23.11
02:30:00 PM     all     30.02      0.00     37.79      0.00      0.00     32.19
02:40:00 PM     all     53.12      0.00     12.50      0.01      0.00     34.37
02:50:00 PM     all     53.57      0.00     12.28      0.01      0.00     34.15
03:00:00 PM     all     47.39      0.00     11.99      0.01      0.00     40.61
...
Average:        all     13.40      0.00      6.19      0.01      0.00     80.40

03:20:00 PM       DEV       tps     rkB/s     wkB/s   areq-sz    aqu-sz     await     svctm     %util
02:10:00 PM    dev8-0    142.92      5.41   6676.74     46.75      0.07      0.76      0.20      2.80
02:20:00 PM    dev8-0    239.82     12.81   8050.04     33.62      0.09      0.72      0.14      3.27
02:30:00 PM    dev8-0    178.67      2.63   5310.33     29.74      0.04      0.61      0.08      1.45
02:40:00 PM    dev8-0    318.51      2.07  10603.47     33.30      0.08      0.56      0.17      5.33
02:50:00 PM    dev8-0    389.29      0.11  11048.52     28.38      0.11      0.58      0.16      6.23
03:00:00 PM    dev8-0    421.79      0.03  11976.57     28.39      0.12      0.65      0.13      5.27
...
Average:       dev8-0    142.71      2.36   7260.18     50.89      0.09      0.90      0.18      2.52

Information / version:
All hosts are servers with ...

-        2x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz with 16 threats each, or 32 threats in total.

-        256GB RAM

-        2x 450GB SSDs in RAID1

$> ovs-vswitchd --version
ovs-vswitchd (Open vSwitch) 2.12.0
# No patches applied by us

# Package version
network-scripts-openvswitch.x86_64            2.12.0-1.1.el8
openvswitch.x86_64                            2.12.0-1.1.el8

# Package info
$> sudo yum info openvswitch.x86_64
Last metadata expiration check: 0:41:55 ago on Tue 02 Mar 2021 08:47:40 PM CET.
Installed Packages
Name         : openvswitch
Version      : 2.12.0
Release      : 1.1.el8
Architecture : x86_64
Size         : 5.7 M
Source       : openvswitch-2.12.0-1.1.el8.src.rpm
Repository   : @System
>From repo    : delorean-ussuri-testing
Summary      : Open vSwitch daemon/database/utilities
URL          : http://www.openvswitch.org/
License      : ASL 2.0 and LGPLv2+ and SISSL
Description  : Open vSwitch provides standard network bridging functions and
             : support for the OpenFlow protocol for remote per-flow control of
             : traffic.

$> cat /etc/centos-release
CentOS Linux release 8.2.2004 (Core)

$> cat /proc/version
Linux version 4.18.0-193.28.1.el8_2.x86_64 (mockbuild at kbuilder.bsys.centos.org) (gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #1 SMP Thu Oct 22 00:20:22 UTC 2020

Output of "ovs-dpctl show" is attached to this e-mail.

/etc/openvswitch/conf.db is attached with shrinked content.

Also interesting is, around this first event there are also starting messages like "Unreasonably long XXXXms poll interval".

Tasks until now:
We have worked hard to find out where the root cause may lie. We have updated the firmware of the switches, updated the BIOS of the server systems and the firmware of the network cards. We also changed cables and ports on the switch.
Nothing helped so far.

Questions:

-        What means the "link state down" and "disabled" (as seen above) exactly? Is this a behavior like pulling out the cable or the other side brought the link down? Or could this mean the kernel switched the eth port off?

-        In any case, is this possible to debug or to increase the verbosity regarding the interfaces logs?

-        Do you see a correlation in between the rising up load of the system/cpu and the brought down link states?

Kind regards
Daniel Pfanz

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20210302/3c148765/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ovs-vswitchd.controller01.log
Type: application/octet-stream
Size: 4416301 bytes
Desc: ovs-vswitchd.controller01.log
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20210302/3c148765/attachment-0001.obj>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ovs-dpctl-show-output.txt
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20210302/3c148765/attachment-0002.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: conf.db.shrinked.txt
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20210302/3c148765/attachment-0003.txt>