[ovs-dev] [Bug Report] Cisco Switch will randomly get err-disabled when inter-connected with Open vSwitch

Ben Pfaff blp at ovn.org
Sat Jun 18 16:50:34 UTC 2016


On Sat, Jun 18, 2016 at 04:57:59PM +0800, 渔舟 wrote:
> I also encounter the same problem as Liang Dong did.
> http://thread.gmane.org/gmane.linux.network.openvswitch.general/11704
> 
> 
> Just copy the message body from Liang Dong.
> 
> 
> Hi
> 
> 
> We have found a very strange bug in Open vSwitch, when it is connected to a Cisco Switch port, the port will randomly get err-disabled.
> 
> 
> So we have 76 Debian servers installed with Open vSwitch (2.4.0), each connected an port in Cisco Switch 3110. There will be a chance of err-disabled port on Cisco Switch every week or two. From Cisco switch perspective, the port was disabled because detecting an loopback by receiving a keepalive message which was originated from the cisco switch port.
> 
> 
> Basically the keepalive message was like below:
> 
> 
> 11:37:01.749102 e8:04:62:c8:6e:81  e8:04:62:c8:6e:81, ethertype Loopback (0x9000), length 60: Loopback, skipCount 0, Reply, receipt number 0, data (40 octets)
> 0x0000: 0000 0100 0000 0000 0000 0000 0000 0000 ................
> 0x0010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
> 0x0020: 0000 0000 0000 0000 0000 0000 0000    ..............
> 
> 
> Our first guess was that Open vSwitch accidentally sends the keepalive message it received back to the port and leads to err-disabled state. Normally the Open vSwitch will discard this message, but once a week or two in 76 servers, it will get back to the port on the cisco switch and the port will be err-disabled.
> 
> 
> The work around we are using now are either disabling sending keepalive message on cisco switch or explicitly add a flow rule for discarding that keepalive message on Open vSwitch.

It's really hard to debug problems that are intermittent and require
specific hardware.  If you can eliminate one of those parts of the
problem, then it's easier to deal with.  To attack the intermittent
part, perhaps you could make the Cisco switch send these keepalive
messages much more frequently.  To attack the specific hardware part,
maybe you could reproduce this by sending similar keepalive messages in
software and demonstrate that sometimes OVS sends them back.



More information about the dev mailing list