[ovs-dev] ovs-vswitch kernel panic randomly started after 400+ days uptime

Uri Foox uri at zoey.com
Fri Jan 6 20:38:46 UTC 2017


Hey Joe,

I agree with you.

It dumbfounded us that a single packet could kernel panic a host so easily
and in fact made me believe for at least a few days that this was a red
herring. The fact that we cannot replicate it and randomly occurs (within a
given time period) also makes it impossible for us to properly
test/resolve. Given your explanation I can only infer that this could be
exploitable if you had deep enough network access to cause just enough
trouble for someone to go insane or worst. To level set expectations - this
is well outside my area of expertise and I am trying to figure out as much
as I can on the fly to resolve this issue. If there is any way that my team
can help in finding the root cause of the issue and work towards a patch,
please let me know. I would like to see it resolved because our fix is not
really a fix - it's just a way to restore stability by bypassing what's
broken.

So, I defer to you or someone else on this mailing list to tell me what
next steps you're interested in - if any :) - otherwise, we'll probably end
up rebooting the switch, enable the interfaces and see what happens. If it
solves the problem we'll post that, if it doesn't we'll just leave it
disconnected and hope someone figures out a fix.

Have a great weekend!

Thanks,
Uri


On Fri, Jan 6, 2017 at 3:11 PM, Joe Stringer <joe at ovn.org> wrote:

>
>
> On 6 January 2017 at 11:47, Uri Foox <uri at zoey.com> wrote:
>
>> Hey Joe,
>>
>> I do agree that the patches for the Linux Kernel were not 1:1 with what
>> our stack trace showed but it was the only thing we remotely found that
>> explained our issue. Granted, after upgrading the kernel it was clear that
>> it fixed nothing - so, back to the drawing board...
>>
>> Given your initial comment of something above the stack most likely
>> causing the issue we went through our network switches and took the steps
>> in disconnecting one of the network interfaces on each of the computing
>> nodes that communicate to our Juniper Switch which routes internet traffic.
>> Looking at the Juniper Switch we see a lot of errors about interfaces
>> flapping on/off although the timing of them does not correlate exactly with
>> the timing of the crashes (they are plus/minus a few minutes before/after
>> the crash) but we do see that these errors appear to begin on the same
>> day/time that we had our first kernel panic and have continued. As soon as
>> we disconnected the network interface the Juniper stopped logging any error
>> messages and we have not experienced a kernel panic in nearly six hours
>> whereas before it was happening as frequently as every two hours. Won't
>> declare victory yet but it's the first time in a couple of weeks we've had
>> stability.
>>
>
> For completeness, I want to say - there's no good reason that Linux should
> crash if it receives a bad packet. This condition may be triggered by
> something external, but it's a bug in the kernel. I think that there's
> supposed to be a check after IPGRE decap to ensure the packet is big
> enough, and that doesn't exist. Somewhere between gre_cisco_rcv() and
> ovs_flow_extract() (or, in newer kernels, key_extract()), there is supposed
> to be this check and it's missing.
>
> If you can alleviate your issue, that's great for you; if we can fix this
> problem for other users of GRE (and potentially other tunnel types), that's
> great for everybody. So I think it's worth digging a bit further if we can.
> Having a minimal reproduction environment is always nice so we can verify
> that any proposed fix does address the issue. I also wonder if this even
> affects the latest kernels, although the code has been refactored
> considerably in v4.3 onwards.
>
>
>>
>> Here is a sample of the error message in the Juniper log if it tells you
>> anything:
>>
>> Jan  2 00:34:23  pod2-core dfwd[1114]: CH_NET_SERV_KNOB_STATE read failed
>> (rtslib err 2 - No such file or directory). Setting chassis state to NORMAL
>> (All FPC) and retry in the idle phase (59 retries)
>> Jan  2 00:34:48  pod2-core mib2d[1101]: SNMP_TRAP_LINK_DOWN: ifIndex 567,
>> ifAdminStatus up(1), ifOperStatus down(2), ifName ge-0/0/30
>> Jan  2 00:35:06  pod2-core mib2d[1101]: SNMP_TRAP_LINK_DOWN: ifIndex 569,
>> ifAdminStatus up(1), ifOperStatus down(2), ifName ge-0/0/31
>> Jan  2 00:36:06  pod2-core mib2d[1101]: SNMP_TRAP_LINK_DOWN: ifIndex 569,
>> ifAdminStatus up(1), ifOperStatus down(2), ifName ge-0/0/31
>> Jan  2 00:39:06  pod2-core mib2d[1101]: SNMP_TRAP_LINK_DOWN: ifIndex 569,
>> ifAdminStatus up(1), ifOperStatus down(2), ifName ge-0/0/31
>> Jan  2 00:44:33  pod2-core mib2d[1101]: SNMP_TRAP_LINK_DOWN: ifIndex 567,
>> ifAdminStatus up(1), ifOperStatus down(2), ifName ge-0/0/30
>>
>> Where the interface 0/0/30 can be replaced with any interface that we
>> have plugged into our computing notes - they all showed errors.
>>
>> I suspect that your analysis is somewhat accurate and essentially this
>> switch suffered some sort of failure that has manifested itself in an
>> extremely odd way with sending some rogue packets that either the kernel or
>> the version of OVS we are running cannot recover from.
>>
>> root at node-2:~# ovs-vswitchd -V
>> ovs-vswitchd (Open vSwitch) 2.0.2
>> Compiled Nov 28 2014 21:37:19
>> OpenFlow versions 0x1:0x1
>>
>
>> I figured I would follow up with what we did to "solve" the issue. We're
>> not really sure whether we should reboot or RMA the switch. For now if the
>> above gives Pravin or you any more insights please do share.
>>
>> As a side note, I have to say I am extremely thankful for the replies to
>> this thread. I figured posting something would have a low chance of getting
>> any attention but your confirmation of what I was able to piece together
>> gave us the confidence to move in a direction that hopefully brings back
>> stability.
>>
>
> It always helps when you bring very specific kernel traces,
> impact/behaviour and well-written descriptions. Thanks for reporting the
> issue!
>



-- 
Uri Foox | Zoey | Founder
http://www.zoey.com


More information about the dev mailing list