[ovs-dev] ovs-vswitch kernel panic randomly started after 400+ days uptime

Joe Stringer joe at ovn.org
Fri Jan 6 19:27:44 UTC 2017


On 5 January 2017 at 19:24, Uri Foox <uri at zoey.com> wrote:

> Hey Joe,
>
> Thank you so much for responding! After 10 days of trying to figure this
> out I'm at a loss.
>
> root at node-8:~# modinfo openvswitch
> filename:       /lib/modules/3.13.0-106-generic/kernel/net/
> openvswitch/openvswitch.ko
> license:        GPL
> description:    Open vSwitch switching datapath
> srcversion:     94294A72258BA583D666607
> depends:        libcrc32c,vxlan,gre
> intree:         Y
>

^ intree - that is, the version that comes with this kernel.


> vermagic:       3.13.0-106-generic SMP mod_unload modversions
>
>
> Everything you've mentioned is what I've understood so far including the
> line of code that's triggered. That is what led me to upgrade the kernel to
> 3.13.0-106 because it claims that the CHECKSUM problems are fixed which I
> thought this might be related, guess not.
>

I forgot to actually look through those before, but the call chain looks a
bit different there so I thought it may be a different issue altogether.


> You're saying that skb_headlen is too short for the ethernet header. Do
> you know what would cause this? This hardware configuration has been
> running for 400+ days of uptime with no errors or problems and this
> suddenly started to happen and no matter how many time we reboot things it
> doesn't go away.  I assume given your interpretation we should try to
> restart the switches connected to the servers. Is there any way to log what
> packet is causing this issue? Perhaps that would provide more insight?
>

One thing is that it depends on the packets and how they arrive. I'm not
too familiar with this code, but I could imagine a situation where the
IP+GRE packet gets fragmented, causing a single inner frame to be split
across muliple GRE packets. Then, when Linux receives the two separate
packets, there would be some point in the stack responsible for stitching
these packets back together; but it may not put them into a single
contiguous buffer. If this is subsequently decapped for local delivery of
the inner frame, then perhaps there is less than an ethernet header's worth
of packet in the first of these buffers. It seems unlikely that packets
would be deliberately fragmented like this, but if anyone had access to
your underlying network then they could throw any kind of packet they want
to your server.

There may be another, more likely, explanation - CC Pravin in case he has
any ideas.


> As far as 4.4/newer kernel - I wish. I tried to go that far up but Ubuntu
> wouldn't even boot. The best I could do is 3.13.0-106. I'll try to report
> it over there as well.
>

That's too bad.

FWIW, I see a check for pskb_may_pull() in the outer gre_rcv function,
which would check on the whole GRE packet.. this is then passed to
gre_cisco_rcv() which does the decap and calls through to the OVS gre_rcv()
function. At a glance, following the OVS' gre_rcv() I didn't see another
psukb_may_pull() check for the inner packet. By the time it gets to
ovs_flow_extract(), there's an expectation that this call was made but I'm
really not sure who was supposed to make that check. Also, it should be
ETH_HLEN, which is 14, not 12..

Outer gre_rcv():
http://lxr.free-electrons.com/source/net/ipv4/gre_demux.c?v=3.13#L270

Inner gre_rcv():
http://lxr.free-electrons.com/source/net/openvswitch/vport-gre.c?v=3.13#L92


More information about the dev mailing list