[ovs-dev] ovs-vswitch kernel panic randomly started after 400+ days uptime

Uri Foox uri at zoey.com
Fri Jan 6 01:13:10 UTC 2017


Hi,

Since about 10 days ago, every few hours, one of our 10 compute nodes on
our Openstack cluster kernel panics at the host level kernel panics
(captured through netconsole). The kernel panic is identical across all 10
nodes and happens at random times but at least 1 node kernel panics every
3-12 hours. We have tried numerous things including upgrading the kernel
(Ubuntu 12.04 LTS running 3.1.0-106-generic), modifying sysctl, restarting
switches, restarting all openstack networking services, changing BIOS
settings etc...but no luck. We have not restarted the control nodes or the
Juniper switch that routes all inbound internet traffic.

Based on research we did around skbuff.h we found two kernel patches to
address a checksum failure and also some OVS discussions about it. I was
hoping that the kernel upgrade would solve it but it did not. I do not know
if Openstack will tolerate us upgrading OVS and the fact that it started
completely randomly leads me to believe it's some other factor that we are
unaware of.


   - https://patchwork.ozlabs.org/patch/512625/
   -
   https://github.com/openvswitch/ovs/commit/51b7a90217369f6bbbf164ba471f54ec2817665e
   - https://patchwork.kernel.org/patch/7475491/
   - https://patchwork.ozlabs.org/patch/523632/


Here is one of them. If you have any ideas what we can do, please let me
know.

Thanks,
Uri


Connection from 172.25.2.157 port 5404 [udp/*] accepted
[68240.441681] ------------[ cut here ]------------
[68240.496918] kernel BUG at
/build/linux-lts-trusty-D60X6T/linux-lts-trusty-3.13.0/include/linux/skbuff.h:1486!
[68240.615520] invalid opcode: 0000 [#1] SMP
[68240.664751] Modules linked in: netconsole configfs xt_mac xt_physdev
xt_set ip_set_hash_ip ip_set nfnetlink vhost_net macvtap macvlan vhost veth
bridge stp llc ipt_REJECT xt_state xt_conntrack xt_multiport xt_CT
xt_comment iptable_raw xt_CHECKSUM xt_tcpudp iptable_mangle ipt_MASQUERADE
iptable_nat nf_nat_ipv4 nf_nat ip6table_filter ip6_tables iptable_filter
ip_tables ebtable_nat ebtables x_tables kvm_intel kvm nbd ib_iser rdma_cm
ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi openvswitch vxlan ip_tunnel gre nfsd nfs_acl
auth_rpcgss nfs fscache lockd sunrpc dm_multipath gpio_ich dcdbas scsi_dh
mei_me shpchp sb_edac mei edac_core lpc_ich joydev acpi_power_meter
nf_conntrack_ipv6 mac_hid nf_defrag_ipv6 wmi nf_conntrack_ipv4 ipmi_si xfs
nf_conntrack nf_defrag_ipv4 lp parport igb btrfs hid_generic dca
i2c_algo_bit usbhid raid6_pq ptp ahci bnx2x hid libahci mdio megaraid_sas
pps_core xor libcrc32c
[68241.670838] CPU: 33 PID: 0 Comm: swapper/33 Not tainted
3.13.0-106-generic #153~precise1-Ubuntu
[68241.774871] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.2.3
07/09/2014
[68241.864406] task: ffff881028b94800 ti: ffff881028ba0000 task.ti:
ffff881028ba0000
[68241.953939] RIP: 0010:[<ffffffffa052b4fe>]  [<ffffffffa052b4fe>]
__skb_pull.part.7+0x4/0x6 [openvswitch]
[68242.067531] RSP: 0018:ffff88203fb03b08  EFLAGS: 00010297
[68242.131087] RAX: ffff88165c791966 RBX: ffff88202639e900 RCX:
ffff88165c791900
[68242.216458] RDX: 0000000000000210 RSI: 000000000000001a RDI:
0000000000000214
[68242.301842] RBP: ffff88203fb03b08 R08: 0000000000000000 R09:
0000000000000140
[68242.387207] R10: 000000000000000c R11: 0000000072221c0c R12:
ffff88203fb03b70
[68242.472576] R13: ffff88402794d0c0 R14: ffff88203fb03b70 R15:
ffff88302324e180
[68242.557945] FS:  0000000000000000(0000) GS:ffff88203fb00000(0000)
knlGS:0000000000000000
[68242.654780] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[68242.723550] CR2: 00007f9c7466ab90 CR3: 000000302689e000 CR4:
00000000000427e0
[68242.808931] Stack:
[68242.832981]  ffff88203fb03b38 ffffffffa0524e64 ffffffff8112d1e1
ffff88202639e900
[68242.921980]  ffffe8e000305800 ffff88402794d0c0 ffff88203fb03c28
ffffffffa0523a80
[68243.010963]  ffff88203fb13180 ffff88203fb03b90 ffffffff810a090b
0000000100000000
[68243.099945] Call Trace:
[68243.129188]  <IRQ>
[68243.152204]  [<ffffffffa0524e64>] ovs_flow_extract+0x664/0x720
[openvswitch]
[68243.238893]  [<ffffffff8112d1e1>] ? tracing_record_cmdline+0x21/0x50
[68243.314912]  [<ffffffffa0523a80>]
ovs_dp_process_received_packet+0x60/0x130 [openvswitch]
[68243.412793]  [<ffffffff810a090b>] ? ttwu_do_wakeup+0xfb/0x110
[68243.481559]  [<ffffffffa0529e3a>] ovs_vport_receive+0x2a/0x30
[openvswitch]
[68243.564884]  [<ffffffffa052b374>] gre_rcv+0xa4/0xb8 [openvswitch]
[68243.637802]  [<ffffffffa03e2795>] gre_cisco_rcv+0x75/0xbc [gre]
[68243.708621]  [<ffffffffa03e22f5>] gre_rcv+0x65/0x90 [gre]
[68243.773214]  [<ffffffff816941d8>] ip_local_deliver_finish+0xa8/0x220
[68243.849244]  [<ffffffff816944db>] ip_local_deliver+0x4b/0x90
[68243.916951]  [<ffffffff81693ed1>] ip_rcv_finish+0x121/0x380
[68243.983627]  [<ffffffff816947a6>] ip_rcv+0x286/0x380
[68244.043023]  [<ffffffff8165b80a>] __netif_receive_skb_core+0x61a/0x760
[68244.121122]  [<ffffffff8165b971>] __netif_receive_skb+0x21/0x70
[68244.191942]  [<ffffffff8165c131>] process_backlog+0xb1/0x190
[68244.259642]  [<ffffffff8165ca09>] net_rx_action+0x139/0x280
[68244.326305]  [<ffffffff8107367d>] __do_softirq+0xed/0x360
[68244.390887]  [<ffffffff81073c8e>] irq_exit+0x11e/0x140
[68244.452358]  [<ffffffff8177d873>] do_IRQ+0x63/0xe0
[68244.509674]  [<ffffffff817728ad>] common_interrupt+0x6d/0x6d
[68244.577366]  <EOI>
[68244.600371]  [<ffffffff8109e353>] ? finish_task_switch+0x53/0x160
[68244.675630]  [<ffffffff8176e47e>] __schedule+0x38e/0x720
[68244.739175]  [<ffffffff8176e8c9>] schedule+0x29/0x70
[68244.798567]  [<ffffffff8176ebee>] schedule_preempt_disabled+0xe/0x10
[68244.874582]  [<ffffffff810c7f95>] cpu_idle_loop+0x255/0x2a0
[68244.941246]  [<ffffffff810ddba2>] ?
clockevents_register_device+0xe2/0x140
[68245.023512]  [<ffffffff810c804b>] cpu_startup_entry+0x6b/0x70
[68245.092269]  [<ffffffff81045bbd>] start_secondary+0xcd/0xd0
[68245.158929] Code: c7 e8 cb 52 a0 89 45 f8 e8 50 2e b4 e0 c6 05 15 2e 00
00 01 8b 45 f8 eb 0c 8b 16 48 8b 38 31 f6 e8 38 b9 15 e1 c9 c3 55 48 89 e5
<0f> 0b 8b 57 68 55 31 c0 48 89 e5 39 f2 72 13 2b 57 6c 29 d6 e8
[68245.392237] RIP  [<ffffffffa052b4fe>] __skb_pull.part.7+0x4/0x6
[openvswitch]
[68245.477737]  RSP <ffff88203fb03b08>
[68245.520082] ---[ end trace 383bac9f3e676970 ]---
[68245.583665] Kernel panic - not syncing: Fatal exception in interrupt
[68245.661910] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffff9fffffff)
[68245.792179] ------------[ cut here ]------------
[68245.847479] WARNING: CPU: 33 PID: 0 at
/build/linux-lts-trusty-D60X6T/linux-lts-trusty-3.13.0/arch/x86/kernel/smp.c:124
native_smp_send_reschedule+0x5e/0x60()
[68246.017113] Modules linked in: netconsole configfs xt_mac xt_physdev
xt_set ip_set_hash_ip ip_set nfnetlink vhost_net macvtap macvlan vhost veth
bridge stp llc ipt_REJECT xt_state xt_conntrack xt_multiport xt_CT
xt_comment iptable_raw xt_CHECKSUM xt_tcpudp iptable_mangle ipt_MASQUERADE
iptable_nat nf_nat_ipv4 nf_nat ip6table_filter ip6_tables iptable_filter
ip_tables ebtable_nat ebtables x_tables kvm_intel kvm nbd ib_iser rdma_cm
ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi openvswitch vxlan ip_tunnel gre nfsd nfs_acl
auth_rpcgss nfs fscache lockd sunrpc dm_multipath gpio_ich dcdbas scsi_dh
mei_me shpchp sb_edac mei edac_core lpc_ich joydev acpi_power_meter
nf_conntrack_ipv6 mac_hid nf_defrag_ipv6 wmi nf_conntrack_ipv4 ipmi_si xfs
nf_conntrack nf_defrag_ipv4 lp parport igb btrfs hid_generic dca
i2c_algo_bit usbhid raid6_pq ptp ahci bnx2x hid libahci mdio megaraid_sas
pps_core xor libcrc32c
[68247.030510] CPU: 33 PID: 0 Comm: swapper/33 Tainted: G      D
3.13.0-106-generic #153~precise1-Ubuntu
[68247.147123] Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.2.3
07/09/2014
[68247.236723]  0000000000000000 ffff88203fb03530 ffffffff81765c15
0000000000000000
[68247.326029]  000000000000007c ffff88203fb03570 ffffffff8106e2fc
239c8806e0dc2058
[68247.415346]  0000000000000000 0000000000000021 ffff88103fa13180
ffff88203fb13180
[68247.504655] Call Trace:
[68247.533961]  <IRQ>  [<ffffffff81765c15>] dump_stack+0x64/0x82
[68247.603052]  [<ffffffff8106e2fc>] warn_slowpath_common+0x8c/0xc0
[68247.674969]  [<ffffffff8106e34a>] warn_slowpath_null+0x1a/0x20
[68247.744808]  [<ffffffff8104458e>] native_smp_send_reschedule+0x5e/0x60
[68247.822965]  [<ffffffff810b0c3e>] trigger_load_balance+0x17e/0x1f0
[68247.896964]  [<ffffffff810a1e9f>] scheduler_tick+0xaf/0xf0
[68247.962645]  [<ffffffff8107d871>] update_process_times+0x61/0x80
[68248.034566]  [<ffffffff810e0293>] tick_sched_handle.isra.12+0x33/0x70
[68248.111675]  [<ffffffff810e03bc>] tick_sched_timer+0x4c/0x80
[68248.179432]  [<ffffffff810967a7>] __run_hrtimer+0x77/0x270
[68248.245113]  [<ffffffff810d87a2>] ? ktime_get_update_offsets+0x52/0xf0
[68248.323263]  [<ffffffff810e0370>] ? tick_nohz_handler+0xa0/0xa0
[68248.394139]  [<ffffffff81097147>] hrtimer_interrupt+0x107/0x260
[68248.465015]  [<ffffffff81446875>] ? erst_write+0x135/0x150
[68248.530692]  [<ffffffff81446b40>] ? erst_writer+0x2b0/0x380
[68248.597413]  [<ffffffff8104752b>] local_apic_timer_interrupt+0x3b/0x60
[68248.675571]  [<ffffffff8177d933>] smp_apic_timer_interrupt+0x43/0x60
[68248.751650]  [<ffffffff8177c29d>] apic_timer_interrupt+0x6d/0x80
[68248.823564]  [<ffffffff817584b0>] ? panic+0x19e/0x1e1
[68248.884046]  [<ffffffff81758412>] ? panic+0x100/0x1e1
[68248.944529]  [<ffffffff81773a5a>] oops_end+0x14a/0x160
[68249.006059]  [<ffffffff810196d8>] die+0x58/0x90
[68249.060303]  [<ffffffff8177315b>] do_trap+0xcb/0x170
[68249.119748]  [<ffffffff810166ec>] do_invalid_op+0xac/0x110
[68249.185436]  [<ffffffffa052b4fe>] ? __skb_pull.part.7+0x4/0x6
[openvswitch]
[68249.268785]  [<ffffffff8165dc62>] ? __dev_queue_xmit+0x92/0x500
[68249.339663]  [<ffffffff8177cd5e>] invalid_op+0x1e/0x30
[68249.401185]  [<ffffffffa052b4fe>] ? __skb_pull.part.7+0x4/0x6
[openvswitch]
[68249.484530]  [<ffffffffa0524e64>] ovs_flow_extract+0x664/0x720
[openvswitch]
[68249.568918]  [<ffffffff8112d1e1>] ? tracing_record_cmdline+0x21/0x50
[68249.644994]  [<ffffffffa0523a80>]
ovs_dp_process_received_packet+0x60/0x130 [openvswitch]
[68249.742909]  [<ffffffff810a090b>] ? ttwu_do_wakeup+0xfb/0x110
[68249.811710]  [<ffffffffa0529e3a>] ovs_vport_receive+0x2a/0x30
[openvswitch]
[68249.895055]  [<ffffffffa052b374>] gre_rcv+0xa4/0xb8 [openvswitch]
[68249.968009]  [<ffffffffa03e2795>] gre_cisco_rcv+0x75/0xbc [gre]
[68250.038879]  [<ffffffffa03e22f5>] gre_rcv+0x65/0x90 [gre]
[68250.103522]  [<ffffffff816941d8>] ip_local_deliver_finish+0xa8/0x220
[68250.179595]  [<ffffffff816944db>] ip_local_deliver+0x4b/0x90
[68250.247354]  [<ffffffff81693ed1>] ip_rcv_finish+0x121/0x380
[68250.314069]  [<ffffffff816947a6>] ip_rcv+0x286/0x380
[68250.373512]  [<ffffffff8165b80a>] __netif_receive_skb_core+0x61a/0x760
[68250.451662]  [<ffffffff8165b971>] __netif_receive_skb+0x21/0x70
[68250.522533]  [<ffffffff8165c131>] process_backlog+0xb1/0x190
[68250.590293]  [<ffffffff8165ca09>] net_rx_action+0x139/0x280
[68250.657019]  [<ffffffff8107367d>] __do_softirq+0xed/0x360
[68250.721659]  [<ffffffff81073c8e>] irq_exit+0x11e/0x140
[68250.783185]  [<ffffffff8177d873>] do_IRQ+0x63/0xe0
[68250.840557]  [<ffffffff817728ad>] common_interrupt+0x6d/0x6d
[68250.908314]  <EOI>  [<ffffffff8109e353>] ? finish_task_switch+0x53/0x160
[68250.988818]  [<ffffffff8176e47e>] __schedule+0x38e/0x720
[68251.052421]  [<ffffffff8176e8c9>] schedule+0x29/0x70
[68251.111863]  [<ffffffff8176ebee>] schedule_preempt_disabled+0xe/0x10
[68251.187933]  [<ffffffff810c7f95>] cpu_idle_loop+0x255/0x2a0
[68251.254651]  [<ffffffff810ddba2>] ?
clockevents_register_device+0xe2/0x140
[68251.336958]  [<ffffffff810c804b>] cpu_startup_entry+0x6b/0x70
[68251.405751]  [<ffffffff81045bbd>] start_secondary+0xcd/0xd0
[68251.472464] ---[ end trace 383bac9f3e676971 ]---


More information about the dev mailing list