[ovs-discuss] Kernel Panic with Openvswitch 2.7.3

Mon May 21 18:12:20 UTC 2018

I've been troubleshooting a kernel panic we've seen in our production
environment. First the kernel panic.

------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:3254!
invalid opcode: 0000 [#1] SMP
Modules linked in: zram vhost_vsock vmw_vsock_virtio_transport_common vsock nfnetlink_queue nfnetlink_log bluetooth iptable_nat xfs nf_conntrack_netlink nfnetlink ufs act_police cls_basic sch_ingress ebtable_filter ebtables ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw ip_tables x_tables vport_stt(OE) openvswitch(OE) nf_nat_ipv6 nf_nat_ipv4 nf_nat udp_tunnel dm_crypt ipmi_ssif bonding ipmi_devintf nf_conntrack_ftp nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 dcdbas intel_rapl nf_defrag_ipv4 sb_edac edac_core nf_conntrack x86_pkg_temp_thermal intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel dm_multipath aesni_intel aes_x86_64 lrw glue_helper ablk_helper kvm_intel cryptd intel_cstate intel_rapl_perf kvm irqbypass mei_me ipmi_si vhost_net mei lpc_ich ipmi_msghandler shpchp vhost acpi_power_meter macvtap mac_hid macvlan coretemp lp parport btrfs raid456 async_raid6_recov async_memcpy asyn
crc32c raid0 multipath linear raid1 raid10 ses enclosure scsi_transport_sas sfc(OE) mtd ptp ahci pps_core libahci mdio wmi megaraid_sas(OE) fjes [last unloaded: zram]
CPU: 10 PID: 39947 Comm: CPU 0/KVM Tainted: G           OE K 4.9.77-1-generic #4
Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 1.3.6 06/03/2015
task: ffff9b01ed1eab80 task.stack: ffffa7c0a2b04000
RIP: 0010:[<ffffffffc0734e17>]  [<ffffffffc0734e17>] skb_segment+0xce7/0xed0
RSP: 0018:ffff9b237f943618  EFLAGS: 00010246
RAX: 00000000000089d5 RBX: ffff9b107c430f00 RCX: ffff9b107c431800
RDX: ffff9b22a5ab0d00 RSI: 00000000000060e2 RDI: 0000000000000440
RBP: ffff9b237f9436e8 R08: 00000000000060e2 R09: 000000000000626a
R10: 0000000000005ca2 R11: 0000000000000000 R12: ffff9b11279396c0
R13: ffff9b5360ff5500 R14: 00000000000060e2 R15: 0000000000000011
FS:  00007f557e58f700(0000) GS:ffff9b237f940000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1d5f7d30f2 CR3: 00000006e4f9a000 CR4: 0000000000162670
Stack:
 ffff9b107c431800 ffffffffffffffde fffffff400000000 ffff9b107c431800
 ffff9b00c625bdf0 00005b2e01b39740 0000000000000001 0000000000000088
 0000000000b39740 0000000000000022 0000000000000009 ffff9b5300000000
Call Trace:
 <IRQ> [<ffffffff94bc6137>] udp4_ufo_fragment+0x127/0x1a0
 [<ffffffff94bcf32d>] inet_gso_segment+0x16d/0x3c0
 [<ffffffff94b5293a>] skb_mac_gso_segment+0xaa/0x110
 [<ffffffff94b52a66>] __skb_gso_segment+0xc6/0x190
 [<ffffffff946760d0>] ? ep_read_events_proc+0xc0/0xc0
 [<ffffffffc0665b3f>] queue_gso_packets+0x7f/0x1b0 [openvswitch]
 [<ffffffffc069d88d>] ? udp_error+0x16d/0x1c0 [nf_conntrack]
 [<ffffffffc0695282>] ? nf_ct_get_tuple+0x82/0xa0 [nf_conntrack]
 [<ffffffffc069d910>] ? udp_packet+0x30/0x90 [nf_conntrack]
 [<ffffffffc066dabc>] ? flow_lookup.isra.6+0x7c/0xb0 [openvswitch]
 [<ffffffffc0697d95>] ? nf_conntrack_in+0x2d5/0x560 [nf_conntrack]
 [<ffffffffc0665dc1>] ovs_dp_upcall+0x31/0x60 [openvswitch]
 [<ffffffffc0665ef3>] ovs_dp_process_packet+0x103/0x120 [openvswitch]
 [<ffffffffc065f2d4>] do_execute_actions+0x834/0x1510 [openvswitch]
 [<ffffffffc066dabc>] ? flow_lookup.isra.6+0x7c/0xb0 [openvswitch]
 [<ffffffffc065fff3>] ovs_execute_actions+0x43/0x110 [openvswitch]
 [<ffffffffc0665e76>] ovs_dp_process_packet+0x86/0x120 [openvswitch]
 [<ffffffffc0670040>] ? netdev_port_receive+0x100/0x100 [openvswitch]
 [<ffffffffc066f576>] ovs_vport_receive+0x76/0xd0 [openvswitch]
 [<ffffffff94b4fc3c>] ? netif_rx+0x1c/0x70
 [<ffffffffc06703ec>] ? ovs_ip_tunnel_rcv+0x8c/0xe0 [openvswitch]
 [<ffffffff94b8ae2b>] ? nf_iterate+0x5b/0x70
 [<ffffffffc0672888>] ? nf_ip_hook+0x738/0xde0 [openvswitch]
 [<ffffffff94b91df9>] ? ip_rcv_finish+0x129/0x420
 [<ffffffff94b8ae9b>] ? nf_hook_slow+0x5b/0xa0
 [<ffffffffc066fff0>] netdev_port_receive+0xb0/0x100 [openvswitch]
 [<ffffffffc0670040>] ? netdev_port_receive+0x100/0x100 [openvswitch]
 [<ffffffffc0670078>] netdev_frame_hook+0x38/0x60 [openvswitch]
 [<ffffffff94b501b0>] __netif_receive_skb_core+0x220/0xac0
 [<ffffffffc028c1e0>] ? efx_fast_push_rx_descriptors+0x50/0x310 [sfc]
 [<ffffffff94b50a68>] __netif_receive_skb+0x18/0x60
 [<ffffffff94b51b99>] process_backlog+0x89/0x140
 [<ffffffff94b511ac>] net_rx_action+0x10c/0x360
 [<ffffffff94c6eb0f>] __do_softirq+0xdf/0x2bb
 [<ffffffffc0285642>] ? efx_ef10_msi_interrupt+0x62/0x70 [sfc]
 [<ffffffff94c6dc3b>] do_IRQ+0x8b/0xd0
 [<ffffffff94487816>] irq_exit+0xb6/0xc0
 [<ffffffff94c6b956>] common_interrupt+0x96/0x96
 <EOI> [<ffffffff94c6b798>] ?  irq_entries_start+0x578/0x6a0
 [<ffffffffc07b367b>] ? vmx_handle_external_intr+0x5b/0x60 [kvm_intel]
 [<ffffffffc052fe86>] vcpu_enter_guest+0x396/0x1290 [kvm]
 [<ffffffffc0536e07>] kvm_arch_vcpu_ioctl_run+0xb7/0x3d0 [kvm]
 [<ffffffffc051c6cf>] kvm_vcpu_ioctl+0x2af/0x570 [kvm]
 [<ffffffff94508362>] ? do_futex+0xb2/0x520
 [<ffffffff94641bb9>] do_vfs_ioctl+0x99/0x5f0
 [<ffffffffc052c6bf>] ? kvm_on_user_return+0x6f/0xa0 [kvm]
 [<ffffffff94642189>] SyS_ioctl+0x79/0x90
 [<ffffffff94c6aee4>] entry_SYSCALL_64_fastpath+0x24/0xcf
Code: 89 87 e0 00 00 00 49 8b 57 60 48 8b 43 60 48 89 53 60 49 89 47 60 49 8b 57 18 48 8b 43 18 48 89 53 18 49 89 47 18 e9 fa fb ff ff <0f> 0b 44 89 ee 48 89 df e8 6c 9a 40 d4 85 c0 0f 84 78 fe ff ff
RIP  [<ffffffffc0734e17>] skb_segment+0xce7/0xed0
 RSP <ffff9b237f943618>
---[ end trace f0d2cc8df9be8c23 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: 0x13400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
Rebooting in 10 seconds..
ACPI MEMORY or I/O RESET_REG.

We are running a 4.9.77 kernel with one patch backported.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/core/skbuff.c?h=v4.17-rc6&id=13acc94eff122b260a387d68668bf9d670738e6a

This patch is to a fix a different kernel panic using STT, however the
panic is still reproducible without this patch applied.

We are using stock Open vSwitch 2.7.3.

The panic is very reproducible, but it does require some configuration.

In our case, we have two hosts acting as hypervisors. Each host has one
guest VM. An STT tunnel is setup between the two hosts attached to each
guest. One guest will act as a source and one will act as a destination.
The destination has connection tracking setup in the flows. We have a
script running `ovs-dpctl del-flows` in a loop to make reproducing the
crash easier, but it's not strictly necessary. (This is just to make
it easier for an upcall to occur, see below)

The source guest then sends a couple of very large (>60k) UDP packets.
The destination host then crashes with the above panic.

The crash is a result of an skb that is not understood by skb_segment
in net/core/skbuff.c.

The skb comes from the solarflare NIC as a large skb, requiring the use
of frag_list. It looks something like this (note the use of frag_list):

skb: ffff92544a0bf000
  len: 60177, data_len: 60169, nr_frags: 17
  frag_list: ffff92544a0bed00

skb: ffff92544a0bed00
  len: 24820, data_len: 24820, nr_frags: 17
  next: ffff92544a0be700

skb: ffff92544a0be700
  len: 10589, data_len: 10589, nr_frags: 8

It winds its way through the networking core and openvswitch (stripping
off the outer STT encapsulation) eventually requiring an upcall.

Since datapath/flow.c sets OVS_FRAG_TYPE_FIRST for any GSO UDP packet
and connection tracking is setup, it will end up being passed into the
networking core to be reassembled.

https://github.com/openvswitch/ovs/blob/master/datapath/flow.c#L651

ip_frag_reasm in net/ipv4/ip_fragment.c will then change the skb into
what appears to be a malformed form because the skb uses a frag_list.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/net/ipv4/ip_fragment.c?h=v4.9.101#n583

After ip_frag_reasm, the skb now looks like this:

skb: ffff92544a0bf000
  len: 60197, data_len: 60169, nr_frags: 17
  frag_list: ffff92544a0bfe00

skb: ffff92544a0bfe00
  len: 35409, data_len: 35409, nr_frags: 0
  frag_list: ffff92544a0bed00

skb: ffff92544a0bed00
  len: 24820, data_len: 24820, nr_frags: 17
  next: ffff92544a0be700

skb: ffff92544a0be700
  len: 10589, data_len: 10589, nr_frags: 8

There are two nested frag_list uses, with the newly introduced skb
ffff92544a0bfe00 having nr_frags == 0.

Eventually openvswitch will want to fragment the skb for the upcall,
which ends up in skb_segment and finally crashes on the BUG_ON(!nfrags)
(for skb ffff92544a0bfe00 in the examples above)

It's not clear to me if the problem is in openvswitch or in the
networking core.

Why does openvswitch set OVS_FRAG_TYPE_FIRST for any skb with
SKB_GSO_UDP set even if it's not a fragmented packet?

Would it ever make sense for ip_frag_reasm to see an skb large enough to
require the use of frag_list?

JE