[ovs-dev] 答复: 答复: 答复: [PATCH] pkt reassemble: fix kernel panic for ovs reassemble

Greg Rose gvrose8192 at gmail.com
Tue Jun 27 14:52:05 UTC 2017


On 06/26/2017 05:51 PM, 王志克 wrote:
> Hi Greg,
>
> The exact issue occured on the 20th of check-kmod (sometimes there are other kernel issue: kernel just hangs but without panic). OVS2.6.0 on CentOS7.2 with kernel 3.10.0-327.el7.x86_64. Some info below, which hopes helpful.

OK, I'll try with that kernel.  The three VMs I have that are running the test are still up and running after overnight.  So let me try the base install  kernel.

Thanks,

- Greg

>
> datapath-sanity
>
>    1: datapath - ping between two ports               ok
>    2: datapath - http between two ports               ok
>    3: datapath - ping between two ports on vlan       ok
>    4: datapath - ping6 between two ports              ok
>    5: datapath - ping6 between two ports on vlan      ok
>    6: datapath - ping over vxlan tunnel               FAILED (system-traffic.at:159)
>    7: datapath - ping over gre tunnel                 FAILED (system-traffic.at:199)
>    8: datapath - ping over geneve tunnel              skipped (system-traffic.at:213)
>    9: datapath - basic truncate action                ok
>   10: datapath - truncate and output to gre tunnel    FAILED (system-traffic.at:445)
>   11: conntrack - controller                          FAILED (system-traffic.at:522)
>   12: conntrack - IPv4 HTTP                           ok
>   13: conntrack - IPv6 HTTP                           ok
>   14: conntrack - IPv4 ping                           ok
>   15: conntrack - IPv6 ping                           ok
>   16: conntrack - commit, recirc                      ok
>   17: conntrack - preserve registers                  ok
>   18: conntrack - invalid                             ok
>   19: conntrack - zones                               ok
>   20: conntrack - zones from field ....(system crash...)
>
>
> [root at localhost vmcore-127.0.0.1-2017-06-25-23:17:12]# ls
> analyzer      backtrace  count      last_occurrence  os_info     runlevel  type  username  vmcore
> architecture  component  event_log  machineid        os_release  time      uid   uuid      vmcore-dmesg.txt
> [root at localhost vmcore-127.0.0.1-2017-06-25-23:17:12]# cat backtrace
>
> Version: 3.10.0-327.el7.x86_64
> BUG: unable to handle kernel paging request at ffffffffa0715ae8
> IP: [<ffffffff8108e6a7>] get_next_timer_interrupt+0x97/0x270
> PGD 194d067 PUD 194e063 PMD b746f067 PTE 0
> Oops: 0000 [#1] SMP
> Modules linked in: nf_nat_ftp nf_conntrack_ftp nf_conntrack_netlink nfnetlink ip_gre ip_tunnel gre vxlan ip6_udp_tunnel udp_tunnel 8021q garp m                                                                                              rp veth xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute brid                                                                                              ge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6                                                                                              table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_ra                                                                                              w iptable_filter vmw_vsock_vmci_transport vsock bnep dm_mirror dm_region_hash dm_log dm_mod snd_seq_midi snd_seq_midi_event snd_ens1371 snd_raw                                                                                              midi coretemp snd_ac97_codec ac97_bus crc32_pclmul snd_seq ghash_clmulni_intel ppdev
>   snd_seq_device cryptd btusb snd_pcm bluetooth snd_timer snd soundcore sg vmw_balloon rfkill pcspkr parport_pc parport i2c_piix4 vmw_vmci shpch                                                                                              p nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod crc_t10dif crct10dif_generic pata_acpi cr                                                                                              ct10dif_pclmul crct10dif_common crc32c_intel serio_raw vmwgfx drm_kms_helper ttm mptspi scsi_transport_spi e1000 mptscsih mptbase drm i2c_core                                                                                               ata_piix libata [last unloaded: openvswitch]
> CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           OE  ------------   3.10.0-327.el7.x86_64 #1
> Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> task: ffff8800b9a81700 ti: ffff8800b9a8c000 task.ti: ffff8800b9a8c000
> RIP: 0010:[<ffffffff8108e6a7>]  [<ffffffff8108e6a7>] get_next_timer_interrupt+0x97/0x270
> RSP: 0018:ffff8800b9a8fdd8  EFLAGS: 00010012
> RAX: ffffffffa0715ad0 RBX: 00000863b6f08300 RCX: ffff8800b95a8d08
> RDX: 00000000000000ce RSI: 00000000000000ce RDI: 0000000100882cce
> RBP: ffff8800b9a8fe30 R08: 0000000000000202 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000001 R12: 0000000100882ccd
> R13: 7fffffffffffffff R14: ffff8800b95a8000 R15: 0000000100882ccd
> FS:  0000000000000000(0000) GS:ffff8800bb620000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffffffa0715ae8 CR3: 00000000b64d8000 CR4: 00000000003407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Stack:
>   ffff8800b9f5e780 0000000000000000 ffff8800b9a8dfd8 ffff8800b9a8fe10
>   ffff8800b9a8fe48 20cc1170855d3261 ffff8800bb62dbc0 00000863b6f08300
>   0000000000000001 ffff8800bb62cf00 0000000100882ccd ffff8800b9a8fe88
> Call Trace:
>   [<ffffffff810e0978>] tick_nohz_stop_sched_tick+0x1e8/0x2e0
>   [<ffffffff8101cd15>] ? native_sched_clock+0x35/0x80
>   [<ffffffff810e0b0e>] __tick_nohz_idle_enter+0x9e/0x150
>   [<ffffffff810e102d>] tick_nohz_idle_enter+0x3d/0x70
>   [<ffffffff810d615e>] cpu_startup_entry+0x9e/0x290
>   [<ffffffff810475fa>] start_secondary+0x1ba/0x230
> Code: 18 49 8b 7e 10 48 39 cf 48 89 ca 78 5a 40 0f b6 d7 89 d6 48 63 c6 48 c1 e0 04 49 8d 0c 06 48 8b 41 28 48 83 c1 28 48 39 c8 74 0e <f6> 40                                                                                               18 01 74 23 48 8b 00 48 39 c8 75 f2 83 c6 01 40 0f b6 f6
> RIP  [<ffffffff8108e6a7>] get_next_timer_interrupt+0x97/0x270
>   RSP <ffff8800b9a8fdd8>
>
>
> Wang Zhike
>
> -----邮件原件-----
> 发件人: Greg Rose [mailto:gvrose8192 at gmail.com]
> 发送时间: 2017年6月27日 6:26
> 收件人: 王志克
> 抄送: dev at openvswitch.org; Joe Stringer
> 主题: Re: [ovs-dev] 答复: 答复: [PATCH] pkt reassemble: fix kernel panic for ovs reassemble
>
> On 06/26/2017 04:56 AM, 王志克 wrote:
> > Hi Joe,
> >
> > I will try to check how to send the patch. Maybe tomorrow since I am quite busy now.
> >
> > Regarding the crash, I can reproduce it even with official OVS, like ovs2.6.0. (I just run the check kmod in a loop until kernel panic). So it is not related to the new fix.
> >
> > Br,
> > Wang Zhike
> I've been running 'make check-kmod' in a continuous loop on 3 virtual machines since this morning.  So far no kernel splats but plenty of errors:
>
> This is on the Ubuntu machine running 4.0 kernel:
>
> ERROR: 66 tests were run,
> 24 failed unexpectedly.
> 23 tests were skipped.
> ## -------------------------------------- ## ## system-kmod-testsuite.log was created. ## ## -------------------------------------- ##
>
> Please send `tests/system-kmod-testsuite.log' and all information you think might help:
>
>      To: <bugs at openvswitch.org>
>         Subject: [openvswitch 2.7.90] system-kmod-testsuite: 16 17 35 57 58 59 60 61 62 63 70 71 72 75 76 81 82 83 84 85 86 87 88 89 failed
>
> Centos 7.2 running 4.9.24 kernel:
>
> ## ------------- ##
> ## Test results. ##
> ## ------------- ##
>
> ERROR: 76 tests were run,
> 34 failed unexpectedly.
> 13 tests were skipped.
> ## -------------------------------------- ## ## system-kmod-testsuite.log was created. ## ## -------------------------------------- ##
>
> Please send `tests/system-kmod-testsuite.log' and all information you think might help:
>
>      To: <bugs at openvswitch.org>
>         Subject: [openvswitch 2.7.90] system-kmod-testsuite: 2 14 15 20 21 22 23 24 25 26 27 28 29 30 31 32 47 48 49 50 51 57 59 60 61 62 70 71 75 76 84 85 86 87 failed
>
> Centos 7.2 running 4.10.17 kernel:
>
> ## ------------- ##
> ## Test results. ##
> ## ------------- ##
>
> ERROR: 74 tests were run,
> 34 failed unexpectedly.
> 15 tests were skipped.
> ## -------------------------------------- ## ## system-kmod-testsuite.log was created. ## ## -------------------------------------- ##
>
> Please send `tests/system-kmod-testsuite.log' and all information you think might help:
>
>      To: <bugs at openvswitch.org>
>         Subject: [openvswitch 2.7.90] system-kmod-testsuite: 2 14 15 20 21 22 23 24 25 26 27 28 29 30 31 32 47 48 49 50 51 57 59 60 61 62 70 71 75 76 84 85 86 87 failed
>
> I confess to not spending a lot of time running check-kmod.  I certainly intend to in the future.
>
> - Greg
>
> >
> > -----邮件原件-----
> > 发件人: Joe Stringer [mailto:joe at ovn.org]
> > 发送时间: 2017年6月24日 5:15
> > 收件人: 王志克
> > 抄送: dev at openvswitch.org
> > 主题: Re: 答复: [ovs-dev] [PATCH] pkt reassemble: fix kernel panic for ovs
> > reassemble
> >
> > Hi Wang Zhike,
> >
> > I'd like if others like Greg could take a look as well, since this code is delicate. The more review it gets, the better. It seems like maybe the version of your email that goes to the list does not get the attachment. Perhaps you could try sending the patch using git send-email or putting the patch on GitHub instead, and linking to it here.
> >
> > For what it's worth, I did run your patch for a while and it seemed
> > OK, but when I tried again today on an Ubuntu Trusty (Linux
> > 3.13.0-119-generic) box, running make check-kmod, I saw an issue with
> > get_next_timer_interrupt():
> >
> > [181250.892557] BUG: unable to handle kernel paging request at
> > ffffffffa03317e0 [181250.892557] IP: [<ffffffff81079606>]
> > get_next_timer_interrupt+0x86/0x250
> > [181250.892557] PGD 1c11067 PUD 1c12063 PMD 1381a2067 PTE 0
> > [181250.892557] Oops: 0000 [#1] SMP [181250.892557] Modules linked in:
> > nf_nat_ipv6 nf_nat_ipv4 nf_nat
> > gre(-) nf_conntrack_ipv6 nf_conntrack_ipv4 nf_defrag_ipv6
> > nf_defrag_ipv4 nf_conntrack_netlink nfnetlink nf_conntrack bonding
> > 8021q garp stp mrp llc veth nfsd auth_rpcgss nfs_acl nfs lockd sunrpc
> > fscache dm_crypt kvm_intel kvm serio_raw netconsole configfs
> > crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel
> > aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse floppy ahci libahci [last unloaded: libcrc32c]
> > [181250.892557] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           OX
> > 3.13.0-119-generic #166-Ubuntu
> > [181250.892557] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [181250.892557] task: ffffffff81c15480 ti: ffffffff81c00000 task.ti:
> > ffffffff81c00000
> > [181250.892557] RIP: 0010:[<ffffffff81079606>]  [<ffffffff81079606>]
> > get_next_timer_interrupt+0x86/0x250
> > [181250.892557] RSP: 0018:ffffffff81c01e00  EFLAGS: 00010002 [181250.892557] RAX: ffffffffa03317c8 RBX: 0000000102b245da RCX:
> > 00000000000000db
> > [181250.892557] RDX: ffffffff81ebac58 RSI: 00000000000000db RDI:
> > 0000000102b245db
> > [181250.892557] RBP: ffffffff81c01e48 R08: 0000000000c88c1c R09:
> > 0000000000000000
> > [181250.892557] R10: 0000000000000000 R11: 0000000000000000 R12:
> > 0000000142b245d9
> > [181250.892557] R13: ffffffff81eb9e80 R14: 0000000102b245da R15:
> > 0000000000cd63e8
> > [181250.892557] FS:  0000000000000000(0000) GS:ffff88013fc00000(0000)
> > knlGS:0000000000000000
> > [181250.892557] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b [181250.892557] CR2: ffffffffa03317e0 CR3: 000000003707f000 CR4:
> > 00000000000006f0
> > [181250.892557] Stack:
> > [181250.892557]  0000000000000000 ffffffff81c01e30 ffffffff810a3af5
> > ffff88013fc13bc0
> > [181250.892557]  ffff88013fc0dce0 0000000102b245da 0000000000000000
> > 00000063ae154000
> > [181250.892557]  0000000000cd63e8 ffffffff81c01ea8 ffffffff810da655
> > 0000a4d8c2cb6200
> > [181250.892557] Call Trace:
> > [181250.892557]  [<ffffffff810a3af5>] ? set_next_entity+0x95/0xb0
> > [181250.892557]  [<ffffffff810da655>]
> > tick_nohz_stop_sched_tick+0x1e5/0x340
> > [181250.892557]  [<ffffffff810da851>]
> > __tick_nohz_idle_enter+0xa1/0x160 [181250.892557]
> > [<ffffffff810dab4d>] tick_nohz_idle_enter+0x3d/0x70 [181250.892557]
> > [<ffffffff810c2af7>] cpu_startup_entry+0x87/0x2b0 [181250.892557]
> > [<ffffffff8171b387>] rest_init+0x77/0x80 [181250.892557]
> > [<ffffffff81d34f6a>] start_kernel+0x432/0x43d [181250.892557]
> > [<ffffffff81d34941>] ? repair_env_string+0x5c/0x5c [181250.892557]
> > [<ffffffff81d34120>] ? early_idt_handler_array+0x120/0x120
> > [181250.892557]  [<ffffffff81d345ee>]
> > x86_64_start_reservations+0x2a/0x2c
> > [181250.892557]  [<ffffffff81d34733>] x86_64_start_kernel+0x143/0x152
> > [181250.892557] Code: 8b 7d 10 4d 8b 75 18 4c 39 f7 78 5c 40 0f b6 cf
> > 89 ce 48 63 c6 48 c1 e0 04 49 8d 54 05 00 48 8b 42 28 48 83 c2 28 48
> > 39 d0 74 0e <f6> 40 18 01 74 24 48 8b 00 48 39 d0 75 f2 83 c6 01 40 0f
> > b6 f6
> > [181250.892557] RIP  [<ffffffff81079606>]
> > get_next_timer_interrupt+0x86/0x250
> > [181250.892557]  RSP <ffffffff81c01e00> [181250.892557] CR2:
> > ffffffffa03317e0
> >
> > It seems like perhaps a fragment timer signed up by OVS is still
> > remaining when the OVS module is unloaded, so it may attempt to clean
> > up an entry using OVS code but the OVS code has been unloaded at that
> > point. This might be related to IPv6 cvlan test - that seems to be
> > where my VM froze and went to 100% CPU, but I would think that the
> > IPv6 fragmentation cleanup test is a more likely to cause this, since it leaves fragments behind in the cache after the test finishes. I've only hit this when running all of the tests in make check-kmod.
> >
> > Cheers,
> > Joe
> >
> > On 22 June 2017 at 17:53, 王志克 <wangzhike at jd.com> wrote:
> >> Hi Joe,
> >>
> >> Please check the attachment. Thanks.
> >>
> >> Br,
> >> Wang Zhike
> >>
> >> -----邮件原件-----
> >> 发件人: Joe Stringer [mailto:joe at ovn.org]
> >> 发送时间: 2017年6月23日 8:20
> >> 收件人: 王志克
> >> 抄送: dev at openvswitch.org
> >> 主题: Re: [ovs-dev] [PATCH] pkt reassemble: fix kernel panic for ovs
> >> reassemble
> >>
> >> On 21 June 2017 at 18:54, 王志克 <wangzhike at jd.com> wrote:
> >>> Ovs and kernel stack would add frag_queue to same netns_frags list.
> >>> As result, ovs and kernel may access the fraq_queue without correct
> >>> lock. Also the struct ipq may be different on kernel(older than
> >>> 4.3), which leads to invalid pointer access.
> >>>
> >>> The fix creates specific netns_frags for ovs.
> >>>
> >>> Signed-off-by: wangzhike <wangzhike at jd.com>
> >>> ---
> >>
> >> Hi,
> >>
> >> It looks like the whitespace has been corrupted in this version of the patch that you sent, I cannot apply it. Probably your email client mistreats it when sending the email out. A reliable method to send patches correctly via email is to use the commandline client 'git send-email'. This is the preferred method. If you are unable to set that up, consider attaching the patch to the email (or send a pull request on GitHub).
> >>
> >> Cheers,
> >> Joe
> > _______________________________________________
> > dev mailing list
> > dev at openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
>



More information about the dev mailing list