[ovs-dev] [PATCH RFC v2 0/8] Introduce connection tracking tc offload

Mon Jul 29 13:05:50 UTC 2019

On 7/23/2019 6:42 PM, Marcelo Ricardo Leitner wrote:

> On Sat, Jul 20, 2019 at 08:26:59AM +0000, Paul Blakey wrote:
>> Hi Marcelo, thanks for reporting this, can you dump the datapath rules via ovs-appctl dpctl/dump-flows -m --names
>> Also running "tc filter show dev ns2-veth-ab ingress" while it happened can show us more details.
>>
>> I'll try and reproduce it on my end.
>> How often does this happen? and what is the setup ?
> Quite often.
> Setup is:
> ns1                                                             ns2
> --------------.                                     .--------------
>                |         v------ br0 ------v         |
>   ns1-veth-ba--x----ns1-veth-ab       ns2-veth-ab----x-- ns2-veth-ba
>                |                                     |
> --------------'                                     '--------------
>
> Seems related to the skb_ext patches.. with a kernel with more
> debugging stuff enabled, I'm seeing:
>
> [   17.467576] =============================
> [   17.468717] WARNING: suspicious RCU usage
> [   17.469687] 5.2.0.c1f3d.g83fb7bc8ff16+ #2 Not tainted
> [   17.471496] -----------------------------
> [   17.472922] net/sched/sch_ingress.c:52 suspicious rcu_dereference_protected() usage!
> [   17.474715]
>                 other info that might help us debug this:
>
> [   17.476219]
>                 rcu_scheduler_active = 2, debug_locks = 1
> [   17.477408] no locks held by ovs-vswitchd/1063.
> [   17.478267]
>                 stack backtrace:
> [   17.479545] CPU: 10 PID: 1063 Comm: ovs-vswitchd Not tainted 5.2.0.c1f3d.g83fb7bc8ff16+ #2
> [   17.481932] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014
> [   17.483725] Call Trace:
> [   17.484179]  dump_stack+0x85/0xc0
> [   17.484782]  ingress_tcf_block+0x4c/0x50 [sch_ingress]
> [   17.485717]  __tcf_block_find+0x28/0x80
> [   17.486419]  tc_new_tfilter+0x1b6/0x960
> [   17.487093]  ? tc_del_tfilter+0x720/0x720
> [   17.487854]  rtnetlink_rcv_msg+0x389/0x4b0
> [   17.488590]  ? netlink_deliver_tap+0x95/0x400
> [   17.489343]  ? rtnl_dellink+0x2d0/0x2d0
> [   17.490034]  netlink_rcv_skb+0x49/0x110
> [   17.490726]  netlink_unicast+0x171/0x200
> [   17.491409]  netlink_sendmsg+0x21e/0x3e0
> [   17.492096]  sock_sendmsg+0x5e/0x60
> [   17.492792]  ___sys_sendmsg+0x2ae/0x330
> [   17.493725]  ? ___sys_recvmsg+0x159/0x1f0
> [   17.494734]  ? up_write+0x1c/0xc0
> [   17.495515]  ? ext4_file_write_iter+0xd1/0x3b0
> [   17.496624]  ? find_held_lock+0x2b/0x80
> [   17.497362]  ? ksys_write+0xc0/0xe0
> [   17.498000]  __sys_sendmsg+0x59/0xa0
> [   17.498653]  do_syscall_64+0x5c/0xa0
> [   17.499307]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [   17.500225] RIP: 0033:0x7f6bd4ab2b45
>
> And I could trigger a crash with this trace:
> (crash was on a different run, while the RCU trace above and outputs below are
> from the same run)
>
> [  382.670318] BUG: kernel NULL pointer dereference, address: 0000000000000030
> [  382.671927] #PF: supervisor read access in kernel mode
> [  382.673070] #PF: error_code(0x0000) - not-present page
> [  382.674170] PGD 0 P4D 0
> [  382.674873] Oops: 0000 [#1] SMP NOPTI
> [  382.675753] CPU: 1 PID: 6988 Comm: nc Kdump: loaded Not tainted 5.2.0.c1f3d.g83fb7bc8ff16+ #1
> [  382.677350] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014
> [  382.679255] RIP: 0010:tcf_classify+0x104/0x1e0
> [  382.680942] Code: 5c 41 5d 41 5e 41 5f c3 48 8b 55 00 be 01 00 00 00 4c 89 ff 48 89 14 24 e8 79 e3 f9 ff 48 8b 14 24 48 85 c0 0f 84 cb 00 00
>   00 <48> 8b 4a 30 8b 49 40 89 08 41 8d 45 01 41 83 fd 03 7e ac e8 84 37
> [  382.684792] RSP: 0018:ffffbd19000f4dd0 EFLAGS: 00010282
> [  382.685746] RAX: ffff9e04772b0008 RBX: ffff9e0479088360 RCX: 0000000000000001
> [  382.686929] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9e047ad9a140
> [  382.688144] RBP: ffffbd19000f4e40 R08: ffff9e047b870b60 R09: 0000000000000002
> [  382.689347] R10: ffff9e04550ccd68 R11: ffff9e0476050a00 R12: 0000000000000001
> [  382.690534] R13: 0000000000000000 R14: ffff9e0479088360 R15: ffff9e0479229ce0
> [  382.691730] FS:  00007efd4d75bb80(0000) GS:ffff9e047b840000(0000) knlGS:0000000000000000
> [  382.693033] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  382.694061] CR2: 0000000000000030 CR3: 0000000163abc000 CR4: 00000000003406e0
> [  382.695283] Call Trace:
> [  382.695962]  <IRQ>
> [  382.696573]  __netif_receive_skb_core+0x3c0/0xcf0
> [  382.697492]  ? reweight_entity+0x15b/0x1a0
> [  382.698340]  __netif_receive_skb_one_core+0x37/0x90
> [  382.699274]  process_backlog+0x9c/0x150
> [  382.700199]  net_rx_action+0xff/0x350
> [  382.701093]  __do_softirq+0xee/0x2ff
> [  382.701976]  do_softirq_own_stack+0x2a/0x40
> [  382.702948]  </IRQ>
> [  382.703626]  do_softirq.part.0+0x41/0x50
> [  382.704556]  __local_bh_enable_ip+0x4b/0x50
> [  382.705536]  ip_finish_output2+0x1a9/0x580
> [  382.706518]  ip_output+0x71/0xf0
> [  382.707392]  ? __ip_finish_output+0x1c0/0x1c0
> [  382.708422]  __ip_queue_xmit+0x16c/0x420
> [  382.709423]  ? __kmalloc_reserve.isra.0+0x2d/0x70
> [  382.711184]  __tcp_transmit_skb+0x52c/0xab0
> [  382.712582]  tcp_connect+0xaed/0xda0
> [  382.713403]  ? kvm_clock_get_cycles+0xd/0x10
> [  382.714315]  tcp_v4_connect+0x3f1/0x500
> [  382.715156]  __inet_stream_connect+0xc5/0x360
> [  382.716070]  inet_stream_connect+0x37/0x50
> [  382.716947]  __sys_connect+0xd6/0x100
> [  382.717766]  ? do_fcntl+0x1a5/0x650
> [  382.718588]  ? __x64_sys_fcntl+0x84/0xb0
> [  382.719450]  __x64_sys_connect+0x16/0x20
> [  382.720310]  do_syscall_64+0x5b/0x180
> [  382.721146]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>

Hi,

I reproduced both traces.

The first rcu trace is because we deference the qdisc block without rtnl 
locked (flower allows it unlocked)

and the second trace is because res->goto_tp can be null if the last 
prio of a chain we goto was just deleted, and we try and deference it 
unconditionally when we set the skb extension in tcf_classify.

This can happen when a filter is deleted while there is still traffic, I 
have a test for it and will try on the fixes.

I'll post fixed patches (Patch by VladB) to the tc skb extension to fix 
both of the above, and ovs userspace RFC v2 to fix the nat issue.

Thanks,

Paul.

>