[ovs-discuss] OVS 2.0.0: rmmod hangs while trying to remove openvswitch.ko

lin shaopeng slin0209 at gmail.com
Thu Mar 20 08:25:58 UTC 2014


Greetings folks,

I am experiencing an issue with openvswitch-2.0.0。And I believe there is a
deadlock in it。

Steps to reproduce the issue:

1.Install openvswitch-2.0.0 on a CentOS machine with kernel version 3.0.57。
2.Start up vswitchd and ovsdb。
3.Create bridge and add vxlan port:
 # ovs-vsctl add-br ovsbr1
 # ovs-vsctl add-port ovsbr1 eth1
 # ovs-vsctl add-port ovsbr1 vxlan2 -- set interface vxlan2 type=vxlan
options:local_ip=11.12.13.4 options:remote_ip=flow
4.Stop vswitchd and ovsdb。
5.Rmmod openvswitch。 The command hangs up, and never get returned。After 120
seconds,system log get some outputs:

2147 Mar 19 01:51:53 NODE-4 kernel: [49343.232348] ovs_workq       D
ffff88007f052000     0  7558      2 0x00000000
2148 Mar 19 01:51:53 NODE-4 kernel: [49343.232355]  ffff880044397d60
0000000000000246 ffff88004417e040 0000000000012000
2149 Mar 19 01:51:53 NODE-4 kernel: [49343.232362]  ffff880044397fd8
ffff880000000000 ffff88004417e040 0000000000012000
2150 Mar 19 01:51:53 NODE-4 kernel: [49343.232369]  ffff880044397fd8
ffff880044396010 ffff880044397fd8 0000000000012000
2151 Mar 19 01:51:53 NODE-4 kernel: [49343.232377] Call Trace:
2152 Mar 19 01:51:53 NODE-4 kernel: [49343.232386]  [<ffffffff81974b4b>] ?
xen_hypervisor_callback+0x1b/0x20
2153 Mar 19 01:51:53 NODE-4 kernel: [49343.232390]  [<ffffffff8197378a>] ?
error_exit+0x2a/0x60
2154 Mar 19 01:51:53 NODE-4 kernel: [49343.232393]  [<ffffffff819732e1>] ?
retint_restore_args+0x5/0x6
2155 Mar 19 01:51:53 NODE-4 kernel: [49343.232395]  [<ffffffff81972da9>] ?
_raw_spin_lock+0x9/0x10
2156 Mar 19 01:51:53 NODE-4 kernel: [49343.232401]  [<ffffffff810e4dc6>] ?
kmem_cache_free+0xc6/0x1a0
2157 Mar 19 01:51:53 NODE-4 kernel: [49343.232404]  [<ffffffff8197108a>]
schedule+0x3a/0x50
2158 Mar 19 01:51:53 NODE-4 kernel: [49343.232407]  [<ffffffff8197197f>]
__mutex_lock_slowpath+0xdf/0x160
2159 Mar 19 01:51:53 NODE-4 kernel: [49343.232413]  [<ffffffffa00636b0>] ?
vxlan_udp_encap_recv+0x160/0x160 [openvswitch]
2160 Mar 19 01:51:53 NODE-4 kernel: [49343.232416]  [<ffffffff819717ce>]
mutex_lock+0x1e/0x40
2161 Mar 19 01:51:53 NODE-4 kernel: [49343.232420]  [<ffffffff81851378>]
unregister_pernet_device+0x18/0x50
2162 Mar 19 01:51:53 NODE-4 kernel: [49343.232424]  [<ffffffffa0063704>]
vxlan_del_work+0x54/0x70 [openvswitch]
2163 Mar 19 01:51:53 NODE-4 kernel: [49343.232429]  [<ffffffffa0063e01>]
worker_thread+0xe1/0x1d0 [openvswitch]
2164 Mar 19 01:51:53 NODE-4 kernel: [49343.232433]  [<ffffffff8106f020>] ?
wake_up_bit+0x40/0x40
2165 Mar 19 01:51:53 NODE-4 kernel: [49343.232438]  [<ffffffffa0063d20>] ?
ovs_workqueues_exit+0x30/0x30 [openvswitch]
2166 Mar 19 01:51:53 NODE-4 kernel: [49343.232440]  [<ffffffff8106eb66>]
kthread+0x96/0xa0
2167 Mar 19 01:51:53 NODE-4 kernel: [49343.232443]  [<ffffffff81974a24>]
kernel_thread_helper+0x4/0x10
2168 Mar 19 01:51:53 NODE-4 kernel: [49343.232446]  [<ffffffff81973b36>] ?
int_ret_from_sys_call+0x7/0x1b
2169 Mar 19 01:51:53 NODE-4 kernel: [49343.232448]  [<ffffffff819732e1>] ?
retint_restore_args+0x5/0x6
2170 Mar 19 01:51:53 NODE-4 kernel: [49343.232451]  [<ffffffff81974a20>] ?
gs_change+0x13/0x13
2171 Mar 19 01:51:53 NODE-4 kernel: [49343.232453] INFO: task rmmod:8479
blocked for more than 120 seconds.
2172 Mar 19 01:51:53 NODE-4 kernel: [49343.232455] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
2173 Mar 19 01:51:53 NODE-4 kernel: [49343.232456] rmmod           D
ffff88007f1d2000     0  8479   1095 0x00000000
2174 Mar 19 01:51:53 NODE-4 kernel: [49343.232459]  ffff880049c5bc18
0000000000000282 ffffffff32323532 ffffffff81d6730a
2175 Mar 19 01:51:53 NODE-4 kernel: [49343.232462]  ffff880049c5bb78
ffff880000000000 ffff8800442844c0 0000000000012000
2176 Mar 19 01:51:53 NODE-4 kernel: [49343.232465]  ffff880049c5bfd8
ffff880049c5a010 ffff880049c5bfd8 0000000000012000
2177 Mar 19 01:51:53 NODE-4 kernel: [49343.232468] Call Trace:
2178 Mar 19 01:51:53 NODE-4 kernel: [49343.232471]  [<ffffffff81972e29>] ?
_raw_spin_unlock_irqrestore+0x19/0x20
2179 Mar 19 01:51:53 NODE-4 kernel: [49343.232474]  [<ffffffff81095332>] ?
irq_to_desc+0x12/0x20
2180 Mar 19 01:51:53 NODE-4 kernel: [49343.232477]  [<ffffffff810979d9>] ?
irq_get_irq_data+0x9/0x10
2181 Mar 19 01:51:53 NODE-4 kernel: [49343.232480]  [<ffffffff8141d669>] ?
info_for_irq+0x9/0x20
2182 Mar 19 01:51:53 NODE-4 kernel: [49343.232483]  [<ffffffff8197108a>]
schedule+0x3a/0x50
2183 Mar 19 01:51:53 NODE-4 kernel: [49343.232486]  [<ffffffff819714a5>]
schedule_timeout+0x1a5/0x200
2184 Mar 19 01:51:53 NODE-4 kernel: [49343.232490]  [<ffffffff81004bcf>] ?
xen_pte_val+0x2f/0x80
2185 Mar 19 01:51:53 NODE-4 kernel: [49343.232492]  [<ffffffff810e5837>] ?
kfree+0x117/0x210
2186 Mar 19 01:51:53 NODE-4 kernel: [49343.232495]  [<ffffffff810e5837>] ?
kfree+0x117/0x210
2187 Mar 19 01:51:53 NODE-4 kernel: [49343.232497]  [<ffffffff81970531>]
wait_for_common+0xd1/0x180
2188 Mar 19 01:51:53 NODE-4 kernel: [49343.232501]  [<ffffffff813b6e20>] ?
kobject_del+0x40/0x40
2189 Mar 19 01:51:53 NODE-4 kernel: [49343.232506]  [<ffffffff8104e210>] ?
try_to_wake_up+0x260/0x260
2190 Mar 19 01:51:53 NODE-4 kernel: [49343.232509]  [<ffffffff81970688>]
wait_for_completion+0x18/0x20
2191 Mar 19 01:51:53 NODE-4 kernel: [49343.232513]  [<ffffffffa0063f71>]
__cancel_work_timer+0x81/0x1b0 [openvswitch]
2192 Mar 19 01:51:53 NODE-4 kernel: [49343.232516]  [<ffffffff8104e183>] ?
try_to_wake_up+0x1d3/0x260
2193 Mar 19 01:51:53 NODE-4 kernel: [49343.232521]  [<ffffffffa00640d0>] ?
rpl_cancel_delayed_work_sync+0x10/0x10 [openvswitch]
2194 Mar 19 01:51:53 NODE-4 kernel: [49343.232526]  [<ffffffffa00640ab>]
cancel_work_sync+0xb/0x20 [openvswitch]
2195 Mar 19 01:51:53 NODE-4 kernel: [49343.232530]  [<ffffffffa00643a3>]
ovs_exit_net+0x73/0x82 [openvswitch]
2196 Mar 19 01:51:53 NODE-4 kernel: [49343.232533]  [<ffffffff818511d9>]
ops_exit_list+0x39/0x60
2197 Mar 19 01:51:53 NODE-4 kernel: [49343.232535]  [<ffffffff8185132d>]
unregister_pernet_operations+0x3d/0x70
2198 Mar 19 01:51:53 NODE-4 kernel: [49343.232538]  [<ffffffff81851389>]
unregister_pernet_device+0x29/0x50
2199 Mar 19 01:51:53 NODE-4 kernel: [49343.232541]  [<ffffffffa00583c8>]
dp_cleanup+0x58/0x80 [openvswitch]
2200 Mar 19 01:51:53 NODE-4 kernel: [49343.232546]  [<ffffffff810885a8>]
sys_delete_module+0x178/0x240

Some investigations to the code:

Thread rmmod:
Dp_cleanup -->
 unregister_netdevice_notifier(&ovs_dp_device_notifier); -->
  dp_device_event -->    (event == NETDEV_UNREGISTER)
 queue_work(&ovs_net->dp_notify_work); -->  ovs_workq will deal with this
 unregister_pernet_device(&ovs_net_ops); --> acquire net_mutex
  unregister_pernet_operations -->
   ops_exit_list -->
    ovs_exit_net -->
     cancel_work_sync -->
      __cancel_work_timer -->
       workqueue_barrier -->    queue a barrier work to ovs_workq, and
waiting for it.
        wait_for_completion(&barr.done);


Thread ovs_workq:

worker_thread -->
 run_workqueue -->
  ovs_dp_notify_wq -->
   dp_detach_port_notify -->
 ovs_dp_detach_port -->
  ovs_vport_del -->
   vport->ops->destroy(vport); -->
     vxlan_tnl_destroy -->
       vxlan_sock_release -->
         queue_work(&vs->del_work); --> queue vxlan_del_work to ovs_workq


worker_thread -->
 run_workqueue -->
  vxlan_del_work -->
   vxlan_cleanup_module -->
     unregister_pernet_device(&vxlan_net_ops); -->  acquire net_mutex
I believe there is a deadlock in it.
After rmmod is issued, unregister_netdevice_notifier queue dp_notify_work to
 ovs_workq,then the thread go forward to unregister_pernet_device
(&ovs_net_ops); In unregister_pernet_device(), thread
rmmod holds net_mutex, and do ovs_exit_net, it adds an barrier work to the
tail
of ovs_workq workqueue. Then waiting for barrier to be done while holding
net_mutex.
On the other hand, thread ovs_workq is doing vxlan cleanup, to
unregister_pernet_device(&vxlan_net_ops),
it tries to acquire net_mutex, which is holding by rmmod. So vxlan_del_work
never get finished, and the barrier work never get executed, rmmod can't
continue and release net_mutex.

Can anyone confirm this? And is there a fix for this issue?

Thanks,
Lin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://openvswitch.org/pipermail/ovs-discuss/attachments/20140320/19c6d010/attachment-0002.html>


More information about the discuss mailing list