[ovs-dev] Hypervisor down during upgrade OVS 2.10.x to 2.10.y

Jin, Liang liangjin at ebay.com
Wed Aug 28 21:08:33 UTC 2019


Hi,
We upgrade the OVS recently from one version 2.10 to another version 2.10.  on some HV upgrade, the HV is down when running force reload kernel.
In the ovs-ctl log, kill ovs-vswitch is failed, but the script is still going to reload the modules.
```
ovsdb-server is running with pid 2431
ovs-vswitchd is running with pid 2507
Thu Aug 22 23:13:49 UTC 2019:stop
2019-08-22T23:13:50Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
Alarm clock
2019-08-22T23:13:51Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
Alarm clock
* Exiting ovs-vswitchd (2507)
* Killing ovs-vswitchd (2507)
* Killing ovs-vswitchd (2507) with SIGKILL
* Killing ovs-vswitchd (2507) failed
* Exiting ovsdb-server (2431)
Thu Aug 22 23:14:58 UTC 2019:load-kmod
Thu Aug 22 23:14:58 UTC 2019:start --system-id=random --no-full-hostname
/usr/share/openvswitch/scripts/ovs-ctl: unknown option "--no-full-hostname" (use --help for help)
* Starting ovsdb-server
* Configuring Open vSwitch system IDs
* ovs-vswitchd is already running
* Enabling remote OVSDB managers
ovsdb-server is running with pid 3860447
ovs-vswitchd is running with pid 2507
ovsdb-server is running with pid 3860447
ovs-vswitchd is running with pid 2507
Thu Aug 22 23:15:09 UTC 2019:load-kmod
Thu Aug 22 23:15:09 UTC 2019:force-reload-kmod --system-id=random --no-full-hostname
/usr/share/openvswitch/scripts/ovs-ctl: unknown option "--no-full-hostname" (use --help for help)
* Detected internal interfaces: br-int
Thu Aug 22 23:37:08 UTC 2019:stop
2019-08-22T23:37:09Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
Alarm clock
2019-08-22T23:37:10Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
Alarm clock
* Exiting ovs-vswitchd (2507)
* Killing ovs-vswitchd (2507)
* Killing ovs-vswitchd (2507) with SIGKILL
* Killing ovs-vswitchd (2507) failed
* Exiting ovsdb-server (3860447)
Thu Aug 22 23:40:42 UTC 2019:load-kmod
* Inserting openvswitch module
Thu Aug 22 23:40:42 UTC 2019:start --system-id=random --no-full-hostname
/usr/share/openvswitch/scripts/ovs-ctl: unknown option "--no-full-hostname" (use --help for help)
* Starting ovsdb-server
* Configuring Open vSwitch system IDs
* Starting ovs-vswitchd
* Enabling remote OVSDB managers
ovsdb-server is running with pid 2399
ovs-vswitchd is running with pid 2440
ovsdb-server is running with pid 2399
ovs-vswitchd is running with pid 2440
Thu Aug 22 23:46:18 UTC 2019:load-kmod
Thu Aug 22 23:46:18 UTC 2019:force-reload-kmod --system-id=random --no-full-hostname
/usr/share/openvswitch/scripts/ovs-ctl: unknown option "--no-full-hostname" (use --help for help)
* Detected internal interfaces: br-int br0
* Saving flows
* Exiting ovsdb-server (2399)
* Starting ovsdb-server
* Configuring Open vSwitch system IDs
* Flush old conntrack entries
* Exiting ovs-vswitchd (2440)
* Saving interface configuration
* Removing datapath: system at ovs-system
* Removing openvswitch module
rmmod: ERROR: Module vxlan is in use by: i40e
* Forcing removal of vxlan module
* Inserting openvswitch module
* Starting ovs-vswitchd
* Restoring saved flows
* Enabling remote OVSDB managers
* Restoring interface configuration
```

But in kern.log, we see the log as below, the process could not exit because waiting br0 release,  and then, the ovs-ctl try to `kill term` and `kill -9` the process, it does not work, because kernel is in infinity loop.  Then, ovs-ctl try to save the flows, when save flow, core dump happened in kernel. Then HV is down until restart it.
```
Aug 22 16:13:45 slx11c-9gjm kernel: [21177057.998961] device br0 left promiscuous mode
Aug 22 16:13:55 slx11c-9gjm kernel: [21177068.044859] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:14:05 slx11c-9gjm kernel: [21177078.068429] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:14:15 slx11c-9gjm kernel: [21177088.312001] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:14:25 slx11c-9gjm kernel: [21177098.359556] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:14:35 slx11c-9gjm kernel: [21177108.603058] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:14:45 slx11c-9gjm kernel: [21177118.658643] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:14:55 slx11c-9gjm kernel: [21177128.678211] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:15:05 slx11c-9gjm kernel: [21177138.721732] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:15:15 slx11c-9gjm kernel: [21177148.817317] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:15:25 slx11c-9gjm kernel: [21177158.828853] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:15:35 slx11c-9gjm kernel: [21177168.860459] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:15:46 slx11c-9gjm kernel: [21177179.104006] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:15:56 slx11c-9gjm kernel: [21177189.143518] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:16:06 slx11c-9gjm kernel: [21177199.183075] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:16:16 slx11c-9gjm kernel: [21177209.262617] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:16:26 slx11c-9gjm kernel: [21177219.510207] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:16:36 slx11c-9gjm kernel: [21177229.521757] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:16:46 slx11c-9gjm kernel: [21177239.765278] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:16:57 slx11c-9gjm kernel: [21177250.008826] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.016400] unregister_netdevice: waiting for br0 to become free. Usage count = 1
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.256456] INFO: task revalidator221:3859855 blocked for more than 120 seconds.
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.264121]       Tainted: G           OE   4.4.0-104-generic #127-Ubuntu
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271176] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271179] revalidator221  D ffff8826844eba48     0 3859855   2506 0x00000084
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271180]  ffff8826844eba48 0000000000000000 ffff884fc2e69540 ffff882758688000
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271181]  ffff8826844ec000 ffffffffc0463564 ffff882758688000 00000000ffffffff
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271182]  ffffffffc0463568 ffff8826844eba60 ffffffff81840765 ffffffffc0463560
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271182] Call Trace:
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271191]  [<ffffffff81840765>] schedule+0x35/0x80
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271192]  [<ffffffff81840a0e>] schedule_preempt_disabled+0xe/0x10
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271193]  [<ffffffff81842649>] __mutex_lock_slowpath+0xb9/0x130
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271195]  [<ffffffff818426df>] mutex_lock+0x1f/0x30
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271207]  [<ffffffffc042ef56>] ovs_dp_cmd_get+0x46/0xf0 [openvswitch]
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271211]  [<ffffffff8176d287>] genl_family_rcv_msg+0x1e7/0x3c0
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271213]  [<ffffffff8176d460>] ? genl_family_rcv_msg+0x3c0/0x3c0
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271214]  [<ffffffff8176d4d6>] genl_rcv_msg+0x76/0xb0
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271215]  [<ffffffff8176a97c>] ? netlink_lookup+0xdc/0x140
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271217]  [<ffffffff8176c9a4>] netlink_rcv_skb+0xa4/0xc0
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271218]  [<ffffffff8176d088>] genl_rcv+0x28/0x40
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271219]  [<ffffffff8176c37a>] netlink_unicast+0x18a/0x240
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271221]  [<ffffffff8176c72b>] netlink_sendmsg+0x2fb/0x3a0
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271225]  [<ffffffff813a1c11>] ? aa_sock_msg_perm+0x61/0x150
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271227]  [<ffffffff8171bad8>] sock_sendmsg+0x38/0x50
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271228]  [<ffffffff8171c581>] ___sys_sendmsg+0x281/0x290
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271231]  [<ffffffff810badf7>] ? should_numa_migrate_memory+0x57/0x130
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271233]  [<ffffffff811c2713>] ? handle_mm_fault+0xbc3/0x1820
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271235]  [<ffffffff812101b4>] ? new_sync_read+0x94/0xd0
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271237]  [<ffffffff8122e9e5>] ? __fget_light+0x25/0x60
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271238]  [<ffffffff8171ced1>] __sys_sendmsg+0x51/0x90
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271239]  [<ffffffff8171cf22>] SyS_sendmsg+0x12/0x20
Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271240]  [<ffffffff81844872>] entry_SYSCALL_64_fastpath+0x16/0x71
```
My question is:

  1.  Why we need to stop the openvswitch service before back up flows.
  2.  If the openvswitch is exiting status, why it still can receive request.
  3.  Why the br0 took so long time to exit.

I think this is a bug need to improve during upgrading.

Thanks
L.J.


More information about the dev mailing list