[ovs-discuss] [ovs-dev] Hypervisor down during upgrade OVS 2.10.x to 2.10.y

aginwala aginwala at asu.edu
Fri Sep 6 23:17:36 UTC 2019


Hi:

Adding correct ovs-discuss ML. I did get a chance to take a look on it a
bit. I think this is the bug in 4.4.0-104-generic kernel on ubuntu 16.04 as
its being discussed on ubuntu forum
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407  where it can
be hit all of a sudden as per the kernel logs shared "unregister_netdevice:
waiting for br0 to become free. Usage count = 1".  Folks are proposing on
the forum to upgrade to higher kernel to get rid of this issue. Upstream
linux proposed relevant fixes @
https://github.com/torvalds/linux/commit/ee60ad219f5c7c4fb2f047f88037770063ef785f
to
address related issues. I guess kernel folks can comment on this more. Not
sure if I missed anything else.

May be we can we can do some improvements in force-reload-kmod where right
now stop_forwarding causes stop ovs-vswitchd but system stalls because br0
(eth0 is added to br0) is busy causing network connectivity loss. Host
recovers only after host restart in current case. Not sure, if we need to
handle this corner case in ovs?

On Wed, Aug 28, 2019 at 2:21 PM Jin, Liang via dev <ovs-dev at openvswitch.org>
wrote:

>
> Hi,
> We upgrade the OVS recently from one version 2.10 to another version
> 2.10.  on some HV upgrade, the HV is down when running force reload kernel.
> In the ovs-ctl log, kill ovs-vswitch is failed, but the script is still
> going to reload the modules.
> ```
> ovsdb-server is running with pid 2431
> ovs-vswitchd is running with pid 2507
> Thu Aug 22 23:13:49 UTC 2019:stop
> 2019-08-22T23:13:50Z|00001|fatal_signal|WARN|terminating with signal 14
> (Alarm clock)
> Alarm clock
> 2019-08-22T23:13:51Z|00001|fatal_signal|WARN|terminating with signal 14
> (Alarm clock)
> Alarm clock
> * Exiting ovs-vswitchd (2507)
> * Killing ovs-vswitchd (2507)
> * Killing ovs-vswitchd (2507) with SIGKILL
> * Killing ovs-vswitchd (2507) failed
> * Exiting ovsdb-server (2431)
> Thu Aug 22 23:14:58 UTC 2019:load-kmod
> Thu Aug 22 23:14:58 UTC 2019:start --system-id=random --no-full-hostname
> /usr/share/openvswitch/scripts/ovs-ctl: unknown option
> "--no-full-hostname" (use --help for help)
> * Starting ovsdb-server
> * Configuring Open vSwitch system IDs
> * ovs-vswitchd is already running
> * Enabling remote OVSDB managers
> ovsdb-server is running with pid 3860447
> ovs-vswitchd is running with pid 2507
> ovsdb-server is running with pid 3860447
> ovs-vswitchd is running with pid 2507
> Thu Aug 22 23:15:09 UTC 2019:load-kmod
> Thu Aug 22 23:15:09 UTC 2019:force-reload-kmod --system-id=random
> --no-full-hostname
> /usr/share/openvswitch/scripts/ovs-ctl: unknown option
> "--no-full-hostname" (use --help for help)
> * Detected internal interfaces: br-int
> Thu Aug 22 23:37:08 UTC 2019:stop
> 2019-08-22T23:37:09Z|00001|fatal_signal|WARN|terminating with signal 14
> (Alarm clock)
> Alarm clock
> 2019-08-22T23:37:10Z|00001|fatal_signal|WARN|terminating with signal 14
> (Alarm clock)
> Alarm clock
> * Exiting ovs-vswitchd (2507)
> * Killing ovs-vswitchd (2507)
> * Killing ovs-vswitchd (2507) with SIGKILL
> * Killing ovs-vswitchd (2507) failed
> * Exiting ovsdb-server (3860447)
> Thu Aug 22 23:40:42 UTC 2019:load-kmod
> * Inserting openvswitch module
> Thu Aug 22 23:40:42 UTC 2019:start --system-id=random --no-full-hostname
> /usr/share/openvswitch/scripts/ovs-ctl: unknown option
> "--no-full-hostname" (use --help for help)
> * Starting ovsdb-server
> * Configuring Open vSwitch system IDs
> * Starting ovs-vswitchd
> * Enabling remote OVSDB managers
> ovsdb-server is running with pid 2399
> ovs-vswitchd is running with pid 2440
> ovsdb-server is running with pid 2399
> ovs-vswitchd is running with pid 2440
> Thu Aug 22 23:46:18 UTC 2019:load-kmod
> Thu Aug 22 23:46:18 UTC 2019:force-reload-kmod --system-id=random
> --no-full-hostname
> /usr/share/openvswitch/scripts/ovs-ctl: unknown option
> "--no-full-hostname" (use --help for help)
> * Detected internal interfaces: br-int br0
> * Saving flows
> * Exiting ovsdb-server (2399)
> * Starting ovsdb-server
> * Configuring Open vSwitch system IDs
> * Flush old conntrack entries
> * Exiting ovs-vswitchd (2440)
> * Saving interface configuration
> * Removing datapath: system at ovs-system
> * Removing openvswitch module
> rmmod: ERROR: Module vxlan is in use by: i40e
> * Forcing removal of vxlan module
> * Inserting openvswitch module
> * Starting ovs-vswitchd
> * Restoring saved flows
> * Enabling remote OVSDB managers
> * Restoring interface configuration
> ```
>
> But in kern.log, we see the log as below, the process could not exit
> because waiting br0 release,  and then, the ovs-ctl try to `kill term` and
> `kill -9` the process, it does not work, because kernel is in infinity
> loop.  Then, ovs-ctl try to save the flows, when save flow, core dump
> happened in kernel. Then HV is down until restart it.
> ```
> Aug 22 16:13:45 slx11c-9gjm kernel: [21177057.998961] device br0 left
> promiscuous mode
> Aug 22 16:13:55 slx11c-9gjm kernel: [21177068.044859]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:14:05 slx11c-9gjm kernel: [21177078.068429]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:14:15 slx11c-9gjm kernel: [21177088.312001]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:14:25 slx11c-9gjm kernel: [21177098.359556]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:14:35 slx11c-9gjm kernel: [21177108.603058]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:14:45 slx11c-9gjm kernel: [21177118.658643]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:14:55 slx11c-9gjm kernel: [21177128.678211]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:15:05 slx11c-9gjm kernel: [21177138.721732]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:15:15 slx11c-9gjm kernel: [21177148.817317]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:15:25 slx11c-9gjm kernel: [21177158.828853]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:15:35 slx11c-9gjm kernel: [21177168.860459]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:15:46 slx11c-9gjm kernel: [21177179.104006]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:15:56 slx11c-9gjm kernel: [21177189.143518]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:16:06 slx11c-9gjm kernel: [21177199.183075]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:16:16 slx11c-9gjm kernel: [21177209.262617]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:16:26 slx11c-9gjm kernel: [21177219.510207]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:16:36 slx11c-9gjm kernel: [21177229.521757]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:16:46 slx11c-9gjm kernel: [21177239.765278]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:16:57 slx11c-9gjm kernel: [21177250.008826]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.016400]
> unregister_netdevice: waiting for br0 to become free. Usage count = 1
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.256456] INFO: task
> revalidator221:3859855 blocked for more than 120 seconds.
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.264121]       Tainted: G
>        OE   4.4.0-104-generic #127-Ubuntu
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271176] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271179] revalidator221  D
> ffff8826844eba48     0 3859855   2506 0x00000084
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271180]  ffff8826844eba48
> 0000000000000000 ffff884fc2e69540 ffff882758688000
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271181]  ffff8826844ec000
> ffffffffc0463564 ffff882758688000 00000000ffffffff
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271182]  ffffffffc0463568
> ffff8826844eba60 ffffffff81840765 ffffffffc0463560
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271182] Call Trace:
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271191]
> [<ffffffff81840765>] schedule+0x35/0x80
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271192]
> [<ffffffff81840a0e>] schedule_preempt_disabled+0xe/0x10
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271193]
> [<ffffffff81842649>] __mutex_lock_slowpath+0xb9/0x130
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271195]
> [<ffffffff818426df>] mutex_lock+0x1f/0x30
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271207]
> [<ffffffffc042ef56>] ovs_dp_cmd_get+0x46/0xf0 [openvswitch]
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271211]
> [<ffffffff8176d287>] genl_family_rcv_msg+0x1e7/0x3c0
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271213]
> [<ffffffff8176d460>] ? genl_family_rcv_msg+0x3c0/0x3c0
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271214]
> [<ffffffff8176d4d6>] genl_rcv_msg+0x76/0xb0
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271215]
> [<ffffffff8176a97c>] ? netlink_lookup+0xdc/0x140
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271217]
> [<ffffffff8176c9a4>] netlink_rcv_skb+0xa4/0xc0
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271218]
> [<ffffffff8176d088>] genl_rcv+0x28/0x40
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271219]
> [<ffffffff8176c37a>] netlink_unicast+0x18a/0x240
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271221]
> [<ffffffff8176c72b>] netlink_sendmsg+0x2fb/0x3a0
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271225]
> [<ffffffff813a1c11>] ? aa_sock_msg_perm+0x61/0x150
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271227]
> [<ffffffff8171bad8>] sock_sendmsg+0x38/0x50
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271228]
> [<ffffffff8171c581>] ___sys_sendmsg+0x281/0x290
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271231]
> [<ffffffff810badf7>] ? should_numa_migrate_memory+0x57/0x130
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271233]
> [<ffffffff811c2713>] ? handle_mm_fault+0xbc3/0x1820
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271235]
> [<ffffffff812101b4>] ? new_sync_read+0x94/0xd0
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271237]
> [<ffffffff8122e9e5>] ? __fget_light+0x25/0x60
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271238]
> [<ffffffff8171ced1>] __sys_sendmsg+0x51/0x90
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271239]
> [<ffffffff8171cf22>] SyS_sendmsg+0x12/0x20
> Aug 22 16:17:07 slx11c-9gjm kernel: [21177260.271240]
> [<ffffffff81844872>] entry_SYSCALL_64_fastpath+0x16/0x71
> ```
> My question is:
>
>   1.  Why we need to stop the openvswitch service before back up flows.
>   2.  If the openvswitch is exiting status, why it still can receive
> request.
>   3.  Why the br0 took so long time to exit.
>
> I think this is a bug need to improve during upgrading.
>
> Thanks
> L.J.
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20190906/40156f5a/attachment.html>


More information about the discuss mailing list