[ovs-dev] [PATCH] ovs-lib: Handle daemon segfaults during exit.

Gurucharan Shetty guru at ovn.org
Fri Sep 18 22:33:21 UTC 2020

Currently, we terminate a daemon by trying
"ovs-appctl exit", "SIGTERM" and finally "SIGKILL".
But the logic fails if during "ovs-appctl exit", the
daemon crashes (segfaults). The monitor will automatically
restart the daemon with a new pid. The current logic of
checking the non-existance of old pid succeeds and we proceed
with the assumption that the daemon is dead.

This is a problem during OVS upgrades as we will continue
to run the older version of OVS.

With this commit, we take care of this situation. If there
is a segfault, the pidfile is not deleted. So, we wait a
little to give time for the monitor to restart the daemon
(which is usually instantaneous) and then re-read the pidfile.

VMware-BZ: #2633995
Signed-off-by: Gurucharan Shetty <guru at ovn.org>
 utilities/ovs-lib.in | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/utilities/ovs-lib.in b/utilities/ovs-lib.in
index d646b44..f7e9756 100644
--- a/utilities/ovs-lib.in
+++ b/utilities/ovs-lib.in
@@ -255,20 +255,36 @@ stop_daemon () {
             if version_geq "$version" "2.5.90"; then
                 actions="$graceful $actions"
+            actiontype=""
             for action in $actions; do
                 if pid_exists "$pid" >/dev/null 2>&1; then :; else
-                    return 0
+                    # pid does not exist.
+                    if [ -n "$actiontype" ]; then
+                        return 0
+                    fi
+                    # But, does the file exist? We may have had a daemon
+                    # segfault with `ovs-appctl exit`. Check one more time
+                    # before deciding that the daemon is dead.
+                    [ -e "$rundir/$1.pid" ] && sleep 2 && pid=`cat "$rundir/$1.pid"` 2>/dev/null
+                    if pid_exists "$pid" >/dev/null 2>&1; then :; else
+                        return 0
+                    fi
                 case $action in
                         action "Exiting $1 ($pid)" \
                             ${bindir}/ovs-appctl -T 1 -t $rundir/$1.$pid.ctl exit $2
+                        # The above command could have resulted in delayed
+                        # daemon segfault. And if a monitor is running, it
+                        # would restart the daemon giving it a new pid.
                         action "Killing $1 ($pid)" kill $pid
+                        actiontype="force"
                         action "Killing $1 ($pid) with SIGKILL" kill -9 $pid
+                        actiontype="force"
                         log_failure_msg "Killing $1 ($pid) failed"

More information about the dev mailing list