[ovs-dev] [PATCH v3 1/4] ovn: ovn-ctl support for HA ovn DB servers

Fri Oct 14 09:52:01 UTC 2016

On Friday 14 October 2016 04:00 AM, Andy Zhou wrote:
>
>
> Done. Now it shows the following.
>
>     [root at h2 ovs]# crm configure show
>
>     node1: h1 \
>
>     attributes
>
>     node2: h2
>
>     primitiveClusterIP IPaddr2 \
>
>     paramsip=10.33.75.200cidr_netmask=32\
>
>     opstart interval=0stimeout=20s\
>
>     opstop interval=0stimeout=20s\
>
>     opmonitor interval=30s
>
>     primitiveovndb ocf:ovn:ovndb-servers \
>
>     opstart interval=0stimeout=30s\
>
>     opstop interval=0stimeout=20s\
>
>     oppromote interval=0stimeout=50s\
>
>     opdemote interval=0stimeout=50s\
>
>     opmonitor interval=1min\
>
>     meta
>
>     msovndb-master ovndb\
>
>     metanotify=true
>
>     colocationcolocation-ovndb-master-ClusterIP-INFINITY inf:
>     ovndb-master:Started ClusterIP:Master
>
>     orderorder-ClusterIP-ovndb-master-mandatory inf: ClusterIP:start
>     ovndb-master:start
>
>     propertycib-bootstrap-options: \
>
>     have-watchdog=false\
>
>     dc-version=1.1.13-10.el7_2.4-44eb2dd\
>
>     cluster-infrastructure=corosync\
>
>     cluster-name=mycluster\
>
>     stonith-enabled=false
>
>     propertyovn_ovsdb_master_server: \
>
>
>
>
> My installation does not have ocf-tester,  There is a program called 
> ocft with a test option. Not sure if this is a suitable replacement. 
> If not, how could I get
> the ocf-tester program? I ran the ocft program and get the following 
> output. Not sure what it means.
>
>  [root at h2 ovs]# ocft test -n test-ovndb -o master_ip 10.0.0.1 
> /usr/share/openvswitch/scripts/ovndb-servers.ocf
>
> ERROR: cases directory not found.
>
>

I have attached ocf-tester with this mail. I guess it's a standalone 
script. If it does not work, I think it's better not to attempt anymore 
as we have another way to find out.

>
>
>     Alternately, you can check if the ovsdb servers are started
>     properly by running
>
>     /usr/share/openvswitch/scripts/ovn-ctl --db-sb-sync-from=10.0.0.1
>     --db-nb-sync-from=10.0.0.1 start_ovsdb
>
>
> The output are as follows. Should we use --db-sb-sync-from-addr instead?
>  [root at h2 ovs]# /usr/share/openvswitch/scripts/ovn-ctl 
> --db-sb-sync-from=10.0.0.1 --db-nb-sync-from=10.0.0.1 start_ovsdb
>
> /usr/share/openvswitch/scripts/ovn-ctl: unknown option 
> "--db-sb-sync-from=10.0.0.1" (use --help for help)
>
> /usr/share/openvswitch/scripts/ovn-ctl: unknown option 
> "--db-nb-sync-from=10.0.0.1" (use --help for help)
>
> 'ovn-ctl' runs without any error message after I fixed the command 
> line parameter.

I am sorry for the misinformation, Andy. What you ran is correct. Could 
you check the status of the ovsdb servers in h2 after you run the above 
command using the following commands.

ovn-ctl status_ovnnb
ovn-ctl status_ovnsb

Both the above commands should return "running/backup". If you see in 
the OCF script in the function ovsdb_server_start(), we wait 
indefinitely till  the DB servers are started. Since the 'start' action 
on h2 times out, I doubt that the servers are not started properly.

>
>     I think this is mostly due to the crm configuration. Once you add
>     the 'ms' and 'colocation' resources, you should be able to
>     overcome this problem.
>
> No, ovndb still failed to launch on h2.
>
> [root at h2 ovs]# crm status
>
> Last updated: Thu Oct 13 11:27:42 2016Last change: Thu Oct 13 11:17:25 
> 2016 by root via cibadmin on h2
>
> Stack: corosync
>
> Current DC: h2 (version 1.1.13-10.el7_2.4-44eb2dd) - partition with quorum
>
> 2 nodes and 3 resources configured
>
>
> *Online*: [ h1 h2 ]
>
>
> Full list of resources:
>
>
> *ClusterIP*(ocf::heartbeat:IPaddr2):*Started*h1
>
> *Master*/*Slave*Set: ovndb-*master*[ovndb]
>
> *Master*s: [ h1 ]
>
> *Stopped*: [ h2 ]
>
>
> Failed Actions:
>
> * ovndb_start_0 on h2 '*unknown error*' (1): call=39, status=*Timed 
> Out*, exitreason='none',
>
>     last-rc-change='Wed Oct 12 14:43:03 2016', queued=0ms, exec=30003
>
>     I have never tried colocating two resources with the ClusterIP
>     resource. Just for testing, is it possible to drop the WebServer
>     resource?
>
> Done.  It did not make any difference that I can see.
>

-------------- next part --------------
#!/bin/sh
#
#	$Id: ocf-tester,v 1.2 2006/08/14 09:38:20 andrew Exp $
#
# Copyright (c) 2006 Novell Inc, Andrew Beekhof
#                    All Rights Reserved.
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of version 2 of the GNU General Public License as
# published by the Free Software Foundation.
#
# This program is distributed in the hope that it would be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# Further, this software is distributed without any warranty that it is
# free of the rightful claim of any third person regarding infringement
# or the like.  Any license provided herein, whether implied or
# otherwise, applies only to this software file.  Patent licenses, if
# any, provided herein do not apply to combinations of this program with
# other software, or any other product whatsoever.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write the Free Software Foundation,
# Inc., 59 Temple Place - Suite 330, Boston MA 02111-1307, USA.
#

LRMD=/usr/lib/heartbeat/lrmd
LRMADMIN=/usr/sbin/lrmadmin
DATADIR=/usr/share
METADATA_LINT="xmllint --noout --valid -"

# set some common meta attributes, which are expected to be
# present by resource agents
export OCF_RESKEY_CRM_meta_timeout=20000  # 20 seconds timeout
export OCF_RESKEY_CRM_meta_interval=10000  # reset this for probes

num_errors=0

info() {
    [ "$quiet" -eq 1 ] && return
    echo "$*"
}
debug() {
    [ "$verbose" -eq 0 ] && return
    echo "$*"
}
usage() {
    # make sure to output errors on stderr
    [ "x$1" = "x0" ] || exec >&2

    echo "Tool for testing if a cluster resource is OCF compliant"
    echo ""
    echo "Usage: ocf-tester [-Lh] -n resource_name [-o name=value]* /full/path/to/resource/agent"
    echo ""
    echo "Options:"
    echo "  -h       		This text"
    echo "  -v       		Be verbose while testing"
    echo "  -q       		Be quiet while testing"
    echo "  -d       		Turn on RA debugging"
    echo "  -n name		Name of the resource"	
    echo "  -o name=value		Name and value of any parameters required by the agent"
    echo "  -L			Use lrmadmin/lrmd for tests"
    exit $1
}

assert() {
    rc=$1; shift
    target=$1; shift
    msg=$1; shift
    local targetrc matched

    if [ $# = 0 ]; then
	exit_code=0
    else
	exit_code=$1; shift
    fi

    for targetrc in `echo $target | tr ':' ' '`; do
        [ $rc -eq $targetrc ] && matched=1
    done
    if [ "$matched" != 1 ]; then
	num_errors=`expr $num_errors + 1`
	echo "* rc=$rc: $msg"
	if [ $exit_code != 0 ]; then
	    [ -n "$command_output" ] && cat<<EOF
$command_output
EOF
	    echo "Aborting tests"
	    exit $exit_code
	fi
    fi
    command_output=""
}

done=0
ra_args=""
verbose=0
quiet=0
while test "$done" = "0"; do
    case "$1" in
	-n) OCF_RESOURCE_INSTANCE=$2; ra_args="$ra_args OCF_RESOURCE_INSTANCE=$2"; shift; shift;;
	-o) name=${2%%=*}; value=${2#*=}; 
		lrm_ra_args="$lrm_ra_args $2";
		ra_args="$ra_args OCF_RESKEY_$name='$value'"; shift; shift;;
	-L) use_lrmd=1; shift;;
	-v) verbose=1; shift;;
	-d) export HA_debug=1; shift;;
	-q) quiet=1; shift;;
	-?|--help) usage 0;;
	--version) echo "UNKNOWN"; exit 0;;
	-*) echo "unknown option: $1" >&2; usage 1;;
	*) done=1;;
    esac
done

if [ "x" = "x$OCF_ROOT" ]; then
    if [ -d /usr/lib/ocf ]; then
	export OCF_ROOT=/usr/lib/ocf
    else
	echo "You must supply the location of OCF_ROOT (common location is /usr/lib/ocf)" >&2
	usage 1
    fi
fi

if [ "x" = "x$OCF_RESOURCE_INSTANCE" ]; then
    echo "You must give your resource a name, set OCF_RESOURCE_INSTANCE" >&2
    usage 1
fi

agent=$1
if [ ! -e $agent ]; then
    echo "You must provide the full path to your resource agent" >&2
    usage 1
fi
installed_rc=5
stopped_rc=7
has_demote=1
has_promote=1

start_lrmd() {
	lrmd_timeout=0
	lrmd_interval=0
	lrmd_target_rc=EVERYTIME
	lrmd_started=""
	$LRMD -s 2>/dev/null
	rc=$?
	if [ $rc -eq 3 ]; then
		lrmd_started=1
		$LRMD &
		sleep 1
		$LRMD -s 2>/dev/null
	else
		return $rc
	fi
}
add_resource() {
	$LRMADMIN -A $OCF_RESOURCE_INSTANCE \
		ocf \
		`basename $agent` \
		$(basename `dirname $agent`) \
		$lrm_ra_args > /dev/null
}
del_resource() {
	$LRMADMIN -D $OCF_RESOURCE_INSTANCE
}
parse_lrmadmin_output() {
	awk '
BEGIN{ rc=1; }
/Waiting for lrmd to callback.../ { n=1; next; }
n==1 && /----------------operation--------------/ { n++; next; }
n==2 && /return code:/ { rc=$0; sub("return code: *","",rc); next }
n==2 && /---------------------------------------/ {
        n++;
        next;
}
END{
	if( n!=3 ) exit 1;
	else exit rc;
}
'
}
exec_resource() {
	op="$1"
	args="$2"
	$LRMADMIN -E $OCF_RESOURCE_INSTANCE \
		$op $lrmd_timeout $lrmd_interval \
		$lrmd_target_rc \
		$args | parse_lrmadmin_output
}

if [ "$use_lrmd" = 1 ]; then
	echo "Using lrmd/lrmadmin for all tests"
	start_lrmd || {
		echo "could not start lrmd" >&2
		exit 1
	}
	trap '
		[ "$lrmd_started" = 1 ] && $LRMD -k
	' EXIT
	add_resource || {
		echo "failed to add resource to lrmd" >&2
		exit 1
	}
fi

lrm_test_command() {
	action="$1"
	msg="$2"
	debug "$msg"
	exec_resource $action "$lrm_ra_args"
}

test_permissions() {
    action=meta-data
    debug ${1:-"Testing permissions with uid nobody"}
    su nobody -s /bin/sh $agent $action > /dev/null
}

test_metadata() {
    action=meta-data
    msg=${1:-"Testing: $action"}
    debug $msg
    bash $agent $action | (cd $DATADIR/resource-agents && $METADATA_LINT)
    rc=$?
    #echo rc: $rc
    return $rc
}

test_command() {
    action=$1; shift
    export __OCF_ACTION=$action
    msg=${1:-"Testing: $action"}
    if [ "$use_lrmd" = 1 ]; then
    	lrm_test_command $action "$msg"
    	return $?
    fi
    #echo Running: "export $ra_args; bash $agent $action 2>&1 > /dev/null"
    if [ $verbose -eq 0 ]; then
	command_output=`bash $agent $action 2>&1`
    else
    	debug $msg
	bash $agent $action
    fi
    rc=$?
    #echo rc: $rc
    return $rc
}

# Begin tests
info "Beginning tests for $agent..."

if [ ! -f $agent ]; then
    assert 7 0 "Could not find file: $agent"
fi

if [ `id -u` = 0 ]; then
	test_permissions
	assert $? 0 "Your agent has too restrictive permissions: should be 755"
else
	echo "WARN: Can't check agent's permissions because we're not root; they should be 755"
fi

test_metadata
assert $? 0 "Your agent produces meta-data which does not conform to ra-api-1.dtd"

OCF_TESTER_FAIL_HAVE_BINARY=1
export OCF_TESTER_FAIL_HAVE_BINARY
test_command meta-data
rc=$?
if [ $rc -eq 3 ]; then
    assert $rc 0 "Your agent does not support the meta-data action"
else
    assert $rc 0 "The meta-data action cannot fail and must return 0"
fi
unset OCF_TESTER_FAIL_HAVE_BINARY

ra_args="export $ra_args"
eval $ra_args
test_command validate-all
rc=$?
if [ $rc -eq 3 ]; then
    assert $rc 0 "Your agent does not support the validate-all action"
elif [ $rc -ne 0 ]; then
    assert $rc 0 "Validation failed.  Did you supply enough options with -o ?" 1
    usage $rc
fi

test_command monitor "Checking current state"
rc=$?
if [ $rc -eq 3 ]; then
    assert $rc 7 "Your agent does not support the monitor action" 1

elif [ $rc -eq 8 ]; then
    test_command demote "Cleanup, demote"
    assert $? 0 "Your agent was a master and could not be demoted" 1

    test_command stop "Cleanup, stop"
    assert $? 0 "Your agent was a master and could not be stopped" 1

elif [ $rc -ne 7 ]; then
    test_command stop
    assert $? 0 "Your agent was active and could not be stopped" 1
fi

test_command monitor
assert $? $stopped_rc "Monitoring a stopped resource should return $stopped_rc"

OCF_TESTER_FAIL_HAVE_BINARY=1
export OCF_TESTER_FAIL_HAVE_BINARY
OCF_RESKEY_CRM_meta_interval=0
test_command monitor
assert $? $stopped_rc:$installed_rc "The initial probe for a stopped resource should return $stopped_rc or $installed_rc even if all binaries are missing"
unset OCF_TESTER_FAIL_HAVE_BINARY
OCF_RESKEY_CRM_meta_interval=20000

test_command start
assert $? 0 "Start failed.  Did you supply enough options with -o ?" 1

test_command monitor
assert $? 0 "Monitoring an active resource should return 0"

OCF_RESKEY_CRM_meta_interval=0
test_command monitor
assert $? 0 "Probing an active resource should return 0"
OCF_RESKEY_CRM_meta_interval=20000

test_command notify
rc=$?
if [ $rc -eq 3 ]; then
    info "* Your agent does not support the notify action (optional)"
else
    assert $rc 0 "The notify action cannot fail and must return 0"
fi

test_command demote "Checking for demote action"
if [ $? -eq 3 ]; then
    has_demote=0
    info "* Your agent does not support the demote action (optional)"
fi

test_command promote "Checking for promote action"
if [ $? -eq 3 ]; then
    has_promote=0
    info "* Your agent does not support the promote action (optional)"
fi

if [ $has_promote -eq 1 -a $has_demote -eq 1 ]; then
    test_command demote "Testing: demotion of started resource"
    assert $? 0 "Demoting a start resource should not fail"

    test_command promote
    assert $? 0 "Promote failed"

    test_command demote
    assert $? 0 "Demote failed" 1

    test_command demote "Testing: demotion of demoted resource"
    assert $? 0 "Demoting a demoted resource should not fail"

    test_command promote "Promoting resource"
    assert $? 0 "Promote failed" 1

    test_command promote "Testing: promotion of promoted resource"
    assert $? 0 "Promoting a promoted resource should not fail"

    test_command demote "Demoting resource"
    assert $? 0 "Demote failed" 1

elif [ $has_promote -eq 0 -a $has_demote -eq 0 ]; then
    info "* Your agent does not support master/slave (optional)"

else
    echo "* Your agent partially supports master/slave"
    num_errors=`expr $num_errors + 1`
fi

test_command stop
assert $? 0 "Stop failed" 1

test_command monitor
assert $? $stopped_rc "Monitoring a stopped resource should return $stopped_rc"

test_command start "Restarting resource..."
assert $? 0 "Start failed" 1

test_command monitor
assert $? 0 "Monitoring an active resource should return 0"

test_command start "Testing: starting a started resource"
assert $? 0 "Starting a running resource is required to succeed"

test_command monitor
assert $? 0 "Monitoring an active resource should return 0"

test_command stop "Stopping resource"
assert $? 0 "Stop could not clean up after multiple starts" 1

test_command monitor
assert $? $stopped_rc "Monitoring a stopped resource should return $stopped_rc"

test_command stop "Testing: stopping a stopped resource"
assert $? 0 "Stopping a stopped resource is required to succeed"

test_command monitor
assert $? $stopped_rc "Monitoring a stopped resource should return $stopped_rc"

test_command migrate_to "Checking for migrate_to action"
rc=$?
if [ $rc -ne 3 ]; then
    test_command migrate_from "Checking for migrate_from action"
fi
if [ $? -eq 3 ]; then
    info "* Your agent does not support the migrate action (optional)"
fi

test_command reload "Checking for reload action"
if [ $? -eq 3 ]; then
    info "* Your agent does not support the reload action (optional)"
fi

if [ $num_errors -gt 0 ]; then
    echo "Tests failed: $agent failed $num_errors tests" >&2
    exit 1
else 
    echo $agent passed all tests
    exit 0
fi

# vim:et:ts=8:sw=4