[ovs-dev] [PATCH] ovn pacemaker: Fix the promotion issue in other cluster nodes when the master node is reset

Numan Siddique nusiddiq at redhat.com
Fri May 18 06:23:22 UTC 2018


On Fri, May 18, 2018 at 4:24 AM, aginwala <aginwala at asu.edu> wrote:

> Hi:
>
> I tried and it didnt help where Ip resource is always showing stopped
> where my private VIP IP is 192.168.220.108
> # kernel panic on  active node
> root at test7:~# echo c > /proc/sysrq-trigger
>
>
> root at test6:~# crm stat
> Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 22:45:03
> 2018 by root via cibadmin on test6
> Stack: corosync
> Current DC: test7 (version 1.1.14-70404b0) - partition with quorum
> 2 nodes and 3 resources configured
>
> Online: [ test6 test7 ]
>
> Full list of resources:
>
>  VirtualIP (ocf::heartbeat:IPaddr2): Started test7
>  Master/Slave Set: ovndb_servers-master [ovndb_servers]
>      Masters: [ test7 ]
>      Slaves: [ test6 ]
>
> root at test6:~# crm stat
> Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 22:45:03
> 2018 by root via cibadmin on test6
> Stack: corosync
> Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum
> 2 nodes and 3 resources configured
>
> Online: [ test6 ]
> OFFLINE: [ test7 ]
>
> Full list of resources:
>
>  VirtualIP (ocf::heartbeat:IPaddr2): Stopped
>  Master/Slave Set: ovndb_servers-master [ovndb_servers]
>      Slaves: [ test6 ]
>      Stopped: [ test7 ]
>
> root at test6:~# crm stat
> Last updated: Thu May 17 22:49:26 2018 Last change: Thu May 17 22:45:03
> 2018 by root via cibadmin on test6
> Stack: corosync
> Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum
> 2 nodes and 3 resources configured
>
> Online: [ test6 ]
> OFFLINE: [ test7 ]
>
> Full list of resources:
>
>  VirtualIP (ocf::heartbeat:IPaddr2): Stopped
>  Master/Slave Set: ovndb_servers-master [ovndb_servers]
>      Stopped: [ test6 test7 ]
>
> I think this change not needed or something else is wrong when using
> virtual IP resource.
>

Hi Aliasgar, I think you haven't created the resource properly. Or haven't
set the  colocation constraints properly. What pcs/crm commands you used to
create OVN db resources ?
Can you share the output of "pcs resource show ovndb_servers" and "pcs
constraint"
In case of tripleo we create resource like this -
https://github.com/openstack/puppet-tripleo/blob/master/manifests/profile/pacemaker/ovn_northd.pp#L80


>
> May we you need a similar promotion logic that we have for LB with
> pacemaker in the discussion (will submit formal patch soon). I did test
> with kernel panic with LB code change and it works fine where node2 gets
> promoted. Below works fine for LB even if there is kernel panic without
> this change:
>

This issue is not seen all the time. I have another setup where I don't see
this issue at all. The issue is seen when the IPAddr2 resource is moved to
another slave node and ovsdb-server's start reporting as master as soon as
the IP address is configured.

When the issue is seen we  hit the code here -
https://github.com/openvswitch/ovs/blob/master/ovn/utilities/ovndb-servers.ocf#L412.
Ideally when promot action is called, ovsdb servers will be running as
slaves/standby and the promote action promotes them to master. But when the
issue is seen, the ovsdb servers report the status as active. Because of
which we don't complete the full promote action and return at L412. And
later when notify action is called, we demote the servers because of this -
https://github.com/openvswitch/ovs/blob/master/ovn/utilities/ovndb-servers.ocf#L176

For the use case like your's (where load balancer VIP is used), you may not
see this issue at all since you will not be using the IPaddr2 resource as
master ip.




> root at test-pace1-2365293:~# echo c > /proc/sysrq-trigger
> root at test-pace2-2365308:~# crm stat
> Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 23:10:52
> 2018 by root via cibadmin on test-pace2-2365308
> Stack: corosync
> Current DC: test-pace1-2365293 (version 1.1.14-70404b0) - partition with
> quorum
> 2 nodes and 2 resources configured
>
> Online: [ test-pace1-2365293 test-pace2-2365308 ]
>
> Full list of resources:
>
>  Master/Slave Set: ovndb_servers-master [ovndb_servers]
>      Masters: [ test-pace1-2365293 ]
>      Slaves: [ test-pace2-2365308 ]
>
> root at test-pace2-2365308:~# crm stat
> Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 23:10:52
> 2018 by root via cibadmin on test-pace2-2365308
> Stack: corosync
> Current DC: test-pace2-2365308 (version 1.1.14-70404b0) - partition
> WITHOUT quorum
> 2 nodes and 2 resources configured
>
> Online: [ test-pace2-2365308 ]
> OFFLINE: [ test-pace1-2365293 ]
>
> Full list of resources:
>
>  Master/Slave Set: ovndb_servers-master [ovndb_servers]
>      Slaves: [ test-pace2-2365308 ]
>      Stopped: [ test-pace1-2365293 ]
>
> root at test-pace2-2365308:~# ps aux | grep ovs
> root     15175  0.0  0.0  18048   372 ?        Ss   15:15   0:00
> ovsdb-server: monitoring pid 15176 (healthy)
> root     15176  0.0  0.0  18312  4096 ?        S    15:15   0:00
> ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/
> openvswitch/ovsdb-server-nb.log --remote=punix:/var/run/openvswitch/ovnnb_db.sock
> --pidfile=/var/run/openvswitch/ovnnb_db.pid --unixctl=ovnnb_db.ctl
> --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections
> --private-key=db:OVN_Northbound,SSL,private_key --certificate=db:OVN_Northbound,SSL,certificate
> --ca-cert=db:OVN_Northbound,SSL,ca_cert --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols
> --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers
> --remote=ptcp:6641:0.0.0.0 --sync-from=tcp:192.0.2.254:6641
> /etc/openvswitch/ovnnb_db.db
> root     15184  0.0  0.0  18048   376 ?        Ss   15:15   0:00
> ovsdb-server: monitoring pid 15185 (healthy)
> root     15185  0.0  0.0  18300  4480 ?        S    15:15   0:00
> ovsdb-server -vconsole:off -vfile:info --log-file=/var/log/
> openvswitch/ovsdb-server-sb.log --remote=punix:/var/run/openvswitch/ovnsb_db.sock
> --pidfile=/var/run/openvswitch/ovnsb_db.pid --unixctl=ovnsb_db.ctl
> --detach --monitor --remote=db:OVN_Southbound,SB_Global,connections
> --private-key=db:OVN_Southbound,SSL,private_key --certificate=db:OVN_Southbound,SSL,certificate
> --ca-cert=db:OVN_Southbound,SSL,ca_cert --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols
> --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers
> --remote=ptcp:6642:0.0.0.0 --sync-from=tcp:192.0.2.254:6642
> /etc/openvswitch/ovnsb_db.db
> root     15398  0.0  0.0  12940   972 pts/0    S+   15:15   0:00 grep
> --color=auto ovs
>
> >>>I just want to point out that I am also seeing below errors when
> setting target with master IP using ipaddr2 resource too!
> 2018-05-17T21:58:51.889Z|00011|ovsdb_jsonrpc_server|ERR|ptcp:6641:
> 192.168.220.108: listen failed: Cannot assign requested address
> 2018-05-17T21:58:51.889Z|00012|socket_util|ERR|6641:192.168.220.108:
> bind: Cannot assign requested address
> That needs to be handled too since existing code do throw this error! Only
> if I skip setting target then it the error is gone.?
>

In the case of tripleo, we handle this error by setting the sysctl
value net.ipv4.ip_nonlocal_bind to 1 -
https://github.com/openstack/puppet-tripleo/blob/master/manifests/profile/pacemaker/ovn_northd.pp#L67



>
>
>
> Regards,
> Aliasgar
>
>
> On Thu, May 17, 2018 at 3:04 AM, <nusiddiq at redhat.com> wrote:
>
>> From: Numan Siddique <nusiddiq at redhat.com>
>>
>> When a node 'A' in the pacemaker cluster running OVN db servers in master
>> is
>> brought down ungracefully ('echo b > /proc/sysrq_trigger' for example),
>> pacemaker
>> is not able to promote any other node to master in the cluster. When
>> pacemaker selects
>> a node B for instance to promote, it moves the IPAddr2 resource (i.e the
>> master ip)
>> to node 'B'. As soon the node is configured with the IP address, when the
>> issue is
>> seen, the OVN db servers which were running as standy earlier,
>> transitions to active.
>> Ideally this should not have happened. The ovsdb-servers are expected to
>> remain in
>> standby until there are promoted. (This needs separate investigation).
>> When the pacemaker
>> calls the OVN OCF script's promote action, the ovsdb_server_promot
>> function returns
>> almost immediately without recording the present master. And later in the
>> notify action
>> it demotes back the OVN db servers since the last known master doesn't
>> match with
>> node 'B's hostname. This results in pacemaker promoting/demoting in a
>> loop.
>>
>> This patch fixes the issue by not returning immediately when promote
>> action is
>> called if the OVN db servers are running as active. Now it would continue
>> with
>> the ovsdb_server_promot function and records the new master by setting
>> proper
>> master score ($CRM_MASTER -N $host_name -v ${master_score})
>>
>> This issue is not seen when a node is brought down gracefully as
>> pacemaker before
>> promoting a node, calls stop, start and then promote actions. Not sure
>> why pacemaker
>> doesn't call stop, start and promote actions when a node is reset
>> ungracefully.
>>
>> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1579025
>> Signed-off-by: Numan Siddique <nusiddiq at redhat.com>
>> ---
>>  ovn/utilities/ovndb-servers.ocf | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/ovn/utilities/ovndb-servers.ocf
>> b/ovn/utilities/ovndb-servers.ocf
>> index 164b6bce6..23dc70056 100755
>> --- a/ovn/utilities/ovndb-servers.ocf
>> +++ b/ovn/utilities/ovndb-servers.ocf
>> @@ -409,7 +409,7 @@ ovsdb_server_promote() {
>>      rc=$?
>>      case $rc in
>>          ${OCF_SUCCESS}) ;;
>> -        ${OCF_RUNNING_MASTER}) return ${OCF_SUCCESS};;
>> +        ${OCF_RUNNING_MASTER}) ;;
>>          *)
>>              ovsdb_server_master_update $OCF_RUNNING_MASTER
>>              return ${rc}
>> --
>> 2.17.0
>>
>> _______________________________________________
>> dev mailing list
>> dev at openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
>
>


More information about the dev mailing list