[ovs-discuss] Question to OVN DB pacemaker script

aginwala aginwala at asu.edu
Fri May 11 21:25:49 UTC 2018


Thanks Han for more suggestions:


I did test failover by gracefully stopping pacemaker+corosync on master
node along with crm move and it works as expected too as crm move is
triggering promote of new master and hence the new master gets elected
along with slave getting demoted as expected to listen on sync-from node.
Hence, whatever code change I posted earlier is well and good.

# crm stat
Stack: corosync
Current DC: test-pace1-2365293 (version 1.1.14-70404b0) - partition with
quorum
2 nodes and 2 resources configured

Online: [ test-pace1-2365293 test-pace2-2365308 ]

Full list of resources:

 Master/Slave Set: ovndb_servers-master [ovndb_servers]
     Masters: [ test-pace2-2365308 ]
     Slaves: [ test-pace1-2365293 ]

#crm --debug resource move ovndb_servers test-pace1-2365293
DEBUG: pacemaker version: [err: ][out: CRM Version: 1.1.14 (70404b0)]
DEBUG: found pacemaker version: 1.1.14
DEBUG: invoke: crm_resource --quiet --move -r 'ovndb_servers'
--node='test-pace1-2365293'
# crm stat

Stack: corosync
Current DC: test-pace1-2365293 (version 1.1.14-70404b0) - partition with
quorum
2 nodes and 2 resources configured

Online: [ test-pace1-2365293 test-pace2-2365308 ]

Full list of resources:

 Master/Slave Set: ovndb_servers-master [ovndb_servers]
     Masters: [ test-pace1-2365293 ]
     Slaves: [ test-pace2-2365308 ]

Failed Actions:
* ovndb_servers_monitor_10000 on test-pace2-2365308 'master' (8): call=46,
status=complete, exitreason='none',
    last-rc-change='Fri May 11 14:08:35 2018', queued=0ms, exec=83ms

Note: Failed Actions warning only comes for crm move command and not using
reboot/kill/service pacemaker/corosync stop/start

I cleaned up the warning using below commad:
#crm_resource -P
Waiting for 1 replies from the CRMd. OK

Also wanted to call out above findings noticed that ocf_attribute_target is
not getting called as per pacemaker logs as code says it will not work for
older pacemaker versions and not sure what versions exactly as I am on
version 1.1.14
# pacemaker logs
 notice: operation_finished: ovndb_servers_monitor_10000:7561:stderr [
/usr/lib/ocf/resource.d/ovn/ovndb-servers: line 31: ocf_attribute_target:
command not found ]


# Also need nb db logs are showing socket util errors which I think need a
code change too to skip stamping it as functionality is still working as
expected (may be in a separate commit since its ovsdb change)
018-05-11T21:14:25.958Z|00560|socket_util|ERR|6641:10.149.4.252: bind:
Cannot assign requested address
2018-05-11T21:14:25.958Z|00561|socket_util|ERR|6641:10.149.4.252: bind:
Cannot assign requested address
2018-05-11T21:14:27.859Z|00562|socket_util|ERR|6641:10.149.4.252: bind:
Cannot assign requested address



Let me know for any suggestions further.


Regards,
Aliasgar


On Thu, May 10, 2018 at 3:49 PM, Han Zhou <zhouhan at gmail.com> wrote:

> Good progress!
>
> I think at least one more change is needed to ensure when demote happens,
> the TCP port is shut down. Otherwise, the LB will be confused again and
> can't figure out which one is active. This is the graceful failover
> scenario which can be tested by crm resource move instead of reboot/killing
> process.
>
> This may be done by the same approach you did for promote, i.e. stop ovsdb
> and then call ovsdb_server_start() so the parameters are reset correctly
> before starting. Alternatively we can add a command in ovsdb-server, in
> addition to the commands that switches to/from active/backup modes, to
> open/close the TCP ports, to avoid restarting during failover, but I am not
> sure if this is valuable. It depends on whether restarting ovsdb-server
> during failover is sufficient enough. Could you add the restart logic for
> demote and try more? Thanks!
>
> Thanks,
> Han
>
> On Thu, May 10, 2018 at 1:54 PM, aginwala <aginwala at asu.edu> wrote:
>
>> Hi :
>>
>> Just to further update, I am able to re-open tcp port for failover
>> scenario when new master is getting promoted with additional code changes
>> as below which do require stop of ovs service on the new selected master to
>> reset the tcp settings:
>>
>>
>> diff --git a/ovn/utilities/ovndb-servers.ocf
>> b/ovn/utilities/ovndb-servers.ocf
>> index 164b6bc..8cb4c25 100755
>> --- a/ovn/utilities/ovndb-servers.ocf
>> +++ b/ovn/utilities/ovndb-servers.ocf
>> @@ -295,8 +295,8 @@ ovsdb_server_start() {
>>
>>      set ${OVN_CTL}
>>
>> -    set $@ --db-nb-addr=${MASTER_IP} --db-nb-port=${NB_MASTER_PORT}
>> -    set $@ --db-sb-addr=${MASTER_IP} --db-sb-port=${SB_MASTER_PORT}
>> +    set $@ --db-nb-port=${NB_MASTER_PORT}
>> +    set $@ --db-sb-port=${SB_MASTER_PORT}
>>
>>      if [ "x${NB_MASTER_PROTO}" = xtcp ]; then
>>          set $@ --db-nb-create-insecure-remote=yes
>> @@ -307,6 +307,8 @@ ovsdb_server_start() {
>>      fi
>>
>>      if [ "x${present_master}" = x ]; then
>> +        set $@ --db-nb-create-insecure-remote=yes
>> +        set $@ --db-sb-create-insecure-remote=yes
>>          # No master detected, or the previous master is not among the
>>          # set starting.
>>          #
>> @@ -316,6 +318,8 @@ ovsdb_server_start() {
>>          set $@ --db-nb-sync-from-addr=${INVALID_IP_ADDRESS}
>> --db-sb-sync-from-addr=${INVALID_IP_ADDRESS}
>>
>>      elif [ ${present_master} != ${host_name} ]; then
>> +        set $@ --db-nb-create-insecure-remote=no
>> +        set $@ --db-sb-create-insecure-remote=no
>>          # An existing master is active, connect to it
>>          set $@ --db-nb-sync-from-addr=${MASTER_IP}
>> --db-sb-sync-from-addr=${MASTER_IP}
>>          set $@ --db-nb-sync-from-port=${NB_MASTER_PORT}
>> @@ -416,6 +420,8 @@ ovsdb_server_promote() {
>>              ;;
>>      esac
>>
>> +    ${OVN_CTL} stop_ovsdb
>> +    ovsdb_server_start
>>      ${OVN_CTL} promote_ovnnb
>>      ${OVN_CTL} promote_ovnsb
>>
>>
>>
>> Below are the scenarios tested:
>> MasterSlaveScenarioResult
>>
>>    -
>>
>>
>>    -
>>
>> reboot/failure New master gets promoted with tcp ports enabled to start
>> taking LB traffic.
>>
>>    -
>>
>>
>>    -
>>
>> reboot/failure
>> No change and current master continues taking traffic with slave continue
>> to sync from master.
>>
>>    -
>>
>>
>>    -
>>
>> reboot/failure
>> New master gets promoted with tcp ports enabled to start taking LB
>> traffic.
>>
>> Also sync on slaves from master works as expected:
>> # On master
>> ovn-nbctl --db=tcp:10.169.129.33:6641 ls-add  556
>> # on slave port is shutdown as expected
>> ovn-nbctl --db=tcp:10.169.129.34:6641 show
>> ovn-nbctl: tcp:10.169.129.34:6641: database connection failed
>> (Connection refused)
>> # on slave local unix socket, above lswitch 556 gets replicated too as
>> --sync-from=tcp:10.149.4.252:6641
>> ovn-nbctl show
>> switch 2bd07b67-fd6b-401d-9612-da75e8f9ffc8 (556)
>>
>> # Same testing for sb db too
>> # Slave port 6642 is shutdown too
>> ovn-sbctl --db=tcp:10.169.129.34:6642 show hangs and
>> # Using master ip works
>>  ovn-sbctl --db=tcp:10.169.129.33:6642 show
>> Chassis "21f12bd6-e9e8-4ee2-afeb-28b331df6715"
>>     hostname: "test-pace2-2365308.lvs02.dev.ebayc3.com"
>>     Encap geneve
>>         ip: "10.169.129.34"
>>         options: {csum="true"}
>>
>>
>>
>> # Accessing via LB vip works fine too as only one member is active:
>> for i in `seq 1 500`; do ovn-sbctl --db=tcp:10.149.4.252:6642 show; done
>> switch 2bd07b67-fd6b-401d-9612-da75e8f9ffc8 (556)
>> switch 2bd07b67-fd6b-401d-9612-da75e8f9ffc8 (556)
>> switch 2bd07b67-fd6b-401d-9612-da75e8f9ffc8 (556)
>> switch 2bd07b67-fd6b-401d-9612-da75e8f9ffc8 (556)
>> switch 2bd07b67-fd6b-401d-9612-da75e8f9ffc8 (556)
>>
>>
>> Everything works fine as expected. Let me know for any corner case
>> missed. I will submit a formal patch using LISTEN_ON_MASTER_IP_ONLY for
>> using LB with tcp  to avoid breaking existing functionality accordingly.
>>
>>
>>
>> Regards,
>> Aliasgar
>>
>>
>>
>> On Thu, May 10, 2018 at 9:55 AM, aginwala <aginwala at asu.edu> wrote:
>>
>>> Thanks folks for suggestions:
>>>
>>> For LB vip configurations, I did  the testing further and yes it does
>>> tries to hit the slave db as per the logs below and fails as slave do not
>>> have write permission of which LB is not aware of:
>>> for i in `seq 1 500`; do ovn-nbctl --db=tcp:10.149.4.252:6641 ls-add
>>> $i590;done
>>> ovn-nbctl: transaction error: {"details":"insert operation not allowed
>>> when database server is in read only mode","error":"not allowed"}
>>> ovn-nbctl: transaction error: {"details":"insert operation not allowed
>>> when database server is in read only mode","error":"not allowed"}
>>> ovn-nbctl: transaction error: {"details":"insert operation not allowed
>>> when database server is in read only mode","error":"not allowed"}
>>>
>>> Hence, with little more code changes(in the same patch without the flag
>>> variable suggestion), I am able to shutdown the tcp port on the slave and
>>> it works fine as below:
>>> #Master Node
>>> # ovn-nbctl --db=tcp:10.169.129.33:6641 ls-add test444
>>> #Slave Node
>>> # ovn-nbctl --db=tcp:10.169.129.34:6641 ls-add test444
>>> ovn-nbctl: tcp:10.169.129.34:6641: database connection failed
>>> (Connection refused)
>>>
>>> Code to shutdown tcp port on slave db along with only master listening
>>> on tcp ports:
>>> diff --git a/ovn/utilities/ovndb-servers.ocf
>>> b/ovn/utilities/ovndb-servers.ocf
>>> index 164b6bc..b265df6 100755
>>> --- a/ovn/utilities/ovndb-servers.ocf
>>> +++ b/ovn/utilities/ovndb-servers.ocf
>>> @@ -295,8 +295,8 @@ ovsdb_server_start() {
>>>
>>>      set ${OVN_CTL}
>>>
>>> -    set $@ --db-nb-addr=${MASTER_IP} --db-nb-port=${NB_MASTER_PORT}
>>> -    set $@ --db-sb-addr=${MASTER_IP} --db-sb-port=${SB_MASTER_PORT}
>>> +    set $@ --db-nb-port=${NB_MASTER_PORT}
>>> +    set $@ --db-sb-port=${SB_MASTER_PORT}
>>>
>>>      if [ "x${NB_MASTER_PROTO}" = xtcp ]; then
>>>          set $@ --db-nb-create-insecure-remote=yes
>>> @@ -307,6 +307,8 @@ ovsdb_server_start() {
>>>      fi
>>>
>>>      if [ "x${present_master}" = x ]; then
>>> +        set $@ --db-nb-create-insecure-remote=yes
>>> +        set $@ --db-sb-create-insecure-remote=yes
>>>          # No master detected, or the previous master is not among the
>>>          # set starting.
>>>          #
>>> @@ -316,6 +318,8 @@ ovsdb_server_start() {
>>>          set $@ --db-nb-sync-from-addr=${INVALID_IP_ADDRESS}
>>> --db-sb-sync-from-addr=${INVALID_IP_ADDR
>>>
>>>      elif [ ${present_master} != ${host_name} ]; then
>>> +        set $@ --db-nb-create-insecure-remote=no
>>> +        set $@ --db-sb-create-insecure-remote=no
>>>
>>>
>>> But I noticed that if the slave becomes active post failover after
>>> active node reboot/failure, pacemaker shows it online but I am not able to
>>> access the dbs.
>>>
>>> # crm status
>>> Online: [ test-pace2-2365308 ]
>>> OFFLINE: [ test-pace1-2365293 ]
>>>
>>> Full list of resources:
>>>
>>>  Master/Slave Set: ovndb_servers-master [ovndb_servers]
>>>      Masters: [ test-pace2-2365308 ]
>>>      Stopped: [ test-pace1-2365293 ]
>>>
>>>
>>> # ovn-nbctl --db=tcp:10.169.129.33:6641 ls-add test444
>>> ovn-nbctl: tcp:10.169.129.33:6641: database connection failed
>>> (Connection refused)
>>> # ovn-nbctl --db=tcp:10.169.129.34:6641 ls-add test444
>>> ovn-nbctl: tcp:10.169.129.34:6641: database connection failed
>>> (Connection refused)
>>>
>>> Hence, if failover happens, slave is already running with
>>> --sync-from=lbVIP:6641/6642 for nb and sb db respectively. Thus, re-opening
>>> of tcp ports for nb and sb db on the slave that is getting promoted to
>>> master is not happening automatically.
>>>
>>> Let me know if there is a valid way/approach too which I am missing to
>>> handle it during slave promote logic?  Will do further code changes
>>> accordingly.
>>>
>>> Note: Current code changes for use with LB will needs to be handled for
>>> ssl too. Will have to handle that separately but want to get the tcp
>>> working first and we can add ssl support later.
>>>
>>>
>>> Regards,
>>> Aliasgar
>>>
>>> On Wed, May 9, 2018 at 12:19 PM, Numan Siddique <nusiddiq at redhat.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Thu, May 10, 2018 at 12:44 AM, Han Zhou <zhouhan at gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Wed, May 9, 2018 at 11:51 AM, Numan Siddique <nusiddiq at redhat.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, May 10, 2018 at 12:15 AM, Han Zhou <zhouhan at gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Ali for the quick patch. Please see my comments inline.
>>>>>>>
>>>>>>> On Wed, May 9, 2018 at 9:30 AM, aginwala <aginwala at asu.edu> wrote:
>>>>>>> >
>>>>>>> > Thanks Han and Numan for the clarity to help sort it out.
>>>>>>> >
>>>>>>> > For making vip work with using LB in my two node setup, I had
>>>>>>> changed below code to skip setting master IP  when creating pcs resource
>>>>>>> for ovndbs and listen on 0.0.0.0 instead. Hence, the discussion seems
>>>>>>> inline with the code change which is small for sure as below:
>>>>>>> >
>>>>>>> >
>>>>>>> > diff --git a/ovn/utilities/ovndb-servers.ocf
>>>>>>> b/ovn/utilities/ovndb-servers.ocf
>>>>>>> > index 164b6bc..d4c9ad7 100755
>>>>>>> > --- a/ovn/utilities/ovndb-servers.ocf
>>>>>>> > +++ b/ovn/utilities/ovndb-servers.ocf
>>>>>>> > @@ -295,8 +295,8 @@ ovsdb_server_start() {
>>>>>>> >
>>>>>>> >      set ${OVN_CTL}
>>>>>>> >
>>>>>>> > -    set $@ --db-nb-addr=${MASTER_IP}
>>>>>>> --db-nb-port=${NB_MASTER_PORT}
>>>>>>> > -    set $@ --db-sb-addr=${MASTER_IP}
>>>>>>> --db-sb-port=${SB_MASTER_PORT}
>>>>>>> > +    set $@ --db-nb-port=${NB_MASTER_PORT}
>>>>>>> > +    set $@ --db-sb-port=${SB_MASTER_PORT}
>>>>>>> >
>>>>>>> >      if [ "x${NB_MASTER_PROTO}" = xtcp ]; then
>>>>>>> >          set $@ --db-nb-create-insecure-remote=yes
>>>>>>> >
>>>>>>>
>>>>>>> This change solves the IP binding problem. It will just listen on
>>>>>>> 0.0.0.0.
>>>>>>>
>>>>>>
>>>>>> One problem with this approach I see is that it would listen on all
>>>>>> the IPs. May be it's not a good idea and may have some security issues.
>>>>>>
>>>>>> Can we instead check the value of  MASTER_IP param something like
>>>>>> below ?
>>>>>>
>>>>>>  if [ "$MASTER_IP" == "0.0.0.0" ]; then
>>>>>>      set $@ --db-nb-addr=${MASTER_IP} --db-nb-port=${NB_MASTER_PORT}
>>>>>>      set $@ --db-sb-addr=${MASTER_IP} --db-sb-port=${SB_MASTER_PORT}
>>>>>> else
>>>>>>      set $@ --db-nb-port=${NB_MASTER_PORT}
>>>>>>      set $@ --db-sb-port=${SB_MASTER_PORT}
>>>>>> fi
>>>>>>
>>>>>> And when you create OVN pacemaker resource in your deployment, you
>>>>>> can pass master_ip=0.0.0.0
>>>>>>
>>>>>> Will this work ?
>>>>>>
>>>>>>
>>>>> Maybe some misunderstanding here. We still need to use master_ip = LB
>>>>> VIP, so that the standby nodes can "sync-from" the active node. So we
>>>>> cannot pass 0.0.0.0 explicitly.
>>>>>
>>>>
>>>> I misunderstood earlier. I thought you wouldn't need master ip at all.
>>>> Thanks for the clarification.
>>>>
>>>>>
>>>>> I didn't understand your code above either. Why would we specify the
>>>>> master_ip if we know it is 0.0.0.0? Or do you mean the other way around but
>>>>> just a typo in the code?
>>>>>
>>>>> For security of listening on any IP, I am not quit sure. It may be a
>>>>> problem if the nodes sits on multiple networks and some of them are
>>>>> considered insecure, and you want to listen on the security one only. If
>>>>> this is the concern, we can add a parameter e.g. LISTEN_ON_MASTER_IP_ONLY,
>>>>> and set it to true by default. What do you think?
>>>>>
>>>>
>>>> I would prefer adding the parameter as you have suggested so that the
>>>> existing behavior remain intact.
>>>>
>>>> Thanks
>>>> Numan
>>>>
>>>>
>>>>> Thanks,
>>>>> Han
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20180511/5e35c2ca/attachment-0001.html>


More information about the discuss mailing list