[ovs-discuss] OVN Controller - High Availability via Pacemaker

Wed May 1 16:21:36 UTC 2019

Numan:

Thank you very much.  It seems to have been the northd services were not started/running.

/// BEFORE ///

pcs resource create ovndb_servers ocf:ovn:ovndb-servers \
    master_ip=192.168.100.10 \
    ovn_ctl=/usr/share/openvswitch/scripts/ovn-ctl \
    op monitor interval="10s" \
    op monitor role=Master interval="15s"

# ovn-sbctl show
[[ no output ]]

/// CURRENT ///
pcs resource create ovndb_servers ocf:ovn:ovndb-servers \
    master_ip=192.168.100.10 \
    manage_northd="yes" \
    ovn_ctl=/usr/share/openvswitch/scripts/ovn-ctl \
    op monitor interval="10s" \
    op monitor role=Master interval="15s"

# ovn-sbctl show
Chassis "3fd25b76-3170-4eab-8604-690182500478"
   hostname: "lab-vxlan-cn03"
    Encap geneve
        ip: "192.168.100.78"
        options: {csum="true"}
Chassis "bda632c5-afeb-41bd-80e5-5c423172a771"
    hostname: "lab-vxlan-cn02"
    Encap geneve
        ip: "192.168.100.77"
        options: {csum="true"}
Chassis "5bd199bf-9e3e-41b1-bdb0-b1fab1adff7c"
    hostname: "lab-vxlan-cn04"
    Encap geneve
        ip: "192.168.100.79"
        options: {csum="true"}
Chassis "0bd6c91a-10d3-4a1f-86bf-bdeb4bf110a3"
    hostname: "lab-vxlan-cn01"
    Encap geneve
        ip: "192.168.100.76"
        options: {csum="true"}
    Port_Binding "9de72cf5-cbb2-4ebe-9c89-3962a29ed869"   <<<<< VM Port Binding is active.

Now to test multi-vm and failover scenarios.
I have also corrected my documentation regarding your statement about “OVN Controller” vs “OVN DB” – thank you for making that clarification.

Regards,

Stephen Flynn

From: Numan Siddique <nusiddiq at redhat.com>
Sent: Wednesday, May 1, 2019 12:03 PM
To: Stephen Flynn <sflynn at staff.atlantic.net>
Cc: ovs-discuss at openvswitch.org
Subject: Re: [ovs-discuss] OVN Controller - High Availability via Pacemaker

On Wed, May 1, 2019 at 9:07 PM Numan Siddique <nusiddiq at redhat.com<mailto:nusiddiq at redhat.com>> wrote:

Hi Stephen,

Your setup seems fine to me. Please see below for some comments.

On Wed, May 1, 2019 at 8:23 PM Stephen Flynn via discuss <ovs-discuss at openvswitch.org<mailto:ovs-discuss at openvswitch.org>> wrote:
Greetings OVS Discuss Group:

First – I want to apologize for the wall of text to follow – I wasn’t sure how much information would be wanted and I didn’t want everyone to have to ask 100 questions to get what they needed.

I am working on settings up an OVN Controller (3 Nodes) using Pacemaker / Corosync.   For reference --  A single node controller (no pacemaker / corosync) operates without issue.
Root Goal:   Failover Redundancy for the OVN Controller – Allows for maintenance of controller nodes and/or failure of a controller node.

Just a small correction - Pacemker is used to provide active/passive HA for the OVN DB servers. When you refer OVN Controller its a bit confusing since we have a service
called - ovn-controller  (provided by ovn-host package) which runs on each host/hypervisor.

I have been following the very limited documentation on how to setup this environment but don’t seem to be having much luck getting it to be 100% stable or operational.
http://docs.openvswitch.org/en/latest/topics/integration/?highlight=pacemaker

If I could ask someone to assist in reviewing the below installation and provide some insight into what I may be doing wrong, or have wrong (need newer version of code, etc) I would be grateful.  I’ve currently only attempted this by using “packaged” versions of code to “keep it simple” … but I do realize that getting this to work may require a newer code release.
Along with this, once I am able to get a stable environment, I would like to contribute updated documentation to the community on how to perform a full setup.

-- Environment –

# cat /etc/lsb-release | grep DESCRIPTION
DISTRIB_DESCRIPTION="Ubuntu 18.04.2 LTS"

# ovn-northd --version
ovn-northd (Open vSwitch) 2.9.2

# ovn-nbctl --version
ovn-nbctl (Open vSwitch) 2.9.2
DB Schema 5.10.0

# ovn-sbctl --version
ovn-sbctl (Open vSwitch) 2.9.2
DB Schema 1.15.0

-- Setup Steps –
# cat /etc/hosts

# LAB - Compute Nodes
192.168.100.10  ctrl00
192.168.100.11  ctrl01
192.168.100.12  ctrl02
192.168.100.13  ctrl03
192.168.100.76  cn01
192.168.100.77  cn02
192.168.100.78  cn03
192.168.100.79  cn04

### All Controllers

## System Package Updates
apt clean all; apt update; apt -y dist-upgrade

## Time Sync Services (NTP) [ctrl01, ctrl02, ctrl03]
apt install -y ntp

## Install OVN Central Controller [ctrl01, ctrl02, ctrl03]
apt install -y openvswitch-common openvswitch-switch python-openvswitch python3-openvswitch ovn-common ovn-central

## Install pacemaker and corosync [ctrl01, ctrl02, ctrl03]
apt install -y pcs pacemaker pacemaker-cli-utils

## Reset the Pacemaker Password [ctrl01, ctrl02, ctrl03]
Password = 6B43WAmuPzM2Ewsr

enc_passwd=$(python3 -c 'import crypt; print(crypt.crypt("6B43WAmuPzM2Ewsr", crypt.mksalt(crypt.METHOD_SHA512)))')
usermod -p "${enc_passwd}" hacluster

## Enable and Start the PCS Daemon [ctrl01, ctrl02, ctrl03]
sudo systemctl enable pcsd;
sudo systemctl start pcsd;
sudo systemctl status pcsd;
```

### Cluster Controller #1 (ONLY)

## Enable Cluster Services [ctrl01 --ONLY-- ]

pcs cluster auth ctrl01 ctrl02 ctrl03 -u hacluster -p '6B43WAmuPzM2Ewsr' --force
pcs cluster setup --name OVN-CLUSTER ctrl01 ctrl02 ctrl03 --force

pcs cluster enable --all;
pcs cluster start --all;

pcs property set stonith-enabled=false
pcs property set no-quorum-policy=ignore

pcs status

[[ output ]]
Cluster name: OVN-CLUSTER
Stack: corosync
Current DC: ctrl01 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Wed May  1 09:42:33 2019
Last change: Wed May  1 09:42:29 2019 by root via cibadmin on ctrl01

3 nodes configured
0 resources configured

Online: [ ctrl01 ctrl02 ctrl03 ]

No resources

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

## Add cluster resources [node01 --ONLY-- ]
pcs resource create ovn-virtual-ip ocf:heartbeat:IPaddr2 nic=ens192 ip=192.168.100.10 cidr_netmask=24 op monitor interval=30s

pcs resource create ovndb_servers ocf:ovn:ovndb-servers \
    master_ip=192.168.100.10 \
    ovn_ctl=/usr/share/openvswitch/scripts/ovn-ctl \
    op monitor interval="10s" \
    op monitor role=Master interval="15s"

pcs resource master ovndb_servers-master ovndb_servers \
    meta notify="true"

pcs constraint order promote ovndb_servers-master then ovn-virtual-ip
pcs constraint colocation add ovn-virtual-ip with master ovndb_servers-master score=INFINITY

pcs status

[[ output ]]
Cluster name: OVN-CLUSTER
Stack: corosync
Current DC: ctrl01 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Wed May  1 09:46:02 2019
Last change: Wed May  1 09:44:53 2019 by root via crm_attribute on ctrl02

3 nodes configured
4 resources configured

Online: [ ctrl01 ctrl02 ctrl03 ]

Full list of resources:

ovn-virtual-ip (ocf::heartbeat:IPaddr2):       Started ctrl02
Master/Slave Set: ovndb_servers-master [ovndb_servers]
     Masters: [ ctrl02 ]
     Slaves: [ ctrl01 ctrl03 ]

Failed Actions:
* ovndb_servers_monitor_10000 on ctrl01 'master' (8): call=18, status=complete, exitreason='',
    last-rc-change='Wed May  1 09:43:28 2019', queued=0ms, exec=73ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

/////

At this point, I execute ‘pcs cluster stop {{ controller }}; pcs cluster start {{ controller }}’ for each controller one at a time and eventually everything clears up.

# pcs status
Cluster name: OVN-CLUSTER
Stack: corosync
Current DC: ctrl02 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Wed May  1 09:47:50 2019
Last change: Wed May  1 09:44:53 2019 by root via crm_attribute on ctrl02

3 nodes configured
4 resources configured

Online: [ ctrl01 ctrl02 ctrl03 ]

Full list of resources:

ovn-virtual-ip (ocf::heartbeat:IPaddr2):       Started ctrl02
Master/Slave Set: ovndb_servers-master [ovndb_servers]
     Masters: [ ctrl02 ]
     Slaves: [ ctrl01 ctrl03 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

/////

lab-kvmctrl-01:~# ovn-sbctl show
Chassis "3fd25b76-3170-4eab-8604-690182500478"
    hostname: "lab-vxlan-cn03"
    Encap geneve
        ip: "192.168.100.78"
        options: {csum="true"}
Chassis "0bd6c91a-10d3-4a1f-86bf-bdeb4bf110a3"
    hostname: "lab-vxlan-cn01"
    Encap geneve
        ip: "192.168.100.76"
        options: {csum="true"}
Chassis "bda632c5-afeb-41bd-80e5-5c423172a771"
    hostname: "lab-vxlan-cn02"
    Encap geneve
        ip: "192.168.100.77"
        options: {csum="true"}
Chassis "5bd199bf-9e3e-41b1-bdb0-b1fab1adff7c"
    hostname: "lab-vxlan-cn04"
    Encap geneve
        ip: "192.168.100.79"
        options: {csum="true"}

/////

Now I turn up a VM on CN01 and it connects to ‘br-int’ for one of the interfaces.
OVN_NB has the logical switch configured, and the port assignment.
OVN_SB never receives the port state ‘online’ from the CN.

# ovs-vsctl show
aca349bd-6a23-47f6-98da-a35773753858
    Bridge br-int
        fail_mode: secure
        Port br-int
            Interface br-int
                type: internal
        Port "ovn-bda632-0"
            Interface "ovn-bda632-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="192.168.100.77"}
        Port "ovn-3fd25b-0"
            Interface "ovn-3fd25b-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="192.168.100.78"}
        Port "ovn-5bd199-0"
            Interface "ovn-5bd199-0"
                type: geneve
                options: {csum="true", key=flow, remote_ip="192.168.100.79"}
        Port "525401c1d4e1"  <<<<<<<<<  VM PORT
            Interface "525401c1d4e1" <<<<<<<<<  VM PORT

/////
VM DOMXML – INTERFACE
/////
    <interface type='bridge'>
      <mac address='52:54:01:c1:d4:e1'/>
      <source bridge='br-int'/>
      <virtualport type='openvswitch'>
        <parameters interfaceid='9de72cf5-cbb2-4ebe-9c89-3962a29ed869'/>
      </virtualport>
      <target dev='525401c1d4e1'/>
      <model type='virtio'/>
      <alias name='net1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </interface>

# ovn-nbctl show
switch 9f87e014-4b3d-40e4-9a42-9f0b28957c05 (ls_1234)
    port 9de72cf5-cbb2-4ebe-9c89-3962a29ed869
        addresses: ["52:54:01:c1:d4:e1"]

# ovn-sbctl show
Chassis "3fd25b76-3170-4eab-8604-690182500478"
    hostname: "lab-vxlan-cn03"
    Encap geneve
        ip: "192.168.100.78"
        options: {csum="true"}
Chassis "5bd199bf-9e3e-41b1-bdb0-b1fab1adff7c"
    hostname: "lab-vxlan-cn04"
    Encap geneve
        ip: "192.168.100.79"
        options: {csum="true"}
Chassis "bda632c5-afeb-41bd-80e5-5c423172a771"
    hostname: "lab-vxlan-cn02"
    Encap geneve
        ip: "192.168.100.77"
        options: {csum="true"}
Chassis "0bd6c91a-10d3-4a1f-86bf-bdeb4bf110a3"
    hostname: "lab-vxlan-cn01"
    Encap geneve
        ip: "192.168.100.76"
        options: {csum="true"}

lab-vxlan-cn01# ovs list open_vswitch | grep external_ids
external_ids        : {hostname="lab-vxlan-cn01", ovn-encap-ip="192.168.100.76", ovn-encap-type=geneve, ovn-nb="tcp:192.168.100.10:6641<http://192.168.100.10:6641>", ovn-remote="tcp:192.168.100.10:6642<http://192.168.100.10:6642>", rundir="/var/run/openvswitch", system-id="0bd6c91a-10d3-4a1f-86bf-bdeb4bf110a3"}

If you are familiar with puppet, you can refer to this [1] which creates the ocf:ovn:ovndb-servers  resource. But the steps you provided above to setup the pacemaker cluster
seems fine to me.

[1] - https://github.com/openstack/puppet-tripleo/blob/master/manifests/profile/pacemaker/ovn_northd.pp

You can do few things to check if your setup is fine or not

1. Check that ovn-controllers on nodes - cn[01-04] are able to communicate to the OVN Southbound DB.
On the master node you can delete the chassis like "ovn-sbctl chassis-del 3fd25b76-3170-4eab-8604-690182500478"
and then run "ovn-sbctl show". If the chassis record for cn03 reappears then its fine.

2. Run - ovn-nbctl --db=tcp:192.168.100.10:6641<http://192.168.100.10:6641> show to make sure you are able to talk to the OVN DB servers.

3.  See the ovn-controller.log on the CN01 and see if it has claimed the port or not. In the logs you should see "Claiming port ...."

Looks to me ovn-northd is not running on the master controller node - ctrl02.

Run "ovn-sbctl list port_binding". If this is empty then ovn-northd is not running.

In order for the OVN OCF pacemaker script to start ovn-northd on the master node you need to pass
"--manage-northd=yes".

Hope this helps.

Thanks

Thanks
Numan

/////
Netstat output from “master” controller
/////

# netstat -antp | grep -v sshd | grep -v WAIT
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:2224<http://0.0.0.0:2224>            0.0.0.0:*               LISTEN      1885/ruby
tcp        0      0 192.168.100.10:6641<http://192.168.100.10:6641>     0.0.0.0:*               LISTEN      3493/ovsdb-server
tcp        0      0 192.168.100.10:6642<http://192.168.100.10:6642>     0.0.0.0:*               LISTEN      3503/ovsdb-server
tcp        0      0 192.168.100.12:48362<http://192.168.100.12:48362>    192.168.100.12:2224<http://192.168.100.12:2224>     ESTABLISHED 1885/ruby
tcp        0      0 192.168.100.12:2224<http://192.168.100.12:2224>     192.168.100.11:58166<http://192.168.100.11:58166>    ESTABLISHED 1885/ruby
tcp        0      0 192.168.100.10:6642<http://192.168.100.10:6642>     192.168.100.13:34044<http://192.168.100.13:34044>    ESTABLISHED 3503/ovsdb-server
tcp        0      0 192.168.100.12:55480<http://192.168.100.12:55480>    192.168.100.13:2224<http://192.168.100.13:2224>     ESTABLISHED 1885/ruby
tcp        0      0 192.168.100.12:2224<http://192.168.100.12:2224>     192.168.100.12:48362<http://192.168.100.12:48362>    ESTABLISHED 1885/ruby
tcp        0      0 192.168.100.12:2224<http://192.168.100.12:2224>     192.168.100.13:43488<http://192.168.100.13:43488>    ESTABLISHED 1885/ruby
tcp        0      0 192.168.100.10:6641<http://192.168.100.10:6641>     192.168.100.13:34148<http://192.168.100.13:34148>    ESTABLISHED 3493/ovsdb-server
tcp        0      0 192.168.100.10:6642<http://192.168.100.10:6642>     192.168.100.79:47974<http://192.168.100.79:47974>    ESTABLISHED 3503/ovsdb-server
tcp        0      0 192.168.100.10:6641<http://192.168.100.10:6641>     192.168.100.11:60226<http://192.168.100.11:60226>    ESTABLISHED 3493/ovsdb-server
tcp        0      0 192.168.100.10:6642<http://192.168.100.10:6642>     192.168.100.76:55570<http://192.168.100.76:55570>    ESTABLISHED 3503/ovsdb-server
tcp        0      0 192.168.100.10:6642<http://192.168.100.10:6642>     192.168.100.78:36428<http://192.168.100.78:36428>    ESTABLISHED 3503/ovsdb-server
tcp        0      0 192.168.100.10:6642<http://192.168.100.10:6642>     192.168.100.11:40682<http://192.168.100.11:40682>    ESTABLISHED 3503/ovsdb-server
tcp        0      0 192.168.100.10:6642<http://192.168.100.10:6642>     192.168.100.77:58772<http://192.168.100.77:58772>    ESTABLISHED 3503/ovsdb-server
tcp        0      0 192.168.100.12:41974<http://192.168.100.12:41974>    192.168.100.11:2224<http://192.168.100.11:2224>     ESTABLISHED 1885/ruby
tcp6       0      0 :::2224                 :::*                    LISTEN      1885/ruby

Regards,

Stephen Flynn

_______________________________________________
discuss mailing list
discuss at openvswitch.org<mailto:discuss at openvswitch.org>
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20190501/48716d38/attachment-0001.html>