[ovs-discuss] raft ovsdb clustering

aginwala aginwala at asu.edu
Wed Mar 21 16:49:58 UTC 2018


Thanks Numan:

Yup agree with the locking part. For now; yes I am running northd on one
node. I might right a script to monitor northd  in cluster so that if the
node where it's running goes down, script can spin up northd on one other
active nodes as a dirty hack.

Sure, will await for the inputs from Ben too on this and see how complex
would it be to roll out this feature.


Regards,


On Wed, Mar 21, 2018 at 5:43 AM, Numan Siddique <nusiddiq at redhat.com> wrote:

> Hi Aliasgar,
>
> ovsdb-server maintains locks per each connection and not across the db. A
> workaround for you now would be to configure all the ovn-northd instances
> to connect to one ovsdb-server if you want to have active/standy.
>
> Probably Ben can answer if there is a plan to support ovsdb locks across
> the db. We also need this support in networking-ovn as it also uses ovsdb
> locks.
>
> Thanks
> Numan
>
>
> On Wed, Mar 21, 2018 at 1:40 PM, aginwala <aginwala at asu.edu> wrote:
>
>> Hi Numan:
>>
>> Just figured out that ovn-northd is running as active on all 3 nodes
>> instead of one active instance as I continued to test further which results
>> in db errors as per logs.
>>
>>
>> # on node 3, I run ovn-nbctl ls-add ls2 ; it populates below logs in
>> ovn-north
>> 2018-03-21T06:01:59.442Z|00007|ovsdb_idl|WARN|transaction error:
>> {"details":"Transaction causes multiple rows in \"Datapath_Binding\" table
>> to have identical values (1) for index on column \"tunnel_key\".  First
>> row, with UUID 8c5d9342-2b90-4229-8ea1-001a733a915c, was inserted by
>> this transaction.  Second row, with UUID 8e06f919-4cc7-4ffc-9a79-20ce6663b683,
>> existed in the database before this transaction and was not modified by the
>> transaction.","error":"constraint violation"}
>>
>> In southbound datapath list, 2 duplicate records gets created for same
>> switch.
>>
>> # ovn-sbctl list Datapath
>> _uuid               : b270ae30-3458-445f-95d2-b14e8ebddd01
>> external_ids        : {logical-switch="4d6674e3-ff9f-4f38-b050-0fa9bec9e34d",
>> name="ls2"}
>> tunnel_key          : 2
>>
>> _uuid               : 8e06f919-4cc7-4ffc-9a79-20ce6663b683
>> external_ids        : {logical-switch="4d6674e3-ff9f-4f38-b050-0fa9bec9e34d",
>> name="ls2"}
>> tunnel_key          : 1
>>
>>
>>
>> # on nodes 1 and 2 where northd is running, it gives below error:
>> 2018-03-21T06:01:59.437Z|00008|ovsdb_idl|WARN|transaction error:
>> {"details":"cannot delete Datapath_Binding row
>> 8e06f919-4cc7-4ffc-9a79-20ce6663b683 because of 17 remaining
>> reference(s)","error":"referential integrity violation"}
>>
>> As per commit message, for northd I re-tried setting --ovnnb-db="tcp:
>> 10.169.125.152:6641,tcp:10.169.125.131:6641,tcp:10.148.181.162:6641"
>> and --ovnsb-db="tcp:10.169.125.152:6642,tcp:10.169.125.131:6642,tcp:
>> 10.148.181.162:6642" and it did not help either.
>>
>> There is no issue if I keep running only one instance of northd on any of
>> these 3 nodes. Hence, wanted to know is there something else missing
>> here to make only one northd instance as active and rest as standby?
>>
>>
>> Regards,
>>
>> On Thu, Mar 15, 2018 at 3:09 AM, Numan Siddique <nusiddiq at redhat.com>
>> wrote:
>>
>>> That's great
>>>
>>> Numan
>>>
>>>
>>> On Thu, Mar 15, 2018 at 2:57 AM, aginwala <aginwala at asu.edu> wrote:
>>>
>>>> Hi Numan:
>>>>
>>>> I tried on new nodes (kernel : 4.4.0-104-generic , Ubuntu 16.04)with
>>>> fresh installation and it worked super fine for both sb and nb dbs. Seems
>>>> like some kernel issue on the previous nodes when I re-installed raft patch
>>>> as I was running different ovs version on those nodes before.
>>>>
>>>>
>>>> For 2 HVs, I now set ovn-remote="tcp:10.169.125.152:6642, tcp:
>>>> 10.169.125.131:6642, tcp:10.148.181.162:6642"  and started controller
>>>> and it works super fine.
>>>>
>>>>
>>>> Did some failover testing by rebooting/killing the leader (
>>>> 10.169.125.152) and bringing it back up and it works as expected.
>>>> Nothing weird noted so far.
>>>>
>>>> # check-cluster gives below data one of the node(10.148.181.162) post
>>>> leader failure
>>>>
>>>> ovsdb-tool check-cluster /etc/openvswitch/ovnsb_db.db
>>>> ovsdb-tool: leader /etc/openvswitch/ovnsb_db.db for term 2 has log
>>>> entries only up to index 18446744073709551615, but index 9 was committed in
>>>> a previous term (e.g. by /etc/openvswitch/ovnsb_db.db)
>>>>
>>>>
>>>> For check-cluster, are we planning to add more output showing which
>>>> node is active(leader), etc in upcoming versions ?
>>>>
>>>>
>>>> Thanks a ton for helping sort this out.  I think the patch looks good
>>>> to be merged post addressing of the comments by Justin along with the man
>>>> page details for ovsdb-tool.
>>>>
>>>>
>>>> I will do some more crash testing for the cluster along with the scale
>>>> test and keep you posted if something unexpected is noted.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>>
>>>>
>>>> On Tue, Mar 13, 2018 at 11:07 PM, Numan Siddique <nusiddiq at redhat.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 14, 2018 at 7:51 AM, aginwala <aginwala at asu.edu> wrote:
>>>>>
>>>>>> Sure.
>>>>>>
>>>>>> To add on , I also ran for nb db too using different port  and Node2
>>>>>> crashes with same error :
>>>>>> # Node 2
>>>>>> /usr/share/openvswitch/scripts/ovn-ctl --db-nb-addr=10.99.152.138
>>>>>> --db-nb-port=6641 --db-nb-cluster-remote-addr="tcp:10.99.152.148:6645"
>>>>>> --db-nb-cluster-local-addr="tcp:10.99.152.138:6645" start_nb_ovsdb
>>>>>> ovsdb-server: ovsdb error: /etc/openvswitch/ovnnb_db.db: cannot
>>>>>> identify file type
>>>>>>
>>>>>>
>>>>>>
>>>>> Hi Aliasgar,
>>>>>
>>>>> It worked for me. Can you delete the old db files in /etc/openvswitch/
>>>>> and try running the commands again ?
>>>>>
>>>>> Below are the commands I ran in my setup.
>>>>>
>>>>> Node 1
>>>>> -------
>>>>> sudo /usr/share/openvswitch/scripts/ovn-ctl
>>>>> --db-sb-addr=192.168.121.91 --db-sb-port=6642 --db-sb-create-insecure-remote=yes
>>>>> --db-sb-cluster-local-addr=tcp:192.168.121.91:6644 start_sb_ovsdb
>>>>>
>>>>> Node 2
>>>>> ---------
>>>>> sudo /usr/share/openvswitch/scripts/ovn-ctl
>>>>> --db-sb-addr=192.168.121.87 --db-sb-port=6642 --db-sb-create-insecure-remote=yes
>>>>> --db-sb-cluster-local-addr="tcp:192.168.121.87:6644"
>>>>> --db-sb-cluster-remote-addr="tcp:192.168.121.91:6644"  start_sb_ovsdb
>>>>>
>>>>> Node 3
>>>>> ---------
>>>>> sudo /usr/share/openvswitch/scripts/ovn-ctl
>>>>> --db-sb-addr=192.168.121.78 --db-sb-port=6642 --db-sb-create-insecure-remote=yes
>>>>> --db-sb-cluster-local-addr="tcp:192.168.121.78:6644"
>>>>> --db-sb-cluster-remote-addr="tcp:192.168.121.91:6644"  start_sb_ovsdb
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>> Numan
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> On Tue, Mar 13, 2018 at 9:40 AM, Numan Siddique <nusiddiq at redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 13, 2018 at 9:46 PM, aginwala <aginwala at asu.edu> wrote:
>>>>>>>
>>>>>>>> Thanks Numan for the response.
>>>>>>>>
>>>>>>>> There is no command start_cluster_sb_ovsdb in the source code too.
>>>>>>>> Is that in a separate commit somewhere? Hence, I used start_sb_ovsdb
>>>>>>>> which I think would not be a right choice?
>>>>>>>>
>>>>>>>
>>>>>>> Sorry, I meant start_sb_ovsdb. Strange that it didn't work for you.
>>>>>>> Let me try it out again and update this thread.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Numan
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> # Node1  came up as expected.
>>>>>>>> ovn-ctl --db-sb-addr=10.99.152.148 --db-sb-port=6642
>>>>>>>> --db-sb-create-insecure-remote=yes --db-sb-cluster-local-addr="tcp:
>>>>>>>> 10.99.152.148:6644" start_sb_ovsdb.
>>>>>>>>
>>>>>>>> # verifying its a clustered db with ovsdb-tool db-local-address
>>>>>>>> /etc/openvswitch/ovnsb_db.db
>>>>>>>> tcp:10.99.152.148:6644
>>>>>>>> # ovn-sbctl show works fine and chassis are being populated
>>>>>>>> correctly.
>>>>>>>>
>>>>>>>> #Node 2 fails with error:
>>>>>>>> /usr/share/openvswitch/scripts/ovn-ctl --db-sb-addr=10.99.152.138
>>>>>>>> --db-sb-port=6642 --db-sb-create-insecure-remote=yes
>>>>>>>> --db-sb-cluster-remote-addr="tcp:10.99.152.148:6644"
>>>>>>>> --db-sb-cluster-local-addr="tcp:10.99.152.138:6644" start_sb_ovsdb
>>>>>>>> ovsdb-server: ovsdb error: /etc/openvswitch/ovnsb_db.db: cannot
>>>>>>>> identify file type
>>>>>>>>
>>>>>>>> # So i did start the sb db the usual way using start_ovsdb to just
>>>>>>>> get the db file created and killed the sb pid and re-ran the command which
>>>>>>>> gave actual error where it complains for join-cluster command that is being
>>>>>>>> called internally
>>>>>>>> /usr/share/openvswitch/scripts/ovn-ctl --db-sb-addr=10.99.152.138
>>>>>>>> --db-sb-port=6642 --db-sb-create-insecure-remote=yes
>>>>>>>> --db-sb-cluster-remote-addr="tcp:10.99.152.148:6644"
>>>>>>>> --db-sb-cluster-local-addr="tcp:10.99.152.138:6644" start_sb_ovsdb
>>>>>>>> ovsdb-tool: /etc/openvswitch/ovnsb_db.db: not a clustered database
>>>>>>>>  * Backing up database to /etc/openvswitch/ovnsb_db.db.b
>>>>>>>> ackup1.15.0-70426956
>>>>>>>> ovsdb-tool: 'join-cluster' command requires at least 4 arguments
>>>>>>>>  * Creating cluster database /etc/openvswitch/ovnsb_db.db from
>>>>>>>> existing one
>>>>>>>>
>>>>>>>>
>>>>>>>> # based on above error I killed the sb db pid again and  try to
>>>>>>>> create a local cluster on node  then re-ran the join operation as per the
>>>>>>>> source code function.
>>>>>>>> ovsdb-tool join-cluster /etc/openvswitch/ovnsb_db.db OVN_Southbound
>>>>>>>> tcp:10.99.152.138:6644 tcp:10.99.152.148:6644 which still complains
>>>>>>>> ovsdb-tool: I/O error: /etc/openvswitch/ovnsb_db.db: create failed
>>>>>>>> (File exists)
>>>>>>>>
>>>>>>>>
>>>>>>>> # Node 3: I did not try as I am assuming the same failure as node 2
>>>>>>>>
>>>>>>>>
>>>>>>>> Let me know may know further.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 13, 2018 at 3:08 AM, Numan Siddique <
>>>>>>>> nusiddiq at redhat.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Aliasgar,
>>>>>>>>>
>>>>>>>>> On Tue, Mar 13, 2018 at 7:11 AM, aginwala <aginwala at asu.edu>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Ben/Noman:
>>>>>>>>>>
>>>>>>>>>> I am trying to setup 3 node southbound db cluster  using raft10
>>>>>>>>>> <https://patchwork.ozlabs.org/patch/854298/> in review.
>>>>>>>>>>
>>>>>>>>>> # Node 1 create-cluster
>>>>>>>>>> ovsdb-tool create-cluster /etc/openvswitch/ovnsb_db.db
>>>>>>>>>> /root/ovs-reviews/ovn/ovn-sb.ovsschema tcp:10.99.152.148:6642
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> A different port is used for RAFT. So you have to choose another
>>>>>>>>> port like 6644 for example.
>>>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> # Node 2
>>>>>>>>>> ovsdb-tool join-cluster /etc/openvswitch/ovnsb_db.db
>>>>>>>>>> OVN_Southbound tcp:10.99.152.138:6642 tcp:10.99.152.148:6642 --cid
>>>>>>>>>> 5dfcb678-bb1d-4377-b02d-a380edec2982
>>>>>>>>>>
>>>>>>>>>> #Node 3
>>>>>>>>>> ovsdb-tool join-cluster /etc/openvswitch/ovnsb_db.db
>>>>>>>>>> OVN_Southbound tcp:10.99.152.101:6642 tcp:10.99.152.138:6642 tcp:
>>>>>>>>>> 10.99.152.148:6642 --cid 5dfcb678-bb1d-4377-b02d-a380edec2982
>>>>>>>>>>
>>>>>>>>>> # ovn remote is set to all 3 nodes
>>>>>>>>>> external_ids:ovn-remote="tcp:10.99.152.148:6642, tcp:
>>>>>>>>>> 10.99.152.138:6642, tcp:10.99.152.101:6642"
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> # Starting sb db on node 1 using below command on node 1:
>>>>>>>>>>
>>>>>>>>>> ovsdb-server --detach --monitor -vconsole:off -vraft -vjsonrpc
>>>>>>>>>> --log-file=/var/log/openvswitch/ovsdb-server-sb.log
>>>>>>>>>> --pidfile=/var/run/openvswitch/ovnsb_db.pid
>>>>>>>>>> --remote=db:OVN_Southbound,SB_Global,connections
>>>>>>>>>> --unixctl=ovnsb_db.ctl --private-key=db:OVN_Southbound,SSL,private_key
>>>>>>>>>> --certificate=db:OVN_Southbound,SSL,certificate
>>>>>>>>>> --ca-cert=db:OVN_Southbound,SSL,ca_cert
>>>>>>>>>> --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols
>>>>>>>>>> --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers
>>>>>>>>>> --remote=punix:/var/run/openvswitch/ovnsb_db.sock
>>>>>>>>>> /etc/openvswitch/ovnsb_db.db
>>>>>>>>>>
>>>>>>>>>> # check-cluster is returning nothing
>>>>>>>>>> ovsdb-tool check-cluster /etc/openvswitch/ovnsb_db.db
>>>>>>>>>>
>>>>>>>>>> # ovsdb-server-sb.log below shows the leader is elected with only
>>>>>>>>>> one server and there are rbac related debug logs with rpc replies and empty
>>>>>>>>>> params with no errors
>>>>>>>>>>
>>>>>>>>>> 2018-03-13T01:12:02Z|00002|raft|DBG|server 63d1 added to
>>>>>>>>>> configuration
>>>>>>>>>> 2018-03-13T01:12:02Z|00003|raft|INFO|term 6: starting election
>>>>>>>>>> 2018-03-13T01:12:02Z|00004|raft|INFO|term 6: elected leader by
>>>>>>>>>> 1+ of 1 servers
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Now Starting the ovsdb-server on the other clusters fails saying
>>>>>>>>>> ovsdb-server: ovsdb error: /etc/openvswitch/ovnsb_db.db: cannot
>>>>>>>>>> identify file type
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Also noticed that man ovsdb-tool is missing cluster details.
>>>>>>>>>> Might want to address it in the same patch or different.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Please advise to what is missing here for running ovn-sbctl show
>>>>>>>>>> as this command hangs.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think you can use the ovn-ctl command "start_cluster_sb_ovsdb"
>>>>>>>>> for your testing (atleast for now)
>>>>>>>>>
>>>>>>>>> For your setup, I think you can start the cluster as
>>>>>>>>>
>>>>>>>>> # Node 1
>>>>>>>>> ovn-ctl --db-sb-addr=10.99.152.148 --db-sb-port=6642
>>>>>>>>> --db-sb-create-insecure-remote=yes --db-sb-cluster-local-addr="tc
>>>>>>>>> p:10.99.152.148:6644" start_cluster_sb_ovsdb
>>>>>>>>>
>>>>>>>>> # Node 2
>>>>>>>>> ovn-ctl --db-sb-addr=10.99.152.138 --db-sb-port=6642
>>>>>>>>> --db-sb-create-insecure-remote=yes --db-sb-cluster-local-addr="tc
>>>>>>>>> p:10.99.152.138:6644"  --db-sb-cluster-remote-addr="tcp:
>>>>>>>>> 10.99.152.148:6644" start_cluster_sb_ovsdb
>>>>>>>>>
>>>>>>>>> # Node 3
>>>>>>>>> ovn-ctl --db-sb-addr=10.99.152.101 --db-sb-port=6642
>>>>>>>>> --db-sb-create-insecure-remote=yes --db-sb-cluster-local-addr="tc
>>>>>>>>> p:10.99.152.101:6644"  --db-sb-cluster-remote-addr="tcp:
>>>>>>>>> 10.99.152.148:6644" start_cluster_sb_ovsdb
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Let me know how it goes.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Numan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> discuss mailing list
>>>>>>>>>> discuss at openvswitch.org
>>>>>>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20180321/086c6696/attachment-0001.html>


More information about the discuss mailing list