[ovs-discuss] raft ovsdb clustering

Numan Siddique nusiddiq at redhat.com
Wed Mar 21 12:43:47 UTC 2018


Hi Aliasgar,

ovsdb-server maintains locks per each connection and not across the db. A
workaround for you now would be to configure all the ovn-northd instances
to connect to one ovsdb-server if you want to have active/standy.

Probably Ben can answer if there is a plan to support ovsdb locks across
the db. We also need this support in networking-ovn as it also uses ovsdb
locks.

Thanks
Numan


On Wed, Mar 21, 2018 at 1:40 PM, aginwala <aginwala at asu.edu> wrote:

> Hi Numan:
>
> Just figured out that ovn-northd is running as active on all 3 nodes
> instead of one active instance as I continued to test further which results
> in db errors as per logs.
>
>
> # on node 3, I run ovn-nbctl ls-add ls2 ; it populates below logs in
> ovn-north
> 2018-03-21T06:01:59.442Z|00007|ovsdb_idl|WARN|transaction error:
> {"details":"Transaction causes multiple rows in \"Datapath_Binding\" table
> to have identical values (1) for index on column \"tunnel_key\".  First
> row, with UUID 8c5d9342-2b90-4229-8ea1-001a733a915c, was inserted by this
> transaction.  Second row, with UUID 8e06f919-4cc7-4ffc-9a79-20ce6663b683,
> existed in the database before this transaction and was not modified by the
> transaction.","error":"constraint violation"}
>
> In southbound datapath list, 2 duplicate records gets created for same
> switch.
>
> # ovn-sbctl list Datapath
> _uuid               : b270ae30-3458-445f-95d2-b14e8ebddd01
> external_ids        : {logical-switch="4d6674e3-ff9f-4f38-b050-0fa9bec9e34d",
> name="ls2"}
> tunnel_key          : 2
>
> _uuid               : 8e06f919-4cc7-4ffc-9a79-20ce6663b683
> external_ids        : {logical-switch="4d6674e3-ff9f-4f38-b050-0fa9bec9e34d",
> name="ls2"}
> tunnel_key          : 1
>
>
>
> # on nodes 1 and 2 where northd is running, it gives below error:
> 2018-03-21T06:01:59.437Z|00008|ovsdb_idl|WARN|transaction error:
> {"details":"cannot delete Datapath_Binding row
> 8e06f919-4cc7-4ffc-9a79-20ce6663b683 because of 17 remaining
> reference(s)","error":"referential integrity violation"}
>
> As per commit message, for northd I re-tried setting --ovnnb-db="tcp:
> 10.169.125.152:6641,tcp:10.169.125.131:6641,tcp:10.148.181.162:6641"  and
> --ovnsb-db="tcp:10.169.125.152:6642,tcp:10.169.125.131:6642,tcp:
> 10.148.181.162:6642" and it did not help either.
>
> There is no issue if I keep running only one instance of northd on any of
> these 3 nodes. Hence, wanted to know is there something else missing here
> to make only one northd instance as active and rest as standby?
>
>
> Regards,
>
> On Thu, Mar 15, 2018 at 3:09 AM, Numan Siddique <nusiddiq at redhat.com>
> wrote:
>
>> That's great
>>
>> Numan
>>
>>
>> On Thu, Mar 15, 2018 at 2:57 AM, aginwala <aginwala at asu.edu> wrote:
>>
>>> Hi Numan:
>>>
>>> I tried on new nodes (kernel : 4.4.0-104-generic , Ubuntu 16.04)with
>>> fresh installation and it worked super fine for both sb and nb dbs. Seems
>>> like some kernel issue on the previous nodes when I re-installed raft patch
>>> as I was running different ovs version on those nodes before.
>>>
>>>
>>> For 2 HVs, I now set ovn-remote="tcp:10.169.125.152:6642, tcp:
>>> 10.169.125.131:6642, tcp:10.148.181.162:6642"  and started controller
>>> and it works super fine.
>>>
>>>
>>> Did some failover testing by rebooting/killing the leader (
>>> 10.169.125.152) and bringing it back up and it works as expected.
>>> Nothing weird noted so far.
>>>
>>> # check-cluster gives below data one of the node(10.148.181.162) post
>>> leader failure
>>>
>>> ovsdb-tool check-cluster /etc/openvswitch/ovnsb_db.db
>>> ovsdb-tool: leader /etc/openvswitch/ovnsb_db.db for term 2 has log
>>> entries only up to index 18446744073709551615, but index 9 was committed in
>>> a previous term (e.g. by /etc/openvswitch/ovnsb_db.db)
>>>
>>>
>>> For check-cluster, are we planning to add more output showing which node
>>> is active(leader), etc in upcoming versions ?
>>>
>>>
>>> Thanks a ton for helping sort this out.  I think the patch looks good to
>>> be merged post addressing of the comments by Justin along with the man page
>>> details for ovsdb-tool.
>>>
>>>
>>> I will do some more crash testing for the cluster along with the scale
>>> test and keep you posted if something unexpected is noted.
>>>
>>>
>>>
>>> Regards,
>>>
>>>
>>>
>>> On Tue, Mar 13, 2018 at 11:07 PM, Numan Siddique <nusiddiq at redhat.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Mar 14, 2018 at 7:51 AM, aginwala <aginwala at asu.edu> wrote:
>>>>
>>>>> Sure.
>>>>>
>>>>> To add on , I also ran for nb db too using different port  and Node2
>>>>> crashes with same error :
>>>>> # Node 2
>>>>> /usr/share/openvswitch/scripts/ovn-ctl --db-nb-addr=10.99.152.138
>>>>> --db-nb-port=6641 --db-nb-cluster-remote-addr="tcp:10.99.152.148:6645"
>>>>> --db-nb-cluster-local-addr="tcp:10.99.152.138:6645" start_nb_ovsdb
>>>>> ovsdb-server: ovsdb error: /etc/openvswitch/ovnnb_db.db: cannot
>>>>> identify file type
>>>>>
>>>>>
>>>>>
>>>> Hi Aliasgar,
>>>>
>>>> It worked for me. Can you delete the old db files in /etc/openvswitch/
>>>> and try running the commands again ?
>>>>
>>>> Below are the commands I ran in my setup.
>>>>
>>>> Node 1
>>>> -------
>>>> sudo /usr/share/openvswitch/scripts/ovn-ctl
>>>> --db-sb-addr=192.168.121.91 --db-sb-port=6642 --db-sb-create-insecure-remote=yes
>>>> --db-sb-cluster-local-addr=tcp:192.168.121.91:6644 start_sb_ovsdb
>>>>
>>>> Node 2
>>>> ---------
>>>> sudo /usr/share/openvswitch/scripts/ovn-ctl
>>>> --db-sb-addr=192.168.121.87 --db-sb-port=6642 --db-sb-create-insecure-remote=yes
>>>> --db-sb-cluster-local-addr="tcp:192.168.121.87:6644"
>>>> --db-sb-cluster-remote-addr="tcp:192.168.121.91:6644"  start_sb_ovsdb
>>>>
>>>> Node 3
>>>> ---------
>>>> sudo /usr/share/openvswitch/scripts/ovn-ctl
>>>> --db-sb-addr=192.168.121.78 --db-sb-port=6642 --db-sb-create-insecure-remote=yes
>>>> --db-sb-cluster-local-addr="tcp:192.168.121.78:6644"
>>>> --db-sb-cluster-remote-addr="tcp:192.168.121.91:6644"  start_sb_ovsdb
>>>>
>>>>
>>>>
>>>> Thanks
>>>> Numan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>> On Tue, Mar 13, 2018 at 9:40 AM, Numan Siddique <nusiddiq at redhat.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 13, 2018 at 9:46 PM, aginwala <aginwala at asu.edu> wrote:
>>>>>>
>>>>>>> Thanks Numan for the response.
>>>>>>>
>>>>>>> There is no command start_cluster_sb_ovsdb in the source code too.
>>>>>>> Is that in a separate commit somewhere? Hence, I used start_sb_ovsdb
>>>>>>> which I think would not be a right choice?
>>>>>>>
>>>>>>
>>>>>> Sorry, I meant start_sb_ovsdb. Strange that it didn't work for you.
>>>>>> Let me try it out again and update this thread.
>>>>>>
>>>>>> Thanks
>>>>>> Numan
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> # Node1  came up as expected.
>>>>>>> ovn-ctl --db-sb-addr=10.99.152.148 --db-sb-port=6642
>>>>>>> --db-sb-create-insecure-remote=yes --db-sb-cluster-local-addr="tcp:
>>>>>>> 10.99.152.148:6644" start_sb_ovsdb.
>>>>>>>
>>>>>>> # verifying its a clustered db with ovsdb-tool db-local-address
>>>>>>> /etc/openvswitch/ovnsb_db.db
>>>>>>> tcp:10.99.152.148:6644
>>>>>>> # ovn-sbctl show works fine and chassis are being populated
>>>>>>> correctly.
>>>>>>>
>>>>>>> #Node 2 fails with error:
>>>>>>> /usr/share/openvswitch/scripts/ovn-ctl --db-sb-addr=10.99.152.138
>>>>>>> --db-sb-port=6642 --db-sb-create-insecure-remote=yes
>>>>>>> --db-sb-cluster-remote-addr="tcp:10.99.152.148:6644"
>>>>>>> --db-sb-cluster-local-addr="tcp:10.99.152.138:6644" start_sb_ovsdb
>>>>>>> ovsdb-server: ovsdb error: /etc/openvswitch/ovnsb_db.db: cannot
>>>>>>> identify file type
>>>>>>>
>>>>>>> # So i did start the sb db the usual way using start_ovsdb to just
>>>>>>> get the db file created and killed the sb pid and re-ran the command which
>>>>>>> gave actual error where it complains for join-cluster command that is being
>>>>>>> called internally
>>>>>>> /usr/share/openvswitch/scripts/ovn-ctl --db-sb-addr=10.99.152.138
>>>>>>> --db-sb-port=6642 --db-sb-create-insecure-remote=yes
>>>>>>> --db-sb-cluster-remote-addr="tcp:10.99.152.148:6644"
>>>>>>> --db-sb-cluster-local-addr="tcp:10.99.152.138:6644" start_sb_ovsdb
>>>>>>> ovsdb-tool: /etc/openvswitch/ovnsb_db.db: not a clustered database
>>>>>>>  * Backing up database to /etc/openvswitch/ovnsb_db.db.b
>>>>>>> ackup1.15.0-70426956
>>>>>>> ovsdb-tool: 'join-cluster' command requires at least 4 arguments
>>>>>>>  * Creating cluster database /etc/openvswitch/ovnsb_db.db from
>>>>>>> existing one
>>>>>>>
>>>>>>>
>>>>>>> # based on above error I killed the sb db pid again and  try to
>>>>>>> create a local cluster on node  then re-ran the join operation as per the
>>>>>>> source code function.
>>>>>>> ovsdb-tool join-cluster /etc/openvswitch/ovnsb_db.db OVN_Southbound
>>>>>>> tcp:10.99.152.138:6644 tcp:10.99.152.148:6644 which still complains
>>>>>>> ovsdb-tool: I/O error: /etc/openvswitch/ovnsb_db.db: create failed
>>>>>>> (File exists)
>>>>>>>
>>>>>>>
>>>>>>> # Node 3: I did not try as I am assuming the same failure as node 2
>>>>>>>
>>>>>>>
>>>>>>> Let me know may know further.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 13, 2018 at 3:08 AM, Numan Siddique <nusiddiq at redhat.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi Aliasgar,
>>>>>>>>
>>>>>>>> On Tue, Mar 13, 2018 at 7:11 AM, aginwala <aginwala at asu.edu> wrote:
>>>>>>>>
>>>>>>>>> Hi Ben/Noman:
>>>>>>>>>
>>>>>>>>> I am trying to setup 3 node southbound db cluster  using raft10
>>>>>>>>> <https://patchwork.ozlabs.org/patch/854298/> in review.
>>>>>>>>>
>>>>>>>>> # Node 1 create-cluster
>>>>>>>>> ovsdb-tool create-cluster /etc/openvswitch/ovnsb_db.db
>>>>>>>>> /root/ovs-reviews/ovn/ovn-sb.ovsschema tcp:10.99.152.148:6642
>>>>>>>>>
>>>>>>>>
>>>>>>>> A different port is used for RAFT. So you have to choose another
>>>>>>>> port like 6644 for example.
>>>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> # Node 2
>>>>>>>>> ovsdb-tool join-cluster /etc/openvswitch/ovnsb_db.db
>>>>>>>>> OVN_Southbound tcp:10.99.152.138:6642 tcp:10.99.152.148:6642 --cid
>>>>>>>>> 5dfcb678-bb1d-4377-b02d-a380edec2982
>>>>>>>>>
>>>>>>>>> #Node 3
>>>>>>>>> ovsdb-tool join-cluster /etc/openvswitch/ovnsb_db.db
>>>>>>>>> OVN_Southbound tcp:10.99.152.101:6642 tcp:10.99.152.138:6642 tcp:
>>>>>>>>> 10.99.152.148:6642 --cid 5dfcb678-bb1d-4377-b02d-a380edec2982
>>>>>>>>>
>>>>>>>>> # ovn remote is set to all 3 nodes
>>>>>>>>> external_ids:ovn-remote="tcp:10.99.152.148:6642, tcp:
>>>>>>>>> 10.99.152.138:6642, tcp:10.99.152.101:6642"
>>>>>>>>>
>>>>>>>>
>>>>>>>>> # Starting sb db on node 1 using below command on node 1:
>>>>>>>>>
>>>>>>>>> ovsdb-server --detach --monitor -vconsole:off -vraft -vjsonrpc
>>>>>>>>> --log-file=/var/log/openvswitch/ovsdb-server-sb.log
>>>>>>>>> --pidfile=/var/run/openvswitch/ovnsb_db.pid
>>>>>>>>> --remote=db:OVN_Southbound,SB_Global,connections
>>>>>>>>> --unixctl=ovnsb_db.ctl --private-key=db:OVN_Southbound,SSL,private_key
>>>>>>>>> --certificate=db:OVN_Southbound,SSL,certificate
>>>>>>>>> --ca-cert=db:OVN_Southbound,SSL,ca_cert
>>>>>>>>> --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols
>>>>>>>>> --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers
>>>>>>>>> --remote=punix:/var/run/openvswitch/ovnsb_db.sock
>>>>>>>>> /etc/openvswitch/ovnsb_db.db
>>>>>>>>>
>>>>>>>>> # check-cluster is returning nothing
>>>>>>>>> ovsdb-tool check-cluster /etc/openvswitch/ovnsb_db.db
>>>>>>>>>
>>>>>>>>> # ovsdb-server-sb.log below shows the leader is elected with only
>>>>>>>>> one server and there are rbac related debug logs with rpc replies and empty
>>>>>>>>> params with no errors
>>>>>>>>>
>>>>>>>>> 2018-03-13T01:12:02Z|00002|raft|DBG|server 63d1 added to
>>>>>>>>> configuration
>>>>>>>>> 2018-03-13T01:12:02Z|00003|raft|INFO|term 6: starting election
>>>>>>>>> 2018-03-13T01:12:02Z|00004|raft|INFO|term 6: elected leader by 1+
>>>>>>>>> of 1 servers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Now Starting the ovsdb-server on the other clusters fails saying
>>>>>>>>> ovsdb-server: ovsdb error: /etc/openvswitch/ovnsb_db.db: cannot
>>>>>>>>> identify file type
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Also noticed that man ovsdb-tool is missing cluster details. Might
>>>>>>>>> want to address it in the same patch or different.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please advise to what is missing here for running ovn-sbctl show
>>>>>>>>> as this command hangs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> I think you can use the ovn-ctl command "start_cluster_sb_ovsdb"
>>>>>>>> for your testing (atleast for now)
>>>>>>>>
>>>>>>>> For your setup, I think you can start the cluster as
>>>>>>>>
>>>>>>>> # Node 1
>>>>>>>> ovn-ctl --db-sb-addr=10.99.152.148 --db-sb-port=6642
>>>>>>>> --db-sb-create-insecure-remote=yes --db-sb-cluster-local-addr="tcp:
>>>>>>>> 10.99.152.148:6644" start_cluster_sb_ovsdb
>>>>>>>>
>>>>>>>> # Node 2
>>>>>>>> ovn-ctl --db-sb-addr=10.99.152.138 --db-sb-port=6642
>>>>>>>> --db-sb-create-insecure-remote=yes --db-sb-cluster-local-addr="tc
>>>>>>>> p:10.99.152.138:6644"  --db-sb-cluster-remote-addr="tcp:
>>>>>>>> 10.99.152.148:6644" start_cluster_sb_ovsdb
>>>>>>>>
>>>>>>>> # Node 3
>>>>>>>> ovn-ctl --db-sb-addr=10.99.152.101 --db-sb-port=6642
>>>>>>>> --db-sb-create-insecure-remote=yes --db-sb-cluster-local-addr="tc
>>>>>>>> p:10.99.152.101:6644"  --db-sb-cluster-remote-addr="tcp:
>>>>>>>> 10.99.152.148:6644" start_cluster_sb_ovsdb
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Let me know how it goes.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Numan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> discuss mailing list
>>>>>>>>> discuss at openvswitch.org
>>>>>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20180321/fa272da5/attachment-0001.html>


More information about the discuss mailing list