[ovs-dev] [PATCH 15/15] ovsdb: Introduce experimental support for clustered databases. (1/2)

Fri Feb 16 09:37:04 UTC 2018

> On Dec 31, 2017, at 9:16 PM, Ben Pfaff <blp at ovn.org> wrote:

I still haven't finished, but I think there are a few items you could start looking at while I finish the review later today.

> diff --git a/Documentation/ref/ovsdb.5.rst b/Documentation/ref/ovsdb.5.rst
> index f3e50976b5c7..0ab888996eac 100644
> --- a/Documentation/ref/ovsdb.5.rst
> +++ b/Documentation/ref/ovsdb.5.rst
> 
> ...
> +<cluster-txn>
> +    A JSON array with two elements:
> +
> +    1. The first element is either a <database-schema> or ``null``.  It is
> +       always present in the first record of a clustered database to indicate
> +       the database's initial schema.  If it is present in a later record, it
> +       indicates a change of schema for the database.

Assuming my reading is correct, I think it might be clearer as the following:

    1. The first element is either a <database-schema> or ``null``.
       A <database-schema> element is always present in the first
       record of a clustered database to indicate the database's initial
       schema.  If it is not ``null`` in a later record, it indicates a
       change of schema for the database.	

> +    2. The second element is either a transaction record in the format
> +       described under ``Transaction Records'' above, or ``null``.

I didn't see "Transaction Records" in the file.

> ...
> +``name": <id>``
> +    The database schema name.  It is only important when a server is in the
> +    process of a joining a cluster: a server will only join a cluster if the
> +    name matches.  (If the database schema name were unique, then we would
> +    not also need a cluster ID.)

I think you can drop the first "a" in "a joining a".

> ...
> +Entry
> +    A Raft log entry.
> +
> +Term
> +    The start of a new term.
> +
> +Vote
> +    The server's vote for a leader in the current term.
> +
> +The following additional types of records aid debugging and troubleshooting,
> +but they do not affect correctness.
> +
> +Note
> +    A human-readable description of some event.
> +
> +Commit Index
> +    An update to the server's ``commit_index``.
> +
> +Leader
> +    Identifies a newly elected leader for the current term.
> +
> +The table below identifies the members that each type of record contains.
> +"yes" indicates that a member is required, "?" that it is optional, blank that
> +it is forbidden, and [1] that ``data`` and ``eid`` must be either both present
> +or both absent.
> +
> +============  ====  =====  ====  ======  ============  ====
> +member        Term  Entry  Vote  Leader  Commit Index  Note
> +============  ====  =====  ====  ======  ============  ====
> +comment         ?     ?      ?      ?          ?         ?
> +term           yes   yes    yes    yes
> +index                yes
> +servers               ?
> +data                 [1]
> +eid                  [1]
> +vote                        yes
> +leader                             yes
> +commit_index                                  yes
> +note                                                   yes
> +============  ====  =====  ====  ======  ============  ====

I know it's just my OCD, but it might be nice to list the record types in the same order as the column headers.

> 
> +Joining a Cluster
> +~~~~~~~~~~~~~~~~~
> +
> +In addition to general format for a clustered database, there is also a special
> +case for a database file created by ``ovsdb-tool join-cluster``.  Such a file
> +contains exactly one record, which conveys the information passed to the
> +``join-cluster`` command.  It has the following members:
> +
> +``"server_id": <raw-uuid>`` and ``"local_address": <address>`` and ``"name": <id>``
> +    These have the same semantics described above in the general description
> +    of the format.
> +
> +``"cluster_id": <raw-uuid>``
> +    This is provided only if the user gave the ``--cid`` option to
> +    ``join-cluster``.  It has the same semantics described above.
> +
> +
> +``"remote_addresses"; [<address>*]``
> +    One or more remote servers to contact for joining the cluster.

I think there's an extra blank line above this section.

> diff --git a/Documentation/ref/ovsdb.7.rst b/Documentation/ref/ovsdb.7.rst
> index 6adef73826e8..75c2542cdffc 100644
> --- a/Documentation/ref/ovsdb.7.rst
> +++ b/Documentation/ref/ovsdb.7.rst
> ...
> +To configure a client, such as ``ovs-vswitchd`` or ``ovs-vsctl``, to use a
> +clustered database, configure the server to listen on a "connection method"
> +that the client can reach, then point the client to that connection method.
> +See `Connection Methods`_ below for information about connection methods.

This is in the "Standalone Database Service Model" section, which seems incorrect.

> +Clustered Database Service Model
> +--------------------------------
> +
> +A **clustered** database runs across 3 or 5 database servers (the **cluster**)
> +on different hosts.  Servers in a cluster automatically synchronize writes
> +within the cluster.  A 3-server cluster can remain available in the face of at
> +most 1 server failure; a 5-server cluster tolerates up to 2 failures.
> +Clusters larger than 5 servers will also work, with every 2 added servers
> +allowing the cluster to tolerate 1 more failure, but write performance
> +decreases.  The number of servers should be odd: a 4- or 6-server cluster
> +cannot tolerate more failures than a 3- or 5-server cluster, respectively.

The first sentence makes it sound like only 3 or 5 database servers is supported in a cluster, but the rest of the paragraph make it clear that more can be run.

> +To configure a client, such as ``ovn-controller`` or ``ovn-sbctl``, to use a
> +clustered database, first configure all of the servers to listen on a
> +connection method that the client can reach, then point the client to all of
> +the connection methods, comma-separated.  See `Connection Methods`_, below, for
> +more detail.

The phrase 'then point the client to all of the connection methods" sounds odd to me.  Maybe the "servers' connection methods"?

> +* A client uses a database on one server in a cluster, then another server in
> +  the cluster (perhaps because the first server failed) could observe stale
> +  data.  Clustered OVSDB clients, however, can use a column in the ``_Server``
> +  database to detect that data on a server is older than data that the client
> +  previously read.  The OVSDB client library in Open vSwitch uses this feature
> +  to avoid servers with stale data.

I found the first sentence a little confusing.  How about?

	+* If a client uses a database on one server in a cluster, then another server
	+  in the cluster (perhaps because the first server failed) the client could
	+  observe stale data.

> -OVSDB also supports online database schema conversion.
> -To convert a database online, use ``ovsdb-client convert``.
> +OVSDB also supports online database schema conversion, for any of its database

I don't think the comma is necessary.

> diff --git a/lib/jsonrpc.c b/lib/jsonrpc.c
> index f8786f909ac8..0b8c1468b8d1 100644
> --- a/lib/jsonrpc.c
> +++ b/lib/jsonrpc.c
> ...
> +struct jsonrpc *
> +jsonrpc_session_steal(struct jsonrpc_session *s)
> +{
> +    struct jsonrpc *rpc = s->rpc;
> +    s->rpc = NULL;
> +    jsonrpc_session_close(s);
> +    return rpc;
> +}

I don't think there are any external callers to this function, so it could be declared static.

> diff --git a/lib/ovsdb-idl.c b/lib/ovsdb-idl.c
> index 24ba5b50fddc..c88d93b14a61 100644
> --- a/lib/ovsdb-idl.c
> +++ b/lib/ovsdb-idl.c
> ...
> +static void
> +ovsdb_idl_process_msg(struct ovsdb_idl *idl, struct jsonrpc_msg *msg)
> +{
> ...
> +    /* Unknown message.  Log at a low level because this can happen if
> +     * ovsdb_idl_txn_destroy() is called to destroy a transaction
> +     * before we receive the reply.
> +     *
> +     * (We could sort those out from other kinds of unknown messages by
> +     * using distinctive IDs for transactions, if it seems valuable to
> +     * do so, and then it would be possible to use different log
> +     * levels. XXX?) */

Pointing out the "XXX", but it does seem like something that would be addressed at a future date.

> diff --git a/ovn/controller/ovn-controller.c b/ovn/controller/ovn-controller.c
> index c286ccbcaf8d..103261853952 100644
> --- a/ovn/controller/ovn-controller.c
> +++ b/ovn/controller/ovn-controller.c
> @@ -615,6 +615,7 @@ main(int argc, char *argv[])
>     char *ovnsb_remote = get_ovnsb_remote(ovs_idl_loop.idl);
>     struct ovsdb_idl_loop ovnsb_idl_loop = OVSDB_IDL_LOOP_INITIALIZER(
>         ovsdb_idl_create(ovnsb_remote, &sbrec_idl_class, true, true));
> +    ovsdb_idl_set_leader_only(ovnsb_idl_loop.idl, false);
> 
>     create_ovnsb_indexes(ovnsb_idl_loop.idl);
>     lport_init(ovnsb_idl_loop.idl);

Is there a reason ovn-controller doesn't try to connect the leader?  My guess is that this will have better performance, and most of the writing it does is fairly siloed, but there are things such as MAC bindings that could be shared.  I suppose if there are multiple writers to the same address there may be more serious problems.

> diff --git a/ovn/utilities/ovn-ctl b/ovn/utilities/ovn-ctl
> index 0e56bf8c501b..2b3c9215e5fa 100755
> --- a/ovn/utilities/ovn-ctl
> +++ b/ovn/utilities/ovn-ctl
> @@ -95,10 +95,37 @@ promote_ovnsb() {
> 
> start_nb_ovsdb() {
> ...
> +    if test X$mode != "Xcluster"; then
>         upgrade_db "$DB_NB_FILE" "$DB_NB_SCHEMA" 1>/dev/null 2>/dev/null
> +    else
> +        if test -z "$DB_NB_CLUSTER_REMOTE_ADDR"; then
> +            create_cluster "$DB_NB_FILE" "$DB_NB_SCHEMA" \
> +"$DB_NB_CLUSTER_LOCAL_ADDR"
> +        else
> +            join_cluster "$DB_NB_FILE" "OVN_Northbound" \
> +"$DB_NB_CLUSTER_LOCAL_ADDR" "$DB_NB_CLUSTER_REMOTE_ADDR"
> +        fi

I find both of these "if test"s confusing, since they're checking for a negative.  What about something like the following?

	    if test X$mode = "Xcluster"; then
	        if test -n "$DB_NB_CLUSTER_REMOTE_ADDR"; then
	            join_cluster "$DB_NB_FILE" "OVN_Northbound" \
	"$DB_NB_CLUSTER_LOCAL_ADDR" "$DB_NB_CLUSTER_REMOTE_ADDR"
	        else
	            create_cluster "$DB_NB_FILE" "$DB_NB_SCHEMA" \
	"$DB_NB_CLUSTER_LOCAL_ADDR"
	        fi
	    else
	        upgrade_db "$DB_NB_FILE" "$DB_NB_SCHEMA" 1>/dev/null 2>/dev/null
	    fi

> +    set ovsdb-server
> +    set "$@" --detach --monitor

This is always setting "--detach --monitor", but then there's a check a few lines down that may or may not set it.  I suspect this line shouldn't be here.

> +    set "$@" $OVN_NB_LOG --log-file=$OVN_NB_LOGFILE
> +    set "$@" --remote=punix:$DB_NB_SOCK --pidfile=$DB_NB_PID
...
>         if test X"$DB_NB_DETACH" != Xno; then
>             set "$@" --detach --monitor
> ...
>         set "$@" $OVN_NB_LOG --log-file=$OVN_NB_LOGFILE
>         set "$@" --remote=punix:$DB_NB_SOCK --pidfile=$DB_NB_PID

Those last two lines were already executed a few lines earlier.

Also, the indentation for this block seems too deep.

> +    # TODO (numans): Remove this 'if' once we have the fix to
> +    # start ovsdb-server with the below options for the cluster db.
> +    if test X$mode != "Xcluster"; then
>         set "$@" --remote=db:OVN_Northbound,NB_Global,connections
>         set "$@" --private-key=db:OVN_Northbound,SSL,private_key
>         set "$@" --certificate=db:OVN_Northbound,SSL,certificate
>         set "$@" --ca-cert=db:OVN_Northbound,SSL,ca_cert
>         set "$@" --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols
>         set "$@" --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers
> +    fi

Should this TODO item be addressed?  I'm not really clear on what's missing from the implementation anyway.  I didn't notice any indication that support for these was missing.

If it stays, I'd recommend putting this TODO block down a line or two to make it easier to spot.

> +    if test -e $ovnnb_active_conf_file; then
> +        set "$@" --sync-from=`cat $ovnnb_active_conf_file`
> +    fi

I wonder if this could cause problems when someone was running active-passive, and then switches to clustered.  Should you either remove this file at the beginning of start_nb_ovsdb() or only add "--sync-from" if "mode" is "active_passive"?

> +    if test -z "$DB_NB_CLUSTER_REMOTE_ADDR"; then
>         ovn-nbctl init
>     fi

Once again, I find these negative tests confusing, but I can see it needs to capture multiple things.  What about a comment along the lines of:

	# Initialize the database if it's running standalone,
	# active-passive, or is the first server in a cluster. 

> start_sb_ovsdb() {
> -    # Check and eventually start ovsdb-server for Southbound DB
> -    if ! pidfile_is_running $DB_SB_PID; then
> +    # Check and eventually start ovsdb-server for Northbound DB

This should probably be "Southbound".

Since the functions are basically identical in structure, the comments I mentioned in ovn_nb_ovsdb() apply here as well.

> @@ -494,6 +555,14 @@ File location options:
>   --db-sb-sync-from-port=ADDR OVN Southbound active db tcp port (default: $DB_SB_SYNC_FROM_PORT)
>   --db-sb-sync-from-proto=PROTO OVN Southbound active db transport (default: $DB_SB_SYNC_FROM_PROTO)
>   --db-sb-create-insecure-remote=yes|no Create ptcp OVN Southbound remote (default: $DB_SB_CREATE_INSECURE_REMOTE)
> +  --db-nb-cluster-local-addr=ADDR OVN_Northbound cluster local address \
> +  (default: $DB_NB_CLUSTER_LOCAL_ADDR)
> +  --db-nb-cluster-remote-addr=ADDR OVN_Northbound cluster remote address \
> +  (default: $DB_NB_CLUSTER_REMOTE_ADDR)
> +  --db-sb-cluster-local-addr=ADDR OVN_Northbound cluster local address \
> +  (default: $DB_SB_CLUSTER_LOCAL_ADDR)
> +  --db-sb-cluster-remote-addr=ADDR OVN_Northbound cluster remote address \
> +  (default: $DB_SB_CLUSTER_REMOTE_ADDR)

I believe these last two should say "OVN_Southbound cluster".

These commands don't appear to be documented in the ovn-ctl man page.  It looks like that man page is generally missing a lot descriptions for options, but I'm wondering if some better guidance for starting up OVN in a cluster would be helpful.  For example, I think it would be a problem if one were to specify both a local and remote address.

> diff --git a/ovn/utilities/ovn-nbctl.c b/ovn/utilities/ovn-nbctl.c
> index c920ad878ab6..fccbc7cdd9fe 100644
> --- a/ovn/utilities/ovn-nbctl.c
> +++ b/ovn/utilities/ovn-nbctl.c
> ...
> @@ -182,6 +186,8 @@ parse_options(int argc, char *argv[], struct shash *local_options)
>         {"help", no_argument, NULL, 'h'},
>         {"commands", no_argument, NULL, OPT_COMMANDS},
>         {"options", no_argument, NULL, OPT_OPTIONS},
> +        {"leader-only", no_argument, &leader_only, true},
> +        {"no-leader-only", no_argument, &leader_only, false},

Do you think it's worth adding these to usage()?

> diff --git a/ovn/utilities/ovn-sbctl.c b/ovn/utilities/ovn-sbctl.c
> index f16cefedd897..9f09b83e8454 100644
> --- a/ovn/utilities/ovn-sbctl.c
> +++ b/ovn/utilities/ovn-sbctl.c
> ...
> @@ -178,6 +182,8 @@ parse_options(int argc, char *argv[], struct shash *local_options)
>         {"help", no_argument, NULL, 'h'},
>         {"commands", no_argument, NULL, OPT_COMMANDS},
>         {"options", no_argument, NULL, OPT_OPTIONS},
> +        {"leader-only", no_argument, &leader_only, true},
> +        {"no-leader-only", no_argument, &leader_only, false},

Same question as ovn-nbctl.

> --- /dev/null
> +++ b/ovsdb/TODO.rst
> @@ -0,0 +1,61 @@
> ...
> +===========================
> +OVSDB Clustering To-do List
> +===========================
> +
> +* Unit test snapshotting.

Did you want to add ephemeral columns to the todo list?

> diff --git a/ovsdb/_server.xml b/ovsdb/_server.xml
> index 8ef782fb97b2..e4536671ccbe 100644
> --- a/ovsdb/_server.xml
> +++ b/ovsdb/_server.xml
> @@ -37,13 +37,13 @@
> 
>     <p>
>       When a database is removed from the server, in addition to
> +      <code>Database</code> table updates, the server sends
> +      <code>canceled</code> messages, as described in RFC 7047 section 4.1.4,

Should that be "cancel" message instead of "canceled"?

Also, from the description in 4.1.4, it sounds like just a client sends the "cancel" message to the server, not vice versa.

> @@ -58,8 +58,63 @@
> ...
>     <column name="schema">
> -      The database schema, as a JSON string.
> +      The database schema, as a JSON string.  Until a clustered database
> +      finishes joining its cluster, this is empty.
>     </column>

To me, this construed as a standalone database would always be empty, but I suspect that's not the case. Assuming I'm correct, maybe: "In the case of a clustered database, this is empty until it finishes joining its cluster."

> diff --git a/ovsdb/file.c b/ovsdb/file.c
> index dadb988d3088..d01e54fbe6ae 100644
> --- a/ovsdb/file.c
> +++ b/ovsdb/file.c
> ...
> struct json *
> +ovsdb_to_txn_json(const struct ovsdb *db, const char *comment)
> {

I think it would be nice to mention that 'comment' is not consumed.

> +struct json *
> +ovsdb_file_txn_annotate(struct json *json, const char *comment)
> {

Same here.

> +static struct ovsdb *
> +ovsdb_file_read__(const char *filename, bool rw,
> +                  struct ovsdb_schema *new_schema)
> {
> ...
> +    struct ovsdb_schema *schema = ovsdb_storage_read_schema(storage);
> ...
> +    for (;;) {
> +        /* Read a transaction.  Bail if end-of-file. */
> +        struct json *txn_json;
> +        struct ovsdb_schema *schema;

This 'schema' shadows the 'schema' defined earlier.

> +struct ovsdb *
> +ovsdb_file_read_as_schema(const char *filename, struct ovsdb_schema *schema)
> +{
> +    return ovsdb_file_read__(filename, false, schema);
> }

I believe this function consumes 'schema'.  It might be nice to indicate that.

> diff --git a/ovsdb/jsonrpc-server.c b/ovsdb/jsonrpc-server.c
> index df268cd4eedc..0df08da341be 100644
> --- a/ovsdb/jsonrpc-server.c
> +++ b/ovsdb/jsonrpc-server.c
> 
> @@ -1099,7 +1126,10 @@ ovsdb_jsonrpc_trigger_create(struct ovsdb_jsonrpc_session *s, struct ovsdb *db,
>     }
> 
>     if (disconnect_all) {
> -        ovsdb_jsonrpc_server_reconnect(s->remote->server, false);
> +        ovsdb_jsonrpc_server_reconnect(s->remote->server, false,
> +                                       xasprintf("committed %s database "
> +                                                 "schema conversion",
> +                                                 db->name));

Is this the correct comment for this circumstance?

> diff --git a/ovsdb/ovsdb-client.c b/ovsdb/ovsdb-client.c
> index b00f04147d39..de23cc14bbb0 100644
> --- a/ovsdb/ovsdb-client.c
> +++ b/ovsdb/ovsdb-client.c
> ...
> +static int
> +open_rpc(int min_args, enum args_needed need,
> +         int argc, char *argv[], struct jsonrpc **rpcp, char **databasep)
> +{
> +    struct svec remotes = SVEC_EMPTY_INITIALIZER;
> +    struct uuid cid = UUID_ZERO;
> +
> +    /* First figure out the remote(s).  If the first command-line argument has
> +     * the form of a remote, use it, otherwise use the default. */

Do you think it's worth specifying what the default is?  I think it's "Open_vSwitch" if it's defined or the database that doesn't begin with "_" if there's a single database.

> +/* Parses 'reply', a JSON-RPC reply to our request asking for the status of
> + * 'database' on 'server'.  Determines whether this server is acceptable for
> + * the transaction we want to make and return true if so or false to disconnect
> + * and try a different server. */

s/return/returns/

> diff --git a/ovsdb/ovsdb-server.1.in b/ovsdb/ovsdb-server.1.in
> index dfca40d4ef79..51a2c2facc9e 100644
> --- a/ovsdb/ovsdb-server.1.in
> +++ b/ovsdb/ovsdb-server.1.in
> 
> ...
> +.IP "\fBcluster/cid \fIdb\fR"
> +Prints the cluster ID for \fIdb\fR, which is a UUID that identifies
> +the cluster.  If \fIdb\fR is a database newly created by
> +\fBovsdb\-tool cluster\-join\fR, that has not yet successfully joined

I don't think this comma is necessary.

> +.IP "\fBcluster/leave \fR[\fB\-\-force\fR] \fIdb\fR"
> +.IP

This puts in a blank line not present in the other commands.

> +Without \fB\-\-force\fR, this command starts the server gracefully
> +removing itself from its cluster.  At least one server must remain,
> +and the cluster must be healthy, that is, over half its servers are

I think either a semicolon or colon after "healthy" would make the sentence clearer.

> diff --git a/ovsdb/ovsdb-server.c b/ovsdb/ovsdb-server.c
> index f7bf1e270120..68584f396d10 100644
> --- a/ovsdb/ovsdb-server.c
> +++ b/ovsdb/ovsdb-server.c
> 
> @@ -201,10 +216,25 @@ main_loop(struct ovsdb_jsonrpc_server *jsonrpc, struct shash *all_dbs,
> 
> +        struct shash_node *next;
> +        SHASH_FOR_EACH_SAFE (node, next, all_dbs) {
>             struct db *db = node->data;
>             if (ovsdb_trigger_run(db->db, time_msec())) {
> +                ovsdb_jsonrpc_server_reconnect(
> +                    jsonrpc, false,
> +                    xasprintf("committed %s database schema conversion",
> +                              db->db->name));

Is this the correct message?

> diff --git a/ovsdb/ovsdb-tool.1.in b/ovsdb/ovsdb-tool.1.in
> index 7b89ffeec8bf..3efa6a3e7032 100644
> --- a/ovsdb/ovsdb-tool.1.in
> +++ b/ovsdb/ovsdb-tool.1.in
> @@ -15,6 +15,9 @@ ovsdb\-tool \- Open vSwitch database management utility
> .IP "Database Creation Commands:"
> \fBovsdb\-tool \fR[\fIoptions\fR] \fBcreate \fR[\fIdb\fR [\fIschema\fR]]
> .br
> +\fBovsdb\-tool \fR[\fIoptions\fR] \fBcreate\-cluster \fIdb contents address\fR
> +.br
> +\fBovsdb\-tool [\fB\-\-cid=\fIuuid\fR] \fBjoin\-cluster\fI db name local remote\fR...

The synopsis for "join-cluster" does not include "[options]", which would imply they're not supported, but I doubt that's the case.

This isn't just a documentation issue, but the other ovsdb-tool commands allow the "db" and "schema" arguments to be optional, but none of the new cluster commands do.  Was that intentional?

> @@ -37,6 +40,8 @@ ovsdb\-tool \- Open vSwitch database management utility
> .br
> \fBovsdb\-tool \fR[\fIoptions\fR] [\fB\-m\fR | \fB\-\-more\fR]... \fBshow\-log \fR[\fIdb\fR]
> .br
> +\fBovsdb\-tool \fR[\fIoptions\fR] \fBcheck\-cluster \fIdb\fR...
> +.br
> \fBovsdb\-tool \fR[\fIoptions\fR] \fBdb\-name \fR[\fIdb\fR]

The "db-cid", "db-sid", and "db-local-address" commands are not listed in the Synopsis.  Was that intentional?

> -.IP "\fBcreate\fI db schema\fR"
> +.IP "\fBcreate\fI [db [schema]]\fR"

I don't think the square brackets should be part of the italicization.  Most of the documented commands don't indicate that "db" and "schema" are optional, so I sent a patch to do that, which will also address this line.

> +.IP "\fBcreate\-cluster\fI db contents local"
> ...
> +point of failure.  It creates clustered database file \fIdb\fR and
> +configures the server to listen on \fIlocal\fR, which must take the
> +form \fIprotocol\fB:\fIip\fB:\fIport\fR, where \fIprotocol\fR is
> +\fBtcp\fR or \fBssl\fR, \fIip\fR is the server's IP (either an IPv4
> +address or an IPv6 address enclosed in square brackets), and
> +\fIport\fR is a TCP port number.  Only one address is specified, for

> +\fIport\fR is a TCP port number.  Only one address is specified, for
> +the first server in the cluster, ordinarily the one for the server
> +running \fBcreate\-cluster\fR.  The address is used for communication

For "create-cluster", it sounds like "local" will set be set up in what we'd call a "passive" mode, but the argument is one that is in the style of "active".  I wonder if that will cause any confusion.

I don't think the comma after "specified" is necessary.

> .IP "\fBdb\-version\fI db\fR"
> .IQ "\fBschema\-version\fI schema\fR"
> Prints the version number in the schema embedded within the database
> @@ -115,6 +190,10 @@ If \fIschema\fR or \fIdb\fR was created before schema versioning was
> introduced, then it will not have a version number and this command
> will print a blank line.
> .IP
> +The \fBschema\-version\fR command is for standalone and active-backup
> +databases only.  For clustered databases, use \fBovsdb\-client\fR's
> +\fBschema\-version\fR command instead.

I get the impression that "db-version" also doesn't work with clustered databases.

> .IP "\fBdb\-cksum\fI db\fR"
> .IQ "\fBschema\-cksum\fI schema\fR"
> Prints the checksum in the schema embedded within the database
> @@ -123,6 +202,10 @@ If \fIschema\fR or \fIdb\fR was created before schema checksums were
> introduced, then it will not have a checksum and this command
> will print a blank line.
> .IP
> +The \fBschema\-cksum\fR command is for standalone and active-backup
> +databases only.  For clustered databases, use \fBovsdb\-client\fR's
> +\fBschema\-cksum\fR command instead.

Ditto "db-cksum".

> +.IP "\fBcheck\-cluster \fIdb\fR..."
> +Reads all of the records in the supplied database logs, which must be
> +logs collected from different servers (and ideally all the servers) in
> +a single cluster.  Checks each log for self-consistency and the set of
> +logs together for cross-consistency.  If \fBovsdb\-tool\fR detects

Other than "show-log", the man page doesn't describe "db" as a database log, which I think some might find confusing.  I think it would be clearer to reference "db" and call the arguments a set of databases.

> +unusual but not necessary incorrect content, it prints a warning or

s/necessary/necessarily/

> +.IP "\fBdb\-sid\fI db\fR"
> +Prints the server ID, which is a UUID that identifies the server, for
> +\fIdb\fR.  This command works only with clustered databases.  It works
> +regardless of whether \fIdb\fR has joined the cluster.

What does that last sentence mean if it hasn't joined the cluster?  It looks like when you run join-cluster, it generates a random sid, but I assume that changes when it actually joins, so the sid may not be stable.

> +.IP "\fBdb\-local\-address db\fR"
> +Prints the address used for database clustering for \fIdb\fR, in the

It's probably obvious from the name, but I'd still put "local address" in this description.

> diff --git a/ovsdb/ovsdb-tool.c b/ovsdb/ovsdb-tool.c
> index cec64152f079..bef40e81e62b 100644
> --- a/ovsdb/ovsdb-tool.c
> +++ b/ovsdb/ovsdb-tool.c
> ...
> @@ -203,6 +229,16 @@ check_ovsdb_error(struct ovsdb_error *error)
> 
> +static struct ovsdb_schema *
> +read_schema(const char *filename)
> +{
> +    struct ovsdb_storage *storage = ovsdb_storage_open_standalone(filename,
> +                                                                  false);
> +    struct ovsdb_schema *schema = ovsdb_storage_read_schema(storage);
> +    ovsdb_storage_close(storage);
> +    return schema;
> +}

I think it might be clearer to name this "read_standalone_schema" and/or document that it will have a fatal error if it's used with a database that's not standalone.

> +static struct ovsdb_error *
> +write_and_free_json(struct ovsdb_log *log, struct json *json)
> +{
> +    struct ovsdb_error *error = ovsdb_log_write(log, json);
> +    json_destroy(json);
> +    return error;
> +}

I think you can also use this function in do_create().

> +static struct ovsdb_error *
> +write_db(const char *file_name, const char *comment, const struct ovsdb *db)
> +{
> +    struct ovsdb_log *log;
> +    struct ovsdb_error *error = ovsdb_log_open(file_name, OVSDB_MAGIC,
> +                                               OVSDB_LOG_CREATE, false, &log);

This seems to only work with standalone databases.  What about calling it "write_standalone_db"?

> +static void
> compact_or_convert(const char *src_name_, const char *dst_name_,
> +                   struct ovsdb_schema *new_schema, const char *comment)

In general, do you think it's worth adding a comment when a function only works with a standalone or a clustered database?  For example, this would only work with a standalone database.

> +static void
> +do_show_log_standalone(struct ovsdb_log *log)
> +{
> +    struct shash names = SHASH_INITIALIZER(&names);
> ...
> +    ovsdb_schema_destroy(schema);
> +    /* XXX free 'names'. */
> +}

This comment about freeing 'names' was in the previous version, too.  Should it be addressed?

> +static void
> +print_raft_record(const struct raft_record *r,
> +                  struct ovsdb_schema **schemap, struct shash *names)
> +{
> +    if (r->comment) {
> +        printf(" comment: \"%s\"\n", r->comment);
> +    }
> +    if (r->term) {
> +        printf(" term: %"PRIu64"\n", r->term);
> +    }
> +
> +    switch (r->type) {
> +    case RAFT_REC_ENTRY:
> +        printf(" index: %"PRIu64"\n", r->entry.index);
> +        print_servers("servers", r->entry.servers);
> +        if (!uuid_is_zero(&r->entry.eid)) {
> +            printf(" eid: %04x\n", uuid_prefix(&r->entry.eid, 4));
> +        }
> +        print_data("", r->entry.data, schemap, names);
> +        break;
> 

The record printouts seem to contain a lot of newlines.  I suspect (hope!) most people using the logs are trying to debug their applications and not Raft, so I'm wondering if it would be better if printing the log took up less vertical space.  Would it make sense to combine some of this information on the same line

> +static void
> +do_show_log_cluster(struct ovsdb_log *log)
> +{
> +    struct shash names = SHASH_INITIALIZER(&names);
> +    struct ovsdb_schema *schema = NULL;
> +    unsigned int i;
> +
> +    shash_init(&names);
> +    schema = NULL;

I think the initialization of 'names' and 'schema' have already been handled.

> +    for (i = 0; ; i++) {

Since you moved some other declarations closer to their use, I think you could move the declaration of 'i' to here.

>     ovsdb_schema_destroy(schema);
>     /* XXX free 'names'. */

Is this something that should be addressed?

> +static void
> +do_check_cluster(struct ovs_cmdl_context *ctx)
> +{
> ...
> +                        ovs_fatal(0, "%s: record %llu reports leader %s "
> +                                  "for term %"PRIu64" but record %llu "
> +                                  "previously reported the leader as %s "
> +                                  "in that term",
> +                                  s->filename, rec_idx,
> +                                  get_server_name(&c, &r->sid,
> +                                                  buf1, sizeof buf1),
> +                                  term, leader_rec_idx,
> +                                  get_server_name(&c, &leader,
> +                                                  buf2, sizeof buf2));

I get the following compilation error:

ovsdb/ovsdb-tool.c: In function ‘do_check_cluster’:
ovsdb/ovsdb-tool.c:1334:38: error: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 7 has type ‘uint64_t {aka long unsigned int}’ [-Werror=format=]
                         ovs_fatal(0, "%s: record %llu reports leader %s "
                                      ^

This fixed the issue for me:

-                                  "for term %"PRIu64" but record %llu "
+                                  "for term %"PRIu64" but record %"PRIu64 " "

> +                ovs_fatal(0, "log entries with index %"PRIu64" differ:\n"
> +                          "%s has %s\n"
> +                          "%s has %s\n",

I don't think ovs_fatal() format strings are supposed to end with a newline.

> static const struct ovs_cmdl_command all_commands[] = {
>     { "create", "[db [schema]]", 0, 2, do_create, OVS_RW },
> +    { "create-cluster", "db contents local", 3, 3, do_create_cluster, OVS_RW },
> +    { "join-cluster", "db name local remote...", 4, INT_MAX, do_join_cluster,
> +      OVS_RW },
>     { "compact", "[db [dst]]", 0, 2, do_compact, OVS_RW },
>     { "convert", "[db [schema [dst]]]", 0, 3, do_convert, OVS_RW },
>     { "needs-conversion", NULL, 0, 2, do_needs_conversion, OVS_RO },
>     { "db-name", "[db]",  0, 1, do_db_name, OVS_RO },
>     { "db-version", "[db]",  0, 1, do_db_version, OVS_RO },
>     { "db-cksum", "[db]", 0, 1, do_db_cksum, OVS_RO },
> +    { "db-cid", "db", 1, 1, do_db_cid, OVS_RO },
> +    { "db-sid", "db", 1, 1, do_db_sid, OVS_RO },
> +    { "db-local-address", "db", 1, 1, do_db_local_address, OVS_RO },
>     { "schema-name", "[schema]", 0, 1, do_schema_name, OVS_RO },
>     { "schema-version", "[schema]", 0, 1, do_schema_version, OVS_RO },
>     { "schema-cksum", "[schema]", 0, 1, do_schema_cksum, OVS_RO },
>     { "query", "[db] trns", 1, 2, do_query, OVS_RO },
>     { "transact", "[db] trns", 1, 2, do_transact, OVS_RO },
>     { "show-log", "[db]", 0, 1, do_show_log, OVS_RO },
> +    { "check-cluster", "db...", 1, INT_MAX, do_check_cluster, OVS_RO },
>     { "help", NULL, 0, INT_MAX, do_help, OVS_RO },
>     { "list-commands", NULL, 0, INT_MAX, do_list_commands, OVS_RO },
>     { NULL, NULL, 0, 0, NULL, OVS_RO },

The man page description of many of these commands indicate that they don't work in a clustered database (e.g, schema-version, schema-cksum, and convert), but I don't think there's anything that enforces that in a clean way.

Also, as I mentioned in the man page, these new arguments don't allow an optional "db" or "schema", unlike all the other commands.

> diff --git a/ovsdb/ovsdb.c b/ovsdb/ovsdb.c
> index 89f530bcccfb..2ee2e51b4b80 100644
> --- a/ovsdb/ovsdb.c
> +++ b/ovsdb/ovsdb.c
> ...
> +/* XXX add prereq parameter? */
> struct ovsdb *
> +ovsdb_create(struct ovsdb_schema *schema, struct ovsdb_storage *storage)

Did you want to address this "XXX" comment?  Based on usage, I think it would be worth describing the arguments.  For example, I've seen callers supply one or the other argument as NULL, but if both are NULL, then it will segfault.

> -void
> ovsdb_destroy(struct ovsdb *db)
> {

Should this function be freeing 'triggers'?

> diff --git a/ovsdb/ovsdb.h b/ovsdb/ovsdb.h
> index c3e8f2091e35..c5dad832c45d 100644
> --- a/ovsdb/ovsdb.h
> +++ b/ovsdb/ovsdb.h
> ...
> /* Database. */
> +enum ovsdb_state {
> +    OVSDB_LOADING,
> +    OVSDB_RUNNING
> +};

I don't think this enum gets used anywhere and can be removed.

> diff --git a/ovsdb/storage.c b/ovsdb/storage.c
> new file mode 100644
> index 000000000000..34590f374188
> --- /dev/null
> +++ b/ovsdb/storage.c
> ...
> +/* Attempts to read a log record from 'storage'.
> + *
> + * If successful, returns NULL and stores in '*jsonp' the JSON object that the
> + * record contains.  The caller owns the data and must eventually free it (with
> + * json_destroy()).
> + *
> + * If a read error occurs, returns the error and stores NULL in '*jsonp'.
> + *
> + * If the read reaches end of file, returns NULL and stores NULL in
> + * '*jsonp'. */
> +struct ovsdb_error * OVS_WARN_UNUSED_RESULT
> +ovsdb_storage_read(struct ovsdb_storage *storage,
> +                   struct ovsdb_schema **schemap,
> +                   struct json **txnp,
> +                   struct uuid *txnid)

The function description references 'jsonp', but I think it should be 'txnp'.  I think it would be helpful to describe all the arguments, too.

> +bool
> +ovsdb_storage_read_wait(struct ovsdb_storage *storage)
> +{
> +    if (storage->raft) {
> +        return raft_has_next_entry(storage->raft);
> +    } else {
> +        /* XXX */
> +        return false;

Should that "XXX" be addressed?

> +/* Not suitable for writing transactions that change the schema. */
> +struct ovsdb_write * OVS_WARN_UNUSED_RESULT
> +ovsdb_storage_write(struct ovsdb_storage *storage, const struct json *data,
> +                    const struct uuid *prereq, struct uuid *resultp,
> +                    bool durable)
> +{
> ...
> +    if (resultp) {
> +        *resultp = result;
> +    }
> 

"ovsdb/transaction.c" is the only caller of this function, and it doesn't appear to use 'resultp'.  I think you could either drop that argument from this function, or have that caller pass in NULL.

> +struct ovsdb_write * OVS_WARN_UNUSED_RESULT
> +ovsdb_storage_write_schema_change(struct ovsdb_storage *storage,
> +                                  const struct json *schema,
> +                                  const struct json *data,
> +                                  const struct uuid *prereq,
> +                                  struct uuid *resultp)
> +{
> ...
> +    if (resultp) {
> +        *resultp = result;
> +    }

Same thing with this function.

> +struct ovsdb_storage *
> +ovsdb_storage_open_standalone(const char *filename, bool rw)
> +{
> +    struct ovsdb_storage *storage;
> +    struct ovsdb_error *error = ovsdb_storage_open__(filename, rw, false,
> +                                                     &storage);
> +    if (error) {
> +        ovs_fatal(0, "%s", ovsdb_error_to_string_free(error));
> +    }
> +    return storage;
> +}

This is super minor, but I think this code block would make more logical sense if it were near ovsdb_storage_open() and ovsdb_storage_open__()

> +
> +struct ovsdb_schema *
> +ovsdb_storage_read_schema(struct ovsdb_storage *storage)
> +{

I think it might be worth either adding a comment or modifying the name to make it clear that this only works on standalone databases.  Also, it could be moved near the other read functions.  :-)

> diff --git a/ovsdb/storage.h b/ovsdb/storage.h
> new file mode 100644
> index 000000000000..cb5bcb656080
> --- /dev/null
> +++ b/ovsdb/storage.h
> ...
> +struct json;
> +struct ovsdb_schema;
> +struct ovsdb_storage;
> +struct ovsdb_completion;

I don't think "ovsdb_completion" is needed in the header file.

> diff --git a/ovsdb/transaction.c b/ovsdb/transaction.c
> index 893ea1152c5a..de3cb5995af1 100644
> --- a/ovsdb/transaction.c
> +++ b/ovsdb/transaction.c
> ...
> +/* Proposes 'txn' for commitment and then waits for the commit to succeed or
> + * fail  Returns null if successful, otherwise the error.

There's a period missing after "fail".

> diff --git a/ovsdb/trigger.c b/ovsdb/trigger.c
> index 346db7b5fb28..543b31d68dd9 100644
> --- a/ovsdb/trigger.c
> +++ b/ovsdb/trigger.c
> 
> 
> +/* Returns a JSON-RPC message that may be sent to a client to indicate that
> + * 'trigger' was canceled. */
> void
> +ovsdb_trigger_cancel(struct ovsdb_trigger *trigger, const char *reason)
> {

I don't think this description matches the function any longer.

> +    if (trigger->progress) {
> +        /* XXX The transaction still might complete asynchronously. */

I'm guessing this "XXX" is not something that's easily addressable...

> @@ -147,81 +185,190 @@ ovsdb_trigger_wait(struct ovsdb *db, long long int now)
> static bool
> ovsdb_trigger_try(struct ovsdb_trigger *t, long long int now)
> {
> ...
> +            if (!strcmp(ovsdb_error_get_tag(error), "cluster error")) {
> +                /* Temporary error.  Transition back to "initialized" state to
> +                 * try again. */
> +                jsonrpc_msg_destroy(t->reply);
> +                t->reply = NULL;
> +                t->db->run_triggers = true; /* XXX? */
> +                ovsdb_error_destroy(error);

Should this "XXX be addressed?

> diff --git a/ovsdb/trigger.h b/ovsdb/trigger.h
> index d9df97f31222..74636baba8b2 100644
> --- a/ovsdb/trigger.h
> +++ b/ovsdb/trigger.h
> @@ -20,13 +20,35 @@
> 
> struct ovsdb;
> 
> +/* Triggers have the following states:
> + *
> + *    - Initialized (reply == NULL, progress == NULL): Executing the trigger
> + *      can keep it in the initialized state, if it has a "wait" condition that
> + *      isn't met.  Executing the trigger can also yield an error, in which
> + *      case it transition to "complete".  Otherwise, execution yields a

s/transition/transitions/

> diff --git a/tutorial/ovs-sandbox b/tutorial/ovs-sandbox
> index 1632ad15da5a..f3aeafd55ec7 100755
> --- a/tutorial/ovs-sandbox
> +++ b/tutorial/ovs-sandbox
> 
> 
> @@ -348,36 +384,85 @@ fi
> rungdb $gdb_ovsdb $gdb_ovsdb_ex ovsdb-server --detach --no-chdir --pidfile -vconsole:off --log-file \
>     --remote=punix:"$sandbox"/db.sock $ovsdb_server_args
> if $ovn; then
> 
> +    ovn_start_db() {
> +	local db=$1 model=$2 servers=$3 schema=$4

This isn't introduced by this patch, but there are a number of tab/spaces inconsistencies in this file.  However, this patch adds some more, starting with the above two lines.

> diff --git a/utilities/ovs-lib.in b/utilities/ovs-lib.in
> index 1bccea0c5b87..2ed238d62b05 100644
> --- a/utilities/ovs-lib.in
> +++ b/utilities/ovs-lib.in
> @@ -440,3 +440,50 @@ upgrade_db () {
> 
> +create_cluster () {
> +    DB_FILE="$1"
> +    DB_SCHEMA="$2"
> +    LOCAL_ADDR="$3"
> +
> +    if test ! -e "$DB_FILE"; then
> +        action "Creating cluster database $DB_FILE" ovsdb_tool create-cluster "$DB_FILE" "$DB_SCHEMA" "$LOCAL_ADDR"
> +    else
> +        # DB file exists. Check if it is a clustered db or not. If it is a
> +        # clustered db, nothing to be done. Else create a clustered db from that.

This line is too long, but I also think the comment could be more clearly written.

> +        ovsdb_tool db-local-address $DB_FILE
> +        if [ "$?" = "1" ]; then
> +            version=`ovsdb_tool db-version "$DB_FILE"`
> +            cksum=`ovsdb_tool db-cksum "$DB_FILE" | awk '{print $1}'`
> +            backup=$DB_FILE.backup$version-$cksum
> +            action "Backing up database to $backup" cp "$DB_FILE" "$backup" || return 1
> +
> +            action "Creating cluster database $DB_FILE from existing one" \
> +            ovsdb_tool create-cluster "$DB_FILE" "$backup" "$LOCAL_ADDR"
> +        fi
> +    fi
> +}

ovn-ctl calls this function if it needs to create a cluster, and it calls upgrade_db() if the database is being run standalone.  upgrade_db() does a lot more, including upgrading to the latest schema and compacting the database.  Should this function do something similar if it's converting an existing standalone database?

Also, what is the process for upgrade?  In standalone, one just needed to tell ovn-ctl to restart, and the database is upgraded. How will that be handled for existing clustered databases?

> +join_cluster() {
> +    DB_FILE="$1"
> +    SCHEMA_NAME="$2"
> +    LOCAL_ADDR="$3"
> +    REMOTE_ADDR="$4"
> +
> +    if test ! -e "$DB_FILE"; then
> +        ovsdb_tool join-cluster "$DB_FILE" "$SCHEMA_NAME" "$LOCAL_ADDR" "$REMOTE_ADDR"
> +    else
> +        # DB file exists. Check if it is a clustered db or not. If it is a
> +        # clustered db, nothing to be done. Else backup the db and join the cluster.

This line is too long, but I also think the comment could be more clearly written.

> +        ovsdb_tool db-local-address $DB_FILE
> +        if [ "$?" = "1" ]; then
> +            version=`ovsdb_tool db-version "$DB_FILE"`
> +            cksum=`ovsdb_tool db-cksum "$DB_FILE" | awk '{print $1}'`
> +            backup=$DB_FILE.backup$version-$cksum
> +            action "Backing up database to $backup" cp "$DB_FILE" "$backup" || return 1

There are three identical blocks of code in this file that backup the database.  It might be worth making this a function.

> +            action "Creating cluster database $DB_FILE from existing one" \
> +            ovsdb_tool join-cluster "$DB_FILE" "$SCHEMA_NAME" "$LOCAL_ADDR"

I think the description provided to "action" is incorrect.

Here are a few other random things that i think might be worth looking at:

	- I'm seeing some reported memory leaks.  You may want to try running some of the unit tests under valgrind (e.g., "2242: insert row, query table, commit durably - cluster of 5").

	- checkpatch has quite a few errors and warnings.  A few of them actually look correct.

	- It looks like support for the clustered database wasn't added to the Python IDL.  Do you plan on adding that?

This is really an impressive set of changes.  Sorry it's taken so long to review--there's a lot of code!  Thanks for adding this critical feature, and thanks for writing such extensive tests, which give me a lot more confidence than my ability to review it all.

I should be able to finish up the review later today.

--Justin