[ovs-dev] [PATCH 15/15] ovsdb: Introduce experimental support for clustered databases. (2/2)

Justin Pettit jpettit at ovn.org
Sat Feb 24 23:17:51 UTC 2018


> 
> On Dec 31, 2017, at 9:16 PM, Ben Pfaff <blp at ovn.org> wrote:
> 
> diff --git a/Documentation/ref/ovsdb.5.rst b/Documentation/ref/ovsdb.5.rst
> index f3e50976b5c7..0ab888996eac 100644
> --- a/Documentation/ref/ovsdb.5.rst
> +++ b/Documentation/ref/ovsdb.5.rst
> 
> +When the server successfully joins the cluster, the database file is replaced
> +by one in the general format described earlier.

Just to be crystal clear, I think it be good to reference that it's the one described in the "Clustered Format" section.

> diff --git a/ovsdb/raft-private.c b/ovsdb/raft-private.c
> new file mode 100644
> index 000000000000..457d1292a949
> --- /dev/null
> +++ b/ovsdb/raft-private.c
> 
> +struct ovsdb_error * OVS_WARN_UNUSED_RESULT
> +raft_address_validate(const char *address)
> +{
> ...
> +        return ovsdb_error(NULL, "%s: expected \"tcp\" or \"ssl\" address",
> +                           address);

I assume you're downplaying "unix:", since it only makes sense for testing purposes?

> +struct ovsdb_error * OVS_WARN_UNUSED_RESULT
> +raft_address_validate_json(const struct json *address)

This function is only used within this file, so it could be declared static if you don't think it's generally useful.

> +char *
> +raft_address_to_nickname(const char *address, const struct uuid *sid)
> +{

I think it may be worth explaining what a nickname is and how one gets one.

> +static struct ovsdb_error * OVS_WARN_UNUSED_RESULT
> +raft_servers_from_json__(const struct json *json, struct hmap *servers)
> +{
> ...
> +        if (!uuid_from_string(&sid, node->name)) {
> +            return ovsdb_syntax_error(json, NULL, "%s is a not a UUID",

I think you can drop the first "a" in that error message.

> +void
> +raft_servers_format(const struct hmap *servers, struct ds *ds)
> +{
> +    int i = 0;
> +    const struct raft_server *s;
> +    HMAP_FOR_EACH (s, hmap_node, servers) {
> +        if (i++) {
> +            ds_put_cstr(ds, ", ");
> +        }
> +        ds_put_format(ds, SID_FMT"(%s)", SID_ARGS(&s->sid), s->address);

Do you think it's worth putting a space between the SID and parenthetical address?

> +static void
> +raft_header_from_json__(struct raft_header *h, struct ovsdb_parser *p)
> +{
> ...
> +    /* Parse "remotes", if present.

Should that be "remote_addresses"?

> +static void
> +raft_record_from_json__(struct raft_record *r, struct ovsdb_parser *p)
> +{
> 
> +    /* All remaining types of log records include "term", plus at most one of:
> +     *
> +     *     - "index" plus zero or more of "data" and "servers".  If "data" is
> +     *       present then "eid" may also be present.
> ...
> +        r->entry.data = json_nullable_clone(
> +            ovsdb_parser_member(p, "data",
> +                                OP_OBJECT | OP_ARRAY | OP_OPTIONAL));
> +        r->entry.eid = (r->entry.data
> +                        ? raft_parse_required_uuid(p, "eid")
> +                        : UUID_ZERO);

The description of "index" makes it sound like "eid" is optional if "data" is present, but this code seems to expect it to be set.

> diff --git a/ovsdb/raft-private.h b/ovsdb/raft-private.h
> new file mode 100644
> index 000000000000..6e147fadb0ac
> --- /dev/null
> +++ b/ovsdb/raft-private.h
> ...
> +    char *nickname;             /* 1ab3(s3) */

It might be nice to put the example in quotes.

> +const char *raft_servers_get_nickname__(const struct hmap *servers,
> +                                        const struct uuid *sid);
> ...
> +bool raft_parse_uuid__(struct ovsdb_parser *, const char *name, bool optional,
> +                       struct uuid *);

The coding style states that functions with a "__" suffix are used to indicate "internal use only".  I suppose this entire file is supposed to be internal, but it still strikes me as funny to see these function in header files.  What do you think?

> diff --git a/ovsdb/raft-rpc.c b/ovsdb/raft-rpc.c
> new file mode 100644
> index 000000000000..617d4aa4eaad
> --- /dev/null
> +++ b/ovsdb/raft-rpc.c
> 
> ...
> +static void
> +raft_append_request_from_jsonrpc(struct ovsdb_parser *p,
> +                                 struct raft_append_request *rq)
> +{
> ...
> +
> +    const struct json *log = ovsdb_parser_member(p, "log", OP_ARRAY);
> +    if (!log) {
> +        return;
> +    }

Should this parser argument include OP_OPTIONAL?

> +static void
> +raft_append_reply_to_jsonrpc(const struct raft_append_reply *rpy,
> +                             struct json *args)
> +{
> ...
> +    json_object_put_string(args, "result",
> +                           raft_append_result_to_string(rpy->result));

I believe raft_append_result_to_string() can return NULL, which could cause a problem here.

> +static void
> +raft_format_append_reply(const struct raft_append_reply *rpy, struct ds *s)
> +{
> ...
> +    ds_put_format(s, " result=\"%s\"",
> +                  raft_append_result_to_string(rpy->result));

I don't think this will crash, but it will print strange if 'rpy->result' is not valid.

Is it intentional that "n_entries" and "prev_log_*" aren't printed?

> +static void
> +raft_format_vote_request(const struct raft_vote_request *rq, struct ds *s)
> +{
> +    ds_put_format(s, " term=%"PRIu64, rq->term);
> +    ds_put_format(s, " last_log_index=%"PRIu64, rq->last_log_index);
> +    ds_put_format(s, " last_log_term=%"PRIu64, rq->last_log_term);
> +}

Should "leadership_transfer" be printed?

> +static void
> +raft_install_snapshot_request_to_jsonrpc(
> +    const struct raft_install_snapshot_request *rq, struct json *args)
> +{
> ...
> +    json_object_put(args, "last_servers", json_clone(rq->last_servers));
> ...
> +
> +    json_object_put(args, "data", json_clone(rq->data));

I believe "rq->last_servers" and 'rq->data' may be null from raft_install_snapshot_request_clone().

> +static void
> +raft_format_install_snapshot_reply(
> +    const struct raft_install_snapshot_reply *rpy, struct ds *s)
> +{
> +    ds_put_format(s, " term=%"PRIu64, rpy->term);
> +}

Was it intentional to leave off "last_index" and "last_term"?

> +static void
> +raft_execute_command_request_to_jsonrpc(
> +    const struct raft_execute_command_request *rq, struct json *args)
> +{
> +    json_object_put(args, "data", json_clone(rq->data));

Can 'rq->data' be null from raft_execute_command_request_clone()?

> +static void
> +raft_format_execute_command_request(
> +    const struct raft_execute_command_request *rq, struct ds *s)
> +{
> ...
> +    json_to_ds(rq->data, JSSF_SORT, s);

Same question about 'rq->data' being null.

> +static void
> +raft_execute_command_reply_to_jsonrpc(
> +    const struct raft_execute_command_reply *rpy, struct json *args)
> +{
> +    json_object_put_format(args, "result", UUID_FMT, UUID_ARGS(&rpy->result));
> +    json_object_put_string(args, "status",
> +                           raft_command_status_to_string(rpy->status));

I believe raft_command_status_to_string() can return null.

> +static void
> +raft_format_execute_command_reply(
> +    const struct raft_execute_command_reply *rpy, struct ds *s)
> +{
> +    ds_put_format(s, " result="UUID_FMT, UUID_ARGS(&rpy->result));
> +    ds_put_format(s, " status=\"%s\"",
> +                  raft_command_status_to_string(rpy->status));

Same concern, but this wouldn't cause a crash.

> +struct ovsdb_error * OVS_WARN_UNUSED_RESULT
> +raft_rpc_from_jsonrpc(struct uuid *cidp,
> +                      const struct uuid *sid,
> +                      const struct jsonrpc_msg *msg,
> +                      union raft_rpc *rpc)
> +{

I think it would be worth adding a comment to this function; behavior such as how 'cidp' is used is not obvious unless you read the code.

> +void
> +raft_rpc_format(const union raft_rpc *rpc, struct ds *s)
> +{
> +    ds_put_format(s, "%s", raft_rpc_type_to_string(rpc->type));
> +    if (rpc->common.comment) {
> +        ds_put_format(s, " \"%s\"", rpc->common.comment);
> +    }
> +    ds_put_char(s, ':');

Is printing the SID not important?

> diff --git a/lib/jsonrpc.h b/lib/jsonrpc.h
> index 969a6ed38cd6..a44114e8dcd9 100644
> --- a/lib/jsonrpc.h
> +++ b/lib/jsonrpc.h
> 
> struct raft_rpc_common {
>     enum raft_rpc_type type;
>     struct uuid sid;            /* SID of peer server. */

I found it surprising in the code that 'sid' could be the source or destination depending on whether we were converting to or from jsonrpc.  I think it could be made clearer here.

> diff --git a/ovsdb/raft.c b/ovsdb/raft.c
> new file mode 100644
> index 000000000000..9c358bf35687
> --- /dev/null
> +++ b/ovsdb/raft.c
> ...
> +/* An action deferred until a log write commits to disk. */
> +struct raft_waiter {
> +    struct ovs_list list_node;
> +    uint64_t commit_ticket;
> +
> +    enum raft_waiter_type type;
> +    union {
> +        /* RAFT_W_ENTRY.
> +         *
> +         * Waits for a RAFT_REC_ENTRY write to our local log to commit.  Upon
> +         * completion, updates log_synced to indicate that the new log entry or
> +         * entries are committed and, if we are leader, also updates our local
> +         * match_index. */
> +        struct {
> +            uint64_t index;
> +        } entry;
> +
> +        /* RAFT_W_TERM.
> +         *
> +         * Waits for a RAFT_REC_TERM or RAFT_REC_VOTE record write to commit.
> +         * Upon completion, updates synced_term and synced_vote, which triggers
> +         * sending RPCs deferred by the uncommitted term and vote. */
> +        struct {
> +            uint64_t term;
> +            struct uuid vote;
> +        } term;
> +
> +        /* RAFT_W_RPC.
> +         *
> +         * Sometimes, sending an RPC to a peer must be delayed until an entry,
> +         * a term, or a vote mentioned in the RPC is synced to disk.  This
> +         * waiter keeps a copy of such an RPC until the previous waiters have
> +         * committed. */
> +        union raft_rpc *rpc;
> +    };
> +};

Minor, but I think it's a little easier to read when single quotes are used around references to struct members.

> +/* The Raft state machine. */
> +struct raft {
> ...
> +/* Persistent derived state.
> + *
> + * This must be updated on stable storage before responding to RPCs.  It can be
> + * derived from the header, snapshot, and log in 'log'. */
> +
> +    struct uuid cid;            /* Cluster ID (immutable for the cluster). */
> +    struct uuid sid;            /* Server ID (immutable for the server). */
> +    char *local_address;        /* Local address (immutable for the server). */
> +    char *local_nickname;       /* Used for local server in log messages. */
> +    char *name;                 /* Cluster name (immutable for the cluster). */

I think 'name' is the schema name based on other references, including raft_get_name(), but this makes it sound different to me.

> +/* Persistent state on all servers.
> + *
> + * Must be updated on stable storage before responding to RPCs. */
> +
> +    /* Current term and vote, which might be on the way to disk now. */
> +    uint64_t term;              /* Initialized to 0 and only increases. */
> +    struct uuid vote;           /* In 'term', or all-zeros if none. */

I found the 'vote' description confusing.

> +/* Creates an on-disk file that represents a new Raft cluster and initializes
> + * it to consist of a single server, the one on which this function is called.
> + *
> + * Creates the local copy of the cluster's log in 'file_name', which must not
> + * already exist.  Gives it the name 'name', which should be the database
> + * schema name and which is used only to match up this database with server
> + * added to the cluster later if the cluster ID is unavailable.

"with *the* server"

> + * The new server is located at 'local_address', which must take one of the
> + * forms "tcp:IP[:PORT]" or "ssl:IP[:PORT]", where IP is an IPv4 address or a
> + * square bracket enclosed IPv6 address.  PORT, if present, is a port number
> + * that defaults to RAFT_PORT.

I think the port must be explicitly set.  I don't think there is a "RAFT_PORT" default defined.

> +/* Creates a database file that represents a new server in an existing Raft
> + * cluster.
> + *
> + * Creates the local copy of the cluster's log in 'file_name', which must not
> + * already exist.  Gives it the name 'name', which must be the same name
> + * passed in to raft_create_cluster() earlier.

Is raft_create_cluster() supposed to be called before this?  I think raft_create_cluster() would have already created the log file, which the previous sentence said must not already exist.

> + * 'cid' is optional.  If specified, the new server will join only the cluster
> + * with the given cluster ID.
> + *
> + * The new server is located at 'local_address', which must take one of the
> + * forms "tcp:IP[:PORT]" or "ssl:IP[:PORT]", where IP is an IPv4 address or a
> + * square bracket enclosed IPv6 address.  PORT, if present, is a port number
> + * that defaults to RAFT_PORT.

Same comment about explicitly setting the port and "RAFT_PORT" not being defined.

> + *
> + * Joining the cluster requiring contacting it.  Thus, 'remote_addresses'

s/requiring/requires/

> + * This only creates the on-disk file and does no network access.  Use
> + * raft_open() to start operating the new server.  (Until this happens, the
> + * new server has not joined the cluster.)

The name for this function (and the corresponding ovsdb-tool command) seems a bit confusing in this way.  Not a big deal, but it doesn't seem obvious.

> + * Returns null if successful, otherwise an ovsdb_error describing the
> + * problem. */
> +struct ovsdb_error * OVS_WARN_UNUSED_RESULT
> +raft_join_cluster(const char *file_name,
> +                  const char *name, const char *local_address,
> +                  const struct sset *remote_addresses,
> +                  const struct uuid *cid)
> +{
> ...
> +    /* Write log file. */
> +    struct raft_header h = {
> +        .sid = uuid_random(),
> +        .cid = cid ? *cid : UUID_ZERO,
> +        .name = xstrdup(name),
> +        .local_address = xstrdup(local_address),
> +        .joining = true,
> +    };
> +    sset_clone(&h.remote_addresses, remote_addresses);
> +    error = ovsdb_log_write_and_free(log, raft_header_to_json(&h));
> +    raft_header_uninit(&h);

I don't think anything initializes 'h->snap', which seems like it could be a problem for raft_header_to_json() and raft_header_uninit().

> +/* Reads the initial header record from 'log', which must be a Raft clustered
> + * database log, and populates '*md' with the information read from it.  The
> + * caller must eventually destroy 'md'.

You might mention that they do this by calling raft_metadata_destroy().

> + *
> + * On success, returns NULL.  On failure, returns a error that the caller must
> + * eventually destroy and zeros '*md'. */

s/a error/an error/

> +static struct json *
> +raft_servers_for_index(const struct raft *raft, uint64_t index)

I think it would be worth mentioning that the caller must free the return value.

> +static struct ovsdb_error * OVS_WARN_UNUSED_RESULT
> +raft_read_header(struct raft *raft)
> +{
> +    /* Read header record. */
> +    struct json *json;
> +    struct ovsdb_error *error = ovsdb_log_read(raft->log, &json);

I don't think anything free 'json'.

> +static void
> +raft_add_conn(struct raft *raft, struct jsonrpc_session *js,
> +              const struct uuid *sid, bool incoming)
> +{
> ...
> +    if (sid) {
> +        conn->sid = *sid;
> +    }
> +    conn->nickname = raft_address_to_nickname(jsonrpc_session_get_name(js),
> +                                              &conn->sid);

The 'sid' argument can be null, which seems like it could be a problem with raft_address_to_nickname().

> +/* Attempts to start 'raft' leaving its cluster.  The caller can check progress
> + * using raft_is_leaving() and raft_left(). */
> +void
> +raft_leave(struct raft *raft)
> +{
> ...
> +    raft->leaving = true;
> +    raft_transfer_leadership(raft, "this server is leaving the cluster");
> +    raft_become_follower(raft);
> +    raft_send_remove_server_requests(raft);
> +    raft->leave_timeout = time_msec() + ELECTION_BASE_MSEC;
> +    raft->leaving = true;

This sets 'raft->leaving' twice.  Was that intentional?

> +/* Returns true if 'raft' is successfully left its cluster. */
> +bool
> +raft_left(const struct raft *raft)

I think you can drop "is" in the comment describing the function.

> +static void
> +raft_close__(struct raft *raft)
> +{

I think it would be worth explaining why this would be called separately from raft_close().

> +    struct raft_server *rs = raft->remove_server;
> +    if (rs) {
> +        raft_send_remove_server_reply__(raft, &rs->sid, &rs->requester_sid,
> +                                        rs->requester_conn, false,
> +                                        RAFT_SERVER_SHUTDOWN);
> +        raft_server_destroy(raft->remove_server);
> +    }

I think it might be safer to set 'raft->remove_server' to null in case this gets called, and then raft_close() is called.

> void
> raft_close(struct raft *raft)
> {

Should something free the 'raft->waiters' entries?

> +static bool
> +raft_conn_receive(struct raft *raft, struct raft_conn *conn,
> +                  union raft_rpc *rpc)
> +{
> ...
> +    } else if (!uuid_equals(&conn->sid, &rpc->common.sid)) {
> +        static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5);
> +        VLOG_WARN_RL(&rl,
> +                     "%s: remote server ID changed from "SID_FMT" to "SID_FMT,
> +                     jsonrpc_session_get_name(conn->js),
> +                     SID_ARGS(&conn->sid), SID_ARGS(&rpc->common.sid));

If the SID changed, shouldn't we update "conn->sid"?

> +/* Returns true if 'conn' should stay open, 'conn' if it should be closed. */
> +static bool
> +raft_conn_should_stay_open(struct raft *raft, struct raft_conn *conn)

I assume "false" if it should be closed.  :-)

> +/* Allows 'raft' to maintain the distributed log.  Call this function as part
> + * of the process's main loop. */
> +void
> +raft_run(struct raft *raft)
> +{
> ...
> +    /* Close unneeded sessions. */
> +    struct raft_conn *next;
> +    LIST_FOR_EACH_SAFE (conn, next, list_node, &raft->conns) {
> +        if (!raft_conn_should_stay_open(raft, conn)) {
> +            jsonrpc_session_close(conn->js);
> +            ovs_list_remove(&conn->list_node);
> +            free(conn);

I think 'conn->nickname' needs to be freed, too.  Maybe a helper function, since there are a couple of places that free 'conn'

> +    if (raft->joining && time_msec() >= raft->join_timeout) {
> +        raft->join_timeout = time_msec() + 1000;
> +        struct raft_conn *conn;

This 'conn' shadows a previous declaration.  I think you can just drop declaring this one.

> +    if (time_msec() >= raft->ping_timeout) {
> +        if (raft->role == RAFT_LEADER) {
> +            raft_send_heartbeats(raft);
> +        } else {
> +            long long int now = time_msec();
> +            struct raft_command *cmd, *next;

This 'next' shadows a 'next' of a different type.

> +/* Frees 'cmd'. */
> +void
> +raft_command_unref(struct raft_command *cmd)
> +{

It seems like a more accurate description would be along the lines of "Decrements the ref count for 'cmd' and frees it if zero."

> +/* 'sid' is the server being added or removed. */
> +static void
> +raft_send_add_server_reply__(struct raft *raft, const struct uuid *sid,
> +                             const char *address,
> +                             bool success, const char *comment)

The description makes it sound like a server could be removed using this function, but I don't think that's the case.

> +static void
> +raft_server_init_leader(struct raft *raft, struct raft_server *s)
> +{
> +    s->next_index = raft->log_end;
> +    s->match_index = 0;
> +    s->phase = RAFT_PHASE_STABLE;
> +}

The name of this function is a little surprising to me.  It seems like it's initializing a server, not necessarily a leader.

> +/* Processes term 'term' received as part of RPC 'common'.  Returns true if the
> + * caller should continue processing the RPC, false if the caller should reject
> + * it due to a stale term. */
> +static bool
> +raft_receive_term__(struct raft *raft, const struct raft_rpc_common *common,
> +                    uint64_t term)
> +{
> +    /* Section 3.3 says:
> +     *
> +     *     Current terms are exchanged whenever servers communicate; if one
> +     *     server’s current term is smaller than the other’s, then it updates
> +     *     its current term to the larger value.  If a candidate or leader
> +     *     discovers that its term is out of date, it immediately reverts to
> +     *     follower state.  If a server receives a request with a stale term
> +     *     number, it rejects the request.
> +     */
> +    if (term > raft->term) {
> +        if (!raft_set_term(raft, term, NULL)) {
> +            return false;

I don't understand how the result of raft_set_term() plays into this decision to return false.

> +static const struct json *
> +raft_peek_next_entry(struct raft *raft, struct uuid *eid)
> +{
> +    /* Invariant: log_start - 2 <= last_applied <= commit_index < log_end. */
> +    ovs_assert(raft->log_start <= raft->last_applied + 2);
> +    ovs_assert(raft->last_applied <= raft->commit_index);
> +    ovs_assert(raft->commit_index < raft->log_end);
> +
> +    if (raft->joining || raft->failed) {        /* XXX needed? */
> +        return NULL;
> +    }

Did you want to check this "XXX"?

> +/* If 'prev_log_index' exists in 'raft''s log, term 'prev_log_term', returns
> + * NULL.  Otherwise, returns an explanation for the mismatch.  */
> +static const char *
> +match_index_and_term(const struct raft *raft,
> +                     uint64_t prev_log_index, uint64_t prev_log_term)

The description of 'prev_log_term' was confusing to me.  I think it may be missing some text about it matching the term pointed to by 'prev_log_index'.

> +/* Returns NULL on success, RAFT_IN_PROGRESS for an operation in progress,
> + * otherwise a brief comment explaining failure. */
> +static void
> +raft_handle_append_entries(struct raft *raft,
> +                           const struct raft_append_request *rq,
> +                           uint64_t prev_log_index, uint64_t prev_log_term,
> +                           const struct raft_entry *entries,
> +                           unsigned int n_entries)

This description doesn't seem to match the function.

> +        if (raft->entries[log_index - raft->log_start].term
> +            != entries[i].term) {
> +            if (raft_truncate(raft, log_index)) {
> +                servers_changed = true;

Does this always mean the servers changed?

> +static bool
> +raft_update_leader(struct raft *raft, const struct uuid *sid)
> +{
> +    if (raft->role == RAFT_LEADER && !uuid_equals(sid, &raft->sid)) {
> +        char buf[SID_LEN + 1];
> +        VLOG_ERR("this server is leader but server %s claims to be",
> +                 raft_get_nickname(raft, sid, buf, sizeof buf));
> +        return false;

If 'sid' and 'raft->sid' were equal, wouldn't that mean that this other system is using the same sid?  That also seems like it would be a problem.

> +static void
> +raft_handle_append_request__(struct raft *raft,
> +                             const struct raft_append_request *rq)
> +{
> ...
> +    /*
> +     * We now know that the data in rq->entries[] overlaps the data in
> +     * raft->entries[], as shown below, with some positive 'ofs':
> +     *
> +     *     rq->prev_log_index
> +     *       | first_entry_index
> +     *       |   |             nth_entry_index
> +     *       |   |               |
> +     *       v   v               v
> +     *         +---+---+---+---+---+
> +     *       T | T | T | T | T | T |
> +     *         +---+-------+---+---+
> +     *                     +---+---+---+---+
> +     *                   T | T | T | T | T |
> +     *                     +---+---+---+---+
> +     *                       ^               ^
> +     *                       |               |
> +     *                   log_start        log_end
> +     *
> +     *           |<-- ofs -->|

I think 'ofs' points to the middle of the 'log_start' record.  I think it should maybe be moved a couple of spaces to make it clearer.

> +/* Returns the next log entry or snapshot from 'raft', or NULL if there are
> + * none left to read..  Stores the entry ID of the log entry in '*eid'.  Stores
> + * true in '*is_snapshot' if the returned data is a snapshot, false if it is a
> + * log entry.*/
> +const struct json *
> +raft_next_entry(struct raft *raft, struct uuid *eid, bool *is_snapshot)
> +{

There's an extra period after "read".

> static void
> raft_handle_append_request(struct raft *raft,
>                            const struct raft_append_request *rq)
> {   
>     raft_handle_append_request__(raft, rq);
> }                              

Is there a reason not to fold in raft_handle_append_request__()?  There don't seem to be any other callers to it.

> +static void
> +raft_log_reconfiguration(struct raft *raft)
> +{
> +    /* Add the reconfiguration to the log.
> +     *
> +     * We ignore any */

Is this comment not finished?

> +static void
> +raft_run_reconfigure(struct raft *raft)
> +{
> ...
> +    /* Remove a server, if one is scheduled for removal. */
> +    HMAP_FOR_EACH (s, hmap_node, &raft->servers) {
> +        if (s->phase == RAFT_PHASE_REMOVE) {
> +            hmap_remove(&raft->servers, &s->hmap_node);
> +            raft->remove_server = s;

Are we sure that 'raft->remove_server' will only be assigned once?  I think it might be worth asserting that it's null before assigning here.

> +static void
> +raft_handle_remove_server_reply(struct raft *raft,
> +                                const struct raft_remove_server_reply *rpc)
> +{
> +    if (rpc->success) {
> +        VLOG_INFO("%04x: finished leaving cluster %04x",
> +                  uuid_prefix(&raft->sid, 4), uuid_prefix(&raft->cid, 4));

Do you want to use SID_FMT and CID_FMT here?

> +/* Returns true if 'raft' has grown enough that reducing the log to a snapshot
> + * would be valuable, false otherwise.  When this function returns true, the
> + * client should consider using raft_store_snapshot() to reduce the log storage
> + * requirements. */
> +bool
> +raft_grew_lots(const struct raft *raft)
> +{
> +    return (!raft->joining
> +            && !raft->leaving
> +            && !raft->left
> +            && !raft->failed
> +            && raft->last_applied - raft->log_start >= 100
> +            && ovsdb_log_grew_lots(raft->log));

ovsdb_log_grew_lots() is based on the filesize.  Raft entries are bigger.  It's probably not significant, but should this difference be taken that into account?

> +/* Replaces the log for 'raft', up to the last log entry read, by
> + * 'new_snapshot_data'.  Returns NULL if successful, otherwise an error that
> + * the caller must eventually free. */
> +struct ovsdb_error * OVS_WARN_UNUSED_RESULT
> +raft_store_snapshot(struct raft *raft, const struct json *new_snapshot_data)

I think it would be worth mentioning that this function doesn't take ownership of 'new_snapshot_data'.

> +    const struct raft_entry new_snapshot = {
> +        .term = raft_get_term(raft, new_log_start - 1),
> +        .data = CONST_CAST(struct json *, new_snapshot_data),
> +        .eid = *raft_get_eid(raft, new_log_start - 1),
> +        .servers = raft_servers_for_index(raft, new_log_start - 1),
> +    };
> +    struct ovsdb_error *error = raft_save_snapshot(raft, new_log_start,
> +                                                   &new_snapshot);
> +    if (error) {
> +        return error;
> +    }

I think this may leak 'new_snapshot.servers'.  If it doesn't under the normal circumstance, I think it can under an error.

> +static void
> +raft_unixctl_status(struct unixctl_conn *conn,
> +                    int argc OVS_UNUSED, const char *argv[],
> +                    void *aux OVS_UNUSED)
> +{
> ...
> +    ds_put_format(&s, "Cluster ID: ");
> +    if (!uuid_is_zero(&raft->cid)) {
> +        ds_put_format(&s, UUID_FMT"\n", UUID_ARGS(&raft->cid));
> +    } else {
> +        ds_put_format(&s, "not yet known\n");
> +    }
> +    ds_put_format(&s, "Server ID: "SID_FMT" ("UUID_FMT")\n",
> +                  SID_ARGS(&raft->sid), UUID_ARGS(&raft->sid));

Did you intentionally not use CID_FMT like you did for SID?

> +    ds_put_cstr(&s, "Servers:\n");
> +    struct raft_server *server;
> +    HMAP_FOR_EACH (server, hmap_node, &raft->servers) {
> +        ds_put_format(&s, "    %s ("SID_FMT" at %s)",
> +                      server->nickname,
> +                      SID_ARGS(&server->sid), server->address);
> +        if (uuid_equals(&server->sid, &raft->sid)) {
> +            ds_put_cstr(&s, " (me)");

raft_put_sid() seems to use "self" to describe its own SID.  Do you think it's worth using the same terminology?

> +static void
> +raft_unixctl_leave__(struct unixctl_conn *conn, struct raft *raft)
> +{
> +    if (raft_left(raft)) {
> +        unixctl_command_reply(conn, NULL);
> ...
> +    } else {
> +        raft_leave(raft);
> +        unixctl_command_reply(conn, NULL);
> +    }
> +}

It doesn't seem like this makes a distinction between starting to leave and already having left.  It seems like it would be helpful to provide a reply that indicates the difference.

> +static void
> +raft_unixctl_leave(struct unixctl_conn *conn, int argc, const char *argv[],
> +                   void *aux OVS_UNUSED)
> +{
> +    bool force = argc > 2 && !strcmp(argv[1], "--force");

It doesn't appear that "--force" actually does anything.

Some final random things:

	- It might be nice to add some tests that use "tcp" instead of "unix" sockets.  I thought I noticed some instances in the code that could be problematic.

	- I think corner cases, such as 'raft->failed' getting set to true aren't handled in the client code.

	- It's very difficult to determine whether an argument is consumed or not, or whether the return value should be freed.  Better documentation (particularly for public functions) would help.

	- Should the copyright dates just be 2017 and 2018 for new files?  A lot of them reference older years.

Thanks for implementing this important feature!

Acked-by: Justin Pettit <jpettit at ovn.org>

--Justin




More information about the dev mailing list