[ovs-discuss] the raft_is_connected state of a raft server stays as false and cannot recover

Yun Zhou yunz at nvidia.com
Fri Aug 14 00:26:28 UTC 2020


Hi,

Need expert's view to address a problem we are seeing now and then:  A ovsdb-server node in a 3-nodes raft cluster keeps printing out the "raft_is_connected: false" message, and its "connected" state in its _Server DB stays as false.

According to the ovsdb-server(5) manpage, it means this server is not contacting with a majority of its cluster.

Except its "connected" state, from what we can see, this server is in the follower state and works fine, and connection between it and the other two servers appear healthy as well.

Below is its raft structure snapshot at the time of the problem. Note that its candidate_retrying field stays as true.

Hopefully the provide information can help to figure out what goes wrong here. Unfortunately we don't have a solid case to reproduce it:

(gdb) print *(struct raft *)0xa872c0
$19 = {
  hmap_node = {
    hash = 2911123117,
    next = 0x0
  },
  log = 0xa83690,
  cid = {
    parts = {2699238234, 2258650653, 3035282424, 813064186}
  },
  sid = {
    parts = {1071328836, 400573240, 2626104521, 1746414343}
  },
  local_address = 0xa874e0 "tcp:10.8.51.55:6643",
  local_nickname = 0xa876d0 "3fdb",
  name = 0xa876b0 "OVN_Northbound",
  servers = {
    buckets = 0xad4bc0,
    one = 0x0,
    mask = 3,
    n = 3
  },
  election_timer = 1000,
  election_timer_new = 0,
  term = 3,
  vote = {
    parts = {1071328836, 400573240, 2626104521, 1746414343}
  },
  synced_term = 3,
  synced_vote = {
    parts = {1071328836, 400573240, 2626104521, 1746414343}
  },
  entries = 0xbf0fe0,
  log_start = 2,
  log_end = 312,
  log_synced = 311,
  allocated_log = 512,
  snap = {
    term = 1,
    data = 0xaafb10,
    eid = {
      parts = {1838862864, 1569866528, 2969429118, 3021055395}
    },
    servers = 0xaafa70,
    election_timer = 1000
  },
  role = RAFT_FOLLOWER,
  commit_index = 311,
  last_applied = 311,
  leader_sid = {
    parts = {642765114, 43797788, 2533161504, 3088745929}
  },
  election_base = 6043283367,
  election_timeout = 6043284593,
  joining = false,
  remote_addresses = {
    map = {
      buckets = 0xa87410,
      one = 0xa879c0,
      mask = 0,
      n = 1
    }
  },
  join_timeout = 6037634820,
  leaving = false,
  left = false,
  leave_timeout = 0,
  failed = false,
  waiters = {
    prev = 0xa87448,
    next = 0xa87448
  },
  listener = 0xaafad0,
  listen_backoff = -9223372036854775808,
  conns = {
    prev = 0xbcd660,
    next = 0xaafc20
  },
  add_servers = {
    buckets = 0xa87480,
    one = 0x0,
    mask = 0,
    n = 0
  },
  remove_server = 0x0,
  commands = {
    buckets = 0xa874a8,
    one = 0x0,
    mask = 0,
    n = 0
  },
  ping_timeout = 6043283700,
  n_votes = 1,
  candidate_retrying = true,
  had_leader = false,
  ever_had_leader = true
}

Thanks
- Yun


More information about the discuss mailing list