[ovs-dev] ovsdb failures in cluster mode in IPv6 setup

Riccardo Ravaioli rravaiol at redhat.com
Wed Oct 6 20:02:26 UTC 2021


Hi all,

I have an issue with ovsdb in cluster mode when an instance of a db server
fails.

I'm running a HA single-stack IPv6 ovn-kubernetes Kind cluster, where we
have ovnnb_db and ovnsb_db replicated on three nodes. All control traffic
is IPv6.
Then I take one node, I delete the db files, and I also delete the pod
itself that holds the db server, so as to simulate a node failure.
The pod is recreated as well as the db files, but "ovs-appctl
cluster/status OVN_Northbound" still shows the *old* server instance, along
with the new one.

Indeed, when I look at the ovsdb-server-nb debug logs on the affected node,
I see that it is still receiving heartbeat messages to both the new server
(to which it correctly replies) and the old now (for which it raises an
error: "syntax error: Parsing raft append_request RPC failed: misrouted
message (addressed to 0227 but we're bcda").

On the other hand, in an HA single-stack IPv4 cluster, everything works as
expected:
1) during a few tens of seconds, the cluster/status command from above
shows the old and the new server, as in the ipv6 case;
2) then, the old server is removed, as the new one is correctly added to
the cluster.

This is confirmed in ovsdb-server-nb.logs, where I see the
remove_server_request and remove_server_reply messages.

However, in a HA IPv6 cluster, I keep seeing 4 servers and no
"remove_server_*" messages in the logs...  so it's stuck in the first point
from above.

Is this a bug? Is there anything I can do to debug this further?

Thanks!

Riccardo


More information about the dev mailing list