[ovs-discuss] Issue with failover running ovsdb-server in A/P mode with Pacemaker

Lucas Alvares Gomes lucasagomes at gmail.com
Mon Jul 8 10:58:05 UTC 2019


Hi,

Thanks for reporting, Daniel.

On Mon, Jul 8, 2019 at 11:22 AM Daniel Alvarez Sanchez
<dalvarez at redhat.com> wrote:
>
> Hi folks,
>
> While working with an OpenStack environment running OVN and
> ovsdb-server in A/P configuration with Pacemaker we hit an issue that
> has been probably around for a long time. The bug itself seems to be
> related with ovsdb-server not updating the read-only flag properly.
>
> With a 3 nodes cluster running ovsdb-server in active/passive mode,
> when we restart the master-node, pacemaker promotes another node as
> master and moves the associated IPAddr2 resource to it.
> At this point, ovn-controller instances across the cloud reconnect to
> the new node but there's a window where ovsdb-server is still running
> as backup.
>
> For those ovn-controller instances that reconnect within that window,
> every attempt to write in the OVSDB will fail with "operation not
> allowed when database server is in read only mode". This state will
> remain forever unless a reconnection is forced. Restarting
> ovn-controller or killing the connection (for example with tcpkill)
> will make things work again.
>
> A workaround in OVN OCF script could be to wait for the
> ovsdb_server_promote function to wait until we get 'running/active' on
> that instance.
>
> Another open question is what should clients (in this case,
> ovn-controller) do in such situation? Shall they log an error and
> attempt a reconnection (rate limited)?
>

I would say so, ovn-controller _requires_ a read-write session for it
to function properly. Either it can retry to reconnect forever as you
suggested or assert and exit if it's a read-only connection or a
combination of the two (retry first and then exit).

Also, we need to improve the logs for such errors. While debugging the
problem it wasn't "easy" to find why ovn-controller wasn't updating
the database (we were looking into the nb_cfg column of the Chassis
table in the Southbound OVSDB). We've checked the state of the
connection (it was stable), the process was healthy, etc... Was only
when we enabled the DBG log level for ovn-controller that we've
started seeing messages such as:

2019-07-04T15:11:19.522Z|00148|jsonrpc|DBG|tcp:172.17.1.27:6642:
received notification, method="update2",
params=[["monid","OVN_Southbound"],{"Chassis":{"cb669c72-0f84-412c-a3b
f-482119649d85":{"modify":{"nb_cfg":3300}}}}]
2019-07-04T15:11:19.522Z|00149|jsonrpc|DBG|tcp:172.17.1.27:6642:
received reply, result=[{"details":"update operation not allowed when
database server is in read only mode","er ror":"not allowed"}],
id=8062

So, perhaps logging it as ERROR would be better because without the
DBG level all we could see in the logs was two INFO messages saying
that it reconnected to the Southbound OVSDB.

Cheers,
Lucas


More information about the discuss mailing list