[ovs-discuss] OVN: Delay in handling unixctl commands in ovsdb-server

Thu Feb 13 07:08:49 UTC 2020

On Wed, Feb 12, 2020 at 9:57 AM Numan Siddique <nusiddiq at redhat.com> wrote:
>
> Hi Ben/All,
>
> In an OVN deployment - with OVN dbs deployed as active/standby using
> pacemaker, we are seeing delays in response to unixctl command -
> ovsdb-server/sync-status.
>
> Pacemaker periodically calls the OVN pacemaker OCF script to get the
> status and this script internally invokes - ovs-appctl -t
> /var/run/openvswitch/ovnsb_db.ctl ovsdb-server/sync-status. In a large
> deployment with lots of OVN resources we see that ovsdb-server takes a
> lot of time (sometimes > 60 seconds) to respond to this command. This
> causes pacemaker to stop the service in that node and move the master
> to another node. This causes a lot of disruption.
>
> One approach of solving this issue is to handle unixctl commands in a
> separate thread. The commands like sync-status, get-** etc can be
> easily handled in the thread. Still, there are many commands like
> ovsdb-server/set-active-ovsdb-server, ovsdb-server/compact etc (which
> changes the state) which needs to be synchronized between the main
> ovsdb-server thread and the newly added thread using a mutex.
>
> Does this approach makes sense ? I started working on it. But I wanted
> to check with the community before putting into more efforts.
>
> Are there better ways to solve this issue ?
>
> Thanks
> Numan
>
Hi Numan,

It seems reasonable to me. Multi-threading would add a little complexity,
but in this case it should be straightforward. It merely requires mutexes
to synchronize between the threads for *writes*, and also for *reads* of
non-atomic data.
The only side effect is that *if* the thread that does the DB job really
stucked because of a bug and not handling jobs at all, the unixctl thread
ovsdb-server/sync-status command wouldn't detect it, so it could result in
pacemaker reporting *happy* status without detecting problems. First for
all this is unlikely to happen. But if we really think it is a problem we
can still solve it by incrementing a counter in main loop and have a new
command (readonly, without mutex) to check if this counter is increasing,
to tell if the server if really working.

Thanks,
Han
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20200212/a83d44aa/attachment-0001.html>