[ovs-discuss] Possible data loss of OVSDB active-backup mode

Wed Sep 5 12:34:38 UTC 2018

On Wed, Sep 5, 2018 at 12:42 AM Han Zhou <zhouhan at gmail.com> wrote:

>
>
> On Sun, Sep 2, 2018 at 11:01 PM Numan Siddique <nusiddiq at redhat.com>
> wrote:
> >
> >
> >
> > On Fri, Aug 10, 2018 at 3:59 AM Ben Pfaff <blp at ovn.org> wrote:
> >>
> >> On Thu, Aug 09, 2018 at 09:32:21AM -0700, Han Zhou wrote:
> >> > On Thu, Aug 9, 2018 at 1:57 AM, aginwala <aginwala at asu.edu> wrote:
> >> > >
> >> > >
> >> > > To add on , we are using LB VIP IP and no constraint with 3 nodes
> as Han
> >> > mentioned earlier where active node  have syncs from invalid IP and
> rest
> >> > two nodes sync from LB VIP IP. Also, I was able to get some logs from
> one
> >> > node  that triggered:
> >> >
> https://github.com/openvswitch/ovs/blob/master/ovsdb/ovsdb-server.c#L460
> >> > >
> >> > > 2018-08-04T01:43:39.914Z|03230|reconnect|DBG|tcp:
> 10.189.208.16:50686:
> >> > entering RECONNECT
> >> > > 2018-08-04T01:43:39.914Z|03231|ovsdb_jsonrpc_server|INFO|tcp:
> >> > 10.189.208.16:50686: disconnecting (removing OVN_Northbound database
> due to
> >> > server termination)
> >> > > 2018-08-04T01:43:39.932Z|03232|ovsdb_jsonrpc_server|INFO|tcp:
> >> > 10.189.208.21:56160: disconnecting (removing _Server database due to
> server
> >> > termination)
> >> > > 20
> >> > >
> >> > > I am not sure if sync_from on active node too via some invalid ip is
> >> > causing some flaw when all are down during the race condition in this
> >> > corner case.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Thu, Aug 9, 2018 at 1:35 AM Numan Siddique <nusiddiq at redhat.com>
> wrote:
> >> > >>
> >> > >>
> >> > >>
> >> > >> On Thu, Aug 9, 2018 at 1:07 AM Ben Pfaff <blp at ovn.org> wrote:
> >> > >>>
> >> > >>> On Wed, Aug 08, 2018 at 12:18:10PM -0700, Han Zhou wrote:
> >> > >>> > On Wed, Aug 8, 2018 at 11:24 AM, Ben Pfaff <blp at ovn.org> wrote:
> >> > >>> > >
> >> > >>> > > On Wed, Aug 08, 2018 at 12:37:04AM -0700, Han Zhou wrote:
> >> > >>> > > > Hi,
> >> > >>> > > >
> >> > >>> > > > We found an issue in our testing (thanks aginwala) with
> >> > active-backup
> >> > >>> > mode
> >> > >>> > > > in OVN setup.
> >> > >>> > > > In the 3 node setup with pacemaker, after stopping
> pacemaker on
> >> > all
> >> > >>> > three
> >> > >>> > > > nodes (simulate a complete shutdown), and then if starting
> all of
> >> > them
> >> > >>> > > > simultaneously, there is a good chance that the whole DB
> content
> >> > gets
> >> > >>> > lost.
> >> > >>> > > >
> >> > >>> > > > After studying the replication code, it seems there is a
> phase
> >> > that the
> >> > >>> > > > backup node deletes all its data and wait for data to be
> synced
> >> > from the
> >> > >>> > > > active node:
> >> > >>> > > >
> >> >
> https://github.com/openvswitch/ovs/blob/master/ovsdb/replication.c#L306
> >> > >>> > > >
> >> > >>> > > > At this state, if the node was set to active, then all data
> is
> >> > gone for
> >> > >>> > the
> >> > >>> > > > whole cluster. This can happen in different situations. In
> the
> >> > test
> >> > >>> > > > scenario mentioned above it is very likely to happen, since
> >> > pacemaker
> >> > >>> > just
> >> > >>> > > > randomly select one as master, not knowing the internal sync
> >> > state of
> >> > >>> > each
> >> > >>> > > > node. It could also happen when failover happens right
> after a new
> >> > >>> > backup
> >> > >>> > > > is started, although less likely in real environment, so
> starting
> >> > up
> >> > >>> > node
> >> > >>> > > > one by one may largely reduce the probability.
> >> > >>> > > >
> >> > >>> > > > Does this analysis make sense? We will do more tests to
> verify the
> >> > >>> > > > conclusion, but would like to share with community for
> >> > discussions and
> >> > >>> > > > suggestions. Once this happens it is very critical - even
> more
> >> > serious
> >> > >>> > than
> >> > >>> > > > just no HA. Without HA it is just control plane outage, but
> this
> >> > would
> >> > >>> > be
> >> > >>> > > > data plane outage because OVS flows will be removed
> accordingly
> >> > since
> >> > >>> > the
> >> > >>> > > > data is considered as deleted from ovn-controller point of
> view.
> >> > >>> > > >
> >> > >>> > > > We understand that active-standby is not the ideal HA
> mechanism
> >> > and
> >> > >>> > > > clustering is the future, and we are also testing the
> clustering
> >> > with
> >> > >>> > the
> >> > >>> > > > latest patch. But it would be good if this problem can be
> >> > addressed with
> >> > >>> > > > some quick fix, such as keep a copy of the old data
> somewhere
> >> > until the
> >> > >>> > > > first sync finishes?
> >> > >>> > >
> >> > >>> > > This does seem like a plausible bug, and at first glance I
> believe
> >> > that
> >> > >>> > > you're correct about the race here.  I guess that the correct
> >> > behavior
> >> > >>> > > must be to keep the original data until a new copy of the
> data has
> >> > been
> >> > >>> > > received, and only then atomically replace the original by
> the new.
> >> > >>> > >
> >> > >>> > > Is this something you have time and ability to fix?
> >> > >>> >
> >> > >>> > Thanks Ben for quick response. I guess I will not have time
> until I
> >> > send
> >> > >>> > out next series for incremental processing :)
> >> > >>> > It would be good if someone can help and then please reply this
> email
> >> > if
> >> > >>> > he/she starts working on it so that we will not end up with
> >> > overlapping
> >> > >>> > work.
> >> > >>
> >> > >>
> >> > >> I will give a shot at fixing this issue.
> >> > >>
> >> > >> In the case of tripleo we haven't hit this issue. I haven't tested
> this
> >> > scenario.
> >> > >> I will test it out. One difference when compared to your setup is
> >> > tripleo uses
> >> > >> IPAddr2 resource and a collocation constraint set.
> >> > >>
> >> > >> Thanks
> >> > >> Numan
> >> > >>
> >> >
> >> > Thanks Numan for helping on this. I think IPAddr2 should have same
> problem,
> >> > if my previous analysis was right, unless using IPAddr2 would result
> in
> >> > pacemaker always electing the node that is configured with the master
> IP as
> >> > the master when starting pacemaker on all nodes again.
> >> >
> >> > Ali, thanks for the information. Just to clarify that the log
> "removing xxx
> >> > database due to server termination" is not related to this issue. It
> might
> >> > be misleading but it doesn't mean deleting content of database. It is
> just
> >> > doing clean-up of internal data structure before exiting. The code
> that
> >> > deletes the DB data is here:
> >> >
> https://github.com/openvswitch/ovs/blob/master/ovsdb/replication.c#L306,
> >> > and there is no log printing for this. You may add log here to verify
> when
> >> > you reproduce the issue.
> >>
> >> Right, "removing" in this case just means "no longer serving".
> >
> >
> > Hi Han/Ben,
> >
> > I have  submitted two possible solutions to solve this issue -
> https://patchwork.ozlabs.org/patch/965246/ and
> https://patchwork.ozlabs.org/patch/965247/
> > Han - can you please try these out and see if it solves the issue.
> >
> > Approach 1 resets the database just before processing the monitor reply.
> This approach is simpler, but it has a small window of error. If the
> function process_notification()
> > fails for some reason we could lose the data. I am not sure if it is a
> possibility or not.
> >
> > Approach 2  on the other hand, stores the monitor reply in an in memory
> ovsdb struct, resets the database and then repopulates the db from the in
> memory ovsdb struct.
> >
> > Please let me know which approach seems to be better or if there is any
> other way.
> >
> > Thanks
> > Numan
> >
> >
> Thanks Numan! I like Approach 1 for the simplicity. For the error
> situation, if it happens in extreme situation, since it is standby, we can
> make sure it never serve as active node in that state - by simply exit.
> What do you think?
>

I agree that approach 1 is simpler. I think simply exiting would not help.
If pacemaker is used for active/standby which I suppose is the case with
your setup, pacemaker will restart the ovsdb-server again when it
sees that monitor action returns NOT_RUNNING. I think it should be fine,
because pacemaker would not promote this node as master since there is
already a master. But you found this issue by stopping/starting
the pacemaker resource. So I am not sure how it would behave.

Is it possible to test this patch the way you tested earlier ? If you can
confirm if this fixes the issue, I will submit the patch without the RFC
tag.
Also can you please try to fail the function 'process_notification()' if
possible and see how it works.

Thanks
Numan

>
> Han
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20180905/32ec1bbd/attachment.html>