[ovs-discuss] Possible data loss of OVSDB active-backup mode

Tue Sep 4 19:11:44 UTC 2018

On Sun, Sep 2, 2018 at 11:01 PM Numan Siddique <nusiddiq at redhat.com> wrote:
>
>
>
> On Fri, Aug 10, 2018 at 3:59 AM Ben Pfaff <blp at ovn.org> wrote:
>>
>> On Thu, Aug 09, 2018 at 09:32:21AM -0700, Han Zhou wrote:
>> > On Thu, Aug 9, 2018 at 1:57 AM, aginwala <aginwala at asu.edu> wrote:
>> > >
>> > >
>> > > To add on , we are using LB VIP IP and no constraint with 3 nodes as
Han
>> > mentioned earlier where active node  have syncs from invalid IP and
rest
>> > two nodes sync from LB VIP IP. Also, I was able to get some logs from
one
>> > node  that triggered:
>> >
https://github.com/openvswitch/ovs/blob/master/ovsdb/ovsdb-server.c#L460
>> > >
>> > > 2018-08-04T01:43:39.914Z|03230|reconnect|DBG|tcp:10.189.208.16:50686:
>> > entering RECONNECT
>> > > 2018-08-04T01:43:39.914Z|03231|ovsdb_jsonrpc_server|INFO|tcp:
>> > 10.189.208.16:50686: disconnecting (removing OVN_Northbound database
due to
>> > server termination)
>> > > 2018-08-04T01:43:39.932Z|03232|ovsdb_jsonrpc_server|INFO|tcp:
>> > 10.189.208.21:56160: disconnecting (removing _Server database due to
server
>> > termination)
>> > > 20
>> > >
>> > > I am not sure if sync_from on active node too via some invalid ip is
>> > causing some flaw when all are down during the race condition in this
>> > corner case.
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Thu, Aug 9, 2018 at 1:35 AM Numan Siddique <nusiddiq at redhat.com>
wrote:
>> > >>
>> > >>
>> > >>
>> > >> On Thu, Aug 9, 2018 at 1:07 AM Ben Pfaff <blp at ovn.org> wrote:
>> > >>>
>> > >>> On Wed, Aug 08, 2018 at 12:18:10PM -0700, Han Zhou wrote:
>> > >>> > On Wed, Aug 8, 2018 at 11:24 AM, Ben Pfaff <blp at ovn.org> wrote:
>> > >>> > >
>> > >>> > > On Wed, Aug 08, 2018 at 12:37:04AM -0700, Han Zhou wrote:
>> > >>> > > > Hi,
>> > >>> > > >
>> > >>> > > > We found an issue in our testing (thanks aginwala) with
>> > active-backup
>> > >>> > mode
>> > >>> > > > in OVN setup.
>> > >>> > > > In the 3 node setup with pacemaker, after stopping pacemaker
on
>> > all
>> > >>> > three
>> > >>> > > > nodes (simulate a complete shutdown), and then if starting
all of
>> > them
>> > >>> > > > simultaneously, there is a good chance that the whole DB
content
>> > gets
>> > >>> > lost.
>> > >>> > > >
>> > >>> > > > After studying the replication code, it seems there is a
phase
>> > that the
>> > >>> > > > backup node deletes all its data and wait for data to be
synced
>> > from the
>> > >>> > > > active node:
>> > >>> > > >
>> > https://github.com/openvswitch/ovs/blob/master/ovsdb/replication.c#L306
>> > >>> > > >
>> > >>> > > > At this state, if the node was set to active, then all data
is
>> > gone for
>> > >>> > the
>> > >>> > > > whole cluster. This can happen in different situations. In
the
>> > test
>> > >>> > > > scenario mentioned above it is very likely to happen, since
>> > pacemaker
>> > >>> > just
>> > >>> > > > randomly select one as master, not knowing the internal sync
>> > state of
>> > >>> > each
>> > >>> > > > node. It could also happen when failover happens right after
a new
>> > >>> > backup
>> > >>> > > > is started, although less likely in real environment, so
starting
>> > up
>> > >>> > node
>> > >>> > > > one by one may largely reduce the probability.
>> > >>> > > >
>> > >>> > > > Does this analysis make sense? We will do more tests to
verify the
>> > >>> > > > conclusion, but would like to share with community for
>> > discussions and
>> > >>> > > > suggestions. Once this happens it is very critical - even
more
>> > serious
>> > >>> > than
>> > >>> > > > just no HA. Without HA it is just control plane outage, but
this
>> > would
>> > >>> > be
>> > >>> > > > data plane outage because OVS flows will be removed
accordingly
>> > since
>> > >>> > the
>> > >>> > > > data is considered as deleted from ovn-controller point of
view.
>> > >>> > > >
>> > >>> > > > We understand that active-standby is not the ideal HA
mechanism
>> > and
>> > >>> > > > clustering is the future, and we are also testing the
clustering
>> > with
>> > >>> > the
>> > >>> > > > latest patch. But it would be good if this problem can be
>> > addressed with
>> > >>> > > > some quick fix, such as keep a copy of the old data somewhere
>> > until the
>> > >>> > > > first sync finishes?
>> > >>> > >
>> > >>> > > This does seem like a plausible bug, and at first glance I
believe
>> > that
>> > >>> > > you're correct about the race here.  I guess that the correct
>> > behavior
>> > >>> > > must be to keep the original data until a new copy of the data
has
>> > been
>> > >>> > > received, and only then atomically replace the original by the
new.
>> > >>> > >
>> > >>> > > Is this something you have time and ability to fix?
>> > >>> >
>> > >>> > Thanks Ben for quick response. I guess I will not have time
until I
>> > send
>> > >>> > out next series for incremental processing :)
>> > >>> > It would be good if someone can help and then please reply this
email
>> > if
>> > >>> > he/she starts working on it so that we will not end up with
>> > overlapping
>> > >>> > work.
>> > >>
>> > >>
>> > >> I will give a shot at fixing this issue.
>> > >>
>> > >> In the case of tripleo we haven't hit this issue. I haven't tested
this
>> > scenario.
>> > >> I will test it out. One difference when compared to your setup is
>> > tripleo uses
>> > >> IPAddr2 resource and a collocation constraint set.
>> > >>
>> > >> Thanks
>> > >> Numan
>> > >>
>> >
>> > Thanks Numan for helping on this. I think IPAddr2 should have same
problem,
>> > if my previous analysis was right, unless using IPAddr2 would result in
>> > pacemaker always electing the node that is configured with the master
IP as
>> > the master when starting pacemaker on all nodes again.
>> >
>> > Ali, thanks for the information. Just to clarify that the log
"removing xxx
>> > database due to server termination" is not related to this issue. It
might
>> > be misleading but it doesn't mean deleting content of database. It is
just
>> > doing clean-up of internal data structure before exiting. The code that
>> > deletes the DB data is here:
>> > https://github.com/openvswitch/ovs/blob/master/ovsdb/replication.c#L306
,
>> > and there is no log printing for this. You may add log here to verify
when
>> > you reproduce the issue.
>>
>> Right, "removing" in this case just means "no longer serving".
>
>
> Hi Han/Ben,
>
> I have  submitted two possible solutions to solve this issue -
https://patchwork.ozlabs.org/patch/965246/ and
https://patchwork.ozlabs.org/patch/965247/
> Han - can you please try these out and see if it solves the issue.
>
> Approach 1 resets the database just before processing the monitor reply.
This approach is simpler, but it has a small window of error. If the
function process_notification()
> fails for some reason we could lose the data. I am not sure if it is a
possibility or not.
>
> Approach 2  on the other hand, stores the monitor reply in an in memory
ovsdb struct, resets the database and then repopulates the db from the in
memory ovsdb struct.
>
> Please let me know which approach seems to be better or if there is any
other way.
>
> Thanks
> Numan
>
>
Thanks Numan! I like Approach 1 for the simplicity. For the error
situation, if it happens in extreme situation, since it is standby, we can
make sure it never serve as active node in that state - by simply exit.
What do you think?

Han
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20180904/940b862e/attachment-0001.html>