[ovs-discuss] Possible data loss of OVSDB active-backup mode

Mon Sep 10 16:55:01 UTC 2018

Cool! Thanks a lot.

On Mon, Sep 10, 2018 at 12:57 AM Numan Siddique <nusiddiq at redhat.com> wrote:

>
>
> On Sun, Sep 9, 2018 at 8:38 AM aginwala <aginwala at asu.edu> wrote:
>
>> Hi:
>>
>> As consented with approach 1, I tested it. DB data is retained even for
>> the continuous fail-over scenario where all 3 nodes are started/stopped at
>> the same time multiple times in a loop. Also, works as expected in the
>> normal failover scenarios.
>>
>> Since you also asked to test failing  process_notification, I did
>> introduce 10 sec sleep after line
>> https://github.com/openvswitch/ovs/blob/master/ovsdb/replication.c#L604
>> which actually resulted in pacemaker failure with unknown error for 2 slave
>> nodes but the function did not report any error messages that I was
>> logging. DB data was still intact since it always promoted the 3rd node as
>> master.
>>
>>
>> Output for above failure test:
>> Online: [ test-pace1-2365293 test-pace2-2365308 test-pace3-2598581 ]
>>
>> Full list of resources:
>>
>>  Master/Slave Set: ovndb_servers-master [ovndb_servers]
>>      ovndb_servers (ocf::ovn:ovndb-servers): FAILED test-pace3-2598581
>> (unmanaged)
>>      ovndb_servers (ocf::ovn:ovndb-servers): FAILED test-pace2-2365308
>> (unmanaged)
>>      Masters: [ test-pace1-2365293 ]
>>
>> Failed Actions:
>> * ovndb_servers_stop_0 on test-pace3-2598581 'unknown error' (1):
>> call=12, status=Timed Out, exitreason='none',
>>     last-rc-change='Sat Sep  8 19:22:20 2018', queued=0ms, exec=20003ms
>> * ovndb_servers_stop_0 on test-pace2-2365308 'unknown error' (1):
>> call=12, status=Timed Out, exitreason='none',
>>     last-rc-change='Sat Sep  8 19:22:20 2018', queued=0ms, exec=20002ms
>>
>>
>> Another way I tried to intentionally set error to some non-null string
>> that skipped calling process_notification which does wipes out whole db
>> when that node is promoted because of no notification updates. Was this the
>> approach you wanted to test or some other way (correct me if I am wrong)?
>>
>> Also wanted to say, if you can add a info log statement in the formal
>> patch during reset_database function as I used the same in my env which
>> makes clear from log too about the failover behavior.
>>
>
> Thanks for testing it out. I sent a formal patch here adding the log
> message as suggested by you - https://patchwork.ozlabs.org/patch/967888/
>
> Regards
> Numan
>
>
>>
>> As you guys mentioned, not sure what other corner case might have been
>> missed but this patch LGTM overall (safer than the current code that wipes
>> out the db :))
>>
>> Regards,
>>
>> On Wed, Sep 5, 2018 at 1:24 PM Han Zhou <zhouhan at gmail.com> wrote:
>>
>>>
>>>
>>> On Wed, Sep 5, 2018 at 10:44 AM aginwala <aginwala at asu.edu> wrote:
>>> >
>>> > Thanks Numan:
>>> >
>>> > I will give it shot and update the findings.
>>> >
>>> >
>>> > On Wed, Sep 5, 2018 at 5:35 AM Numan Siddique <nusiddiq at redhat.com>
>>> wrote:
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Sep 5, 2018 at 12:42 AM Han Zhou <zhouhan at gmail.com> wrote:
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Sun, Sep 2, 2018 at 11:01 PM Numan Siddique <nusiddiq at redhat.com>
>>> wrote:
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Fri, Aug 10, 2018 at 3:59 AM Ben Pfaff <blp at ovn.org> wrote:
>>> >>> >>
>>> >>> >> On Thu, Aug 09, 2018 at 09:32:21AM -0700, Han Zhou wrote:
>>> >>> >> > On Thu, Aug 9, 2018 at 1:57 AM, aginwala <aginwala at asu.edu>
>>> wrote:
>>> >>> >> > >
>>> >>> >> > >
>>> >>> >> > > To add on , we are using LB VIP IP and no constraint with 3
>>> nodes as Han
>>> >>> >> > mentioned earlier where active node  have syncs from invalid IP
>>> and rest
>>> >>> >> > two nodes sync from LB VIP IP. Also, I was able to get some
>>> logs from one
>>> >>> >> > node  that triggered:
>>> >>> >> >
>>> https://github.com/openvswitch/ovs/blob/master/ovsdb/ovsdb-server.c#L460
>>> >>> >> > >
>>> >>> >> > > 2018-08-04T01:43:39.914Z|03230|reconnect|DBG|tcp:
>>> 10.189.208.16:50686:
>>> >>> >> > entering RECONNECT
>>> >>> >> > > 2018-08-04T01:43:39.914Z|03231|ovsdb_jsonrpc_server|INFO|tcp:
>>> >>> >> > 10.189.208.16:50686: disconnecting (removing OVN_Northbound
>>> database due to
>>> >>> >> > server termination)
>>> >>> >> > > 2018-08-04T01:43:39.932Z|03232|ovsdb_jsonrpc_server|INFO|tcp:
>>> >>> >> > 10.189.208.21:56160: disconnecting (removing _Server database
>>> due to server
>>> >>> >> > termination)
>>> >>> >> > > 20
>>> >>> >> > >
>>> >>> >> > > I am not sure if sync_from on active node too via some
>>> invalid ip is
>>> >>> >> > causing some flaw when all are down during the race condition
>>> in this
>>> >>> >> > corner case.
>>> >>> >> > >
>>> >>> >> > >
>>> >>> >> > >
>>> >>> >> > >
>>> >>> >> > >
>>> >>> >> > > On Thu, Aug 9, 2018 at 1:35 AM Numan Siddique <
>>> nusiddiq at redhat.com> wrote:
>>> >>> >> > >>
>>> >>> >> > >>
>>> >>> >> > >>
>>> >>> >> > >> On Thu, Aug 9, 2018 at 1:07 AM Ben Pfaff <blp at ovn.org>
>>> wrote:
>>> >>> >> > >>>
>>> >>> >> > >>> On Wed, Aug 08, 2018 at 12:18:10PM -0700, Han Zhou wrote:
>>> >>> >> > >>> > On Wed, Aug 8, 2018 at 11:24 AM, Ben Pfaff <blp at ovn.org>
>>> wrote:
>>> >>> >> > >>> > >
>>> >>> >> > >>> > > On Wed, Aug 08, 2018 at 12:37:04AM -0700, Han Zhou
>>> wrote:
>>> >>> >> > >>> > > > Hi,
>>> >>> >> > >>> > > >
>>> >>> >> > >>> > > > We found an issue in our testing (thanks aginwala)
>>> with
>>> >>> >> > active-backup
>>> >>> >> > >>> > mode
>>> >>> >> > >>> > > > in OVN setup.
>>> >>> >> > >>> > > > In the 3 node setup with pacemaker, after stopping
>>> pacemaker on
>>> >>> >> > all
>>> >>> >> > >>> > three
>>> >>> >> > >>> > > > nodes (simulate a complete shutdown), and then if
>>> starting all of
>>> >>> >> > them
>>> >>> >> > >>> > > > simultaneously, there is a good chance that the whole
>>> DB content
>>> >>> >> > gets
>>> >>> >> > >>> > lost.
>>> >>> >> > >>> > > >
>>> >>> >> > >>> > > > After studying the replication code, it seems there
>>> is a phase
>>> >>> >> > that the
>>> >>> >> > >>> > > > backup node deletes all its data and wait for data to
>>> be synced
>>> >>> >> > from the
>>> >>> >> > >>> > > > active node:
>>> >>> >> > >>> > > >
>>> >>> >> >
>>> https://github.com/openvswitch/ovs/blob/master/ovsdb/replication.c#L306
>>> >>> >> > >>> > > >
>>> >>> >> > >>> > > > At this state, if the node was set to active, then
>>> all data is
>>> >>> >> > gone for
>>> >>> >> > >>> > the
>>> >>> >> > >>> > > > whole cluster. This can happen in different
>>> situations. In the
>>> >>> >> > test
>>> >>> >> > >>> > > > scenario mentioned above it is very likely to happen,
>>> since
>>> >>> >> > pacemaker
>>> >>> >> > >>> > just
>>> >>> >> > >>> > > > randomly select one as master, not knowing the
>>> internal sync
>>> >>> >> > state of
>>> >>> >> > >>> > each
>>> >>> >> > >>> > > > node. It could also happen when failover happens
>>> right after a new
>>> >>> >> > >>> > backup
>>> >>> >> > >>> > > > is started, although less likely in real environment,
>>> so starting
>>> >>> >> > up
>>> >>> >> > >>> > node
>>> >>> >> > >>> > > > one by one may largely reduce the probability.
>>> >>> >> > >>> > > >
>>> >>> >> > >>> > > > Does this analysis make sense? We will do more tests
>>> to verify the
>>> >>> >> > >>> > > > conclusion, but would like to share with community for
>>> >>> >> > discussions and
>>> >>> >> > >>> > > > suggestions. Once this happens it is very critical -
>>> even more
>>> >>> >> > serious
>>> >>> >> > >>> > than
>>> >>> >> > >>> > > > just no HA. Without HA it is just control plane
>>> outage, but this
>>> >>> >> > would
>>> >>> >> > >>> > be
>>> >>> >> > >>> > > > data plane outage because OVS flows will be removed
>>> accordingly
>>> >>> >> > since
>>> >>> >> > >>> > the
>>> >>> >> > >>> > > > data is considered as deleted from ovn-controller
>>> point of view.
>>> >>> >> > >>> > > >
>>> >>> >> > >>> > > > We understand that active-standby is not the ideal HA
>>> mechanism
>>> >>> >> > and
>>> >>> >> > >>> > > > clustering is the future, and we are also testing the
>>> clustering
>>> >>> >> > with
>>> >>> >> > >>> > the
>>> >>> >> > >>> > > > latest patch. But it would be good if this problem
>>> can be
>>> >>> >> > addressed with
>>> >>> >> > >>> > > > some quick fix, such as keep a copy of the old data
>>> somewhere
>>> >>> >> > until the
>>> >>> >> > >>> > > > first sync finishes?
>>> >>> >> > >>> > >
>>> >>> >> > >>> > > This does seem like a plausible bug, and at first
>>> glance I believe
>>> >>> >> > that
>>> >>> >> > >>> > > you're correct about the race here.  I guess that the
>>> correct
>>> >>> >> > behavior
>>> >>> >> > >>> > > must be to keep the original data until a new copy of
>>> the data has
>>> >>> >> > been
>>> >>> >> > >>> > > received, and only then atomically replace the original
>>> by the new.
>>> >>> >> > >>> > >
>>> >>> >> > >>> > > Is this something you have time and ability to fix?
>>> >>> >> > >>> >
>>> >>> >> > >>> > Thanks Ben for quick response. I guess I will not have
>>> time until I
>>> >>> >> > send
>>> >>> >> > >>> > out next series for incremental processing :)
>>> >>> >> > >>> > It would be good if someone can help and then please
>>> reply this email
>>> >>> >> > if
>>> >>> >> > >>> > he/she starts working on it so that we will not end up
>>> with
>>> >>> >> > overlapping
>>> >>> >> > >>> > work.
>>> >>> >> > >>
>>> >>> >> > >>
>>> >>> >> > >> I will give a shot at fixing this issue.
>>> >>> >> > >>
>>> >>> >> > >> In the case of tripleo we haven't hit this issue. I haven't
>>> tested this
>>> >>> >> > scenario.
>>> >>> >> > >> I will test it out. One difference when compared to your
>>> setup is
>>> >>> >> > tripleo uses
>>> >>> >> > >> IPAddr2 resource and a collocation constraint set.
>>> >>> >> > >>
>>> >>> >> > >> Thanks
>>> >>> >> > >> Numan
>>> >>> >> > >>
>>> >>> >> >
>>> >>> >> > Thanks Numan for helping on this. I think IPAddr2 should have
>>> same problem,
>>> >>> >> > if my previous analysis was right, unless using IPAddr2 would
>>> result in
>>> >>> >> > pacemaker always electing the node that is configured with the
>>> master IP as
>>> >>> >> > the master when starting pacemaker on all nodes again.
>>> >>> >> >
>>> >>> >> > Ali, thanks for the information. Just to clarify that the log
>>> "removing xxx
>>> >>> >> > database due to server termination" is not related to this
>>> issue. It might
>>> >>> >> > be misleading but it doesn't mean deleting content of database.
>>> It is just
>>> >>> >> > doing clean-up of internal data structure before exiting. The
>>> code that
>>> >>> >> > deletes the DB data is here:
>>> >>> >> >
>>> https://github.com/openvswitch/ovs/blob/master/ovsdb/replication.c#L306,
>>> >>> >> > and there is no log printing for this. You may add log here to
>>> verify when
>>> >>> >> > you reproduce the issue.
>>> >>> >>
>>> >>> >> Right, "removing" in this case just means "no longer serving".
>>> >>> >
>>> >>> >
>>> >>> > Hi Han/Ben,
>>> >>> >
>>> >>> > I have  submitted two possible solutions to solve this issue -
>>> https://patchwork.ozlabs.org/patch/965246/ and
>>> https://patchwork.ozlabs.org/patch/965247/
>>> >>> > Han - can you please try these out and see if it solves the issue.
>>> >>> >
>>> >>> > Approach 1 resets the database just before processing the monitor
>>> reply. This approach is simpler, but it has a small window of error. If the
>>> function process_notification()
>>> >>> > fails for some reason we could lose the data. I am not sure if it
>>> is a possibility or not.
>>> >>> >
>>> >>> > Approach 2  on the other hand, stores the monitor reply in an in
>>> memory ovsdb struct, resets the database and then repopulates the db from
>>> the in memory ovsdb struct.
>>> >>> >
>>> >>> > Please let me know which approach seems to be better or if there
>>> is any other way.
>>> >>> >
>>> >>> > Thanks
>>> >>> > Numan
>>> >>> >
>>> >>> >
>>> >>> Thanks Numan! I like Approach 1 for the simplicity. For the error
>>> situation, if it happens in extreme situation, since it is standby, we can
>>> make sure it never serve as active node in that state - by simply exit.
>>> What do you think?
>>> >>
>>> >>
>>> >> I agree that approach 1 is simpler. I think simply exiting would not
>>> help. If pacemaker is used for active/standby which I suppose is the case
>>> with your setup, pacemaker will restart the ovsdb-server again when it
>>> >> sees that monitor action returns NOT_RUNNING. I think it should be
>>> fine, because pacemaker would not promote this node as master since there
>>> is already a master. But you found this issue by stopping/starting
>>> >> the pacemaker resource. So I am not sure how it would behave.
>>>
>>> Hi Numan, I agree with you after thinking about it again. Simply exiting
>>> would not solve the issue. It is less likely to happen than the original
>>> implementation but there is still a probability. It seems we will have to
>>> either do atomic swapping to make sure there is never a state that the
>>> ovsdb doesn't have data in the disk file, or have some state in the file to
>>> indicate that the DB is in *incomplete* state and should not be used as
>>> active node. For this reason, even approach 2 still has a problem. Imaging
>>> the process got killed after reset database but before new data population
>>> to file is complete, it would still leave the data on disk incomplete.
>>>
>>> Regards,
>>> Han
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20180910/27e74d42/attachment-0001.html>