[ovs-discuss] Possible data loss of OVSDB active-backup mode

Han Zhou zhouhan at gmail.com
Wed Sep 5 20:24:35 UTC 2018


On Wed, Sep 5, 2018 at 10:44 AM aginwala <aginwala at asu.edu> wrote:
>
> Thanks Numan:
>
> I will give it shot and update the findings.
>
>
> On Wed, Sep 5, 2018 at 5:35 AM Numan Siddique <nusiddiq at redhat.com> wrote:
>>
>>
>>
>> On Wed, Sep 5, 2018 at 12:42 AM Han Zhou <zhouhan at gmail.com> wrote:
>>>
>>>
>>>
>>> On Sun, Sep 2, 2018 at 11:01 PM Numan Siddique <nusiddiq at redhat.com>
wrote:
>>> >
>>> >
>>> >
>>> > On Fri, Aug 10, 2018 at 3:59 AM Ben Pfaff <blp at ovn.org> wrote:
>>> >>
>>> >> On Thu, Aug 09, 2018 at 09:32:21AM -0700, Han Zhou wrote:
>>> >> > On Thu, Aug 9, 2018 at 1:57 AM, aginwala <aginwala at asu.edu> wrote:
>>> >> > >
>>> >> > >
>>> >> > > To add on , we are using LB VIP IP and no constraint with 3
nodes as Han
>>> >> > mentioned earlier where active node  have syncs from invalid IP
and rest
>>> >> > two nodes sync from LB VIP IP. Also, I was able to get some logs
from one
>>> >> > node  that triggered:
>>> >> >
https://github.com/openvswitch/ovs/blob/master/ovsdb/ovsdb-server.c#L460
>>> >> > >
>>> >> > > 2018-08-04T01:43:39.914Z|03230|reconnect|DBG|tcp:
10.189.208.16:50686:
>>> >> > entering RECONNECT
>>> >> > > 2018-08-04T01:43:39.914Z|03231|ovsdb_jsonrpc_server|INFO|tcp:
>>> >> > 10.189.208.16:50686: disconnecting (removing OVN_Northbound
database due to
>>> >> > server termination)
>>> >> > > 2018-08-04T01:43:39.932Z|03232|ovsdb_jsonrpc_server|INFO|tcp:
>>> >> > 10.189.208.21:56160: disconnecting (removing _Server database due
to server
>>> >> > termination)
>>> >> > > 20
>>> >> > >
>>> >> > > I am not sure if sync_from on active node too via some invalid
ip is
>>> >> > causing some flaw when all are down during the race condition in
this
>>> >> > corner case.
>>> >> > >
>>> >> > >
>>> >> > >
>>> >> > >
>>> >> > >
>>> >> > > On Thu, Aug 9, 2018 at 1:35 AM Numan Siddique <
nusiddiq at redhat.com> wrote:
>>> >> > >>
>>> >> > >>
>>> >> > >>
>>> >> > >> On Thu, Aug 9, 2018 at 1:07 AM Ben Pfaff <blp at ovn.org> wrote:
>>> >> > >>>
>>> >> > >>> On Wed, Aug 08, 2018 at 12:18:10PM -0700, Han Zhou wrote:
>>> >> > >>> > On Wed, Aug 8, 2018 at 11:24 AM, Ben Pfaff <blp at ovn.org>
wrote:
>>> >> > >>> > >
>>> >> > >>> > > On Wed, Aug 08, 2018 at 12:37:04AM -0700, Han Zhou wrote:
>>> >> > >>> > > > Hi,
>>> >> > >>> > > >
>>> >> > >>> > > > We found an issue in our testing (thanks aginwala) with
>>> >> > active-backup
>>> >> > >>> > mode
>>> >> > >>> > > > in OVN setup.
>>> >> > >>> > > > In the 3 node setup with pacemaker, after stopping
pacemaker on
>>> >> > all
>>> >> > >>> > three
>>> >> > >>> > > > nodes (simulate a complete shutdown), and then if
starting all of
>>> >> > them
>>> >> > >>> > > > simultaneously, there is a good chance that the whole DB
content
>>> >> > gets
>>> >> > >>> > lost.
>>> >> > >>> > > >
>>> >> > >>> > > > After studying the replication code, it seems there is a
phase
>>> >> > that the
>>> >> > >>> > > > backup node deletes all its data and wait for data to be
synced
>>> >> > from the
>>> >> > >>> > > > active node:
>>> >> > >>> > > >
>>> >> >
https://github.com/openvswitch/ovs/blob/master/ovsdb/replication.c#L306
>>> >> > >>> > > >
>>> >> > >>> > > > At this state, if the node was set to active, then all
data is
>>> >> > gone for
>>> >> > >>> > the
>>> >> > >>> > > > whole cluster. This can happen in different situations.
In the
>>> >> > test
>>> >> > >>> > > > scenario mentioned above it is very likely to happen,
since
>>> >> > pacemaker
>>> >> > >>> > just
>>> >> > >>> > > > randomly select one as master, not knowing the internal
sync
>>> >> > state of
>>> >> > >>> > each
>>> >> > >>> > > > node. It could also happen when failover happens right
after a new
>>> >> > >>> > backup
>>> >> > >>> > > > is started, although less likely in real environment, so
starting
>>> >> > up
>>> >> > >>> > node
>>> >> > >>> > > > one by one may largely reduce the probability.
>>> >> > >>> > > >
>>> >> > >>> > > > Does this analysis make sense? We will do more tests to
verify the
>>> >> > >>> > > > conclusion, but would like to share with community for
>>> >> > discussions and
>>> >> > >>> > > > suggestions. Once this happens it is very critical -
even more
>>> >> > serious
>>> >> > >>> > than
>>> >> > >>> > > > just no HA. Without HA it is just control plane outage,
but this
>>> >> > would
>>> >> > >>> > be
>>> >> > >>> > > > data plane outage because OVS flows will be removed
accordingly
>>> >> > since
>>> >> > >>> > the
>>> >> > >>> > > > data is considered as deleted from ovn-controller point
of view.
>>> >> > >>> > > >
>>> >> > >>> > > > We understand that active-standby is not the ideal HA
mechanism
>>> >> > and
>>> >> > >>> > > > clustering is the future, and we are also testing the
clustering
>>> >> > with
>>> >> > >>> > the
>>> >> > >>> > > > latest patch. But it would be good if this problem can be
>>> >> > addressed with
>>> >> > >>> > > > some quick fix, such as keep a copy of the old data
somewhere
>>> >> > until the
>>> >> > >>> > > > first sync finishes?
>>> >> > >>> > >
>>> >> > >>> > > This does seem like a plausible bug, and at first glance I
believe
>>> >> > that
>>> >> > >>> > > you're correct about the race here.  I guess that the
correct
>>> >> > behavior
>>> >> > >>> > > must be to keep the original data until a new copy of the
data has
>>> >> > been
>>> >> > >>> > > received, and only then atomically replace the original by
the new.
>>> >> > >>> > >
>>> >> > >>> > > Is this something you have time and ability to fix?
>>> >> > >>> >
>>> >> > >>> > Thanks Ben for quick response. I guess I will not have time
until I
>>> >> > send
>>> >> > >>> > out next series for incremental processing :)
>>> >> > >>> > It would be good if someone can help and then please reply
this email
>>> >> > if
>>> >> > >>> > he/she starts working on it so that we will not end up with
>>> >> > overlapping
>>> >> > >>> > work.
>>> >> > >>
>>> >> > >>
>>> >> > >> I will give a shot at fixing this issue.
>>> >> > >>
>>> >> > >> In the case of tripleo we haven't hit this issue. I haven't
tested this
>>> >> > scenario.
>>> >> > >> I will test it out. One difference when compared to your setup
is
>>> >> > tripleo uses
>>> >> > >> IPAddr2 resource and a collocation constraint set.
>>> >> > >>
>>> >> > >> Thanks
>>> >> > >> Numan
>>> >> > >>
>>> >> >
>>> >> > Thanks Numan for helping on this. I think IPAddr2 should have same
problem,
>>> >> > if my previous analysis was right, unless using IPAddr2 would
result in
>>> >> > pacemaker always electing the node that is configured with the
master IP as
>>> >> > the master when starting pacemaker on all nodes again.
>>> >> >
>>> >> > Ali, thanks for the information. Just to clarify that the log
"removing xxx
>>> >> > database due to server termination" is not related to this issue.
It might
>>> >> > be misleading but it doesn't mean deleting content of database. It
is just
>>> >> > doing clean-up of internal data structure before exiting. The code
that
>>> >> > deletes the DB data is here:
>>> >> >
https://github.com/openvswitch/ovs/blob/master/ovsdb/replication.c#L306,
>>> >> > and there is no log printing for this. You may add log here to
verify when
>>> >> > you reproduce the issue.
>>> >>
>>> >> Right, "removing" in this case just means "no longer serving".
>>> >
>>> >
>>> > Hi Han/Ben,
>>> >
>>> > I have  submitted two possible solutions to solve this issue -
https://patchwork.ozlabs.org/patch/965246/ and
https://patchwork.ozlabs.org/patch/965247/
>>> > Han - can you please try these out and see if it solves the issue.
>>> >
>>> > Approach 1 resets the database just before processing the monitor
reply. This approach is simpler, but it has a small window of error. If the
function process_notification()
>>> > fails for some reason we could lose the data. I am not sure if it is
a possibility or not.
>>> >
>>> > Approach 2  on the other hand, stores the monitor reply in an in
memory ovsdb struct, resets the database and then repopulates the db from
the in memory ovsdb struct.
>>> >
>>> > Please let me know which approach seems to be better or if there is
any other way.
>>> >
>>> > Thanks
>>> > Numan
>>> >
>>> >
>>> Thanks Numan! I like Approach 1 for the simplicity. For the error
situation, if it happens in extreme situation, since it is standby, we can
make sure it never serve as active node in that state - by simply exit.
What do you think?
>>
>>
>> I agree that approach 1 is simpler. I think simply exiting would not
help. If pacemaker is used for active/standby which I suppose is the case
with your setup, pacemaker will restart the ovsdb-server again when it
>> sees that monitor action returns NOT_RUNNING. I think it should be fine,
because pacemaker would not promote this node as master since there is
already a master. But you found this issue by stopping/starting
>> the pacemaker resource. So I am not sure how it would behave.

Hi Numan, I agree with you after thinking about it again. Simply exiting
would not solve the issue. It is less likely to happen than the original
implementation but there is still a probability. It seems we will have to
either do atomic swapping to make sure there is never a state that the
ovsdb doesn't have data in the disk file, or have some state in the file to
indicate that the DB is in *incomplete* state and should not be used as
active node. For this reason, even approach 2 still has a problem. Imaging
the process got killed after reset database but before new data population
to file is complete, it would still leave the data on disk incomplete.

Regards,
Han
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20180905/6c57a04c/attachment-0001.html>


More information about the discuss mailing list