[ovs-dev] [replication SMv2 7/7] ovsdb: Replication usability improvements

Tue Aug 30 18:33:04 UTC 2016

On Tue, Aug 30, 2016 at 4:17 AM, Numan Siddique <nusiddiq at redhat.com> wrote:

>
>
> On Tue, Aug 30, 2016 at 1:11 AM, Andy Zhou <azhou at ovn.org> wrote:
>
>>
>>
>> On Mon, Aug 29, 2016 at 3:14 AM, Numan Siddique <nusiddiq at redhat.com>
>> wrote:
>>
>>>
>>>
>>> On Sat, Aug 27, 2016 at 4:45 AM, Andy Zhou <azhou at ovn.org> wrote:
>>>
>>>> Added the '--no-sync' option base on feedbacks of current
>>>> implementation.
>>>>
>>>> Added appctl command "ovsdb-server/sync-status" based on feedbacks
>>>> of current implementation.
>>>>
>>>> Added a test to simulate the integration of HA manager with OVSDB
>>>> server using replication.
>>>>
>>>> Other documentation and API improvements.
>>>>
>>>> Signed-off-by: Andy Zhou <azhou at ovn.org>
>>>> ------
>>>>
>>>> I hope to get some review comments on the command line and appctl
>>>> interfaces for replication. Since 2.6 is the first release of those
>>>> interfaces, it is easier to making changes, compare to future
>>>> releases.
>>>>
>>>> ----
>>>> v1->v2: Fix creashes reported at:
>>>> http://openvswitch.org/pipermail/dev/2016-August/078591.html
>>>> ---
>>>>
>>>
>>> I haven't tested these patches yet. This patch seems to have a white
>>> space warning when applied.
>>>
>> Thanks for the reported. I will fold the fix in the next version when
>> posting.
>>
>> In case it helps, you can also access the patches from my private repo at:
>>       https://github.com/azhou-nicira/ovs-review/tree/ovsdb-replic
>> ation-sm-v2
>>
>>
> 
> Hi Andy,
> 
> I am seeing the below crash when
>
>   - The ovsdb-server changes from
> master to standby and the active-ovsdb-server it is about to connect to
> is killed just before that or it is not reachable.
>
>   -
> The pacemaker OCF script calls the sync-status cmd soon after that.
>
>
> Please let me know if you need more information.
>
>
> Core was generated by `ovsdb-server -vdbg --log-file=/opt/stack/logs/ovsdb-server-sb.log
> --remote=puni'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x000000000041241d in replication_status () at ovsdb/replication.c:875
> 875            SHASH_FOR_EACH (node, replication_dbs) {
> Missing separate debuginfos, use: dnf debuginfo-install
> glibc-2.23.1-10.fc24.x86_64 openssl-libs-1.0.2h-3.fc24.x86_64
> (gdb) bt
> #0  0x000000000041241d in replication_status () at ovsdb/replication.c:875
> #1  0x0000000000406eda in ovsdb_server_get_sync_status (conn=0x1421fd0,
> argc=<optimized out>, argv=<optimized out>, config_=<optimized out>)
>     at ovsdb/ovsdb-server.c:1480
> #2  0x00000000004324ee in process_command (request=0x1421f30,
> conn=0x1421fd0) at lib/unixctl.c:313
> #3  run_connection (conn=0x1421fd0) at lib/unixctl.c:347
> #4  unixctl_server_run (server=server at entry=0x141e140) at
> lib/unixctl.c:400
> #5  0x0000000000405bdc in main_loop (is_backup=0x7fff08062256,
> exiting=0x7fff08062257, run_process=0x0, remotes=0x7fff080622a0,
> unixctl=0x141e140,
>     all_dbs=0x7fff080622e0, jsonrpc=0x13f6f00) at ovsdb/ovsdb-server.c:182
> #6  main (argc=<optimized out>, argv=<optimized out>) at
> ovsdb/ovsdb-server.c:430
>
> Numan, thanks for the report. I think I spotted the bug:

Currently, when replication state machine is reset,  the state update takes
place after a round of main loop run. this time lag
could lead to the back trace in case the unixctl commands was issued during
this time lag.  I have a fix that add another
state to represent the reset condition.  The fix is at:

https://github.com/azhou-nicira/ovs-review/tree/ovsdb-replication-sm-v3

Would you please let me know if this version works any better?. Thanks!