[ovs-dev] [replication SMv2 7/7] ovsdb: Replication usability improvements

Wed Aug 31 12:06:10 UTC 2016

On Wed, Aug 31, 2016 at 12:03 AM, Andy Zhou <azhou at ovn.org> wrote:

>
>
> On Tue, Aug 30, 2016 at 4:17 AM, Numan Siddique <nusiddiq at redhat.com>
> wrote:
>
>>
>>
>> On Tue, Aug 30, 2016 at 1:11 AM, Andy Zhou <azhou at ovn.org> wrote:
>>
>>>
>>>
>>> On Mon, Aug 29, 2016 at 3:14 AM, Numan Siddique <nusiddiq at redhat.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Sat, Aug 27, 2016 at 4:45 AM, Andy Zhou <azhou at ovn.org> wrote:
>>>>
>>>>> Added the '--no-sync' option base on feedbacks of current
>>>>> implementation.
>>>>>
>>>>> Added appctl command "ovsdb-server/sync-status" based on feedbacks
>>>>> of current implementation.
>>>>>
>>>>> Added a test to simulate the integration of HA manager with OVSDB
>>>>> server using replication.
>>>>>
>>>>> Other documentation and API improvements.
>>>>>
>>>>> Signed-off-by: Andy Zhou <azhou at ovn.org>
>>>>> ------
>>>>>
>>>>> I hope to get some review comments on the command line and appctl
>>>>> interfaces for replication. Since 2.6 is the first release of those
>>>>> interfaces, it is easier to making changes, compare to future
>>>>> releases.
>>>>>
>>>>> ----
>>>>> v1->v2: Fix creashes reported at:
>>>>> http://openvswitch.org/pipermail/dev/2016-August/078591.html
>>>>> ---
>>>>>
>>>>
>>>> I haven't tested these patches yet. This patch seems to have a white
>>>> space warning when applied.
>>>>
>>> Thanks for the reported. I will fold the fix in the next version when
>>> posting.
>>>
>>> In case it helps, you can also access the patches from my private repo
>>> at:
>>>       https://github.com/azhou-nicira/ovs-review/tree/ovsdb-replic
>>> ation-sm-v2
>>>
>>>
>> 
>> Hi Andy,
>> 
>> I am seeing the below crash when
>>
>>   - The ovsdb-server changes from
>> master to standby and the active-ovsdb-server it is about to connect to
>> is killed just before that or it is not reachable.
>>
>>   -
>> The pacemaker OCF script calls the sync-status cmd soon after that.
>>
>>
>> Please let me know if you need more information.
>>
>>
>> Core was generated by `ovsdb-server -vdbg --log-file=/opt/stack/logs/ovsdb-server-sb.log
>> --remote=puni'.
>> Program terminated with signal SIGSEGV, Segmentation fault.
>> #0  0x000000000041241d in replication_status () at ovsdb/replication.c:875
>> 875            SHASH_FOR_EACH (node, replication_dbs) {
>> Missing separate debuginfos, use: dnf debuginfo-install
>> glibc-2.23.1-10.fc24.x86_64 openssl-libs-1.0.2h-3.fc24.x86_64
>> (gdb) bt
>> #0  0x000000000041241d in replication_status () at ovsdb/replication.c:875
>> #1  0x0000000000406eda in ovsdb_server_get_sync_status (conn=0x1421fd0,
>> argc=<optimized out>, argv=<optimized out>, config_=<optimized out>)
>>     at ovsdb/ovsdb-server.c:1480
>> #2  0x00000000004324ee in process_command (request=0x1421f30,
>> conn=0x1421fd0) at lib/unixctl.c:313
>> #3  run_connection (conn=0x1421fd0) at lib/unixctl.c:347
>> #4  unixctl_server_run (server=server at entry=0x141e140) at
>> lib/unixctl.c:400
>> #5  0x0000000000405bdc in main_loop (is_backup=0x7fff08062256,
>> exiting=0x7fff08062257, run_process=0x0, remotes=0x7fff080622a0,
>> unixctl=0x141e140,
>>     all_dbs=0x7fff080622e0, jsonrpc=0x13f6f00) at ovsdb/ovsdb-server.c:182
>> #6  main (argc=<optimized out>, argv=<optimized out>) at
>> ovsdb/ovsdb-server.c:430
>>
>> Numan, thanks for the report. I think I spotted the bug:
>
> Currently, when replication state machine is reset,  the state update
> takes place after a round of main loop run. this time lag
> could lead to the back trace in case the unixctl commands was issued
> during this time lag.  I have a fix that add another
> state to represent the reset condition.  The fix is at:
>
> https://github.com/azhou-nicira/ovs-review/tree/ovsdb-replication-sm-v3
>
> Would you please let me know if this version works any better?. Thanks!
>

Sure. I would test and let you know.

Thanks
Numan