[ovs-dev] [PATCH v5] ovsdb raft: Sync commit index to followers without delay.

Wed Apr 10 01:48:09 UTC 2019

On Mon, Apr 8, 2019 at 4:08 PM Han Zhou <zhouhan at gmail.com> wrote:
>
>
>
> On Wed, Mar 27, 2019 at 4:10 PM Han Zhou <zhouhan at gmail.com> wrote:
> >
> > On Mon, Mar 25, 2019 at 4:24 PM Han Zhou <zhouhan at gmail.com> wrote:
> > >
> > > On Mon, Mar 25, 2019 at 2:00 PM Ben Pfaff <blp at ovn.org> wrote:
> > > >
> > > > On Sat, Mar 23, 2019 at 09:44:26AM -0700, Han Zhou wrote:
> > > > > From: Han Zhou <hzhou8 at ebay.com>
> > > > >
> > > > > When update is requested from follower, the leader sends
AppendRequest
> > > > > to all followers and wait until AppendReply received from
majority, and
> > > > > then it will update commit index - the new entry is regarded as
committed
> > > > > in raft log. However, this commit will not be notified to
followers
> > > > > (including the one initiated the request) until next heartbeat
(ping
> > > > > timeout), if no other pending requests. This results in long
latency
> > > > > for updates made through followers, especially when a batch of
updates
> > > > > are requested through the same follower.
> > > >
> > > > The tests pass now, but each one of them ends up with several
ovn-sbctl
> > > > and ovsdb-server processes all trying to use 100% of a CPU.  If I
run
> > > > the tests with 10-way parallelism, the load average goes above 100.
> > > > Surely something has to be wrong in the implementation here.
> > > >
> > > > Each of the ovn-sbctl processes is trying to push through only a
single
> > > > transaction; after that, it exits.  If we think of ovsdb-server as
> > > > giving each of its clients one chance to execute an RPC in
round-robin
> > > > order, which is approximately correct, then one of those
transactions
> > > > should succeed per round.  I don't understand why, if this model is
> > > > correct, the ovn-sbctls would burn so much CPU.  If the model is
wrong,
> > > > then we need to understand why it is wrong and how we can fix it.
Maybe
> > > > the ovn-sbctl processes are retrying blindly without waiting for an
> > > > update from the server; if so, that's a bug and it should be fixed.
> > >
> > > Hi Ben, to make the test case more effective, in v4/v5 I enlarged the
> > > size of each transaction 50 times, specified by the var "n3". This is
> > > just to create the load in test and slow down the transaction
> > > execution, to ensure there are enough parallel requests ongoing, and
> > > to ensure we can test the server membership changes during this
> > > period. Without this change (in test case only), the CPU was low, but
> > > it just making the torture test not torturing at all (either skipped
> > > because the execution was too fast, or if put sleep there it loses the
> > > parallelism). And the test case change was proved to be more effective
> > > - it found an existing bug, as mentioned in the notes.
> > >
> > > If your question is why increasing the size of transaction causing
> > > high CPU load, it seems expected if we see what's being done by each
> > > client:
> > > - download initial data, which increases as more and more transaction
> > > executed, each adding 50 entries (10 * 5 * 50 = 2500), later clients
> > > will take more CPU because of this.
> > > - send transaction with operations: wait + update, which has double
> > > size of the original data, plus newly added 50 entries.
> > > - there are some chances that clients got conflict update (wait
> > > operation fails), which triggers retry. This doesn't happen a lot (by
> > > checking the logs), and I think it is expected due to the purpose of
> > > the testing - to increase parallelism.
> > > - process updates, at least once for its own update, but also could
> > > process other's update. (This step is not CPU costly since each update
> > > notification is not big)
> > >
> > > So I think increasing the size of transaction causing higher CPU is
> > > expected due to these operations, considering the JsonRPC cost. I ran
> > > the tests a lot a times and I didn't observe any client occupying CPU
> > > for long time - there were short spikes (not 100%) shown by "top", but
> > > I never see anyone there twice. If you see any behavior like it is
> > > continously retrying, could you please share the x-y.log file of that
> > > client?
> > >
> > > I agree there could be improvement to the RPC performance of
> > > ovsdb/IDL, but it is not related to this patch (or maybe we can say
> > > the test in this patch utilized this slowness).
> > >
> >
> > Hi Ben, does my explain make sense? To prove that the CPU cost is
> > expected (at least nothing caused by this patch), I compared with
> > master branch using leader_only = true for the ovn-sbctl command in
> > torture test, and execute same updated torture test case (i.e. same
> > number of n1 = 10, n2 = 5, and n3 = 50). When leader_only = true,
> > there is no latency for transaction execution even without this patch,
> > because no idle waiting for syncing from leader to followers, so all
> > transactions are executed full speed, and the performance is
> > comparable.
> >
> > time make check TESTSUITEFLAGS="2529"
> > (2529: OVSDB 3-server torture test - kill/restart leader)
> >
> > - *Master branch* with new test case, and use leader_only=true:
> > real    0m15.448s
> > user    0m15.384s
> > sys    0m13.687s
> >
> > - With this patch (leader_only=false):
> > real    0m13.904s
> > user    0m15.578s
> > sys    0m11.817s
> >
> > - With this patch but leader_only=true (exactly same test case as for
> > *Master branch*):
> > real    0m14.419s
> > user    0m14.972s
> > sys    0m12.467s
> >
> > As we can see, there is no obvious difference. During the tests I was
> > also watch "top" output and I didn't notice any difference either. The
> > "leader_only" update on top of this patch is simply:
> > ------8><------------------------------------------><8-----------
> > diff --git a/tests/ovsdb-cluster.at b/tests/ovsdb-cluster.at
> > index 5550a19..94ec9c7 100644
> > --- a/tests/ovsdb-cluster.at
> > +++ b/tests/ovsdb-cluster.at
> > @@ -140,7 +140,7 @@ ovsdb|WARN|schema: changed 2 columns in
> > 'OVN_Southbound' database from ephemeral
> >               for k in $(seq $n3); do
> >                   txn="$txn -- add SB_Global . external_ids
$i-$j-$k=$i-$j-$k"
> >               done
> > -             run_as "ovn-sbctl($i-$j)" ovn-sbctl
> > "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m"
> > --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120
> > --no-leader-only $txn
> > +             run_as "ovn-sbctl($i-$j)" ovn-sbctl
> > "-vPATTERN:console:ovn-sbctl($i-$j)|%D{%H:%M:%S}|%05N|%c|%p|%m"
> > --log-file=$i-$j.log -vfile -vsyslog:off -vtimeval:off --timeout=120
> > $txn
> >               status=$?
> >               if test $status != 0; then
> >                   echo "$i-$j exited with status $status" >
$i-$j:$status
> >
--------------------------------------------------------------------------
> > Please try it and let me know if you still have concerns.
> >
> > Thanks,
> > Han
>
> Hi Ben, this one is small but important, just waiting for your final
approval. Sorry for pushing this, but I am also working on UTs for
different failure scenarios of Raft, this patch helps for some of the test
cases, so I hope this can be merged before I send out the UT series.
>
> Thanks,
> Han

I embedded this one as the first patch in a new series:
https://patchwork.ozlabs.org/project/openvswitch/list/?series=101873
The new series is mainly for raft UT and one bug fix, which is not closely
related to but depends on this patch.