[ovs-dev] [PATCH v7] ovsdb: provide raft and command interfaces with priority

Wed Aug 18 07:47:49 UTC 2021

On 17/08/2021 13:52, Ilya Maximets wrote:
> On 8/17/21 1:27 PM, Anton Ivanov wrote:
>> Hi Ilia, hi list,
>>
>> I ran some detailed experiments and there is an issue with all forms of "skipping" and/or reordering processing.
>>
>> If the session list is skipped or reordered (I tried "fast-forwarding" the list to a new head position after hitting a time constraint), ovsdb fails to issue the response to some transactions when running the cluster test suite.
>>
>> At present I am unable to get to the root cause.
>>
>> The issue does not exist if processing bails out of the session loop and is re-run IN FULL (as in the earliest versions of the patch).
> That is weird.  I'm not sure how the re-ordering is different from
> the 're-run in full' here.  The only thing that different is an
> actual order in which sessions are processed, because we're still
> re-running all of them in full until the time allows.

Statement of the fact - all varieties of skipping resulted in this from time to time. I went through the code about 20 times yesterday and I cannot figure out what causes it.

The only thing which worked at the end was to reorder the list as follows (this is in version 8) and re-run sessions in full:

1. Cut elements from the next element at the point where processing was interrupted to tail out. Create a new list with these elements (all unprocessed elements). The "next" is now at head position.

2. Push the processed elements at the end of this new list.

3. Replace the original list with this rearranged list.

There is no skipping at session level - they are all re-run again immediately after giving raft_run a chance to work, just in different order - first the unprocessed ones, then the ones that were processed prior to the interruption.

I also tried a few other approaches - f.e. rearranging the list order on each iteration. They can be also made to work provided that there is a full re-run and no skipping.

A.

>
>> I am going to re-issue the patch without any skipping whatsoever (either at remotes or at sessions level), because that works and improves raft (and overall ovn) stability.
>>
>> While there may be some starvation of the sessions towards the end of the session list, it should be a second order effect, because re-processing sessions which have just been processed generates only a minimal amount of changes.
>>
>> Skipping (if any) will be a later optimization after I get to the bottom of this and figure out why monitor updates are not followed by the transaction response.
> This doesn't sound good to me.  It's pretty easy to spam the
> ovsdb-server with monitor requests or condition changes.  This
> requires walk across the whole database.  And if the database
> is big enough, other sessions will never be served due to one
> faulty/malicious connection.   It's also possible that we
> have a few thousands connections and processing of all of them
> legitimately takes a lot of time.  This will be a problem
> if the rate of database changes is relatively high and constant.
>
> Best regards, Ilya Maximets.
>
-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/