[ovs-dev] [PATCH v3 1/9] jsonrpc: Avoid disconnecting prematurely due to long poll intervals.

Ilya Maximets i.maximets at ovn.org
Mon Dec 21 11:58:14 UTC 2020


On 12/19/20 3:11 AM, Ben Pfaff wrote:
> On Fri, Dec 18, 2020 at 10:11:15PM +0100, Ilya Maximets wrote:
>> On 12/18/20 6:31 AM, Ben Pfaff wrote:
>>> Open vSwitch has a few different jsonrpc-based protocols that depend on
>>> jsonrpc_session to make sure that the connection is up and working.
>>> In turn, jsonrpc_session uses the "reconnect" state machine to send
>>> probes if nothing is received.  This works fine in normal circumstances.
>>> In unusual circumstances, though, it can happen that the program is
>>> busy and doesn't even try to receive anything for a long time.  Then the
>>> timer can time out without a good reason; if it had tried to receive
>>> something, it would have.
>>>
>>> There's a solution that the clients of jsonrpc_session could adopt.
>>> Instead of first calling jsonrpc_session_run(), which is what calls into
>>> "reconnect" to deal with timing out, and then calling into
>>> jsonrpc_session_recv(), which is what tries to receive something, they
>>> could use the opposite order.  That would make sure that the timeout
>>> was always based on a recent attempt to receive something.  Great.
>>>
>>> The actual code in OVS that uses jsonrpc_session, though, tends to use
>>> the opposite order, and there are enough users and this is a subtle
>>> enough issue that it could get flipped back around even if we fixed it
>>> now.  So this commit takes a different approach.  Instead of fixing
>>> this in the users of jsonrpc_session, we fix it in the users of
>>> reconnect: make them tell when they've tried to receive something (or
>>> disable this particular feature).
>>>
>>> This commit fixes the problem that way.  It's kind of hard to reproduce
>>> but I'm pretty sure that I've seen it a number of times in testing.
>>>
>>> Signed-off-by: Ben Pfaff <blp at ovn.org>
>>> ---
>>>  lib/jsonrpc.c          |  5 ++++-
>>>  lib/reconnect.c        | 25 +++++++++++++++++++++++--
>>>  lib/reconnect.h        |  1 +
>>>  tests/test-reconnect.c |  1 +
>>>  4 files changed, 29 insertions(+), 3 deletions(-)
>>
>> Thanks for the fix!
>>
>> I'd still prefer to s/reconnect_try_receive/reconnect_receive_attempt/
>> or some other name since the 'try_receive' part makes me think that
>> we're actually asking it to receive something.
> 
> I am sorry that I forgot about that request from the OVN meeting.  I
> made that change just now.
> 
> I decided to call it reconnect_receive_attempted(), to indicate that the
> received attempt has already happened.
> 
>> It might also be good to add a specific unit test to test-reconnect
>> that will check that timeout is not set.  This should not be hard, but,
>> I guess, this could be a separate patch.
> 
> I wasn't sure how to test this.  Could you explain your idea?  I didn't
> quite follow.

I mean pure reconnect test in tests/reconnect.at.
Idea is to add new command 'do_receive_attempted' to test-reconnect.c and
check that in pair with 'advance' we will have particular timeouts or
'no timeout' if there was no receive attempted.  Does that make sense?

BTW, do we need the same fix for the python 'reconnect' library?

> 
>> For this patch with or without above changes:
>>
>> Acked-by: Ilya Maximets <i.maximets at ovn.org>
> .
>> It might make sense to backport this to 2.13 as OVN users might hit
>> this issue on a highly loaded cluster.
> 
> Thanks.  I applied this to master and branch-2.13.

We should, probably, apply this to 2.14 too to avoid feature/fix gap
during upgrade.

Best regards, Ilya Maximets.


More information about the dev mailing list