[ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.

Ilya Maximets i.maximets at samsung.com
Wed Jun 20 07:42:58 UTC 2018


On 19.06.2018 19:03, Ben Pfaff wrote:
> On Tue, Jun 19, 2018 at 10:58:36AM +0300, Ilya Maximets wrote:
>> On 18.06.2018 22:24, Ben Pfaff wrote:
>>> On Mon, Jun 18, 2018 at 06:34:13PM +0300, Ilya Maximets wrote:
>>>> On 18.06.2018 18:07, Ben Pfaff wrote:
>>>>> On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote:
>>>>>>> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote:
>>>>>>>> On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote:
>>>>>>>>> On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff <blp at ovn.org> wrote:
>>>>>>>>>>
>>>>>>>>>> Until now, rconn_get_version() has only reported the OpenFlow version in
>>>>>>>>>> use when the rconn is actually connected.  This makes sense, but it has a
>>>>>>>>>> harsh consequence.  Consider code like this:
>>>>>>>>>>
>>>>>>>>>>     if (rconn_is_connected(rconn) && rconn_get_version(rconn) >= 0) {
>>>>>>>>>>         for (int i = 0; i < 2; i++) {
>>>>>>>>>>             struct ofpbuf *b = ofputil_encode_echo_request(
>>>>>>>>>>                 rconn_get_version(rconn));
>>>>>>>>>>             rconn_send(rconn, b, NULL);
>>>>>>>>>>         }
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>> Maybe not the smartest code in the world, and probably no one would write
>>>>>>>>>> this exact code in any case, but it doesn't look too risky or crazy.
>>>>>>>>>>
>>>>>>>>>> But it is.  The second trip through the loop can assert-fail inside
>>>>>>>>>> ofputil_encode_echo_request() because rconn_get_version(rconn) returns -1
>>>>>>>>>> instead of a valid OpenFlow version.  That happens if the first call to
>>>>>>>>>> rconn_send() encounters an error while sending the message and therefore
>>>>>>>>>> destroys the underlying vconn and disconnects so that rconn_get_version()
>>>>>>>>>> doesn't have a vconn to query for its version.
>>>>>>>>>>
>>>>>>>>>> In a case like this where all the code to send the messages is close by,
>>>>>>>>> we
>>>>>>>>>> could just check rconn_get_version() in each loop iteration.  We could
>>>>>>>>> even
>>>>>>>>>> go through the tree and convince ourselves that individual bits of code
>>>>>>>>> are
>>>>>>>>>> safe, or be conservative and check rconn_get_version() >= 0 in the iffy
>>>>>>>>>> cases.  But this seems to me like an ongoing source of risk and a way to
>>>>>>>>>> get things wrong in corner cases.
>>>>>>>>>>
>>>>>>>>>> This commit takes a different approach.  It introduces a new invariant: if
>>>>>>>>>> an rconn has ever been connected, then it returns a valid OpenFlow version
>>>>>>>>>> from rconn_get_version().  In addition, if an rconn is currently
>>>>>>>>> connected,
>>>>>>>>>> then the OpenFlow version it returns is the correct one (that may be
>>>>>>>>>> obvious, but there were corner cases before where it returned -1 even
>>>>>>>>>> though rconn_is_connected() returned true).
>>>>>>>>>>
>>>>>>>>>> With this commit, the code above would work OK.  If the first call to
>>>>>>>>>> rconn_send() encounters an error sending the message, then
>>>>>>>>>> rconn_get_version() in the second iteration will return the same value as
>>>>>>>>>> in the first iteration.  The message passed to rconn_send() will end up
>>>>>>>>>> being discarded, but that's much better than either an assertion failure
>>>>>>>>> or
>>>>>>>>>> having to carefully analyze a lot of our code to deal with one unusual
>>>>>>>>>> corner case.
>>>>>>>>>>
>>>>>>>>>> Reported-by: Han Zhou <zhouhan at gmail.com>
>>>>>>>>>> Signed-off-by: Ben Pfaff <blp at ovn.org>
>>>>>>>>>> ---
>>>>>>>>>>  lib/learning-switch.c |  2 +-
>>>>>>>>>>  lib/rconn.c           | 41 ++++++++++++++++-------------------------
>>>>>>>>>>  lib/vconn.c           |  1 +
>>>>>>>>>>  3 files changed, 18 insertions(+), 26 deletions(-)
>>>>>>>>>>
>>>>>>>>> Acked-by: Han Zhou <hzhou8 at ebay.com>
>>>>>>>>
>>>>>>>> Thanks.  I applied this to master.  I'll backport it to older versions
>>>>>>>> if no one notices trouble soon.
>>>>>>>
>>>>>>> I backported as far as branch-2.5.
>>>>>>
>>>>>> Sorry for being late, but I just tried to use branch-2.9 on my test environment
>>>>>> and found that this patch produces a lot of "connected" logs that are a bit annoying:
>>>>>>
>>>>>> 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected
>>>>>> 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected
>>>>>> 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected
>>>>>> 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected
>>>>>> 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected
>>>>>> 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected
>>>>>> 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected
>>>>>> 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected
>>>>>> 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected
>>>>>> 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected
>>>>>> 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected
>>>>>> 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected
>>>>>> 2018-06-18T12:55:18.610Z|03063|rconn|INFO|br-int<->unix#2885: connected
>>>>>> 2018-06-18T12:55:20.610Z|03064|rconn|INFO|br-int<->unix#2886: connected
>>>>>> 2018-06-18T12:55:20.611Z|03065|rconn|INFO|br-int<->unix#2887: connected
>>>>>>
>>>>>> On my system it's about 2 log messages on each 2 seconds. It has only
>>>>>> ovsdb-server, ovs-vswitchd and ovsdb-client monitor running.
>>>>>>
>>>>>> Could you please check?
>>>>>
>>>>> Are you sure that it's this patch that makes a difference?  It doesn't
>>>>> change anything in the logging.
>>>>
>>>> I don't know much about this part of OVS. That is wat I have with reverted patch
>>>> and debug enabled:
>>>>
>>>> 2018-06-18T15:31:09.162Z|00423|rconn|DBG|br-int<->unix#150: entering ACTIVE
>>>> 2018-06-18T15:31:09.163Z|00424|rconn|DBG|br-int<->unix#150: connection closed by peer
>>>> 2018-06-18T15:31:09.163Z|00425|rconn|DBG|br-int<->unix#150: entering DISCONNECTED
>>>> 2018-06-18T15:31:11.162Z|00426|rconn|DBG|br-int<->unix#151: entering ACTIVE
>>>> 2018-06-18T15:31:11.162Z|00427|rconn|DBG|br-int<->unix#151: connection closed by peer
>>>> 2018-06-18T15:31:11.162Z|00428|rconn|DBG|br-int<->unix#151: entering DISCONNECTED
>>>> 2018-06-18T15:31:11.162Z|00429|rconn|DBG|br-int<->unix#152: entering ACTIVE
>>>> 2018-06-18T15:31:11.163Z|00430|rconn|DBG|br-int<->unix#152: connection closed by peer
>>>> 2018-06-18T15:31:11.163Z|00431|rconn|DBG|br-int<->unix#152: entering DISCONNECTED
>>>>
>>>> Without revert (just branch-2.9 with patch applied):
>>>>
>>>> 2018-06-18T15:26:32.366Z|00658|rconn|DBG|br-int<->unix#198: entering ACTIVE
>>>> 2018-06-18T15:26:32.366Z|00659|rconn|DBG|br-int<->unix#198: connection closed by peer
>>>> 2018-06-18T15:26:32.366Z|00660|rconn|DBG|br-int<->unix#198: entering DISCONNECTED
>>>> 2018-06-18T15:26:34.365Z|00661|rconn|DBG|br-int<->unix#199: entering CONNECTING
>>>> 2018-06-18T15:26:34.365Z|00662|rconn|INFO|br-int<->unix#199: connected
>>>> 2018-06-18T15:26:34.365Z|00663|rconn|DBG|br-int<->unix#199: entering ACTIVE
>>>> 2018-06-18T15:26:34.365Z|00664|rconn|DBG|br-int<->unix#199: connection closed by peer
>>>> 2018-06-18T15:26:34.365Z|00665|rconn|DBG|br-int<->unix#199: entering DISCONNECTED
>>>> 2018-06-18T15:26:34.365Z|00666|rconn|DBG|br-int<->unix#200: entering CONNECTING
>>>> 2018-06-18T15:26:34.365Z|00667|rconn|INFO|br-int<->unix#200: connected
>>>> 2018-06-18T15:26:34.365Z|00668|rconn|DBG|br-int<->unix#200: entering ACTIVE
>>>> 2018-06-18T15:26:34.366Z|00669|rconn|DBG|br-int<->unix#200: connection closed by peer
>>>> 2018-06-18T15:26:34.366Z|00670|rconn|DBG|br-int<->unix#200: entering DISCONNECTED
>>>> 2018-06-18T15:26:36.365Z|00671|rconn|DBG|br-int<->unix#201: entering CONNECTING
>>>> 2018-06-18T15:26:36.365Z|00672|rconn|INFO|br-int<->unix#201: connected
>>>> 2018-06-18T15:26:36.365Z|00673|rconn|DBG|br-int<->unix#201: entering ACTIVE
>>>> 2018-06-18T15:26:36.365Z|00674|rconn|DBG|br-int<->unix#201: connection closed by peer
>>>> 2018-06-18T15:26:36.365Z|00675|rconn|DBG|br-int<->unix#201: entering DISCONNECTED
>>>> 2018-06-18T15:26:36.365Z|00676|rconn|DBG|br-int<->unix#202: entering CONNECTING
>>>> 2018-06-18T15:26:36.366Z|00677|rconn|INFO|br-int<->unix#202: connected
>>>>
>>>> Looks like that patch just changed a workflow a bit introducing "CONNECTING"
>>>> state in case where was just direct transition to "ACTIVE".
>>>>
>>>> Best regards, Ilya Maximets.
>>>
>>> Thanks, I understand now.
>>>
>>> Does the following patch restore the previous behavior?
>>>
>>> diff --git a/lib/rconn.c b/lib/rconn.c
>>> index a3f811fedbe3..b30909fd1332 100644
>>> --- a/lib/rconn.c
>>> +++ b/lib/rconn.c
>>> @@ -511,7 +511,9 @@ run_CONNECTING(struct rconn *rc)
>>>  {
>>>      int retval = vconn_connect(rc->vconn);
>>>      if (!retval) {
>>> -        VLOG_INFO("%s: connected", rc->name);
>>> +        if (rc->reliable) {
>>> +            VLOG_INFO("%s: connected", rc->name);
>>> +        }
>>>          rc->n_successful_connections++;
>>>          state_transition(rc, S_ACTIVE);
>>>          rc->version = vconn_get_version(rc->vconn);
>>
>>
>> Yes. Thanks. It works.
>> I used following change for testing:
>>
>> diff --git a/lib/rconn.c b/lib/rconn.c
>> index 7af4e73..ece8769 100644
>> --- a/lib/rconn.c
>> +++ b/lib/rconn.c
>> @@ -511,7 +511,7 @@ run_CONNECTING(struct rconn *rc)
>>  {
>>      int retval = vconn_connect(rc->vconn);
>>      if (!retval) {
>> -        VLOG_INFO("%s: connected", rc->name);
>> +        VLOG(rc->reliable ? VLL_INFO : VLL_DBG, "%s: connected", rc->name);
>>          rc->n_successful_connections++;
>>          state_transition(rc, S_ACTIVE);
>>          rc->version = vconn_get_version(rc->vconn);
>> ---
>>
>>
>> Will you prepare a proper patch or should I?
> 
> Your version is better than mine.
> 
> Do you mind preparing the proper patch?

Sure, will send in a minute.


More information about the dev mailing list