[ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.

Han Zhou zhouhan at gmail.com
Mon Jun 18 16:31:43 UTC 2018


On Mon, Jun 18, 2018 at 8:34 AM, Ilya Maximets <i.maximets at samsung.com>
wrote:
>
> On 18.06.2018 18:07, Ben Pfaff wrote:
> > On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote:
> >>> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote:
> >>>> On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote:
> >>>>> On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff <blp at ovn.org> wrote:
> >>>>>>
> >>>>>> Until now, rconn_get_version() has only reported the OpenFlow
version in
> >>>>>> use when the rconn is actually connected.  This makes sense, but
it has a
> >>>>>> harsh consequence.  Consider code like this:
> >>>>>>
> >>>>>>     if (rconn_is_connected(rconn) && rconn_get_version(rconn) >=
0) {
> >>>>>>         for (int i = 0; i < 2; i++) {
> >>>>>>             struct ofpbuf *b = ofputil_encode_echo_request(
> >>>>>>                 rconn_get_version(rconn));
> >>>>>>             rconn_send(rconn, b, NULL);
> >>>>>>         }
> >>>>>>     }
> >>>>>>
> >>>>>> Maybe not the smartest code in the world, and probably no one
would write
> >>>>>> this exact code in any case, but it doesn't look too risky or
crazy.
> >>>>>>
> >>>>>> But it is.  The second trip through the loop can assert-fail inside
> >>>>>> ofputil_encode_echo_request() because rconn_get_version(rconn)
returns -1
> >>>>>> instead of a valid OpenFlow version.  That happens if the first
call to
> >>>>>> rconn_send() encounters an error while sending the message and
therefore
> >>>>>> destroys the underlying vconn and disconnects so that
rconn_get_version()
> >>>>>> doesn't have a vconn to query for its version.
> >>>>>>
> >>>>>> In a case like this where all the code to send the messages is
close by,
> >>>>> we
> >>>>>> could just check rconn_get_version() in each loop iteration.  We
could
> >>>>> even
> >>>>>> go through the tree and convince ourselves that individual bits of
code
> >>>>> are
> >>>>>> safe, or be conservative and check rconn_get_version() >= 0 in the
iffy
> >>>>>> cases.  But this seems to me like an ongoing source of risk and a
way to
> >>>>>> get things wrong in corner cases.
> >>>>>>
> >>>>>> This commit takes a different approach.  It introduces a new
invariant: if
> >>>>>> an rconn has ever been connected, then it returns a valid OpenFlow
version
> >>>>>> from rconn_get_version().  In addition, if an rconn is currently
> >>>>> connected,
> >>>>>> then the OpenFlow version it returns is the correct one (that may
be
> >>>>>> obvious, but there were corner cases before where it returned -1
even
> >>>>>> though rconn_is_connected() returned true).
> >>>>>>
> >>>>>> With this commit, the code above would work OK.  If the first call
to
> >>>>>> rconn_send() encounters an error sending the message, then
> >>>>>> rconn_get_version() in the second iteration will return the same
value as
> >>>>>> in the first iteration.  The message passed to rconn_send() will
end up
> >>>>>> being discarded, but that's much better than either an assertion
failure
> >>>>> or
> >>>>>> having to carefully analyze a lot of our code to deal with one
unusual
> >>>>>> corner case.
> >>>>>>
> >>>>>> Reported-by: Han Zhou <zhouhan at gmail.com>
> >>>>>> Signed-off-by: Ben Pfaff <blp at ovn.org>
> >>>>>> ---
> >>>>>>  lib/learning-switch.c |  2 +-
> >>>>>>  lib/rconn.c           | 41
++++++++++++++++-------------------------
> >>>>>>  lib/vconn.c           |  1 +
> >>>>>>  3 files changed, 18 insertions(+), 26 deletions(-)
> >>>>>>
> >>>>> Acked-by: Han Zhou <hzhou8 at ebay.com>
> >>>>
> >>>> Thanks.  I applied this to master.  I'll backport it to older
versions
> >>>> if no one notices trouble soon.
> >>>
> >>> I backported as far as branch-2.5.
> >>
> >> Sorry for being late, but I just tried to use branch-2.9 on my test
environment
> >> and found that this patch produces a lot of "connected" logs that are
a bit annoying:
> >>
> >> 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected
> >> 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected
> >> 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected
> >> 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected
> >> 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected
> >> 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected
> >> 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected
> >> 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected
> >> 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected
> >> 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected
> >> 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected
> >> 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected
> >> 2018-06-18T12:55:18.610Z|03063|rconn|INFO|br-int<->unix#2885: connected
> >> 2018-06-18T12:55:20.610Z|03064|rconn|INFO|br-int<->unix#2886: connected
> >> 2018-06-18T12:55:20.611Z|03065|rconn|INFO|br-int<->unix#2887: connected
> >>
> >> On my system it's about 2 log messages on each 2 seconds. It has only
> >> ovsdb-server, ovs-vswitchd and ovsdb-client monitor running.
> >>
> >> Could you please check?
> >
> > Are you sure that it's this patch that makes a difference?  It doesn't
> > change anything in the logging.
>
> I don't know much about this part of OVS. That is wat I have with
reverted patch
> and debug enabled:
>
> 2018-06-18T15:31:09.162Z|00423|rconn|DBG|br-int<->unix#150: entering
ACTIVE
> 2018-06-18T15:31:09.163Z|00424|rconn|DBG|br-int<->unix#150: connection
closed by peer
> 2018-06-18T15:31:09.163Z|00425|rconn|DBG|br-int<->unix#150: entering
DISCONNECTED
> 2018-06-18T15:31:11.162Z|00426|rconn|DBG|br-int<->unix#151: entering
ACTIVE
> 2018-06-18T15:31:11.162Z|00427|rconn|DBG|br-int<->unix#151: connection
closed by peer
> 2018-06-18T15:31:11.162Z|00428|rconn|DBG|br-int<->unix#151: entering
DISCONNECTED
> 2018-06-18T15:31:11.162Z|00429|rconn|DBG|br-int<->unix#152: entering
ACTIVE
> 2018-06-18T15:31:11.163Z|00430|rconn|DBG|br-int<->unix#152: connection
closed by peer
> 2018-06-18T15:31:11.163Z|00431|rconn|DBG|br-int<->unix#152: entering
DISCONNECTED
>
> Without revert (just branch-2.9 with patch applied):
>
> 2018-06-18T15:26:32.366Z|00658|rconn|DBG|br-int<->unix#198: entering
ACTIVE
> 2018-06-18T15:26:32.366Z|00659|rconn|DBG|br-int<->unix#198: connection
closed by peer
> 2018-06-18T15:26:32.366Z|00660|rconn|DBG|br-int<->unix#198: entering
DISCONNECTED
> 2018-06-18T15:26:34.365Z|00661|rconn|DBG|br-int<->unix#199: entering
CONNECTING
> 2018-06-18T15:26:34.365Z|00662|rconn|INFO|br-int<->unix#199: connected
> 2018-06-18T15:26:34.365Z|00663|rconn|DBG|br-int<->unix#199: entering
ACTIVE
> 2018-06-18T15:26:34.365Z|00664|rconn|DBG|br-int<->unix#199: connection
closed by peer
> 2018-06-18T15:26:34.365Z|00665|rconn|DBG|br-int<->unix#199: entering
DISCONNECTED
> 2018-06-18T15:26:34.365Z|00666|rconn|DBG|br-int<->unix#200: entering
CONNECTING
> 2018-06-18T15:26:34.365Z|00667|rconn|INFO|br-int<->unix#200: connected
> 2018-06-18T15:26:34.365Z|00668|rconn|DBG|br-int<->unix#200: entering
ACTIVE
> 2018-06-18T15:26:34.366Z|00669|rconn|DBG|br-int<->unix#200: connection
closed by peer
> 2018-06-18T15:26:34.366Z|00670|rconn|DBG|br-int<->unix#200: entering
DISCONNECTED
> 2018-06-18T15:26:36.365Z|00671|rconn|DBG|br-int<->unix#201: entering
CONNECTING
> 2018-06-18T15:26:36.365Z|00672|rconn|INFO|br-int<->unix#201: connected
> 2018-06-18T15:26:36.365Z|00673|rconn|DBG|br-int<->unix#201: entering
ACTIVE
> 2018-06-18T15:26:36.365Z|00674|rconn|DBG|br-int<->unix#201: connection
closed by peer
> 2018-06-18T15:26:36.365Z|00675|rconn|DBG|br-int<->unix#201: entering
DISCONNECTED
> 2018-06-18T15:26:36.365Z|00676|rconn|DBG|br-int<->unix#202: entering
CONNECTING
> 2018-06-18T15:26:36.366Z|00677|rconn|INFO|br-int<->unix#202: connected
>
> Looks like that patch just changed a workflow a bit introducing
"CONNECTING"
> state in case where was just direct transition to "ACTIVE".
>
Hi Ilya,

>From the debug log it looks like the connection is flapping, even without
the patch. The patch just revealed the problem by an info level log.

Thanks,
Han


More information about the dev mailing list