[ovs-dev] [PATCH] rconn: Introduce new invariant to fix assertion failure in corner case.

Ben Pfaff blp at ovn.org
Mon Jun 18 19:25:11 UTC 2018


On Mon, Jun 18, 2018 at 09:31:43AM -0700, Han Zhou wrote:
> On Mon, Jun 18, 2018 at 8:34 AM, Ilya Maximets <i.maximets at samsung.com>
> wrote:
> >
> > On 18.06.2018 18:07, Ben Pfaff wrote:
> > > On Mon, Jun 18, 2018 at 05:18:49PM +0300, Ilya Maximets wrote:
> > >>> On Wed, May 23, 2018 at 09:28:59PM -0700, Ben Pfaff wrote:
> > >>>> On Wed, May 23, 2018 at 06:06:44PM -0700, Han Zhou wrote:
> > >>>>> On Wed, May 23, 2018 at 5:14 PM, Ben Pfaff <blp at ovn.org> wrote:
> > >>>>>>
> > >>>>>> Until now, rconn_get_version() has only reported the OpenFlow
> version in
> > >>>>>> use when the rconn is actually connected.  This makes sense, but
> it has a
> > >>>>>> harsh consequence.  Consider code like this:
> > >>>>>>
> > >>>>>>     if (rconn_is_connected(rconn) && rconn_get_version(rconn) >=
> 0) {
> > >>>>>>         for (int i = 0; i < 2; i++) {
> > >>>>>>             struct ofpbuf *b = ofputil_encode_echo_request(
> > >>>>>>                 rconn_get_version(rconn));
> > >>>>>>             rconn_send(rconn, b, NULL);
> > >>>>>>         }
> > >>>>>>     }
> > >>>>>>
> > >>>>>> Maybe not the smartest code in the world, and probably no one
> would write
> > >>>>>> this exact code in any case, but it doesn't look too risky or
> crazy.
> > >>>>>>
> > >>>>>> But it is.  The second trip through the loop can assert-fail inside
> > >>>>>> ofputil_encode_echo_request() because rconn_get_version(rconn)
> returns -1
> > >>>>>> instead of a valid OpenFlow version.  That happens if the first
> call to
> > >>>>>> rconn_send() encounters an error while sending the message and
> therefore
> > >>>>>> destroys the underlying vconn and disconnects so that
> rconn_get_version()
> > >>>>>> doesn't have a vconn to query for its version.
> > >>>>>>
> > >>>>>> In a case like this where all the code to send the messages is
> close by,
> > >>>>> we
> > >>>>>> could just check rconn_get_version() in each loop iteration.  We
> could
> > >>>>> even
> > >>>>>> go through the tree and convince ourselves that individual bits of
> code
> > >>>>> are
> > >>>>>> safe, or be conservative and check rconn_get_version() >= 0 in the
> iffy
> > >>>>>> cases.  But this seems to me like an ongoing source of risk and a
> way to
> > >>>>>> get things wrong in corner cases.
> > >>>>>>
> > >>>>>> This commit takes a different approach.  It introduces a new
> invariant: if
> > >>>>>> an rconn has ever been connected, then it returns a valid OpenFlow
> version
> > >>>>>> from rconn_get_version().  In addition, if an rconn is currently
> > >>>>> connected,
> > >>>>>> then the OpenFlow version it returns is the correct one (that may
> be
> > >>>>>> obvious, but there were corner cases before where it returned -1
> even
> > >>>>>> though rconn_is_connected() returned true).
> > >>>>>>
> > >>>>>> With this commit, the code above would work OK.  If the first call
> to
> > >>>>>> rconn_send() encounters an error sending the message, then
> > >>>>>> rconn_get_version() in the second iteration will return the same
> value as
> > >>>>>> in the first iteration.  The message passed to rconn_send() will
> end up
> > >>>>>> being discarded, but that's much better than either an assertion
> failure
> > >>>>> or
> > >>>>>> having to carefully analyze a lot of our code to deal with one
> unusual
> > >>>>>> corner case.
> > >>>>>>
> > >>>>>> Reported-by: Han Zhou <zhouhan at gmail.com>
> > >>>>>> Signed-off-by: Ben Pfaff <blp at ovn.org>
> > >>>>>> ---
> > >>>>>>  lib/learning-switch.c |  2 +-
> > >>>>>>  lib/rconn.c           | 41
> ++++++++++++++++-------------------------
> > >>>>>>  lib/vconn.c           |  1 +
> > >>>>>>  3 files changed, 18 insertions(+), 26 deletions(-)
> > >>>>>>
> > >>>>> Acked-by: Han Zhou <hzhou8 at ebay.com>
> > >>>>
> > >>>> Thanks.  I applied this to master.  I'll backport it to older
> versions
> > >>>> if no one notices trouble soon.
> > >>>
> > >>> I backported as far as branch-2.5.
> > >>
> > >> Sorry for being late, but I just tried to use branch-2.9 on my test
> environment
> > >> and found that this patch produces a lot of "connected" logs that are
> a bit annoying:
> > >>
> > >> 2018-06-18T12:55:06.610Z|03051|rconn|INFO|br-int<->unix#2873: connected
> > >> 2018-06-18T12:55:08.609Z|03052|rconn|INFO|br-int<->unix#2874: connected
> > >> 2018-06-18T12:55:08.610Z|03053|rconn|INFO|br-int<->unix#2875: connected
> > >> 2018-06-18T12:55:10.608Z|03054|rconn|INFO|br-int<->unix#2876: connected
> > >> 2018-06-18T12:55:10.609Z|03055|rconn|INFO|br-int<->unix#2877: connected
> > >> 2018-06-18T12:55:12.609Z|03056|rconn|INFO|br-int<->unix#2878: connected
> > >> 2018-06-18T12:55:12.609Z|03057|rconn|INFO|br-int<->unix#2879: connected
> > >> 2018-06-18T12:55:14.612Z|03058|rconn|INFO|br-int<->unix#2880: connected
> > >> 2018-06-18T12:55:14.613Z|03059|rconn|INFO|br-int<->unix#2881: connected
> > >> 2018-06-18T12:55:16.613Z|03060|rconn|INFO|br-int<->unix#2882: connected
> > >> 2018-06-18T12:55:16.614Z|03061|rconn|INFO|br-int<->unix#2883: connected
> > >> 2018-06-18T12:55:18.609Z|03062|rconn|INFO|br-int<->unix#2884: connected
> > >> 2018-06-18T12:55:18.610Z|03063|rconn|INFO|br-int<->unix#2885: connected
> > >> 2018-06-18T12:55:20.610Z|03064|rconn|INFO|br-int<->unix#2886: connected
> > >> 2018-06-18T12:55:20.611Z|03065|rconn|INFO|br-int<->unix#2887: connected
> > >>
> > >> On my system it's about 2 log messages on each 2 seconds. It has only
> > >> ovsdb-server, ovs-vswitchd and ovsdb-client monitor running.
> > >>
> > >> Could you please check?
> > >
> > > Are you sure that it's this patch that makes a difference?  It doesn't
> > > change anything in the logging.
> >
> > I don't know much about this part of OVS. That is wat I have with
> reverted patch
> > and debug enabled:
> >
> > 2018-06-18T15:31:09.162Z|00423|rconn|DBG|br-int<->unix#150: entering
> ACTIVE
> > 2018-06-18T15:31:09.163Z|00424|rconn|DBG|br-int<->unix#150: connection
> closed by peer
> > 2018-06-18T15:31:09.163Z|00425|rconn|DBG|br-int<->unix#150: entering
> DISCONNECTED
> > 2018-06-18T15:31:11.162Z|00426|rconn|DBG|br-int<->unix#151: entering
> ACTIVE
> > 2018-06-18T15:31:11.162Z|00427|rconn|DBG|br-int<->unix#151: connection
> closed by peer
> > 2018-06-18T15:31:11.162Z|00428|rconn|DBG|br-int<->unix#151: entering
> DISCONNECTED
> > 2018-06-18T15:31:11.162Z|00429|rconn|DBG|br-int<->unix#152: entering
> ACTIVE
> > 2018-06-18T15:31:11.163Z|00430|rconn|DBG|br-int<->unix#152: connection
> closed by peer
> > 2018-06-18T15:31:11.163Z|00431|rconn|DBG|br-int<->unix#152: entering
> DISCONNECTED
> >
> > Without revert (just branch-2.9 with patch applied):
> >
> > 2018-06-18T15:26:32.366Z|00658|rconn|DBG|br-int<->unix#198: entering
> ACTIVE
> > 2018-06-18T15:26:32.366Z|00659|rconn|DBG|br-int<->unix#198: connection
> closed by peer
> > 2018-06-18T15:26:32.366Z|00660|rconn|DBG|br-int<->unix#198: entering
> DISCONNECTED
> > 2018-06-18T15:26:34.365Z|00661|rconn|DBG|br-int<->unix#199: entering
> CONNECTING
> > 2018-06-18T15:26:34.365Z|00662|rconn|INFO|br-int<->unix#199: connected
> > 2018-06-18T15:26:34.365Z|00663|rconn|DBG|br-int<->unix#199: entering
> ACTIVE
> > 2018-06-18T15:26:34.365Z|00664|rconn|DBG|br-int<->unix#199: connection
> closed by peer
> > 2018-06-18T15:26:34.365Z|00665|rconn|DBG|br-int<->unix#199: entering
> DISCONNECTED
> > 2018-06-18T15:26:34.365Z|00666|rconn|DBG|br-int<->unix#200: entering
> CONNECTING
> > 2018-06-18T15:26:34.365Z|00667|rconn|INFO|br-int<->unix#200: connected
> > 2018-06-18T15:26:34.365Z|00668|rconn|DBG|br-int<->unix#200: entering
> ACTIVE
> > 2018-06-18T15:26:34.366Z|00669|rconn|DBG|br-int<->unix#200: connection
> closed by peer
> > 2018-06-18T15:26:34.366Z|00670|rconn|DBG|br-int<->unix#200: entering
> DISCONNECTED
> > 2018-06-18T15:26:36.365Z|00671|rconn|DBG|br-int<->unix#201: entering
> CONNECTING
> > 2018-06-18T15:26:36.365Z|00672|rconn|INFO|br-int<->unix#201: connected
> > 2018-06-18T15:26:36.365Z|00673|rconn|DBG|br-int<->unix#201: entering
> ACTIVE
> > 2018-06-18T15:26:36.365Z|00674|rconn|DBG|br-int<->unix#201: connection
> closed by peer
> > 2018-06-18T15:26:36.365Z|00675|rconn|DBG|br-int<->unix#201: entering
> DISCONNECTED
> > 2018-06-18T15:26:36.365Z|00676|rconn|DBG|br-int<->unix#202: entering
> CONNECTING
> > 2018-06-18T15:26:36.366Z|00677|rconn|INFO|br-int<->unix#202: connected
> >
> > Looks like that patch just changed a workflow a bit introducing
> "CONNECTING"
> > state in case where was just direct transition to "ACTIVE".
> >
> Hi Ilya,
> 
> From the debug log it looks like the connection is flapping, even without
> the patch. The patch just revealed the problem by an info level log.

Probably these are just connections from "ovs-ofctl" or similar.  We try
to suppress logging for them because it isn't too useful.


More information about the dev mailing list