[ovs-dev] [PATCH] ofproto: fix interactions between flow monitors and barriers
blp at nicira.com
Thu Feb 27 20:55:23 UTC 2014
On Thu, Feb 27, 2014 at 02:50:47PM +0900, YAMAMOTO Takashi wrote:
> >> On Wed, Feb 05, 2014 at 07:04:05PM +0900, YAMAMOTO Takashi wrote:
> >>> Following OpenFlow 1.4 semantics, make barriers wait for
> >>> flow monitor replies. This should fix a race in
> >>> "ofproto - flow monitoring pause and resume" test.
> >>> Signed-off-by: YAMAMOTO Takashi <yamamoto at valinux.co.jp>
> >> I still don't understand this. You cite the following:
> >> * OpenFlow 1.4.0 p.110:
> >> * A OFPMP_FLOW_MONITOR multipart reply can not cross a barrier
> >> * handshake. The switch must always send the OFPMP_FLOW_MONITOR
> >> * multipart reply for a given flow table change before the reply
> >> * to a OFPT_BARRIER_REQUEST request that follows the request
> >> * responsible for the flow table change.
> >> but can you elaborate on your thought process here? After all, when
> >> we're paused we're not sending any flow table changes at all.
> > right. and when resumed, we send flow table changes.
> > the "ofproto - flow monitoring pause and resume" test seems to
> > expect that it can wait for these flow table changes after resume
> > using a barrier. the expectation seems valid from my reading of
> > the specification.
> fwiw, i still see the failure with today's master.
OK, I spent some time trying to understand this, and I think that I
follow now. (My misunderstanding was partly because it doesn't
actually fail for me.)
This test starts an ovs-ofctl monitoring the flow table and makes that
process block so that it isn't actually reading any of the monitor
replies, causing the queue to fill up and ovs-vswitchd to send down a
"pause" message. Then it unblocks, sends a barrier, waits for the
reply, and then tells ovs-ofctl to exit. It checks the log to make
sure that the output includes a resume message. Thus, it implicitly
depended on the resume coming before the barrier reply or, at least,
before it tells ovs-ofctl to exit.
If I'm reading it right, this commit fixes the problem by never
sending a barrier reply while a monitor is paused, instead waiting for
the monitor to resume first. This will certainly avoid the problem.
I am not sure that the spec actually requires it (if it does, then it
is due to lack of alertness on my part, because the OpenFlow spec is
based on what Open vSwitch does). It seems to me, though, that it
adds a risk of not ever making forward progress, because it is
possible that we resume and then pause again before the barrier reply
Thus, I am inclined to consider this a bug in the test, and fixing it
by waiting until the "resume" shows up in the monitor log.
What do you think?
More information about the dev