[ovs-dev] [PATCH v2] ovn-controller: eliminate stall in ofctrl state machine

Ben Pfaff blp at ovn.org
Fri Jul 22 22:47:14 UTC 2016


On Thu, Jul 07, 2016 at 08:31:08PM -0400, Lance Richardson wrote:
> The "ovn -- 2 HVs, 3 LRs connected via LS, static routes"
> test case currently exhibits frequent failures. These failures
> occur because, at the time that the test packets are sent to
> verify forwarding, no flows have been installed in the vswitch
> for one of the hypervisors.
> 
> Investigation shows that, in the failing case, the ofctrl state
> machine has not yet transitioned to the S_UPDATE_FLOWS state.
> This occurrs when ofctrl_run() is called and:
>    1) The state is S_TLV_TABLE_MOD_SENT.
>    2) An OFPTYPE_NXT_TLV_TABLE_REPLY message is queued for reception.
>    3) No event (other than SB probe timer expiration) is expected
>       that would unblock poll_block() in the main ovn-controller
>       loop.
> 
> In this scenario, ofctrl_run() will move state to S_CLEAR_FLOWS
> and return, without having executed run_S_CLEAR_FLOWS() which
> would have immediately transitioned the state to S_UPDATE_FLOWS
> which is needed in order for ovn-controller to configure flows
> in ovs-vswitchd. After a delay of about 5 seconds (the default
> SB probe timer interval), ofctrl_run() would be called again
> to make the transition to S_UPDATE_FLOWS, but by this time
> the test case has already failed.
> 
> Fix by expanding the state machine's "while state != old_state"
> loop to include processing of receive events, with a maximum
> iteration limit to prevent excessive looping in pathological
> cases. Without this fix, around 40 failures are seen out of
> 100 attempts, with this fix no failures have been observed in
> several hundred attempts.
> 
> Signed-off-by: Lance Richardson <lrichard at redhat.com>
> ---
>  v2: Added maximum iteration limit to state machine loop.

Thanks for investigating this.

I understand why your patch fixes a problem, but I don't yet understand
the root cause of the problem, and I'd like to know more before I commit
the patch.  I have two questions:

First, I don't understand why, if there's a message queued for reception
as you describe in 2), the main loop would wait for a timer expiration.
Before poll_block(), the main loop should call ofctrl_wait(), which
calls rconn_recv_wait(), which should cause poll_block() to wake up if
there's anything incoming from the OpenFlow connection.

Second, I don't know why the test "state == old_state" is here in
ofctrl_run().  I think that we can delete it.  It could be a culprit,
but on the other hand it seems like it's not a big deal if poll_block()
properly wakes up when a message is received:

    for (int i = 0; state == old_state && i < 50; i++) {
        struct ofpbuf *msg = rconn_recv(swconn);
        if (!msg) {
            break;
        }

Thanks,

Ben.



More information about the dev mailing list