[ovs-discuss] ovs-master crashes due to double locking of ofproto_mutex.

Ben Pfaff blp at ovn.org
Mon Apr 15 22:59:41 UTC 2019


On Wed, Apr 03, 2019 at 03:36:36PM +0530, Anil Kumar Koli via discuss wrote:
> Hello OVS team,
> 
>  
> 
> OVS crash is observed in ovs-master when controller sends a packet with
> packet-out (output port as OFPP_TABLE) and any of the openflow table entry
> which gets hit results into a learn action.
> 
>  
> 
> Steps to reproduce:
> 
> 1. Start the ovs-vswitchd in dpdk mode.
> 
> 2. Configure the following flows
> 
> ovs-ofctl -OOpenflow13 add-flow br-int "table=0, priority=50,
> ct_state=-trk,ip, in_port=10 actions=ct(table=0)"
> 
> ovs-ofctl -OOpenflow13 add-flow br-int "table=0, priority=50,
> ct_state=+trk,ip, in_port=10 actions=ct(commit),resubmit(,1)"
> 
> ovs-ofctl -OOpenflow13 add-flow br-int "table=1
> actions=learn(table=2,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_IN_PORT[
> ]->NXM_NX_REG0[0..15], output:NXM_NX_REG0[0..15]),resubmit(,2)"
> 
> 3. Send a packet with output as OFPP_TABLE
> 
> ovs-ofctl -OOpenflow13 packet-out br-int 'in_port=10
> packet=505400000007101111111111080045000028000000004006f97cc0a80001c0a800020
> 008000a0000000000000000500200002e7d0000, actions=TABLE'
> 
>  
> 
> This leads to a crash which is a case of double locking of oproto_mutex in
> handle_packet_out and ofproto_flow_mod_learn. 

Strangely, I did not have luck reproducing this, even though you gave a
good reproduction case, but I do recognize the problem.  (Your
reproduction case involves connection-tracking, which complicates
things.  Is connection-tracking an essential part of the problem?)

This use of ofproto_mutex was introduced in the initial commit that
created ofproto_mutex, back in 2013:

    commit abe7b10f58c1d0d1c6c556a89aa6bd9223f56944
    Author:	Ben Pfaff <blp at ovn.org>  Thu Sep 12 00:31:33 2013
    Committer:	Ben Pfaff <blp at ovn.org>  Thu Sep 12 17:43:56 2013

    ofproto: Drop 'expirable_mutex' in favor of new global 'ofproto_mutex'.

    This is the first step toward making a global lock that protects everything
    needed for updating the flow table.  This commit starts out by merging one
    lock into the new one, and later commits will continue that process.

    The mutex is initially a recursive one, because I wasn't sure that there
    were no nested acquisitions at this stage, but a later commit will change
    it to a regular error-checking mutex once it's settled down a bit.

    Signed-off-by: Ben Pfaff <blp at nicira.com>
    Acked-by: Ethan Jackson <ethan at nicira.com>

When I look at the implementation of handle_packet_out(), I don't see
much that requires ofproto_mutex (ofproto_check_ofpacts() does),
*unless* translation executes a learn action.

The same issue is going to come up for bundles, which also call into the
same xlate path from do_bundle_commit() while holding ofproto_mutex.

How do we fix this?

* We can't get rid of ofproto_mutex in do_bundle_commit(), or drop it
  temporarily around flow translation (i.e. the call to
  ofproto_packet_out_start()), because it might need to revert all the
  flow table changes and dropping the mutex would allow other threads to
  race in and make conflicting changes.

* We can't make flow translation require the lock unconditionally,
  because that would serialize flow translation, which would defeat the
  purpose of having multiple flow translation threads.

* We *can* change ofproto_mutex from an error-checking to recursive
  mutex.  Recursive mutexes are not very well respected, for a couple of
  good reasons, but they might be appropriate for this use case.

What do you think?


More information about the discuss mailing list