[ovs-discuss] ovs-master crashes due to double locking of ofproto_mutex.

Ben Pfaff blp at ovn.org
Wed Apr 24 00:33:15 UTC 2019


Thanks for the update.

On Tue, Apr 23, 2019 at 12:03:37PM +0530, Anil Kumar Koli wrote:
> Hi Ben,
> 
> Sorry for the late response. 
> Yes connection-tracking is an essential part of the problem.
> Based on my initial analysis, I also feel like changing ofproto_mutex from
> an error-checking to recursive mutex is the probable solution. As of now, I
> have not completed my analysis, I will update once I have finished my
> analysis. 
> 
> Thanks & Regards,
> Anil Kumar
> 
> -----Original Message-----
> From: Ben Pfaff <blp at ovn.org> 
> Sent: Tuesday, 16 April, 2019 04:30 AM
> To: Anil Kumar Koli <anilkumar.k at altencalsoftlabs.com>
> Cc: ovs-discuss at openvswitch.org
> Subject: Re: [ovs-discuss] ovs-master crashes due to double locking of
> ofproto_mutex.
> 
> On Wed, Apr 03, 2019 at 03:36:36PM +0530, Anil Kumar Koli via discuss wrote:
> > Hello OVS team,
> > 	
> >  
> > 
> > OVS crash is observed in ovs-master when controller sends a packet 
> > with packet-out (output port as OFPP_TABLE) and any of the openflow 
> > table entry which gets hit results into a learn action.
> > 
> >  
> > 
> > Steps to reproduce:
> > 
> > 1. Start the ovs-vswitchd in dpdk mode.
> > 
> > 2. Configure the following flows
> > 
> > ovs-ofctl -OOpenflow13 add-flow br-int "table=0, priority=50, 
> > ct_state=-trk,ip, in_port=10 actions=ct(table=0)"
> > 
> > ovs-ofctl -OOpenflow13 add-flow br-int "table=0, priority=50, 
> > ct_state=+trk,ip, in_port=10 actions=ct(commit),resubmit(,1)"
> > 
> > ovs-ofctl -OOpenflow13 add-flow br-int "table=1 
> > actions=learn(table=2,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_IN
> > _PORT[ ]->NXM_NX_REG0[0..15], output:NXM_NX_REG0[0..15]),resubmit(,2)"
> > 
> > 3. Send a packet with output as OFPP_TABLE
> > 
> > ovs-ofctl -OOpenflow13 packet-out br-int 'in_port=10
> > packet=505400000007101111111111080045000028000000004006f97cc0a80001c0a
> > 800020 008000a0000000000000000500200002e7d0000, actions=TABLE'
> > 
> >  
> > 
> > This leads to a crash which is a case of double locking of 
> > oproto_mutex in handle_packet_out and ofproto_flow_mod_learn.
> 
> Strangely, I did not have luck reproducing this, even though you gave a good
> reproduction case, but I do recognize the problem.  (Your reproduction case
> involves connection-tracking, which complicates things.  Is
> connection-tracking an essential part of the problem?)
> 
> This use of ofproto_mutex was introduced in the initial commit that created
> ofproto_mutex, back in 2013:
> 
>     commit abe7b10f58c1d0d1c6c556a89aa6bd9223f56944
>     Author:	Ben Pfaff <blp at ovn.org>  Thu Sep 12 00:31:33 2013
>     Committer:	Ben Pfaff <blp at ovn.org>  Thu Sep 12 17:43:56 2013
> 
>     ofproto: Drop 'expirable_mutex' in favor of new global 'ofproto_mutex'.
> 
>     This is the first step toward making a global lock that protects
> everything
>     needed for updating the flow table.  This commit starts out by merging
> one
>     lock into the new one, and later commits will continue that process.
> 
>     The mutex is initially a recursive one, because I wasn't sure that there
>     were no nested acquisitions at this stage, but a later commit will
> change
>     it to a regular error-checking mutex once it's settled down a bit.
> 
>     Signed-off-by: Ben Pfaff <blp at nicira.com>
>     Acked-by: Ethan Jackson <ethan at nicira.com>
> 
> When I look at the implementation of handle_packet_out(), I don't see much
> that requires ofproto_mutex (ofproto_check_ofpacts() does),
> *unless* translation executes a learn action.
> 
> The same issue is going to come up for bundles, which also call into the
> same xlate path from do_bundle_commit() while holding ofproto_mutex.
> 
> How do we fix this?
> 
> * We can't get rid of ofproto_mutex in do_bundle_commit(), or drop it
>   temporarily around flow translation (i.e. the call to
>   ofproto_packet_out_start()), because it might need to revert all the
>   flow table changes and dropping the mutex would allow other threads to
>   race in and make conflicting changes.
> 
> * We can't make flow translation require the lock unconditionally,
>   because that would serialize flow translation, which would defeat the
>   purpose of having multiple flow translation threads.
> 
> * We *can* change ofproto_mutex from an error-checking to recursive
>   mutex.  Recursive mutexes are not very well respected, for a couple of
>   good reasons, but they might be appropriate for this use case.
> 
> What do you think?




More information about the discuss mailing list