[ovs-discuss] ovs-master crashes due to double locking of ofproto_mutex.

Anil Kumar Koli anilkumar.k at altencalsoftlabs.com
Tue Apr 23 06:33:37 UTC 2019


Hi Ben,

Sorry for the late response. 
Yes connection-tracking is an essential part of the problem.
Based on my initial analysis, I also feel like changing ofproto_mutex from
an error-checking to recursive mutex is the probable solution. As of now, I
have not completed my analysis, I will update once I have finished my
analysis. 

Thanks & Regards,
Anil Kumar

-----Original Message-----
From: Ben Pfaff <blp at ovn.org> 
Sent: Tuesday, 16 April, 2019 04:30 AM
To: Anil Kumar Koli <anilkumar.k at altencalsoftlabs.com>
Cc: ovs-discuss at openvswitch.org
Subject: Re: [ovs-discuss] ovs-master crashes due to double locking of
ofproto_mutex.

On Wed, Apr 03, 2019 at 03:36:36PM +0530, Anil Kumar Koli via discuss wrote:
> Hello OVS team,
> 	
>  
> 
> OVS crash is observed in ovs-master when controller sends a packet 
> with packet-out (output port as OFPP_TABLE) and any of the openflow 
> table entry which gets hit results into a learn action.
> 
>  
> 
> Steps to reproduce:
> 
> 1. Start the ovs-vswitchd in dpdk mode.
> 
> 2. Configure the following flows
> 
> ovs-ofctl -OOpenflow13 add-flow br-int "table=0, priority=50, 
> ct_state=-trk,ip, in_port=10 actions=ct(table=0)"
> 
> ovs-ofctl -OOpenflow13 add-flow br-int "table=0, priority=50, 
> ct_state=+trk,ip, in_port=10 actions=ct(commit),resubmit(,1)"
> 
> ovs-ofctl -OOpenflow13 add-flow br-int "table=1 
> actions=learn(table=2,NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:NXM_OF_IN
> _PORT[ ]->NXM_NX_REG0[0..15], output:NXM_NX_REG0[0..15]),resubmit(,2)"
> 
> 3. Send a packet with output as OFPP_TABLE
> 
> ovs-ofctl -OOpenflow13 packet-out br-int 'in_port=10
> packet=505400000007101111111111080045000028000000004006f97cc0a80001c0a
> 800020 008000a0000000000000000500200002e7d0000, actions=TABLE'
> 
>  
> 
> This leads to a crash which is a case of double locking of 
> oproto_mutex in handle_packet_out and ofproto_flow_mod_learn.

Strangely, I did not have luck reproducing this, even though you gave a good
reproduction case, but I do recognize the problem.  (Your reproduction case
involves connection-tracking, which complicates things.  Is
connection-tracking an essential part of the problem?)

This use of ofproto_mutex was introduced in the initial commit that created
ofproto_mutex, back in 2013:

    commit abe7b10f58c1d0d1c6c556a89aa6bd9223f56944
    Author:	Ben Pfaff <blp at ovn.org>  Thu Sep 12 00:31:33 2013
    Committer:	Ben Pfaff <blp at ovn.org>  Thu Sep 12 17:43:56 2013

    ofproto: Drop 'expirable_mutex' in favor of new global 'ofproto_mutex'.

    This is the first step toward making a global lock that protects
everything
    needed for updating the flow table.  This commit starts out by merging
one
    lock into the new one, and later commits will continue that process.

    The mutex is initially a recursive one, because I wasn't sure that there
    were no nested acquisitions at this stage, but a later commit will
change
    it to a regular error-checking mutex once it's settled down a bit.

    Signed-off-by: Ben Pfaff <blp at nicira.com>
    Acked-by: Ethan Jackson <ethan at nicira.com>

When I look at the implementation of handle_packet_out(), I don't see much
that requires ofproto_mutex (ofproto_check_ofpacts() does),
*unless* translation executes a learn action.

The same issue is going to come up for bundles, which also call into the
same xlate path from do_bundle_commit() while holding ofproto_mutex.

How do we fix this?

* We can't get rid of ofproto_mutex in do_bundle_commit(), or drop it
  temporarily around flow translation (i.e. the call to
  ofproto_packet_out_start()), because it might need to revert all the
  flow table changes and dropping the mutex would allow other threads to
  race in and make conflicting changes.

* We can't make flow translation require the lock unconditionally,
  because that would serialize flow translation, which would defeat the
  purpose of having multiple flow translation threads.

* We *can* change ofproto_mutex from an error-checking to recursive
  mutex.  Recursive mutexes are not very well respected, for a couple of
  good reasons, but they might be appropriate for this use case.

What do you think?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5540 bytes
Desc: not available
URL: <http://mail.openvswitch.org/pipermail/ovs-discuss/attachments/20190423/b062f67b/attachment.p7s>


More information about the discuss mailing list