[ovs-dev] [PATCH V2] datapath: Prevent panic

Eric Garver e at erig.me
Fri Apr 20 12:39:49 UTC 2018


On Thu, Apr 19, 2018 at 08:07:33PM -0700, Gregory Rose wrote:
> On 4/19/2018 4:18 PM, Flavio Leitner wrote:
> > On Tue, Apr 17, 2018 at 12:34:08PM -0700, Greg Rose wrote:
> > > On RHEL 7.x kernels we observe a panic induced by a paging error
> > > when the timer kicks off a job that subsequently accesses memory
> > > that belonged to the openvswitch kernel module but was since
> > > unloaded - thus the paging error.
> > > 
> > > The panic can be induced on any RHEL 7.x kernel with the following test:
> > > 
> > > while `true`
> > > do
> > >      make check-kmod TESTSUITEFLAGS="-k \!gre"
> > > done
> > > 
> > > On the systems I've been testing on it generally takes anywhere from a
> > > minute to 15 minutes or so to repro but never longer than that.  Similar
> > > results have been seen by other testers.
> > > 
> > > This patch does not fix the underlying bug, which does need to be
> > > investigated and fixed, but it does prevent it from occurring. We
> > > would like to prevent customer systems from panicking while we do
> > > futher investigation to find the root cause.
> > This looks like related and it is fixed in kernel-3.10.0-842.el7 as
> > part of 7.5.

Agree. It seems likely to be the same issue. Sorry I missed this patch
being posted. Thanks far catching it Flavio!

> > 
> > Bug 1531680 - openvswitch: list_add corruption splat on running OVS
> > check-kernel tests
> > [...]
> > Running the OVS check-kernel on kernel-3.10.0-826.el7 results in a
> > splat and/or panic. I bisected this down to:
> > 
> > fc2302dde0d9 ("[net] openvswitch: Fix refcount leak on force commit")

This commit hash is wrong [0]. It's actually RHEL commit

    f37ed043ed24 ("[net] openvswitch: Add force commit")

[0] https://bugzilla.redhat.com/show_bug.cgi?id=1531680#c2

> > 
> > I believe this is caused by a missing del_timer() when deleting the ct
> > in OVS. RHEL7 does not yet have the changes to remove the timers
> > (f330a7fdbe16 ("netfilter: conntrack: get rid of conntrack timer")).
> > The force commit feature was added upstream _after_ f330a7fdbe16.
> > 
> > And the RHEL-only patch is:
> > diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> > index 7582bf79033e..ef2a324e88b4 100644
> > --- a/net/openvswitch/conntrack.c
> > +++ b/net/openvswitch/conntrack.c
> > @@ -554,7 +554,8 @@  ovs_ct_expect_find(struct net *net, const struct
> > nf_conntrack_zone *zone,
> >   		if (h) {
> >   			struct nf_conn *ct = nf_ct_tuplehash_to_ctrack(h);
> > -			nf_ct_delete(ct, 0, 0);
> > +			if (del_timer(&ct->timeout))
> > +				nf_ct_delete(ct, 0, 0);
> >   			nf_conntrack_put(&ct->ct_general);
> >   		}
> >   	}
> > @@ -705,7 +706,8 @@  static bool skb_nfct_cached(struct net *net,
> >   		 * the reference.
> >   		 */
> >   		if (nf_ct_is_confirmed(ct))
> > -			nf_ct_delete(ct, 0, 0);
> > +			if (del_timer(&ct->timeout))
> > +				nf_ct_delete(ct, 0, 0);
> >   		nf_conntrack_put(&ct->ct_general);
> >   		nf_ct_set(skb, NULL, 0);
> > 
> > 
> > If that's the case, we should say thank Eric.
> > Thanks,
> > fbl
> 
> Fantastic, I'll test this and whip up a patch.
> 
> Thanks!
> 
> - Greg

I'll be on the lookout for it. Thanks.

[..]


More information about the dev mailing list