[ovs-dev] [PATCH] netdev-dummy: fix crash with more than one passive connection

Tue Jul 19 14:20:37 UTC 2016

Lance Richardson <lrichard at redhat.com> wrote on 07/19/2016 09:07:41 AM:

> From: Lance Richardson <lrichard at redhat.com>
> To: Ryan Moats/Omaha/IBM at IBMUS
> Cc: dev at openvswitch.org
> Date: 07/19/2016 09:07 AM
> Subject: Re: [ovs-dev] [PATCH] netdev-dummy: fix crash with more
> than one passive connection
>
> ----- Original Message -----
> > From: "Ryan Moats" <rmoats at us.ibm.com>
> > To: "Lance Richardson" <lrichard at redhat.com>
> > Cc: dev at openvswitch.org
> > Sent: Monday, July 18, 2016 11:36:27 PM
> > Subject: Re: [ovs-dev] [PATCH] netdev-dummy: fix crash with more
> than   one   passive connection
> >
> > "dev" <dev-bounces at openvswitch.org> wrote on 07/18/2016 11:29:28 AM:
> >
> > >
> > > > From: "Lance Richardson" <lrichard at redhat.com>
> > > > To: dev at openvswitch.org
> > > > Sent: Wednesday, July 6, 2016 7:39:52 PM
> > > > Subject: [ovs-dev] [PATCH] netdev-dummy: fix crash with more than
> > > one   passive connection
> > > >
> > > > Investigation found that Some of the occasional failures in the
> > > > "ovn -- vtep: 3 HVs, 1 VIFs/HV, 1 GW, 1 LS" test case are caused
> > > > by ovs-vswitchd crashing with SIGSEGV. It turns out that the
> > > > crash occurrs when the number of netdev-dummy passive connections
> > > > transitions from 1 to 2.  When xrealloc() copies the array of
> > > > dummy_packet_stream structures from the original buffer to a
> > > > newly allocated one, the struct ovs_list txq member of the
structure
> > > > becomes corrupt (e.g. if ovs_list_is_empty() would have returned
> > > > false before the copy, it will return true after the copy, which
> > > > will lead to a crash when the bogus packet buffer on the list is
> > > > dereferenced).
> > > >
> > > > Fix by taking a hint from David Wheeler and adding a level of
> > > > indirection.
> > > >
> > > > Signed-off-by: Lance Richardson <lrichard at redhat.com>
> >
> > [snip]
> >
> > >
> > > Here is a small script that reliably reproduces the crash in
> > ovs-vswitchd.
> > > I don't have an explanation for why we have two connections to the
same
> > > port in the "ovn -- vtep: 3 HVs, 1 VIFs/HV, 1 GW, 1 LS" test case
(this
> > > happens infrequently), perhaps it's something like the active connect
> > from
> > > a peer ovs-vswitchd being interrupted and re-tried.
> > >
> > > #!/bin/bash
> > >
> > > set -ex
> > >
> > > PWD=$(pwd)
> > > export PATH=${PWD}/vswitchd:${PWD}/utilities:${PWD}/ovsdb:$PATH
> > >
> > > ovs_setenv() {
> > >     export OVS_RUNDIR=${PWD}/$1
> > >     export OVS_LOGDIR=${PWD}/$1
> > >     export OVS_DBDIR=${PWD}/$1
> > >     export OVS_SYSCONFDIR=${PWD}/$1
> > >     export OVS_PKGDATADIR=${PWD}/$1
> > > }
> > >
> > > for x in repro_main repro_hv1 repro_hv2; do
> > >     mkdir -p $x
> > >     rm -f $x/*
> > >     ovs_setenv $x
> > >     : > $OVS_RUNDIR/.conf.db.~lock~
> > >     ovsdb-tool create $OVS_RUNDIR/conf.db vswitchd/vswitch.ovsschema
> > >     ovsdb-server --remote=punix:$OVS_RUNDIR/db.sock  -vconsole:off
> > > --detach --no-chdir --pidfile --log-file
> > >     ovs-vsctl --no-wait -- init
> > >     ovs-vswitchd --enable-dummy=system -vvconn -vofproto_dpif -
> > > vunixctl -vconsole:off --detach --no-chdir --pidfile --log-file
> > > done
> > >
> > > ovs_setenv repro_main
> > > ovs-vsctl add-br foo  \
> > >           -- add-port foo p1 \
> > >           -- set Interface p1
> > options:pstream="punix:$PWD/repro_main/p1.sock"
> > >
> > > ovs_setenv repro_hv1
> > > ovs-vsctl add-br br1 \
> > >           -- add-port br1 p1 \
> > >           -- set Interface p1
> > options:stream="unix:$PWD/repro_main/p1.sock"
> > >
> > > ovs_setenv repro_hv2
> > > ovs-vsctl add-br br1 \
> > >           -- add-port br1 p1 \
> > >           -- set Interface p1
> > options:stream="unix:$PWD/repro_main/p1.sock"
> >
> > I like this, but I think I'm going to be consistent and ask if you
> > can spin a new version with the above script as a unit test so that
> > we can see the crash before applying the rest of the patch and then
> > verify that it doesn't happen with the patch.
> >
> > Ryan
> >
>
> Hi Ryan,
>
> Thanks for the review.  I'm a little unclear on whether it's a good idea
> to add a test case that doesn't verify any functionality and is no longer
> useful as soon as it is merged.  I was rather hoping that the script
provided
> above would serve the "see the crash before applying" function.
>
> I'm fine with writing and posting such a test case if that's the policy,
> but I do wonder how many new test cases we'll have if one is written for
> every crash that has been fixed, and how effective this will be at
detecting
> future crash causes given the enormous number of ways code can be changed
> to produce a crash.
>
> On the other hand it probably would be a good idea to add test cases to
> verify netdev-dummy functionality (especially for things like multiple
> passive connection support that aren't already used in other test cases).
> This is probably a larger effort though.

My reasoning (and it is purely personal, I'm not a committer or anything :)
was that since we *know* there is a crash lurking there, a canary test
would be useful to make sure we don't regress and allow it to happen
again...

If others think that I'm being overly-paranoid, then I'll happily drop the
request.

Ryan