[ovs-discuss] null ptr exception in ovs_vport_get_stats+0x6a/0x130 [openvswitch]

Tue Jan 5 04:30:01 UTC 2016

On Mon, Jan 4, 2016 at 8:47 PM, Jesse Gross <jesse at kernel.org> wrote:

> On Mon, Jan 4, 2016 at 1:41 PM, Flavio Fernandes <ffernand at redhat.com>
> wrote:
> > So, I'm a happy camper, but can't help but worry a little about the
> > fragility of the
> > system when one attempts to use a port type internal 'directly' as
> bridged.
> > The fix
> > I have in mind is relatively simple:  add a check in
> internal_dev_get_stats
> > to gracefully handle cases when ovs_internal_dev_get_vport returns null.
> Too
> > simple?
>
> I don't think that the problem is simply that we are returning NULL
> from ovs_internal_dev_get_vport(). ovs_internal_dev_get_vport() should
> never return NULL to internal_dev_get_stats() because it is checking
> whether the device has a ops structure that is equal to the one that
> leads to internal_dev_get_stats(). And in fact, if you look at the
> full stack trace, the address being dereferenced is 0x0000000000000060
> rather than 0x0 from a real NULL.
>

ack. If ovs_internal_dev_get_vport
<http://lxr.oss.org.cn/ident?i=ovs_internal_dev_get_vport>() is not
returning NULL then this is
not as simple as what I was interpreting. My thinking was that 0x60 is the
offset of

        &vport <http://lxr.oss.org.cn/ident?i=vport>->err_stats.rx_errors

from line 306 in
http://lxr.oss.org.cn/source/net/openvswitch/vport.c#L306
but you may be right in that if vport was not NULL, then this is an issue in
what ovs_internal_dev_get_vport() is returning.

> This looks like something is overwriting the vport pointer in the
> device structure. If you follow where this is coming from you'll wind
> up at ovs_netdev_get_vport() which is a maze of twisty conditions that
> depend on what kernel version you are using. Particularly on the RHEL
> kernels (which based on your email address I'm guessing you're using),
> the pointer is stashed in a variety of places. My guess is that these
> are not entirely safe in some conditions - likely related to tap
> devices based on your other description. I think the best path forward
> is to try to see which of the conditions your kernel version falls
> into and try to see what might be stomping on the pointer.
>

I see. So it could be I'm looking at the wrong source code. I am
using Centos 7.2 kernel (3.10.0-327.3.1.el7.x86_64 x86_64); I will
find out more about how that differs from upstream kernel.

THANKS Jesse!

-- flaviof
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://openvswitch.org/pipermail/ovs-discuss/attachments/20160104/eba85a44/attachment-0002.html>