[ovs-dev] [PATCH v3 0/5] create tunnel devices using rtnetlink interface

Jesse Gross jesse at kernel.org
Fri Jul 1 19:34:43 UTC 2016


On Fri, Jul 1, 2016 at 10:53 AM, Thadeu Lima de Souza Cascardo
<cascardo at redhat.com> wrote:
> On Thu, Jun 30, 2016 at 03:44:37PM -0700, Jesse Gross wrote:
>> On Thu, Jun 30, 2016 at 12:25 PM, Thadeu Lima de Souza Cascardo
>> <cascardo at redhat.com> wrote:
>> > On Wed, Jun 29, 2016 at 09:38:00PM -0700, Jesse Gross wrote:
>> >> On Wed, Jun 22, 2016 at 8:47 AM, Thadeu Lima de Souza Cascardo
>> >> <cascardo at redhat.com> wrote:
>> >> > This series adds support for the creation of tunnels using the rtnetlink
>> >> > interface. This will open the possibility for new features and flags on those
>> >> > vports without the need to change vport compatibility code.
>> >> >
>> >> > Support for STT and LISP have not been added because these are not upstream yet,
>> >> > so we don't know how the interface will be like upstream. And there are no
>> >> > features in the current drivers right now we could make use of.
>> >>
>> >> I noticed some interesting test failures while looking at this series.
>> >> Some of them are existing but they appear to be related to the earlier
>> >> conversation that we had with the tunnel routing code.
>> >>
>> >> The background is that I normally run unit tests on my development
>> >> machine and real tests with traffic on a different machine. In that
>> >> case the problem does not appear. The issue came up when I happened to
>> >> run the unit tests on a machine with a running configuration of OVS.
>> >> In that case, it seems that something picks up the existing tunnel
>> >> kernel netdev, opens it, and then uses that for the unit tests (which
>> >> should be purely in userspace and theoretically not affected).
>> >>
>> >> Here is what I get in the above scenario (some other related tests
>> >> appear to fail intermittently as well):
>> >> tunnel_push_pop
>> >>
>> >> 750: tunnel_push_pop - action                        FAILED
>> >> (tunnel-push-pop.at:154)
>> >> 751: tunnel_push_pop - packet_out                    FAILED
>> >> (tunnel-push-pop.at:201)
>> >>
>> >> tunnel_push_pop_ipv6
>> >>
>> >> 752: tunnel_push_pop_ipv6 - action                   FAILED
>> >> (tunnel-push-pop-ipv6.at:149)
>> >>
>> >> This is presumably caused by the wrong type being used. Wouldn't
>> >> problems like this be avoided by using the other mechanism that you
>> >> proposed, whereby we identify the type on netdev_open() and use that?
>> >> I guess it doesn't seem like that big of a change to me and looks to
>> >> be more robust (or at least easier to diagnose if things go wrong).
>> >>
>> >
>> > What are the testsuite logs? If I have a vxlan_sys_4789 on my system, I can make
>> > at least 750 and 752 to fail immediately with:
>> >
>> > 2016-06-30T18:51:07.978Z|00183|netdev_linux|ERR|failed to create raw socket (Operation not permitted)
>> > 2016-06-30T18:51:07.978Z|00184|dpif_netdev|ERR|vxlan_sys_4789: cannot receive packets on this network device (Operation not permitted)
>> >
>> > That is the same for almost every test that creates a bridge, if I have a br0 on
>> > my system. And this is pretty much related to this problem, of having the route
>> > table open the device. But setting vxlan_sys_* interfaces type to vxlan won't
>> > fix the br0 problem.
>> >
>> > One of the problems here is that the route table doesn't care if the port is in
>> > the database or not, and which type it has in the database. The userspace
>> > tunneling code, however, will only output to a bridge which must have been
>> > configured from the database, and I see no other current user of tnl ports and
>> > the routing table. So, maybe we need to ignore every route and listening to
>> > devices we won't get any packets from anyway? But I don't want this bug to block
>> > this patchset.
>>
>> I think this problem is too closely related to this patchset to gloss
>> over and it affects the course of action that we take. On the previous
>> version of the series, you gave multiple choices of how to solve the
>> problem related to restarting ovs-vswitchd and I asked that you look
>> at the version that ensures the open netdev is the right type. We now
>> have an example of an issue that comes up when types are not correct,
>> so I'm asking again that you look at the other solutions to patch #1.
>> Even if it is preexisting, I don't want to continue to pile more
>> complexity and workarounds on top.
>>
>
> OK, so let's make sure it is preexisting and that we are talking about the same
> issue here. I want to fix this, but I would rather make sure any fix I come up
> with fails to handle the issue you saw.
>
> Can you recover the logs from that test, or can you reproduce it and check this
> is the same problem?

I attached two test suite logs that seem to be representative of the
problem. I'm pretty sure that it's basically the same issue as you are
seeing (though the actual error message is different - this was run
with master + your patches), I just didn't see it before due to the
way I usually run the tests on different systems. The problems go away
if I delete the tunnel backing interface. In at least one of the logs
you can clearly see that the unit tests are trying to access the
Geneve device as if was a system device.

Thanks for taking a look at these two issues. I know that you are
eager to get this series in but I do think that it's important that we
take care to avoid potential issues down the road.



More information about the dev mailing list