[ovs-discuss] Tunneling to the same machine

Wed Jun 26 07:42:41 UTC 2013

On Tue, Jun 25, 2013 at 09:56:16PM -0700, Jesse Gross wrote:
> On Tue, Jun 25, 2013 at 8:41 PM, Isaku Yamahata <yamahata at valinux.co.jp> wrote:
> > Added Pravin to Cc as I think this also affects upstream kernel tunnel stuff
> > he works on.
> >
> > On Tue, Jun 25, 2013 at 03:28:59PM -0700, Jesse Gross wrote:
> >> On Fri, Jun 14, 2013 at 5:46 PM, Murphy McCauley
> >> <murphy.mccauley at gmail.com> wrote:
> >> >
> >> > On Jun 12, 2013, at 7:30 AM, Murphy McCauley wrote:
> >> >
> >> >> On Jun 10, 2013, at 1:41 PM, Jesse Gross wrote:
> >> >>
> >> >>> On Fri, Jun 7, 2013 at 7:07 PM, Murphy McCauley
> >> >>> <murphy.mccauley at gmail.com> wrote:
> >> >>>> So I'm doing something that's probably a bit strange and (perhaps unsurprisingly) getting results that seem a bit strange.
> >> >>>>
> >> >>>> What I've got is two bridges on a single machine, and I'd like to have GRE/VXLAN tunnels between them.  The reason for this is that while ultimately the controller code is meant to control bridges across multiple physical machines or VMs, I'd like to be able to test it in a single Mininet VM.  In this case, the ports attached to the bridges are all veth pairs which run into separate network namespaces, but I need the bridges to communicate between each other in the root namespace.  I realize that a more common way to link the bridges would be with patch interfaces, but that's not applicable to the configuration this will be in when running "for real", and moreover, it won't work with the NXM_NX_TUN_IPV4_DST approach I'm taking.
> >> >>>>
> >> >>>> .. which is all to say that it'd be nice if I could get this working.
> >> >>>>
> >> >>>> So what I'm doing is setting up two interfaces with IPs of, say, 172.16.0.1 and 172.16.0.2.  I'm then adding tunnels along the lines of:
> >> >>>> ovs-vsctl add-port s1 tun0 -- set interface tun0 type=gre options:remote_ip=172.16.0.1 options:local_ip=172.16.0.2
> >> >>>> ovs-vsctl add-port s2 tun1 -- set interface tun1 type=gre options:remote_ip=172.16.0.2 options:local_ip=172.16.0.1
> >> >>>>
> >> >>>> For very loose definitions of "works", this works.  If I try to ping across the tunnel, I get *one* successful ping.  Snooping the traffic, I see a successful ARP, the first echo request and reply, and then… lots of requests with no replies.  If I kill ping and try to ping again immediately, I get nothing.  If I kill ping and wait a while or try pinging another address -- it works.
> >> >>>>
> >> >>>> Investigating a bit further, I find there's something along the lines of a five or six second flow timeout at play here.  If I keep up activity, further packets never go through (neither ARP nor ICMP).  But after 5 (or 6?) seconds of silence, the whole thing is repeatable.  So a ping -i 6 will appear to work just fine.  It seems weird to me that the initial ARP and ping go through, but subsequent ones don't until the 5/6 seconds elapse, but there it is.
> >> >>>
> >> >>> This is likely because the packets that work are being send up to
> >> >>> userspace as part of a flow setup. When they are sent back down, they
> >> >>> are essentially new packets, cleaned of previously metadata. Anything
> >> >>> that matches an existing flow will be carried through the kernel
> >> >>> directly.
> >> >>>
> >> >>> My guess is that there is some information from the sending IP stack
> >> >>> that is causing problems when it is received as a tunnel. You could
> >> >>> look through the skb since nothing immediately comes to mind (we
> >> >>> already clear out the obvious fields).
> >> >>
> >> >> Thanks, Jesse, this is exactly the direction-pointing I needed, and it was exactly right.  The problem is pkt_type.
> >> >>
> >> >> I believe what's happening is since the packets originally came in via promiscuous capture to the wrong MAC, the pkt_type is set to PACKET_OTHERHOST.  This is fine when they're just being shoved out another interface, but when actually trying to deliver them locally after all, I believe they're getting thrown away by ip_rcv() in net/ipv4/ip_input.c, which specifically checks for PACKET_OTHERHOST.
> >> >
> >> > So I've confirmed that this is what's happening in master (I haven't checked if Pravin Shelar's recent work on the tunneling may change things).
> >> >
> >> > My current solution is to switch to PACKET_HOST when rt_flags has RTCF_LOCAL set.  This makes my local setup work for both GRE and VXLAN and seems to me like it shouldn't change anything under other cases (and, indeed, didn't do so in my limited testing).
> >> >
> >> > Is there anything wrong with this?
> >>
> >> Sorry about the delay on this, we've been having some instability in
> >> master that needed to be addressed. I'm also CCing Isaku Yamahata
> >> since he recently proposed a similar patch. I had a couple of
> >> questions:
> >>  * On Murphy's patch, is it necessary to check for the preconditions
> >> or can we just always set the packet type?
> >>  * For both patches, would PACKET_OUTGOING be more appropriate than PACKET_HOST?
> >
> > To be honest I'm not sure what PACKET_OUTGOING means exactly and how
> > PACKET_OUTGOING is different from PACKET_HOST.
> > My observation is
> > - packets should behaves in same manner with or without kernel datapath
> >   rule.
> >   (If we choose PACKET_OUTGOING, do_output() or functions it calls
> >    should be changed somehow)
> > - PACKET_HOST=0 is default value set by alloc_skb()
> >   So without setting skb->pkt_type explicitly, PACKET_HOST is used.
> >   In kernel source tree, there is no one that explicitly sets PACKET_OUTGOING
> >   except dev_queue_xmit_nit() and decnet.
> > - Although ip_rcv() checks if pkt_type == PACKET_OTHERHOST,
> >   other ip functions (ip_forwarding(), icmp_send(), tcp_v4_rcv()...) check
> >   if pkt_type != PACKET_HOST and drop packet.
> >
> > Other way I thought of is
> > - modify loopback device(loopback_xmit() driver/net/loopback.c)
> >   to clear pkt_type
> > - modify ovs_vport_receive() to clear pkt_type
> 
> Modifying the loopback device might be the cleanest and most generic.
> It already calls eth_type_trans(), which sets pkt_type for all values
> except for PACKET_HOST so initializing it to that seems correct in any
> case.

I sent out the patch to fix loopback device to netdev. We will see the
result soon.

Although the future kernel will fix the issue, there still remain the
currently used kernels for a while and it is desirable to work round it
somehow. Which way do you recommend, Jesse?
-- 
yamahata