[ovs-dev] [PATCH 2/2] datapath: Add support for LISP tunneling

Lori Jakab lojakab at cisco.com
Wed Jan 23 17:20:59 UTC 2013


Hi Jarno,

Thanks for reviewing! Replies inline:

On 01/23/13 16:32, Jarno Rajahalme wrote:
> Please find my comments below,
> 
>   Jarno
> 
> On Jan 22, 2013, at 20:36 , ext Kyle Mestery wrote:
> 
>> From: Lorand Jakab <lojakab at cisco.com>
>>
> ...
>> +Flows on br0 are configured as follows:
>> +
>> +    priority=3,dl_dst=02:00:00:00:00:00,action=mod_dl_dst:<VMx_MAC>,NORMAL
>> +    priority=2,in_port=1,dl_type=0x0806,action=NORMAL
>> +    priority=1,in_port=1,dl_type=0x0800,vlan_tci=0,nw_src=<EID_prefix>,action=output:3
>> +    priority=0,action=NORMAL
> 
> I'll be referring to this example below.
> 
>> diff --git a/datapath/vport-lisp.c b/datapath/vport-lisp.c
>> new file mode 100644
>> index 0000000..558e5aa
>> --- /dev/null
>> +++ b/datapath/vport-lisp.c
>> @@ -0,0 +1,351 @@
>> +/*
>> + * Copyright (c) 2011 Nicira, Inc.
>> + * Copyright (c) 2013 Cisco Systems, Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of version 2 of the GNU General Public
>> + * License as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful, but
>> + * WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
>> + * General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, write to the Free Software
>> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
>> + * 02110-1301, USA
>> + */
>> +
>> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>> +
>> +#include <linux/version.h>
>> +#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,26)
>> +
>> +#include <linux/in.h>
>> +#include <linux/ip.h>
>> +#include <linux/list.h>
>> +#include <linux/net.h>
>> +#include <linux/udp.h>
>> +
>> +#include <net/icmp.h>
>> +#include <net/ip.h>
>> +#include <net/udp.h>
>> +
>> +#include "datapath.h"
>> +#include "tunnel.h"
>> +#include "vport.h"
>> +
>> +#define LISP_DST_PORT 4341  /* Well known UDP port for LISP data packets. */
>> +
>> +struct lisp_net {
>> +	struct socket *lisp_rcv_socket;
>> +	int n_tunnels;
>> +};
>> +static struct lisp_net lisp_net;
> 
> How does this one shared global instance of lisp_net work with multiple networking
> namespaces?  As it is, all namespaces seem to be sharing the same socket?

Yes they are. Would it be better to use the same approach as the capwap
tunnel, which uses the struct ovs_net->vport_net to store struct capwap_net?

> 
>> +
>> +
>> +/*
>> + *  LISP encapsulation header:
>> + *
>> + *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>> + *  |N|L|E|V|I|flags|            Nonce/Map-Version                  |
>> + *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>> + *  |                 Instance ID/Locator Status Bits               |
>> + *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>> + *
>> + */
>> +
>> +/**
>> + * struct lisphdr - LISP header
>> + * @nonce_present: Flag indicating the presence of a 24 bit nonce value.
>> + * @lsb: Flag indicating the presence of Locator Status Bits (LSB).
>> + * @echo_nonce: Flag indicating the use of the echo noncing mechanism.
>> + * @map_version: Flag indicating the use of mapping versioning.
>> + * @instance_id: Flag indicating the presence of a 24 bit Instance ID (IID).
>> + * @rflags: 3 bits reserved for future flags.
>> + * @nonce: 24 bit nonce value.
>> + * @lsb_bits: 32 bit Locator Status Bits
>> + */
>> +struct lisphdr {
>> +#ifdef __LITTLE_ENDIAN_BITFIELD
>> +	__u8 rflags:3;
>> +	__u8 instance_id:1;
>> +	__u8 map_version:1;
>> +	__u8 echo_nonce:1;
>> +	__u8 lsb:1;
>> +	__u8 nonce_present:1;
>> +#else
>> +	__u8 nonce_present:1;
>> +	__u8 lsb:1;
>> +	__u8 echo_nonce:1;
>> +	__u8 map_version:1;
>> +	__u8 instance_id:1;
>> +	__u8 rflags:3;
>> +#endif
>> +	union {
>> +		__u8 nonce[3];
>> +		__u8 map_version[3];
>> +	} u1;
>> +	union {
>> +		__be32 lsb_bits;
>> +		__be32 iid;
>> +	} u2;
>> +};
> 
> The code below would be more readable if the flags names could not be
> confused with the value. E.g., "instance_id" could be called "have_instance_id",
> and "u2.iid" could be called "instance_id"?

Agreed, poor naming choice on my part. Will fix.

> 
>> +
>> +#define LISP_HLEN (sizeof(struct udphdr) + sizeof(struct lisphdr))
>> +
>> +static inline int lisp_hdr_len(const struct tnl_mutable_config *mutable,
>> +			       const struct ovs_key_ipv4_tunnel *tun_key)
>> +{
>> +	return LISP_HLEN;
>> +}
>> +
>> +static inline struct lisphdr *lisp_hdr(const struct sk_buff *skb)
>> +{
>> +	return (struct lisphdr *)(udp_hdr(skb) + 1);
>> +}
>> +
>> +/* Compute source port for outgoing packet.
>> + * Currently we use the flow hash.
>> + */
>> +static u16 get_src_port(struct sk_buff *skb)
>> +{
>> +	int low;
>> +	int high;
>> +	unsigned int range;
>> +	u32 hash = OVS_CB(skb)->flow->hash;
>> +
>> +	inet_get_local_port_range(&low, &high);
>> +	range = (high - low) + 1;
>> +	return (((u64) hash * range) >> 32) + low;
>> +}
>> +
>> +static struct sk_buff *lisp_pre_tunnel(const struct vport *vport,
>> +				       const struct tnl_mutable_config *mutable,
>> +				       struct sk_buff *skb)
>> +{
>> +	/* Pop off "inner" Ethernet header */
>> +	skb_pull(skb, ETH_HLEN);
>> +	return skb;
>> +}
> 
> Here it would be better to pull "skb_network_offset(skb)" instead of "ETH_HLEN".

Agreed.

> That way this would work correctly for packets that may have VLAN tags or
> MPLS labels. The skb->protocol may need to be updated in this case.
> If the above should not be allowed, then there should be a mechanism
> by which this could fail (e.g., return NULL, but see my comment on patch 1/2).
> 
> Maybe ARP packets should be filtered out as well?

How about using a whitelist instead? For the time being, the LISP
protocol only specifies the encapsulation of IPv4 and IPv6 packets,
although others may be added by a future revision of the specification.

> 
> Also, the reduction in the packet size will show up in the port stats. This may or
> may not be desired. Currently, tunnel output will not report the tunnel headers
> as bytes sent (see send_frags() in tunnel.c). In the same spirit, maybe the
> ETH_HLEN (or skb_network_offset()) should NOT be removed from the reported
> bytes, i.e., the stats reflect the bytes given for the vport for transport, not what the
> vport actually sends out.  This can be done with your own .send hook as well as
> you would define what it returns.

Ok, will do.

> 
>> +
>> +/* Returns the least-significant 32 bits of a __be64. */
>> +static __be32 be64_get_low32(__be64 x)
>> +{
>> +#ifdef __BIG_ENDIAN
>> +	return (__force __be32)x;
>> +#else
>> +	return (__force __be32)((__force u64)x >> 32);
>> +#endif
>> +}
>> +
>> +static struct sk_buff *lisp_build_header(const struct vport *vport,
>> +					 const struct tnl_mutable_config *mutable,
>> +					 struct dst_entry *dst,
>> +					 struct sk_buff *skb,
>> +					 int tunnel_hlen)
>> +{
>> +	struct udphdr *udph = udp_hdr(skb);
>> +	struct lisphdr *lisph = (struct lisphdr *)(udph + 1);
>> +	const struct ovs_key_ipv4_tunnel *tun_key = OVS_CB(skb)->tun_key;
>> +	__be64 out_key;
>> +	u32 flags;
>> +
>> +	tnl_get_param(mutable, tun_key, &flags, &out_key);
>> +
>> +	udph->dest = htons(LISP_DST_PORT);
>> +	udph->source = htons(get_src_port(skb));
>> +	udph->check = 0;
>> +	udph->len = htons(skb->len - skb_transport_offset(skb));
>> +
>> +	lisph->nonce_present = 1;   /* We add a nonce instead of map version */
>> +	lisph->lsb = 0;		    /* No reason to set LSBs, just one RLOC */
>> +	lisph->echo_nonce = 0;	    /* No echo noncing */
>> +	lisph->map_version = 0;	    /* No mapping versioning, nonce instead */
>> +	lisph->instance_id = 1;	    /* Store the tun_id as Instance ID  */
>> +	lisph->rflags = 1;	    /* Reserved flags, set to 0  */
>> +
>> +	lisph->u1.nonce[0] = net_random() & 0xFF;
>> +	lisph->u1.nonce[1] = net_random() & 0xFF;
>> +	lisph->u1.nonce[2] = net_random() & 0xFF;
>> +
>> +	lisph->u2.iid = htonl(be64_get_low32(tun_key->tun_id));
> 
> 
> This seems wrong. Isn't __be32 already in the network byte order?

Indeed. Will remove the htonl() call.

> 
>> +
>> +	/*
>> +	 * Allow our local IP stack to fragment the outer packet even if the
>> +	 * DF bit is set as a last resort.  We also need to force selection of
>> +	 * an IP ID here because Linux will otherwise leave it at 0 if the
>> +	 * packet originally had DF set.
>> +	 */
>> +	skb->local_df = 1;
>> +	__ip_select_ident(ip_hdr(skb), dst, 0);
>> +
>> +	return skb;
>> +}
>> +
>> +/* Called with rcu_read_lock and BH disabled. */
>> +static int lisp_rcv(struct sock *sk, struct sk_buff *skb)
>> +{
>> +	struct vport *vport;
>> +	struct lisphdr *lisph;
>> +	const struct tnl_mutable_config *mutable;
>> +	struct iphdr *iph;
>> +	struct ovs_key_ipv4_tunnel tun_key;
>> +	__be64 key;
>> +	u32 tunnel_flags = 0;
>> +	struct ethhdr *ethh;
>> +
>> +	if (unlikely(!pskb_may_pull(skb, LISP_HLEN)))
>> +		goto error;
>> +
>> +	lisph = lisp_hdr(skb);
>> +	if (unlikely(lisph->instance_id != 1))
>> +		goto error;
>> +
>> +	__skb_pull(skb, LISP_HLEN);
>> +	skb_postpull_rcsum(skb, skb_transport_header(skb), LISP_HLEN);
>> +
>> +	key = cpu_to_be64(ntohl(lisph->u2.iid));
> 
> You could add a be32_to_be64() and use that instead?

Sure, will do.

> 
>> +
>> +	iph = ip_hdr(skb);
>> +	vport = ovs_tnl_find_port(dev_net(skb->dev), iph->daddr, iph->saddr,
>> +				  key, TNL_T_PROTO_LISP, &mutable);
>> +	if (unlikely(!vport)) {
>> +		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);
>> +		goto error;
>> +	}
>> +
>> +	if (mutable->flags & TNL_F_IN_KEY_MATCH || !mutable->key.daddr)
>> +		tunnel_flags = OVS_TNL_F_KEY;
>> +	else
>> +		key = 0;
>> +
>> +	/* Save outer tunnel values */
>> +	tnl_tun_key_init(&tun_key, iph, key, tunnel_flags);
>> +	OVS_CB(skb)->tun_key = &tun_key;
>> +
>> +	/* Add Ethernet header */
>> +	skb_push(skb, ETH_HLEN);
>> +
>> +	ethh = (struct ethhdr *)skb->data;
>> +	memset(ethh, 0, ETH_HLEN);
>> +	ethh->h_dest[0] = 0x02;
>> +	ethh->h_source[0] = 0x02;
>> +	ethh->h_proto = htons(ETH_P_IP);
> 
> What if the inner packet is IPv6?

Right, I only tested v4 for now, will fix.

> 
> Also, it might be nice if the added MAC addresses were configurable. This
> may help get rid of the top priority flow entry in your example in some cases?

Should it be done per vport, as an option like remote_ip? Or should we
wait until all bits and pieces are there for configuring null ports, and
add MAC address configurability only then?

> 
>> +
>> +	ovs_tnl_rcv(vport, skb);
>> +	goto out;
>> +
>> +error:
>> +	kfree_skb(skb);
>> +out:
>> +	return 0;
>> +}
>> +
>> +/* Arbitrary value.  Irrelevant as long as it's not 0 since we set the handler. */
>> +#define UDP_ENCAP_LISP 7
>> +static int lisp_socket_init(struct net *net)
>> +{
>> +	int err;
>> +	struct sockaddr_in sin;
>> +
>> +	if (lisp_net.n_tunnels) {
>> +		lisp_net.n_tunnels++;
>> +		return 0;
>> +	}
>> +
>> +	err = sock_create_kern(AF_INET, SOCK_DGRAM, 0,
>> +			       &lisp_net.lisp_rcv_socket);
>> +	if (err)
>> +		goto error;
>> +
>> +	/* release net ref. */
>> +	sk_change_net(lisp_net.lisp_rcv_socket->sk, net);
>> +
>> +	sin.sin_family = AF_INET;
>> +	sin.sin_addr.s_addr = htonl(INADDR_ANY);
>> +	sin.sin_port = htons(LISP_DST_PORT);
>> +
>> +	err = kernel_bind(lisp_net.lisp_rcv_socket,
>> +			  (struct sockaddr *)&sin,
>> +			  sizeof(struct sockaddr_in));
>> +	if (err)
>> +		goto error_sock;
>> +
>> +	udp_sk(lisp_net.lisp_rcv_socket->sk)->encap_type = UDP_ENCAP_LISP;
>> +	udp_sk(lisp_net.lisp_rcv_socket->sk)->encap_rcv = lisp_rcv;
>> +
>> +	udp_encap_enable();
>> +	lisp_net.n_tunnels++;
>> +
>> +	return 0;
>> +
>> +error_sock:
>> +	sk_release_kernel(lisp_net.lisp_rcv_socket->sk);
>> +error:
>> +	pr_warn("cannot register lisp protocol handler: %d\n", err);
>> +	return err;
>> +}
>> +
>> +static const struct tnl_ops ovs_lisp_tnl_ops = {
>> +	.tunnel_type	= TNL_T_PROTO_LISP,
>> +	.ipproto	= IPPROTO_UDP,
>> +	.hdr_len	= lisp_hdr_len,
>> +	.pre_tunnel	= lisp_pre_tunnel,
>> +	.build_header	= lisp_build_header,
>> +};
>> +
>> +static void release_socket(struct net *net)
>> +{
>> +	lisp_net.n_tunnels--;
>> +	if (lisp_net.n_tunnels)
>> +		return;
>> +
>> +	sk_release_kernel(lisp_net.lisp_rcv_socket->sk);
>> +}
>> +
>> +static void lisp_tnl_destroy(struct vport *vport)
>> +{
>> +	ovs_tnl_destroy(vport);
>> +	release_socket(ovs_dp_get_net(vport->dp));
>> +}
>> +
>> +static struct vport *lisp_tnl_create(const struct vport_parms *parms)
>> +{
>> +	int err;
>> +	struct vport *vport;
>> +
>> +	err = lisp_socket_init(ovs_dp_get_net(parms->dp));
>> +	if (err)
>> +		return ERR_PTR(err);
>> +
>> +	vport = ovs_tnl_create(parms, &ovs_lisp_vport_ops, &ovs_lisp_tnl_ops);
>> +	if (IS_ERR(vport))
>> +		release_socket(ovs_dp_get_net(parms->dp));
>> +
>> +	return vport;
>> +}
>> +
>> +static int lisp_tnl_init(void)
>> +{
>> +	lisp_net.n_tunnels = 0;
>> +	return 0;
>> +}
>> +
>> +const struct vport_ops ovs_lisp_vport_ops = {
>> +	.type		= OVS_VPORT_TYPE_LISP,
>> +	.flags		= VPORT_F_TUN_ID,
>> +	.init		= lisp_tnl_init,
>> +	.create		= lisp_tnl_create,
>> +	.destroy	= lisp_tnl_destroy,
>> +	.set_addr	= ovs_tnl_set_addr,
>> +	.get_name	= ovs_tnl_get_name,
>> +	.get_addr	= ovs_tnl_get_addr,
>> +	.get_options	= ovs_tnl_get_options,
>> +	.set_options	= ovs_tnl_set_options,
>> +	.send		= ovs_tnl_send,
> 
> This .send hook could be used to implement the "pre_tunnel" functionality I
> referred to in my comment in patch 1/2.

We missed this, thanks for pointing it out. Will use this in the next
version.

Thanks again,
-Lori

> 
>> +};
>> +#else
>> +#warning LISP tunneling will not be available on kernels before 2.6.26
>> +#endif /* Linux kernel < 2.6.26 */
>> diff --git a/datapath/vport.c b/datapath/vport.c
>> index a78ebfa..bf8c763 100644
>> --- a/datapath/vport.c
>> +++ b/datapath/vport.c
>> @@ -46,6 +46,7 @@ static const struct vport_ops *base_vport_ops_list[] = {
>> #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,26)
>> 	&ovs_capwap_vport_ops,
>> 	&ovs_vxlan_vport_ops,
>> +	&ovs_lisp_vport_ops,
>> #endif
>> };
>>
>> diff --git a/datapath/vport.h b/datapath/vport.h
>> index 91f8836..24f12ae 100644
>> --- a/datapath/vport.h
>> +++ b/datapath/vport.h
>> @@ -240,5 +240,6 @@ extern const struct vport_ops ovs_gre_ft_vport_ops;
>> extern const struct vport_ops ovs_gre64_vport_ops;
>> extern const struct vport_ops ovs_capwap_vport_ops;
>> extern const struct vport_ops ovs_vxlan_vport_ops;
>> +extern const struct vport_ops ovs_lisp_vport_ops;
>>
>> #endif /* vport.h */
>> diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h
>> index f471fbc..b007b35 100644
>> --- a/include/linux/openvswitch.h
>> +++ b/include/linux/openvswitch.h
>> @@ -184,6 +184,7 @@ enum ovs_vport_type {
>> 	OVS_VPORT_TYPE_INTERNAL, /* network device implemented by datapath */
>> 	OVS_VPORT_TYPE_FT_GRE,	 /* Flow based GRE tunnel. */
>> 	OVS_VPORT_TYPE_VXLAN,    /* VXLAN tunnel */
>> +	OVS_VPORT_TYPE_LISP,     /* LISP tunnel */
>> 	OVS_VPORT_TYPE_PATCH = 100, /* virtual tunnel connecting two vports */
>> 	OVS_VPORT_TYPE_GRE,      /* GRE tunnel */
>> 	OVS_VPORT_TYPE_CAPWAP,   /* CAPWAP tunnel */
>> diff --git a/include/openflow/nicira-ext.h b/include/openflow/nicira-ext.h
>> index 91c96b3..f98ab89 100644
>> --- a/include/openflow/nicira-ext.h
>> +++ b/include/openflow/nicira-ext.h
>> @@ -1579,9 +1579,9 @@ OFP_ASSERT(sizeof(struct nx_action_output_reg) == 24);
>>
>> /* Tunnel ID.
>>  *
>> - * For a packet received via a GRE or VXLAN tunnel including a (32-bit) key, the
>> - * key is stored in the low 32-bits and the high bits are zeroed.  For other
>> - * packets, the value is 0.
>> + * For a packet received via a GRE, VXLAN or LISP tunnel including a (32-bit)
>> + * key, the key is stored in the low 32-bits and the high bits are zeroed.  For
>> + * other packets, the value is 0.
>>  *
>>  * All zero bits, for packets not received via a keyed tunnel.
>>  *
>> diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c
>> index 60437b9..8020412 100644
>> --- a/lib/netdev-vport.c
>> +++ b/lib/netdev-vport.c
>> @@ -186,6 +186,9 @@ netdev_vport_get_netdev_type(const struct dpif_linux_vport *vport)
>>     case OVS_VPORT_TYPE_VXLAN:
>>         return "vxlan";
>>
>> +    case OVS_VPORT_TYPE_LISP:
>> +        return "lisp";
>> +
>>     case OVS_VPORT_TYPE_FT_GRE:
>>     case __OVS_VPORT_TYPE_MAX:
>>         break;
>> @@ -915,6 +918,7 @@ netdev_vport_register(void)
>>         TUNNEL_CLASS("ipsec_gre64", OVS_VPORT_TYPE_GRE64),
>>         TUNNEL_CLASS("capwap", OVS_VPORT_TYPE_CAPWAP),
>>         TUNNEL_CLASS("vxlan", OVS_VPORT_TYPE_VXLAN),
>> +        TUNNEL_CLASS("lisp", OVS_VPORT_TYPE_LISP),
>>
>>         { OVS_VPORT_TYPE_PATCH,
>>           { "patch", VPORT_FUNCTIONS(NULL, NULL) },
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index 18643c2..295e41c 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -1278,6 +1278,15 @@
>> 	    </p>
>>           </dd>
>>
>> +          <dt><code>lisp</code></dt>
>> +          <dd>
>> +            A layer 3 tunnel over the experimental, UDP-based LISP
>> +            protocol described at
>> +            <code>http://tools.ietf.org/html/draft-ietf-lisp-24</code>.
>> +            LISP is currently supported only with the Linux kernel datapath
>> +            with kernel version 2.6.26 or later.
>> +          </dd>
>> +
>>           <dt><code>patch</code></dt>
>>           <dd>
>>             A pair of virtual devices that act as a patch cable.
>> -- 
>> 1.7.11.7
>>
>> _______________________________________________
>> dev mailing list
>> dev at openvswitch.org
>> http://openvswitch.org/mailman/listinfo/dev
> 
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev
> 



More information about the dev mailing list