[ovs-dev] [PATCH v4] datapath: Add support for VXLAN tunnels to Open vSwitch
Jesse Gross
jesse at nicira.com
Tue Nov 27 23:13:09 UTC 2012
On Tue, Nov 27, 2012 at 12:19 PM, Kyle Mestery <kmestery at cisco.com> wrote:
> Note: v4 of this patch removes VXLAN over IPSEC support,
> per an offline conversation with Jesse.
>
> Add support for VXLAN tunnels to Open vSwitch. Add support
> for setting the destination UDP port on a per-port basis.
> This is done by adding a "dst_port" parameter to the port
> configuration. This is only applicable currently to VXLAN
> tunnels.
>
> Please note this currently does not implement any sort of multicast
> learning. With this patch, VXLAN tunnels must be configured similar
> to GRE tunnels (e.g. point to point). A subsequent patch will implement
> a VXLAN control plane in userspace to handle multicast learning.
>
> This patch set is based on one posted by Ben Pfaff on Oct. 12, 2011
> to the ovs-dev mailing list:
>
> http://openvswitch.org/pipermail/dev/2011-October/012051.html
>
> The patch has been maintained, updated, and freshened by me and a
> version of it is available at the following github repository:
>
> https://github.com/mestery/ovs-vxlan/tree/vxlan
>
> I've tested this patch with multiple VXLAN tunnels between hosts
> using different UDP port numbers. Performance is on par (though
> slightly faster) than comparable GRE tunnels.
>
> See the following IETF draft for additional information about VXLAN:
> http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02
>
> Signed-off-by: Kyle Mestery <kmestery at cisco.com>
Thanks Kyle.
I got a warning from sparse:
CHECK /home/jgross/openvswitch/datapath/linux/vport-vxlan.c
/home/jgross/openvswitch/datapath/linux/vport-vxlan.c:372:6: warning:
symbol 'vxlan_tnl_destroy' was not declared. Should it be static
> diff --git a/NEWS b/NEWS
> index bb80beb..c343f3a 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -4,6 +4,8 @@ post-v1.9.0
>
> v1.9.0 - xx xxx xxxx
> --------------------
> + - New support for the VXLAN tunnel protocol (see the IETF draft here:
> + http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02).
I think we probably won't want to put any new features into 1.9 (it's
already been branched for stabilization for a while) so let's put it
into the post-1.9 category.
> diff --git a/datapath/vport-vxlan.c b/datapath/vport-vxlan.c
> new file mode 100644
> index 0000000..88e03d5
> --- /dev/null
> +++ b/datapath/vport-vxlan.c
> @@ -0,0 +1,459 @@
> + /*
> + * Copyright (c) 2011 Nicira, Inc.
> + * Copyright (c) 2012 Cisco Systems, Inc.
> + * Distributed under the terms of the GNU GPL version 2.
> + *
> + * Significant portions of this file may be copied from parts of the Linux
> + * kernel, by Linus Torvalds and others.
> + */
We've since switched to a more standard GPL header for the kernel
files (for example, in datapath.c). Can you make it match?
> +/* Default to the OTV port, per the VXLAN IETF draft. */
> +#define VXLAN_DST_PORT 8472
I'd prefer if we pushed the default to userspace and not have the
kernel hardcode a particular port. In other words, userspace would
always request a particular port (using this default if none was
specified in the database) and have the kernel reject it if none was
specified.
> +static struct hlist_head *vxlan_hash_bucket(struct net *net, u16 port)
> +{
> + unsigned int hash = jhash(&port, sizeof(port), (unsigned long) net);
Since it's just one item, you can use jhash_1word() here.
> +/* The below used as the min/max for the UDP port range */
> +#define VXLAN_SRC_PORT_MIN 32768
> +#define VXLAN_SRC_PORT_MAX 61000
Rather than statically defining these we can use
inet_get_local_port_range() to get the ephemeral range.
> +/* Compute source port for outgoing packet.
> + * Currently we use the flow hash.
> + */
> +static u16 get_src_port(struct sk_buff *skb)
> +{
> + unsigned int range = (VXLAN_SRC_PORT_MAX - VXLAN_SRC_PORT_MIN) + 1;
> + u32 hash = OVS_CB(skb)->flow->hash;
> +
> + return (__force u16)(((u64) hash * range) >> 32) + VXLAN_SRC_PORT_MIN;
Do we actually need the __force here? It's a sparse construct and all
of these look like normal C operations.
> +static struct sk_buff *vxlan_build_header(const struct vport *vport,
> + const struct tnl_mutable_config *mutable,
> + struct dst_entry *dst,
> + struct sk_buff *skb,
> + int tunnel_hlen)
> +{
> + struct udphdr *udph = udp_hdr(skb);
> + struct vxlanhdr *vxh = (struct vxlanhdr *)(udph + 1);
> + const struct ovs_key_ipv4_tunnel *tun_key = OVS_CB(skb)->tun_key;
> + __be64 out_key;
> +
> + if (tun_key->ipv4_dst)
> + out_key = tun_key->tun_id;
> + else
> + out_key = mutable->out_key;
I don't think this works in the case of port-based tunneling with flow
keys. I would just use tnl_get_param() to make everything consistent.
> +/* Called with rcu_read_lock and BH disabled. */
> +static int vxlan_rcv(struct sock *sk, struct sk_buff *skb)
> +{
> + struct vport *vport;
> + struct vxlanhdr *vxh;
> + const struct tnl_mutable_config *mutable;
> + struct iphdr *iph;
> + struct ovs_key_ipv4_tunnel tun_key;
> + int tunnel_type;
> + __be64 key;
> + u32 tunnel_flags = 0;
> +
> + if (unlikely(!pskb_may_pull(skb, VXLAN_HLEN + ETH_HLEN)))
> + goto error;
> +
> + vxh = vxlan_hdr(skb);
> + if (unlikely(vxh->vx_flags != htonl(VXLAN_FLAGS) ||
> + vxh->vx_vni & htonl(0xff)))
> + goto error;
> +
> + __skb_pull(skb, VXLAN_HLEN);
> + skb_postpull_rcsum(skb, skb_transport_header(skb), VXLAN_HLEN + ETH_HLEN);
> +
> + key = cpu_to_be64(ntohl(vxh->vx_vni) >> 8);
> +
> + tunnel_type = TNL_T_PROTO_VXLAN;
I'm not sure that this actually needs to be stored in a variable since
it is only used once.
> +static int vxlan_socket_init(struct vxlan_port *vxlan_port)
> +{
> + int err;
> + struct sockaddr_in sin;
> +
> + err = sock_create(AF_INET, SOCK_DGRAM, 0, &vxlan_port->vxlan_rcv_socket);
> + if (err)
> + goto error;
It doesn't look like this handles namespaces. I think we need to use
sock_create_kern(0 and sk_change_net() like in CAPWAP.
> +static int vxlan_tunnel_setup(struct net *net, const char *linkname,
> + struct nlattr *options)
> +{
> + struct nlattr *a[OVS_TUNNEL_ATTR_MAX + 1];
> + int err;
> + u16 dst_port;
> + struct vxlan_port *vxlan_port;
> + struct vxlan_if *vxlan_if;
> +
> + if (!options) {
> + err = -EINVAL;
> + goto out;
> + }
> +
> + err = nla_parse_nested(a, OVS_TUNNEL_ATTR_MAX, options, vxlan_policy);
> + if (err)
> + goto out;
You could use nla_find_nested(), it seems to match what you are
looking for here.
> +
> + if (a[OVS_TUNNEL_ATTR_DST_PORT])
> + dst_port = nla_get_u16(a[OVS_TUNNEL_ATTR_DST_PORT]);
> + else
> + dst_port = VXLAN_DST_PORT;
> +
> + /* Verify if we already have a socket created for this port */
> + vxlan_port = vxlan_port_exists(net, dst_port);
> + if (vxlan_port) {
> + vxlan_port->count++;
> + err = 0;
> + goto out;
> + }
> +
> + /* Add a new socket for this port */
> + vxlan_port = kmalloc(sizeof(struct vxlan_port), GFP_KERNEL);
> + if (!vxlan_port) {
> + err = -ENOMEM;
> + goto out;
> + }
> + memset (vxlan_port, 0, sizeof(struct vxlan_port));
You could combine the kmalloc() and memset() using kzalloc() instead.
> + vxlan_port->port = dst_port;
> + vxlan_port->count++;
I think this would be clearer if you initialized it to 1 (since I
think this will always be the first user).
> +static int vxlan_set_options(struct vport *vport, struct nlattr *options)
> +{
> + int err;
> + const char *vname = vport->ops->get_name(vport);
> +
> + err = vxlan_tunnel_setup(ovs_dp_get_net(vport->dp), vname, options);
> + if (err)
> + goto out;
If the dest port is changed and the refcount drops to zero, shouldn't
we remove that socket? I think the best approach may be to add the
new socket, call set_options, and then drop the old socket.
> +void vxlan_tnl_destroy(struct vport *vport)
> +{
> + struct vxlan_if *vxlan_if;
> + struct vxlan_port *vxlan_port;
> + const char *vname = vport->ops->get_name(vport);
> +
> + vxlan_if = vxlan_if_by_name(ovs_dp_get_net(vport->dp), vname);
> + if (!vxlan_if)
> + goto out;
I think we can use tnl_vport_priv() convert the vport to a tnl_vport
and pull the dst_port out of the mutable directly. Then we wouldn't
have to bother with the if_name hash at all.
> +static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
> +{
> + int err;
> +
> + err = vxlan_tunnel_setup(ovs_dp_get_net(parms->dp), parms->name,
> + parms->options);
> + return ovs_tnl_create(parms, &ovs_vxlan_vport_ops, &ovs_vxlan_tnl_ops);
If ovs_tnl_create() fails won't this result in an incorrect refcount?
> +static int vxlan_init(void)
> +{
[...]
> + vxlan_ports = kzalloc(VXLAN_SOCK_HASH_BUCKETS * sizeof(struct hlist_head),
> + GFP_KERNEL);
We probably could get away with just making this fairly small and then
statically allocating it. Realistically, even just a linked list
would be fine.
> diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h
> index e7d4b49..2ae5681 100644
> --- a/include/linux/openvswitch.h
> +++ b/include/linux/openvswitch.h
> @@ -186,6 +186,7 @@ enum ovs_vport_type {
> OVS_VPORT_TYPE_PATCH = 100, /* virtual tunnel connecting two vports */
> OVS_VPORT_TYPE_GRE, /* GRE tunnel */
> OVS_VPORT_TYPE_CAPWAP, /* CAPWAP tunnel */
> + OVS_VPORT_TYPE_VXLAN, /* VXLAN tunnel */
Since the plan is to get this upstream, let's put it after FT_GRE.
That way we can avoid having to do any switchover.
> diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c
> index 5171171..3dbb798 100644
> --- a/lib/netdev-vport.c
> +++ b/lib/netdev-vport.c
> @@ -173,6 +173,13 @@ netdev_vport_get_netdev_type(const struct dpif_linux_vport *vport)
> case OVS_VPORT_TYPE_CAPWAP:
> return "capwap";
>
> + case OVS_VPORT_TYPE_VXLAN:
> + if (tnl_port_config_from_nlattr(vport->options, vport->options_len,
> + a)) {
> + break;
> + }
We only need to retrieve the attributes in order to check if IPsec is
set, so we should be able to drop the if block here.
> @@ -584,6 +591,7 @@ parse_tunnel_config(const char *name, const char *type,
> const struct smap *args, struct ofpbuf *options)
> {
> bool is_gre = false;
> + bool is_vxlan = false;
I think it's probably better to move to more generic names like
"has_dst_port" rather than directly tying it to the protocol (the
effect would still be the same though).
> @@ -835,6 +848,11 @@ unparse_tunnel_config(const char *name OVS_UNUSED, const char *type OVS_UNUSED,
> smap_add_format(args, "tos", "0x%x", tos);
> }
>
> + if (a[OVS_TUNNEL_ATTR_DST_PORT]) {
> + ovs_be16 dst_port = nl_attr_get_be16(a[OVS_TUNNEL_ATTR_DST_PORT]);
I believe that everywhere else the type for dst_port is a u16 instead of be16.
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index c2786a5..65ead8c 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -1393,9 +1412,10 @@
> deprecated and will be removed soon.
> </column>
>
> - <group title="Tunnel Options: gre only">
> + <group title="Tunnel Options: gre and vxlan only">
> <p>
> - Only <code>gre</code> interfaces support these options.
> + Only <code>gre</code> and <code>vxlan</code> interfaces support these
> + options.
> </p>
> </group>
I think we can just delete this empty group entirely now.
> @@ -1427,11 +1447,19 @@
> </column>
> </group>
>
> - <group title="Tunnel Options: ipsec_gre only">
> + <group title="Tunnel Options: ipsec_gre and ipsec_vxlan only">
> <p>
> - Only <code>ipsec_gre</code> interfaces support these options.
> + Only <code>ipsec_gre</code> and <code>ipsec_vxlan</code> interfaces
> + support these options.
> </p>
>
> + <p>
> + These options are implemented through a separate daemon named
> + <code>ovs-monitor-ipsec</code> that so far has only been ported to
> + and packaged for Debian (including derivative distributions such as
> + Ubuntu).
> + </p>
Since IPsec has been removed, I believe these changes are no longer relevant.
More information about the dev
mailing list