[ovs-dev] [PATCH v4] datapath: Add support for VXLAN tunnels to Open vSwitch

Tue Nov 27 23:13:09 UTC 2012

On Tue, Nov 27, 2012 at 12:19 PM, Kyle Mestery <kmestery at cisco.com> wrote:
> Note: v4 of this patch removes VXLAN over IPSEC support,
> per an offline conversation with Jesse.
>
> Add support for VXLAN tunnels to Open vSwitch. Add support
> for setting the destination UDP port on a per-port basis.
> This is done by adding a "dst_port" parameter to the port
> configuration. This is only applicable currently to VXLAN
> tunnels.
>
> Please note this currently does not implement any sort of multicast
> learning. With this patch, VXLAN tunnels must be configured similar
> to GRE tunnels (e.g. point to point). A subsequent patch will implement
> a VXLAN control plane in userspace to handle multicast learning.
>
> This patch set is based on one posted by Ben Pfaff on Oct. 12, 2011
> to the ovs-dev mailing list:
>
> http://openvswitch.org/pipermail/dev/2011-October/012051.html
>
> The patch has been maintained, updated, and freshened by me and a
> version of it is available at the following github repository:
>
> https://github.com/mestery/ovs-vxlan/tree/vxlan
>
> I've tested this patch with multiple VXLAN tunnels between hosts
> using different UDP port numbers. Performance is on par (though
> slightly faster) than comparable GRE tunnels.
>
> See the following IETF draft for additional information about VXLAN:
> http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02
>
> Signed-off-by: Kyle Mestery <kmestery at cisco.com>

Thanks Kyle.

I got a warning from sparse:
  CHECK   /home/jgross/openvswitch/datapath/linux/vport-vxlan.c
/home/jgross/openvswitch/datapath/linux/vport-vxlan.c:372:6: warning:
symbol 'vxlan_tnl_destroy' was not declared. Should it be static

> diff --git a/NEWS b/NEWS
> index bb80beb..c343f3a 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -4,6 +4,8 @@ post-v1.9.0
>
>  v1.9.0 - xx xxx xxxx
>  --------------------
> +    - New support for the VXLAN tunnel protocol (see the IETF draft here:
> +      http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02).

I think we probably won't want to put any new features into 1.9 (it's
already been branched for stabilization for a while) so let's put it
into the post-1.9 category.

> diff --git a/datapath/vport-vxlan.c b/datapath/vport-vxlan.c
> new file mode 100644
> index 0000000..88e03d5
> --- /dev/null
> +++ b/datapath/vport-vxlan.c
> @@ -0,0 +1,459 @@
> + /*
> + * Copyright (c) 2011 Nicira, Inc.
> + * Copyright (c) 2012 Cisco Systems, Inc.
> + * Distributed under the terms of the GNU GPL version 2.
> + *
> + * Significant portions of this file may be copied from parts of the Linux
> + * kernel, by Linus Torvalds and others.
> + */

We've since switched to a more standard GPL header for the kernel
files (for example, in datapath.c).  Can you make it match?

> +/* Default to the OTV port, per the VXLAN IETF draft. */
> +#define VXLAN_DST_PORT 8472

I'd prefer if we pushed the default to userspace and not have the
kernel hardcode a particular port.  In other words, userspace would
always request a particular port (using this default if none was
specified in the database) and have the kernel reject it if none was
specified.

> +static struct hlist_head *vxlan_hash_bucket(struct net *net, u16 port)
> +{
> +       unsigned int hash = jhash(&port, sizeof(port), (unsigned long) net);

Since it's just one item, you can use jhash_1word() here.

> +/* The below used as the min/max for the UDP port range */
> +#define VXLAN_SRC_PORT_MIN      32768
> +#define VXLAN_SRC_PORT_MAX      61000

Rather than statically defining these we can use
inet_get_local_port_range() to get the ephemeral range.

> +/* Compute source port for outgoing packet.
> + * Currently we use the flow hash.
> + */
> +static u16 get_src_port(struct sk_buff *skb)
> +{
> +       unsigned int range = (VXLAN_SRC_PORT_MAX - VXLAN_SRC_PORT_MIN) + 1;
> +       u32 hash = OVS_CB(skb)->flow->hash;
> +
> +       return (__force u16)(((u64) hash * range) >> 32) + VXLAN_SRC_PORT_MIN;

Do we actually need the __force here?  It's a sparse construct and all
of these look like normal C operations.

> +static struct sk_buff *vxlan_build_header(const struct vport *vport,
> +                                         const struct tnl_mutable_config *mutable,
> +                                         struct dst_entry *dst,
> +                                         struct sk_buff *skb,
> +                                         int tunnel_hlen)
> +{
> +       struct udphdr *udph = udp_hdr(skb);
> +       struct vxlanhdr *vxh = (struct vxlanhdr *)(udph + 1);
> +       const struct ovs_key_ipv4_tunnel *tun_key = OVS_CB(skb)->tun_key;
> +       __be64 out_key;
> +
> +       if (tun_key->ipv4_dst)
> +               out_key = tun_key->tun_id;
> +       else
> +               out_key = mutable->out_key;

I don't think this works in the case of port-based tunneling with flow
keys.  I would just use tnl_get_param() to make everything consistent.

> +/* Called with rcu_read_lock and BH disabled. */
> +static int vxlan_rcv(struct sock *sk, struct sk_buff *skb)
> +{
> +       struct vport *vport;
> +       struct vxlanhdr *vxh;
> +       const struct tnl_mutable_config *mutable;
> +       struct iphdr *iph;
> +       struct ovs_key_ipv4_tunnel tun_key;
> +       int tunnel_type;
> +       __be64 key;
> +       u32 tunnel_flags = 0;
> +
> +       if (unlikely(!pskb_may_pull(skb, VXLAN_HLEN + ETH_HLEN)))
> +               goto error;
> +
> +       vxh = vxlan_hdr(skb);
> +       if (unlikely(vxh->vx_flags != htonl(VXLAN_FLAGS) ||
> +                    vxh->vx_vni & htonl(0xff)))
> +               goto error;
> +
> +       __skb_pull(skb, VXLAN_HLEN);
> +       skb_postpull_rcsum(skb, skb_transport_header(skb), VXLAN_HLEN + ETH_HLEN);
> +
> +       key = cpu_to_be64(ntohl(vxh->vx_vni) >> 8);
> +
> +       tunnel_type = TNL_T_PROTO_VXLAN;

I'm not sure that this actually needs to be stored in a variable since
it is only used once.

> +static int vxlan_socket_init(struct vxlan_port *vxlan_port)
> +{
> +       int err;
> +       struct sockaddr_in sin;
> +
> +       err = sock_create(AF_INET, SOCK_DGRAM, 0, &vxlan_port->vxlan_rcv_socket);
> +       if (err)
> +               goto error;

It doesn't look like this handles namespaces.  I think we need to use
sock_create_kern(0 and sk_change_net() like in CAPWAP.

> +static int vxlan_tunnel_setup(struct net *net, const char *linkname,
> +                            struct nlattr *options)
> +{
> +       struct nlattr *a[OVS_TUNNEL_ATTR_MAX + 1];
> +       int err;
> +       u16 dst_port;
> +       struct vxlan_port *vxlan_port;
> +       struct vxlan_if *vxlan_if;
> +
> +       if (!options) {
> +               err = -EINVAL;
> +               goto out;
> +       }
> +
> +       err = nla_parse_nested(a, OVS_TUNNEL_ATTR_MAX, options, vxlan_policy);
> +       if (err)
> +               goto out;

You could use nla_find_nested(), it seems to match what you are
looking for here.

> +
> +       if (a[OVS_TUNNEL_ATTR_DST_PORT])
> +               dst_port = nla_get_u16(a[OVS_TUNNEL_ATTR_DST_PORT]);
> +       else
> +               dst_port = VXLAN_DST_PORT;
> +
> +       /* Verify if we already have a socket created for this port */
> +       vxlan_port = vxlan_port_exists(net, dst_port);
> +       if (vxlan_port) {
> +               vxlan_port->count++;
> +               err = 0;
> +               goto out;
> +       }
> +
> +       /* Add a new socket for this port */
> +       vxlan_port = kmalloc(sizeof(struct vxlan_port), GFP_KERNEL);
> +       if (!vxlan_port) {
> +               err = -ENOMEM;
> +               goto out;
> +       }
> +       memset (vxlan_port, 0, sizeof(struct vxlan_port));

You could combine the kmalloc() and memset() using kzalloc() instead.

> +       vxlan_port->port = dst_port;
> +       vxlan_port->count++;

I think this would be clearer if you initialized it to 1 (since I
think this will always be the first user).

> +static int vxlan_set_options(struct vport *vport, struct nlattr *options)
> +{
> +       int err;
> +       const char *vname = vport->ops->get_name(vport);
> +
> +       err = vxlan_tunnel_setup(ovs_dp_get_net(vport->dp), vname, options);
> +       if (err)
> +               goto out;

If the dest port is changed and the refcount drops to zero, shouldn't
we remove that socket?  I think the best approach may be to add the
new socket, call set_options, and then drop the old socket.

> +void vxlan_tnl_destroy(struct vport *vport)
> +{
> +       struct vxlan_if *vxlan_if;
> +       struct vxlan_port *vxlan_port;
> +       const char *vname = vport->ops->get_name(vport);
> +
> +       vxlan_if = vxlan_if_by_name(ovs_dp_get_net(vport->dp), vname);
> +       if (!vxlan_if)
> +               goto out;

I think we can use tnl_vport_priv() convert the vport to a tnl_vport
and pull the dst_port out of the mutable directly.  Then we wouldn't
have to bother with the if_name hash at all.

> +static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
> +{
> +       int err;
> +
> +       err = vxlan_tunnel_setup(ovs_dp_get_net(parms->dp), parms->name,
> +                                               parms->options);
> +       return ovs_tnl_create(parms, &ovs_vxlan_vport_ops, &ovs_vxlan_tnl_ops);

If ovs_tnl_create() fails won't this result in an incorrect refcount?

> +static int vxlan_init(void)
> +{
[...]
> +       vxlan_ports = kzalloc(VXLAN_SOCK_HASH_BUCKETS * sizeof(struct hlist_head),
> +                               GFP_KERNEL);

We probably could get away with just making this fairly small and then
statically allocating it.  Realistically, even just a linked list
would be fine.

> diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h
> index e7d4b49..2ae5681 100644
> --- a/include/linux/openvswitch.h
> +++ b/include/linux/openvswitch.h
> @@ -186,6 +186,7 @@ enum ovs_vport_type {
>         OVS_VPORT_TYPE_PATCH = 100, /* virtual tunnel connecting two vports */
>         OVS_VPORT_TYPE_GRE,      /* GRE tunnel */
>         OVS_VPORT_TYPE_CAPWAP,   /* CAPWAP tunnel */
> +       OVS_VPORT_TYPE_VXLAN,    /* VXLAN tunnel */

Since the plan is to get this upstream, let's put it after FT_GRE.
That way we can avoid having to do any switchover.

> diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c
> index 5171171..3dbb798 100644
> --- a/lib/netdev-vport.c
> +++ b/lib/netdev-vport.c
> @@ -173,6 +173,13 @@ netdev_vport_get_netdev_type(const struct dpif_linux_vport *vport)
>      case OVS_VPORT_TYPE_CAPWAP:
>          return "capwap";
>
> +    case OVS_VPORT_TYPE_VXLAN:
> +        if (tnl_port_config_from_nlattr(vport->options, vport->options_len,
> +                                        a)) {
> +            break;
> +        }

We only need to retrieve the attributes in order to check if IPsec is
set, so we should be able to drop the if block here.

> @@ -584,6 +591,7 @@ parse_tunnel_config(const char *name, const char *type,
>                      const struct smap *args, struct ofpbuf *options)
>  {
>      bool is_gre = false;
> +    bool is_vxlan = false;

I think it's probably better to move to more generic names like
"has_dst_port" rather than directly tying it to the protocol (the
effect would still be the same though).

> @@ -835,6 +848,11 @@ unparse_tunnel_config(const char *name OVS_UNUSED, const char *type OVS_UNUSED,
>          smap_add_format(args, "tos", "0x%x", tos);
>      }
>
> +    if (a[OVS_TUNNEL_ATTR_DST_PORT]) {
> +        ovs_be16 dst_port = nl_attr_get_be16(a[OVS_TUNNEL_ATTR_DST_PORT]);

I believe that everywhere else the type for dst_port is a u16 instead of be16.

> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index c2786a5..65ead8c 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -1393,9 +1412,10 @@
>          deprecated and will be removed soon.
>        </column>
>
> -      <group title="Tunnel Options: gre only">
> +      <group title="Tunnel Options: gre and vxlan only">
>          <p>
> -          Only <code>gre</code> interfaces support these options.
> +          Only <code>gre</code> and <code>vxlan</code> interfaces support these
> +          options.
>          </p>
>        </group>

I think we can just delete this empty group entirely now.

> @@ -1427,11 +1447,19 @@
>          </column>
>        </group>
>
> -      <group title="Tunnel Options: ipsec_gre only">
> +      <group title="Tunnel Options: ipsec_gre and ipsec_vxlan only">
>          <p>
> -          Only <code>ipsec_gre</code> interfaces support these options.
> +          Only <code>ipsec_gre</code> and <code>ipsec_vxlan</code> interfaces
> +          support these options.
>          </p>
>
> +       <p>
> +         These options are implemented through a separate daemon named
> +         <code>ovs-monitor-ipsec</code> that so far has only been ported to
> +         and packaged for Debian (including derivative distributions such as
> +         Ubuntu).
> +       </p>

Since IPsec has been removed, I believe these changes are no longer relevant.