[ovs-dev] [PATCH v5] datapath: Add support for VXLAN tunnels to Open vSwitch

Kyle Mestery (kmestery) kmestery at cisco.com
Thu Nov 29 20:23:15 UTC 2012


On Nov 28, 2012, at 8:43 PM, Jesse Gross <jesse at nicira.com> wrote:
> On Wed, Nov 28, 2012 at 2:53 PM, Kyle Mestery <kmestery at cisco.com> wrote:
>> Note: v5 of this patch addresses comments from Jesse
>> and Chris.
>> 
>> Note: v4 of this patch removes VXLAN over IPSEC support,
>> per an offline conversation with Jesse.
>> 
>> Add support for VXLAN tunnels to Open vSwitch. Add support
>> for setting the destination UDP port on a per-port basis.
>> This is done by adding a "dst_port" parameter to the port
>> configuration. This is only applicable currently to VXLAN
>> tunnels.
>> 
>> Please note this currently does not implement any sort of multicast
>> learning. With this patch, VXLAN tunnels must be configured similar
>> to GRE tunnels (e.g. point to point). A subsequent patch will implement
>> a VXLAN control plane in userspace to handle multicast learning.
>> 
>> This patch set is based on one posted by Ben Pfaff on Oct. 12, 2011
>> to the ovs-dev mailing list:
>> 
>> http://openvswitch.org/pipermail/dev/2011-October/012051.html
>> 
>> The patch has been maintained, updated, and freshened by me and a
>> version of it is available at the following github repository:
>> 
>> https://github.com/mestery/ovs-vxlan/tree/vxlan
>> 
>> I've tested this patch with multiple VXLAN tunnels between hosts
>> using different UDP port numbers. Performance is on par (though
>> slightly faster) than comparable GRE tunnels.
>> 
>> See the following IETF draft for additional information about VXLAN:
>> http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02
>> 
>> Signed-off-by: Kyle Mestery <kmestery at cisco.com>
> 
> Thanks Kyle.  I noticed a couple of small things, otherwise looks good.
> 
Thanks for the review Jesse. I'm addressing these now, will hopefully
send out a new patch today or tomorrow morning.

Kyle

> For the commit message, can you use --- at the bottom to separate out
> the comments about the revision?  Otherwise, if I applied this as-is
> it would include the revision history, which doesn't really need to be
> in the commit log since the earlier version won't be there either.
> 
>> diff --git a/NEWS b/NEWS
>> index bb80beb..f6b2f0d 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -1,6 +1,7 @@
>> post-v1.9.0
>> --------------------
>> -
>> +    - New support for the VXLAN tunnel protocol (see the IETF draft here:
>> +      http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02).
> 
> Can you keep the double blank lines that separate the versions?
> 
>> diff --git a/datapath/vport-vxlan.c b/datapath/vport-vxlan.c
>> new file mode 100644
>> index 0000000..8f79991
>> --- /dev/null
>> +++ b/datapath/vport-vxlan.c
>> +/* Protected by RTNL lock. */
>> +#define VXLAN_SOCK_HASH_BUCKETS 32
>> +static struct hlist_head vxlan_ports[VXLAN_SOCK_HASH_BUCKETS];
> 
> Given that in most cases we'll only have a single destination port and
> this isn't part of any fast path, 32 buckets still seems overkill to
> me.  Is there a reason to keep it as a hash table at all?
> 
>> +static struct vxlan_port *vxlan_port_exists(struct net *net, u16 port)
>> +{
>> +       struct hlist_head *bucket = vxlan_hash_bucket(net, port);
>> +       struct vxlan_port *vxlan_port;
>> +       struct hlist_node *node;
>> +
>> +       hlist_for_each_entry(vxlan_port, node, bucket, hash_node) {
>> +               if (vxlan_port->port == port)
> 
> We also should check the net.
> 
>> +static void vxlan_tunnel_release(struct vxlan_port *vxlan_port)
>> +{
>> +       if (vxlan_port->count-- == 0) {
> 
> Shouldn't this be predecrement?
> 
>> +               /* Release old socket */
>> +               sock_release(vxlan_port->vxlan_rcv_socket);
>> +               hlist_del(&vxlan_port->hash_node);
>> +               kfree(vxlan_port);
>> +       }
>> +}
> 
> It would be nice to make the the functions that deal with creation and
> deletion of sockets symmetric to each other in regards to allocation
> of memory, refcounting, and socket creation.
> 
>> +static int vxlan_tunnel_setup(struct net *net, const char *linkname,
>> +                             struct nlattr *options,
>> +                             struct vxlan_port **vxport)
>> +{
>> +       struct nlattr *a;
>> +       int err;
>> +       u16 dst_port;
>> +       struct vxlan_port *vxlan_port;
>> +
>> +       if (!options) {
>> +               err = -EINVAL;
>> +               goto out;
>> +       }
>> +
>> +       a = nla_find_nested(options, OVS_TUNNEL_ATTR_DST_PORT);
>> +       if (a)
> 
> We also should check that the size is u16 here.
> 
>> +       /* Verify if we already have a socket created for this port */
>> +       vxlan_port = vxlan_port_exists(net, dst_port);
>> +       if (vxlan_port) {
>> +               if (vxlan_port->port == dst_port) {
>> +                       vxlan_port->count++;
>> +                       err = 0;
>> +                       goto out;
>> +               }
>> +
>> +               /* Release old socket */
>> +               vxlan_tunnel_release(vxlan_port);
> 
> I don't understand the second check for dst_port and the possibility
> of releasing the old port.  If the result of vxlan_port_exists is
> non-NULL shouldn't that be enough?
> 
>> +       }
>> +
>> +       /* Add a new socket for this port */
>> +       vxlan_port = kzalloc(sizeof(struct vxlan_port), GFP_KERNEL);
>> +       if (!vxlan_port) {
>> +               err = -ENOMEM;
>> +               goto out;
>> +       }
>> +
>> +       vxlan_port->port = dst_port;
>> +       vxlan_port->count = 1;
>> +       hlist_add_head(&vxlan_port->hash_node,
>> +                      vxlan_hash_bucket(net, dst_port));
>> +
>> +       err = vxlan_socket_init(vxlan_port, net);
>> +       if (err)
>> +               goto error;
>> +
>> +       vxport = &vxlan_port;
>> +out:
>> +       return err;
>> +error:
>> +       vxport = NULL;
> 
> These two references to vxport look wrong to me - I think they should
> both be *vxport.
> 
>> +       hlist_del(&vxlan_port->hash_node);
>> +       kfree(vxlan_port);
>> +       goto out;
> 
> The exit path is somewhat confusing in the way that it jumps around.
> Usually we have the error handler(s) first and eventually get to the
> success path, like we were unwinding the stack.  You also could start
> with vxlan_port being NULL and then assign it as part of the success
> path to vxport.
> 
>> +static void vxlan_tnl_destroy(struct vport *vport)
>> +{
>> +       struct vxlan_port *vxlan_port;
>> +       struct tnl_vport *tnl_vport = tnl_vport_priv(vport);
>> +       struct tnl_mutable_config *config;
>> +
>> +       config = rtnl_dereference(tnl_vport->mutable);
>> +
>> +       vxlan_port = vxlan_port_exists(ovs_dp_get_net(vport->dp),
>> +                                        config->dst_port);
>> +       if (!vxlan_port)
>> +               goto out;
>> +
>> +       if (!--vxlan_port->count) {
>> +               vxlan_tunnel_release(vxlan_port);
>> +       }
> 
> I think the refcount gets decremented both here and in vxlan_tunnel_release().
> 
>> +static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
>> +{
>> +       int err;
>> +       struct vport *vport;
>> +       struct vxlan_port *vxlan_port = NULL;
>> +
>> +       err = vxlan_tunnel_setup(ovs_dp_get_net(parms->dp), parms->name,
>> +                                               parms->options, &vxlan_port);
>> +       vport = ovs_tnl_create(parms, &ovs_vxlan_vport_ops, &ovs_vxlan_tnl_ops);
>> +
>> +       if (!vport)
>> +               vxlan_tunnel_release(vxlan_port);
> 
> Shouldn't we return the error and stop immediately if vxlan_tunnel_setup fails?
> 
>> diff --git a/include/openvswitch/tunnel.h b/include/openvswitch/tunnel.h
>> index 42c3621..b4e57e2 100644
>> --- a/include/openvswitch/tunnel.h
>> +++ b/include/openvswitch/tunnel.h
> [...]
>> +/* Default to the OTV port, per the VXLAN IETF draft. */
>> +#define VXLAN_DST_PORT 8472
> 
> I would put this constant at the top of lib/vport-netdev.c.  This file
> is part of the user/kernel interface and I'd like to keep the port
> number entirely in userspace.
> 
>> diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c
>> index 5171171..9452d60 100644
>> --- a/lib/netdev-vport.c
>> +++ b/lib/netdev-vport.c
>> @@ -835,6 +852,11 @@ unparse_tunnel_config(const char *name OVS_UNUSED, const char *type OVS_UNUSED,
>>         smap_add_format(args, "tos", "0x%x", tos);
>>     }
>> 
>> +    if (a[OVS_TUNNEL_ATTR_DST_PORT]) {
>> +        uint16_t dst_port = nl_attr_get_be16(a[OVS_TUNNEL_ATTR_DST_PORT]);
> 
> On the right side where we get the attribute we'll want to use u16 as well.
> 
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index c2786a5..9278e42 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -1247,6 +1246,23 @@
>>             February 2013.
>>           </dd>
>> 
>> +          <dt><code>vxlan</code></dt>
>> +          <dd>
>> +           <p>
>> +             An Ethernet tunnel over the experimental, UDP-based VXLAN
>> +             protocol described at
>> +             <code>http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02</code>.
>> +             VXLAN is currently supported only with the Linux kernel datapath
>> +             with kernel version 2.6.26 or later.
>> +           </p>
>> +           <p>
>> +             As an experimental protocol, VXLAN has no officially assigned UDP
>> +             port.  Open vSwitch currently uses UDP destination port 8472.
>> +             The source port used for VXLAN traffic varies on a per-flow basis
>> +             between 32768 and 65535 to allow load balancing.
> 
> It's no longer quite true that it's the upper 15-bits since we're
> using the ephemeral ports now.





More information about the dev mailing list