[ovs-dev] [PATCH] datapath: Integration with upstream kernel tunneling.

Pravin Shelar pshelar at nicira.com
Thu Apr 4 18:01:12 UTC 2013


On Tue, Apr 2, 2013 at 3:26 PM, Jesse Gross <jesse at nicira.com> wrote:

> On Sat, Mar 30, 2013 at 9:20 AM, Pravin B Shelar <pshelar at nicira.com>
> wrote:
> > Following patch restructure ovs tunneling to make use of kernel
> > api. Doing this tunneling code is simplified as most of protocol
> > processing on send and recv is pushed to kernel tunneling. This
> > way we can share most protocol related code between openvswitch
> > tunneling and linux tunnel devices.
> >
> > Signed-off-by: Pravin B Shelar <pshelar at nicira.com>
>
> I'm going to wait to look at the upstream patches for now since they
> are mostly just exporting existing code.  I hope that we can keep more
> of the dataplane where it is and not have to export those functions.
>
> I got some sparse warnings:
>   CHECK   /home/jgross/openvswitch/datapath/linux/vport-lisp.c
> /home/jgross/openvswitch/datapath/linux/vport-lisp.c:331:22: warning:
> incorrect type in assignment (different base types)
> /home/jgross/openvswitch/datapath/linux/vport-lisp.c:331:22:
> expected restricted __be16 [assigned] [usertype] sin_port
> /home/jgross/openvswitch/datapath/linux/vport-lisp.c:331:22:    got
> restricted __be32 [usertype] dst_port
> /home/jgross/openvswitch/datapath/linux/vport-lisp.c:386:60: warning:
> incorrect type in argument 2 (different base types)
> /home/jgross/openvswitch/datapath/linux/vport-lisp.c:386:60:
> expected restricted __be32 [usertype] dst_port
> /home/jgross/openvswitch/datapath/linux/vport-lisp.c:386:60:    got
> restricted __be16 [assigned] [usertype] dst_port
>   CHECK   /home/jgross/openvswitch/datapath/linux/vport-netdev.c
>   CHECK   /home/jgross/openvswitch/datapath/linux/vport-vxlan.c
> /home/jgross/openvswitch/datapath/linux/vport-vxlan.c:166:5: warning:
> symbol 'vxlan_tnl_send' was not declared. Should it be static?
>
> ok.


> > diff --git a/acinclude.m4 b/acinclude.m4
> > index 19a47dd..c30855f 100644
> > --- a/acinclude.m4
> > +++ b/acinclude.m4
> > @@ -254,6 +254,8 @@ AC_DEFUN([OVS_CHECK_LINUX_COMPAT], [
> >    OVS_GREP_IFELSE([$KSRC/include/linux/skbuff.h], [consume_skb])
> >    OVS_GREP_IFELSE([$KSRC/include/linux/skbuff.h], [skb_frag_page])
> >    OVS_GREP_IFELSE([$KSRC/include/linux/skbuff.h], [skb_reset_mac_len])
> > +  OVS_GREP_IFELSE([$KSRC/include/linux/skbuff.h], [SKB_GSO_GRE])
> > +  OVS_GREP_IFELSE([$KSRC/include/linux/skbuff.h], [SKB_GSO_UDP_TUNNEL])
> >
> >    OVS_GREP_IFELSE([$KSRC/include/linux/string.h], [kmemdup], [],
> >                    [OVS_GREP_IFELSE([$KSRC/include/linux/slab.h],
> [kmemdup])])
> > @@ -262,6 +264,8 @@ AC_DEFUN([OVS_CHECK_LINUX_COMPAT], [
> >                    [OVS_DEFINE([HAVE_BOOL_TYPE])])
> >    OVS_GREP_IFELSE([$KSRC/include/linux/types.h], [__wsum],
> >                    [OVS_DEFINE([HAVE_CSUM_TYPES])])
> > +  OVS_GREP_IFELSE([$KSRC/include/uapi/linux/types.h], [__wsum],
> > +                  [OVS_DEFINE([HAVE_CSUM_TYPES])])
>
> Is there a reason to think these have been backported?  They are used
> mixed with version checks so it seems to make sense to just use
> versions in the absence of another reason.
>
> ok.


> > diff --git a/datapath/datapath.c b/datapath/datapath.c
> > index 9cd4b0e..4d9a3b0 100644
> > --- a/datapath/datapath.c
> > +++ b/datapath/datapath.c
> > @@ -49,8 +49,10 @@
> >  #include <linux/rculist.h>
> >  #include <linux/dmi.h>
> >  #include <net/genetlink.h>
> > +#include <net/gre.h>
> >  #include <net/net_namespace.h>
> >  #include <net/netns/generic.h>
> > +#include <net/vxlan.h>
> >
> >  #include "checksum.h"
> >  #include "datapath.h"
> > @@ -61,7 +63,7 @@
> >  #include "vport-internal_dev.h"
> >
> >  #if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,18) || \
> > -    LINUX_VERSION_CODE >= KERNEL_VERSION(3,9,0)
> > +    LINUX_VERSION_CODE >= KERNEL_VERSION(3,10,0)
> >  #error Kernels before 2.6.18 or after 3.8 are not supported by this
> version of Open vSwitch.
> >  #endif
>
> Can we separate out support for 3.9 into a separate change?  They seem
> like they should be independent.
>
> ok, I will send separate patch for 3.9.


> > @@ -2349,10 +2351,18 @@ static int __init dp_init(void)
> >         if (err)
> >                 goto error_genl_exec;
> >
> > -       err = ovs_flow_init();
> > +       err = gre_compat_init();
> >         if (err)
> >                 goto error_wq;
> >
> > +       err = vxlan_compat_init();
> > +       if (err)
> > +               goto error_gre_compat;
>
> Can we do this initialization when we add the first port?  It will
> help isolate the compatibility code and not use resources that we
> don't need as the number of protocols grows.
>
>
I was going to do it but there is following dead lock possibility.

      CPU0(OVS)                    CPU1(create_new_namespaces)
       ----                    ----
  lock(rtnl_mutex);
                               lock(net_mutex);
                               lock(rtnl_mutex);
  lock(net_mutex);



> > diff --git a/datapath/tunnel.c b/datapath/tunnel.c
> > index 057aaed..bd0ee7e 100644
> > --- a/datapath/tunnel.c
> > +++ b/datapath/tunnel.c
> > +static int ovs_skb_checksum(struct sk_buff *skb)
> >  {
> [...]
> > +static int tnl_send_out(struct vport *vport, struct sk_buff *skb,
>
> These seem very fragile to me and it would be best to not have them in
> this file at all.  Ideally we would have something closer to a real
> GSO backport.  It's going to be a pain but this is the third place
> this has come up (the first two were vlan and MPLS) so it might be
> time.
>
> Let me check if I can make it more like GSO.

> -int ovs_tnl_send(struct vport *vport, struct sk_buff *skb)
> > +int ovs_tnl_send(struct vport *vport, struct sk_buff *skb, u8 ipproto,
> > +                int tunnel_hlen,
> > +                struct sk_buff *(*build_header)(const struct vport *,
> > +                                                struct sk_buff *,
> > +                                                int tunnel_hlen))
> >  {
> > -       struct tnl_vport *tnl_vport = tnl_vport_priv(vport);
> > +       struct iphdr *inner_iph;
> > +       struct iphdr *iph;
> >         struct rtable *rt;
> >         __be32 saddr;
> >         int sent_len = 0;
> > -       int tunnel_hlen;
> > +       int min_headroom;
> >
> >         if (unlikely(!OVS_CB(skb)->tun_key))
> >                 goto error_free;
> >
> > +       if (unlikely(vlan_deaccel_tag(skb)))
> > +               goto error;
> > +
> >         /* Route lookup */
> >         saddr = OVS_CB(skb)->tun_key->ipv4_src;
> >         rt = find_route(ovs_dp_get_net(vport->dp),
> >                         &saddr,
> >                         OVS_CB(skb)->tun_key->ipv4_dst,
> > -                       tnl_vport->tnl_ops->ipproto,
> > +                       ipproto,
> >                         OVS_CB(skb)->tun_key->ipv4_tos,
> >                         skb_get_mark(skb));
> >         if (IS_ERR(rt))
> >                 goto error_free;
> >
> > +       skb_dst_drop(skb);
> > +       skb_dst_set(skb, &rt_dst(rt));
>
> I think everything below this point are dataplane components that
> should be handled outside of OVS.  Why do we need to handle offloads,
> unclone, reserve header space, reset the skb, select identities, etc.?
>
> ok.


> > diff --git a/datapath/vport-gre.c b/datapath/vport-gre.c
> > index 40b96cf..3a722a1 100644
> > --- a/datapath/vport-gre.c
> > +++ b/datapath/vport-gre.c
> > +static struct sk_buff *__gre_build_header(struct sk_buff *skb,
> >                                int tunnel_hlen,
> >                                bool is_gre64)
> > +       skb = gre_handle_offloads(skb, (tun_key->tun_flags &
> OVS_TNL_F_CSUM));
> > +       if (IS_ERR(skb))
> > +               return NULL;
>
> Can't we do this in the GRE code?
>
> I was going to move it to build_heade like vxlan, But I could not due to
the way IP_GRE uses this APIs.
I can not change skb in build header. ref dev->header_ops case
in ipgre_xmit().

> @@ -267,27 +188,23 @@ static int gre_rcv(struct sk_buff *skb)
> >         if (unlikely(!pskb_may_pull(skb, hdr_len + ETH_HLEN)))
> >                 goto error;
> >
> > -       iph = ip_hdr(skb);
> > -       tnl_flags = gre_flags_to_tunnel_flags(gre_flags, is_gre64);
> > -       tnl_tun_key_init(&tun_key, iph, key, tnl_flags);
> > +       tnl_tun_key_init(&tun_key, ip_hdr(skb), key, tnl_flags);
> >         OVS_CB(skb)->tun_key = &tun_key;
> >
> >         __skb_pull(skb, hdr_len);
> >         skb_postpull_rcsum(skb, skb_transport_header(skb), hdr_len +
> ETH_HLEN);
>
> Things like pulling the headers off should happen in shared code.
>
> ok.


> > @@ -334,8 +245,11 @@ static struct vport *gre_create(const struct
> vport_parms *parms)
> >         if (rtnl_dereference(ovs_net->vport_net.gre_vport))
> >                 return ERR_PTR(-EEXIST);
> >
> > -       vport = ovs_tnl_create(parms, &ovs_gre_vport_ops, &gre_tnl_ops);
> > +       vport = ovs_vport_alloc(0, &ovs_gre_vport_ops, parms);
> > +       if (IS_ERR(vport))
> > +               return vport;
> >
> > +       strncpy(ovs_net->vport_net.gre_name, parms->name, IFNAMSIZ);
> >         rcu_assign_pointer(ovs_net->vport_net.gre_vport, vport);
> >         return vport;
>
> I think we need to move protocol registration in here so that it
> happens on demand.  Ideally we should also figure out the right way to
> make the modules load dynamically and not enforce either compile or
> load time dependencies on the tunnel protocols.
>
> right, but we need to sort out OVS locking first as mentioned above.


> > diff --git a/datapath/vport-vxlan.c b/datapath/vport-vxlan.c
> > index 1850fc2..cf3064b 100644
> > --- a/datapath/vport-vxlan.c
> > +++ b/datapath/vport-vxlan.c
> > +static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
> >  {
> > -       err = vxlan_socket_init(vxlan_port, net);
> > +       net = ovs_dp_get_net(parms->dp);
> > +       ovs_vxlan_port = vxlan_vport_priv(vport);
> > +       strncpy(ovs_vxlan_port->name, parms->name, IFNAMSIZ);
> > +       vxlan_port = &ovs_vxlan_port->port;
> > +       vxlan_port->portno = htons(dst_port);
> > +       vxlan_port->vx_rcv = vxlan_rcv;
> > +       vxlan_port->user_data = vport;
> > +       err = vxlan_add_handler(net, vxlan_port);
>
> Shouldn't we make this handler more closely resemble GRE in terms of
> having priorities, etc.?  Having a limit of 8 ports also seems
> somewhat arbitrary and could possibly lead to confusion.
>
> Priority is not add since it is part of vxlan module, and initialized as
part of vxlan module init. It is always going to be first in array
therefore higher priority.
Fixed array is used to make per packet lookup efficient. These choices make
code fast and simple.
I can change if you want.

> diff --git a/datapath/vport.c b/datapath/vport.c
> > index d458a95..bfaf410 100644
> > --- a/datapath/vport.c
> > +++ b/datapath/vport.c
> > @@ -208,6 +208,21 @@ void ovs_vport_free(struct vport *vport)
> >         kfree(vport);
> >  }
> >
> > +static void free_vport_rcu(struct rcu_head *rcu)
> > +{
> > +       struct vport *vport = container_of(rcu, struct vport, rcu);
> > +
> > +       ovs_vport_free(vport);
> > +}
> > +
> > +void ovs_vport_deferred_destroy(struct vport *vport)
> > +{
> > +       if (!vport)
> > +               return;
> > +
> > +       call_rcu(&vport->rcu, free_vport_rcu);
> > +}
>
> It seems like these changes could be broken out separately.
>

ok.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-dev/attachments/20130404/32fb2501/attachment-0003.html>


More information about the dev mailing list