[ovs-dev] [PATCH 1/1 V4] Add support for tun_key to OVS datapath

Kyle Mestery (kmestery) kmestery at cisco.com
Fri Sep 7 19:56:45 UTC 2012


On Sep 6, 2012, at 10:24 PM, Jesse Gross wrote:
> On Wed, Sep 5, 2012 at 2:58 PM, Kyle Mestery <kmestery at cisco.com> wrote:
>> This is a first pass at providing a tun_key which can be
>> used as the basis for flow-based tunnelling. The tun_key
>> includes and replaces the tun_id in both struct ovs_skb_cb
>> and struct sw_tun_key.
>> 
>> This patch allows all existing tun_id behaviour to still work. Existing
>> users of tun_id are redirected to tun_key->tun_id to retain compatibility.
>> However, when the userspace code is updated to make use of the new tun_key,
>> the old behaviour will be deprecated and removed.
>> 
>> NOTE: With these changes, the tunneling code no longer assumes input and
>> output keys are symmetric.  If they are not, PMTUD needs to be disabled
>> for tunneling to work.
>> 
>> Signed-off-by: Kyle Mestery <kmestery at cisco.com>
>> Cc: Simon Horman <horms at verge.net.au>
>> Cc: Jesse Gross <jesse at nicira.com>
> 
> A couple high level Linux kernel style comments:
> * I noticed a couple places that use spaces instead of tabs.
> * The multiline comment style used in networking (although not
> currently respected everywhere) is:
> /* Comment
>  * comment
>  */
OK, cleaned those up.

> 
>> diff --git a/NEWS b/NEWS
>> index cbc5c58..0abf2fa 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -1,5 +1,7 @@
>> post-v1.8.0
>> ------------------------
>> +    - The tunneling code no longer assumes input and output keys are symmetric.
>> +      If they are not, PMTUD needs to be disabled for tunneling to work.
> 
> I would probably note that this only applies in the case of flow-based keys.
> 
Fixed.

>> diff --git a/datapath/actions.c b/datapath/actions.c
>> index ec9b595..31f6bd6 100644
>> --- a/datapath/actions.c
>> +++ b/datapath/actions.c
>> @@ -377,6 +381,16 @@ static int do_execute_actions(struct datapath *dp, struct sk_buff *skb,
>>        int prev_port = -1;
>>        const struct nlattr *a;
>>        int rem;
>> +       struct ovs_key_ipv4_tunnel tun_key;
>> +
>> +       /*
>> +        * If tun_key is NULL for this skb, point it at one on the stack for
>> +        * action processing and output. This can disappear once we drop support
>> +        * for setting tun_id outside of tun_key.
>> +        */
>> +        memset(&tun_key, 0, sizeof(struct ovs_key_ipv4_tunnel));
>> +        if (OVS_CB(skb)->tun_key == NULL)
>> +               OVS_CB(skb)->tun_key = &tun_key;
> 
> I think ideally we want to push this down as far as possible.  struct
> ovs_key_ipv4_tunnel needs to be on the stack in do_execute_actions()
> so that it is in scope when we output but it's best to do the
> memsetting and assignment only when (and if) we actually set the
> tun_id.  We can just pass a pointer into execute_set_action().
> 
OK, fixed.

>> diff --git a/datapath/flow.c b/datapath/flow.c
>> index d07337c..3b85c62 100644
>> --- a/datapath/flow.c
>> +++ b/datapath/flow.c
>> @@ -1023,10 +1026,28 @@ int ovs_flow_from_nlattrs(struct sw_flow_key *swkey, int *key_lenp,
>>        }
>> 
>>        if (attrs & (1ULL << OVS_KEY_ATTR_TUN_ID)) {
>> -               swkey->phy.tun_id = nla_get_be64(a[OVS_KEY_ATTR_TUN_ID]);
>> +               tun_id = nla_get_be64(a[OVS_KEY_ATTR_TUN_ID]);
>>                attrs &= ~(1ULL << OVS_KEY_ATTR_TUN_ID);
>>        }
>> 
>> +       if (attrs & (1ULL << OVS_KEY_ATTR_IPV4_TUNNEL)) {
>> +               struct ovs_key_ipv4_tunnel *tun_key;
>> +               tun_key = nla_data(a[OVS_KEY_ATTR_IPV4_TUNNEL]);
>> +
>> +               /*
>> +                * If both OVS_KEY_ATTR_TUN_ID and OVS_KEY_ATTR_IPV4_TUNNEL are
>> +                * present, ensure they are consistent or error out.
>> +                */
>> +               if (tun_id != 0) {
>> +                       /* Verify they match */
>> +                       if (tun_id != tun_key->tun_id)
>> +                               return -EINVAL;
>> +               }
>> +
>> +               swkey->tun.tun_key = *tun_key;
>> +               attrs &= ~(1ULL << OVS_KEY_ATTR_IPV4_TUNNEL);
>> +       }
> 
> I think if only OVS_KEY_ATTR_TUN_ID is sent then this will result in
> tun_key never getting set at all.  This shouldn't happen in practice
> because on flow setup we always will send both together and they
> should be echoed back together.  It doesn't seem right to silently
> ignore one though.
> 
> Here's a different option - we could enforce that both must come
> together and where there is overlap the information is consistent.
> This should be easier to check and we can avoid ambiguities around the
> value zero.  It also treats them as one, which they effectively are.
> On both the sent and receive sides I would combine the if blocks to
> really make it explicit that they go together.
> 
Done. I checked for both being set, and compared the values if so. I
then checked for only one being set, and returned an error if so. 

>> +int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port,
>> +                                  struct ovs_key_ipv4_tunnel *tun_key,
>>                                   const struct nlattr *attr)
>> {
>>        const struct nlattr *nla;
>>        int rem;
>> 
>>        *in_port = DP_MAX_PORTS;
>> -       *tun_id = 0;
>> +       memset(tun_key, 0, sizeof(struct ovs_key_ipv4_tunnel));
> 
> Can you use sizeof(*tun_key) here?
> 
Fixed.

>> @@ -1185,7 +1207,12 @@ int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port, __be64 *tun_id,
>>                                break;
>> 
>>                        case OVS_KEY_ATTR_TUN_ID:
>> -                               *tun_id = nla_get_be64(nla);
>> +                               tun_key->tun_id = nla_get_be64(nla);
>> +                               break;
>> +
>> +                       case OVS_KEY_ATTR_IPV4_TUNNEL:
>> +                               memcpy(tun_key, nla_data(nla),
>> +                                      sizeof(*tun_key));
> 
> We should enforce consistency here as well (this one is probably more
> likely to be broken anyways because flow setups just echo back what
> the kernel supplied but this is generated "by hand").  In this case we
> can't enforce that both come together.
> 
I don't think there is an implicit order to the netlink attributes here, so what I did
was something like this:

diff --git a/datapath/flow.c b/datapath/flow.c
index 1c4eb99..5be492e 100644
--- a/datapath/flow.c
+++ b/datapath/flow.c
@@ -1187,9 +1187,10 @@ int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port,
 {
        const struct nlattr *nla;
        int rem;
+       __be64 tun_id = 0;
 
        *in_port = DP_MAX_PORTS;
-       memset(tun_key, 0, sizeof(struct ovs_key_ipv4_tunnel));
+       memset(tun_key, 0, sizeof(*tun_key));
        *priority = 0;
 
        nla_for_each_nested(nla, attr, rem) {
@@ -1205,12 +1206,19 @@ int ovs_flow_metadata_from_nlattrs(u32 *priority, u16 *in_port,
                                break;
 
                        case OVS_KEY_ATTR_TUN_ID:
-                               tun_key->tun_id = nla_get_be64(nla);
+                               tun_id = nla_get_be64(nla);
+                               if (tun_key->tun_id != 0 &&
+                                   tun_key->tun_id != tun_id)
+                                       return -EINVAL;
                                break;
 
                        case OVS_KEY_ATTR_IPV4_TUNNEL:
+                               if (tun_key->tun_id != 0)
+                                       tun_id = tun_key->tun_id;
                                memcpy(tun_key, nla_data(nla),
                                       sizeof(*tun_key));
+                               if (tun_id != 0 && tun_id != tun_key->tun_id)
+                                       return -EINVAL;
                                break;
 


>> diff --git a/datapath/flow.h b/datapath/flow.h
>> index 02c563a..5d4fcde 100644
>> --- a/datapath/flow.h
>> +++ b/datapath/flow.h
>> @@ -42,7 +42,6 @@ struct sw_flow_actions {
>> 
>> struct sw_flow_key {
>>        struct {
>> -               __be64  tun_id;         /* Encapsulating tunnel ID. */
>>                u32     priority;       /* Packet QoS priority. */
>>                u16     in_port;        /* Input switch port (or DP_MAX_PORTS). */
>>        } phy;
>> @@ -92,6 +91,9 @@ struct sw_flow_key {
>>                        } nd;
>>                } ipv6;
>>        };
>> +        struct {
>> +               struct ovs_key_ipv4_tunnel tun_key;  /* Encapsulating tunnel key. */
>> +       } tun;
>> };
> 
> The solution for this is a little more complicated because for
> efficiency we only match on as much of the flow struct as is
> necessary.  Since we currently calculate how far we need to search
> (i.e. TCP/IPv4) without taking into account a tunnel header at the end
> this will never match on the tunnel.  The other less serious problem
> is that this reduces the effectiveness of the partial length match
> when tunneling because IPv4 packets will have to match over a series
> of zeros where the IPv6 header exists.  Basically we want to stick
> tunnel headers at the end of the current match, where ever that might
> be.
> 
> One way of doing this is to add a copy of the tun_key to each union
> that could possibly be the end of the struct.  We essentially do that
> with the TCP/UDP ports which could follow either a IPv4 or IPv6
> header.  However in the case of tunnels there are many more possible
> end locations (I count 7) and whereas L4 ports logically follows the
> L3 header that's not really the case with tunnels, we're just moving
> them around because we can and the benefit is large.
> 
> I think what I would do is to make struct tun completely independent
> of struct sw_flow_key and then have it "float" over where ever we
> decide is the end of the flow struct.  It's a little messy if you want
> random access to it because you have to compute the size of the
> earlier members but in practice don't think we ever do so it shouldn't
> be too bad.
> 
I think I see what you're looking for here. I'm going to make this a separate patch
built on top of the other stuff, since it seems this may be a little more complicated
as you indicate. I'll send it out with the other patch, though.

>> diff --git a/datapath/tunnel.c b/datapath/tunnel.c
>> index d651c11..0687f80 100644
>> --- a/datapath/tunnel.c
>> +++ b/datapath/tunnel.c
>> @@ -799,10 +786,10 @@ static void create_tunnel_header(const struct vport *vport,
>>        iph->ihl        = sizeof(struct iphdr) >> 2;
>>        iph->frag_off   = htons(IP_DF);
>>        iph->protocol   = tnl_vport->tnl_ops->ipproto;
>> -       iph->tos        = mutable->tos;
>> +       iph->tos        = mutable->tos;
>>        iph->daddr      = rt->rt_dst;
>>        iph->saddr      = rt->rt_src;
>> -       iph->ttl        = mutable->ttl;
>> +       iph->ttl        = mutable->ttl;
> 
> It looks like these might have been converted from tabs to spaces.
> 
Fixed.

>> @@ -1029,7 +1017,8 @@ static struct rtable *__find_route(const struct tnl_mutable_config *mutable,
>> 
>> static struct rtable *find_route(struct vport *vport,
>>                                 const struct tnl_mutable_config *mutable,
>> -                                u8 tos, struct tnl_cache **cache)
>> +                                u8 tos, __be32 daddr, __be32 saddr,
>> +                                struct tnl_cache **cache)
> 
> Can we make the argument list here match __find_route (where
> possible)?  In particular, it's confusing that daddr and saddr are in
> opposite order and no static checker will be able to catch those
> mistakes.
> 
Corrected.

>> @@ -1219,11 +1213,23 @@ int ovs_tnl_send(struct vport *vport, struct sk_buff *skb)
>> 
>>        if (mutable->flags & TNL_F_TOS_INHERIT)
>>                tos = inner_tos;
>> +       else if (OVS_CB(skb)->tun_key && OVS_CB(skb)->tun_key->ipv4_tos != 0)
>> +               tos = OVS_CB(skb)->tun_key->ipv4_tos;
>>        else
>>                tos = mutable->tos;
>> 
>> +       if (OVS_CB(skb)->tun_key && OVS_CB(skb)->tun_key->ipv4_dst != 0)
>> +               daddr = OVS_CB(skb)->tun_key->ipv4_dst;
>> +       else
>> +               daddr = mutable->key.daddr;
>> +
>> +       if (OVS_CB(skb)->tun_key && OVS_CB(skb)->tun_key->ipv4_src != 0)
>> +               saddr = OVS_CB(skb)->tun_key->ipv4_src;
>> +       else
>> +               saddr = mutable->key.saddr;
> 
> For these checks I think we actually want to always check for
> tun_key->ipv4_dst != 0 as indicator for which type of tunnel key we
> are using rather than to check for the actual value.  For example, in
> the case of ipv4_tos it's quite possible that 0 is the actual value
> that is intended to be set.  If it is set then I would use it,
> overriding any other settings.
> 
Corrected.

>> diff --git a/datapath/tunnel.h b/datapath/tunnel.h
>> index 1924017..78a4d14 100644
>> --- a/datapath/tunnel.h
>> +++ b/datapath/tunnel.h
>> @@ -286,4 +286,14 @@ static inline struct tnl_vport *tnl_vport_priv(const struct vport *vport)
>>        return vport_priv(vport);
>> }
>> 
>> +static inline void tun_key_init(struct ovs_key_ipv4_tunnel *tun_key,
>> +                                const struct iphdr *iph, __be64 tun_id)
> 
> Can you use tnl_ as the prefix here like the other functions in this file?
> 
Fixed.

>> diff --git a/datapath/vport-capwap.c b/datapath/vport-capwap.c
>> index 05a099d..1e08d5a 100644
>> --- a/datapath/vport-capwap.c
>> +++ b/datapath/vport-capwap.c
>> @@ -220,7 +220,7 @@ static struct sk_buff *capwap_update_header(const struct vport *vport,
>>                struct capwaphdr_wsi *wsi = (struct capwaphdr_wsi *)(cwh + 1);
>>                struct capwaphdr_wsi_key *opt = (struct capwaphdr_wsi_key *)(wsi + 1);
>> 
>> -               opt->key = OVS_CB(skb)->tun_id;
>> +               opt->key = OVS_CB(skb)->tun_key->tun_id;
> 
> In theory it's possible that tun_key is NULL (at the moment we always
> initialize to something but that's not the long term goal), which
> means that this will crash.  While we could add a check here I think
> it's probably better to do it in the generic tunneling code since once
> we switch over to flow based tunneling it will be required that you
> set the tun_key (for the IP addresses if for no other reason).  What I
> would do is add a check for tun_key == NULL in tnl_send().  If it is
> NULL we can add a zeroed out dummy since there isn't a requirement
> that the user call set tun_id yet.
> 
Checked added in ovs_tnl_send().

>> diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h
>> index f5c9cca..f9f0c66 100644
>> --- a/include/linux/openvswitch.h
>> +++ b/include/linux/openvswitch.h
>> @@ -278,7 +278,8 @@ enum ovs_key_attr {
>>        OVS_KEY_ATTR_ICMPV6,    /* struct ovs_key_icmpv6 */
>>        OVS_KEY_ATTR_ARP,       /* struct ovs_key_arp */
>>        OVS_KEY_ATTR_ND,        /* struct ovs_key_nd */
>> -       OVS_KEY_ATTR_TUN_ID = 63, /* be64 tunnel ID */
>> +       OVS_KEY_ATTR_IPV4_TUNNEL,  /* struct ovs_key_ipv4_tunnel */
>> +       OVS_KEY_ATTR_TUN_ID = 63,  /* be64 tunnel ID */
> 
> I would assign OVS_KEY_ATTR_IPV4_TUNNEL the value of 62 for the time
> being until we are ready to lock things down.
> 
Got it, done.

>> diff --git a/lib/odp-util.c b/lib/odp-util.c
>> index 901dac3..5a17e11 100644
>> --- a/lib/odp-util.c
>> +++ b/lib/odp-util.c
>> @@ -99,13 +99,14 @@ ovs_key_attr_to_string(enum ovs_key_attr attr)
>>     case OVS_KEY_ATTR_ETHERTYPE: return "eth_type";
>>     case OVS_KEY_ATTR_IPV4: return "ipv4";
>>     case OVS_KEY_ATTR_IPV6: return "ipv6";
>> +    case OVS_KEY_ATTR_TUN_ID: return "tun_id";
>> +    case OVS_KEY_ATTR_IPV4_TUNNEL: return "ipv4_tunnel";
>>     case OVS_KEY_ATTR_TCP: return "tcp";
>>     case OVS_KEY_ATTR_UDP: return "udp";
>>     case OVS_KEY_ATTR_ICMP: return "icmp";
>>     case OVS_KEY_ATTR_ICMPV6: return "icmpv6";
>>     case OVS_KEY_ATTR_ARP: return "arp";
>>     case OVS_KEY_ATTR_ND: return "nd";
>> -    case OVS_KEY_ATTR_TUN_ID: return "tun_id";
>> 
>>     case __OVS_KEY_ATTR_MAX:
>>     default:
>> @@ -601,13 +602,14 @@ odp_flow_key_attr_len(uint16_t type)
>>     switch ((enum ovs_key_attr) type) {
>>     case OVS_KEY_ATTR_ENCAP: return -2;
>>     case OVS_KEY_ATTR_PRIORITY: return 4;
>> -    case OVS_KEY_ATTR_TUN_ID: return 8;
>>     case OVS_KEY_ATTR_IN_PORT: return 4;
>>     case OVS_KEY_ATTR_ETHERNET: return sizeof(struct ovs_key_ethernet);
>>     case OVS_KEY_ATTR_VLAN: return sizeof(ovs_be16);
>>     case OVS_KEY_ATTR_ETHERTYPE: return 2;
>>     case OVS_KEY_ATTR_IPV4: return sizeof(struct ovs_key_ipv4);
>>     case OVS_KEY_ATTR_IPV6: return sizeof(struct ovs_key_ipv6);
>> +    case OVS_KEY_ATTR_TUN_ID: return 8;
>> +    case OVS_KEY_ATTR_IPV4_TUNNEL: return sizeof(struct ovs_key_ipv4_tunnel);
> 	
> Since the IP keys actually refer to the inner header I would logically
> group the tunnel attributes together where they are in the second
> block - after priority.
> 
OK, makes sense. Corrected.

>> diff --git a/lib/odp-util.h b/lib/odp-util.h
>> index 16f2b15..250302e 100644
>> --- a/lib/odp-util.h
>> +++ b/lib/odp-util.h
>> @@ -90,12 +91,12 @@ int odp_actions_from_string(const char *, const struct simap *port_names,
>>  *  OVS_KEY_ATTR_ICMPV6        2     2     4      8
>>  *  OVS_KEY_ATTR_ND           28    --     4     32
>>  *  -------------------------------------------------
>> - *  total                                       156
>> + *  total                                       184
>>  *
>>  * We include some slack space in case the calculation isn't quite right or we
>>  * add another field and forget to adjust this value.
>>  */
>> -#define ODPUTIL_FLOW_KEY_BYTES 200
>> +#define ODPUTIL_FLOW_KEY_BYTES 204
> 
> Since this is just an arbitrary amount of padding we can probably keep
> it at 200 for the time being as a nice round number.


Fixed.




More information about the dev mailing list