[ovs-dev] [PATCH 10/15] datapath: Add basic MPLS support to kernel

Wed Feb 27 07:10:01 UTC 2013

On Tue, Feb 26, 2013 at 02:49:47PM -0800, Jesse Gross wrote:
> On Fri, Feb 15, 2013 at 1:56 AM, Simon Horman <horms at verge.net.au> wrote:
> > Allow datapath to recognize and extract MPLS labels into flow keys
> > and execute actions which push, pop, and set labels on packets.
> >
> > Based heavily on work by Leo Alterman and Ravi K.
> >
> > Cc: Ravi K <rkerur at gmail.com>
> > Cc: Leo Alterman <lalterman at nicira.com>
> > Reviewed-by: Isaku Yamahata <yamahata at valinux.co.jp>
> > Signed-off-by: Simon Horman <horms at verge.net.au>
> 
> I received some sparse errors:
>   CHECK   /home/jgross/openvswitch/datapath/linux/datapath.c
> /home/jgross/openvswitch/datapath/linux/datapath.c:74:5: warning:
> symbol 'ovs_dp_ioctl_hook' was not declared. Should it be static?
> /home/jgross/openvswitch/datapath/linux/datapath.c:87:13: warning:
> restricted __be16 degrades to integer
> /home/jgross/openvswitch/datapath/linux/datapath.c:87:13: warning:
> restricted __be16 degrades to integer
> 
> The first one looks like it is due to a rebase error and the second
> one seems like a real endian problem.

I'm unsure how I missed those. I'll fix them up.

> A couple high level comments:
>  * I'd like to nail down the format of the userspace/kernel interface
> for multiple levels of tags.  We don't need to implement multiple
> levels but we should have a good plan on how we would cleanly extend
> the interface to do that in the future if we want.  Ben and I had
> talked about using an array of tags in a single Netlink attribute,
> which seems fairly clean and avoids needing lots of encap attributes.

Is the implication here that the size of the array and thus depth of the
tags is fixed?

>  * I'm not really excited about enabling userspace to work blind, such
> as popping off multiple levels of tags or manipulating protocols
> beyond what it can see.  This is currently a problem for vlans and the
> combination of MPLS and vlans makes things worse.  Not to sound like a
> broken record but recirculation would solve this problem once and for
> all.  Also, in the case of vlans, even though OpenFlow and OVS do
> support this behavior currently, I don't know of anyone using it since
> they don't have any visibility into the packet either.

Thanks. We spoke about this off-line.

I'm not entirely sure how the complexity of recirculation will play out.
But I intend to work on an prototype and post it to this list. No doubt
I will have questions along the way.

> I'll make a few comments on the code below but the second one impacts
> a number of places so it's important to get resolved first.
> 
> > diff --git a/datapath/actions.c b/datapath/actions.c
> > index f638ffc..60522be 100644
> > --- a/datapath/actions.c
> > +++ b/datapath/actions.c
> > +static int push_mpls(struct sk_buff *skb, const struct ovs_action_push_mpls *mpls)
> > +{
> > +       u32 l2_size;
> > +       __be32 *new_mpls_lse;
> > +
> > +       if (skb_cow_head(skb, MPLS_HLEN) < 0) {
> > +               kfree_skb(skb);
> > +               return -ENOMEM;
> > +       }
> 
> I don't think there is a reason to free the skb on error here since it
> will get handled later on like other actions.  vlans require a special
> case but that doesn't mean that we have to propagate it.

Thanks, I will fix that.

> > +       l2_size = skb_cb_l2_size(skb);
> > +       skb_push(skb, MPLS_HLEN);
> > +       memmove(skb_mac_header(skb) - MPLS_HLEN, skb_mac_header(skb), l2_size);
> > +       skb_reset_mac_header(skb);
> 
> Should we set the mac len as well?

Yes, I think adding call to skb_reset_mac_len() is appropriate both here
in push_mpls() and in pop_mpls().

> > +       new_mpls_lse = (__be32 *)(skb_mac_header(skb) + l2_size);
> > +       *new_mpls_lse = mpls->mpls_lse;
> > +
> > +       set_ethertype(skb, mpls->mpls_ethertype);
> > +       return 0;
> > +}
> 
> There are potentially a number of issues with offloading here: if we
> have a packet that requires some form of offloading (say TSO) and we
> push an MPLS header on the front, then it's quite likely that the NIC
> can no longer handle the packet and we must emulate in software first.
>  That would ideally mean GSO support for MPLS (as was done for vlan).

I take it that would be work in the core kernel code.
Could you give me a few pointers as to what I should be looking at?

> 
> In addition, as MPLS labels are modified we need to update skb->csum
> in the CHECKSUM_COMPLETE case (it looks like we have a similar issue
> for some of the other modifications as well).

Thanks. I agree that seems to be a problem.

> > +static int set_mpls(struct sk_buff *skb, const __be32 *mpls_lse)
> > +{
> > +       __be16 current_ethertype = get_ethertype(skb);
> > +       if (eth_p_mpls(current_ethertype))
> > +               memcpy(skb_cb_mpls_bos(skb), mpls_lse, sizeof(__be32));
> 
> You could use MPLS_HLEN here.

Sure.

> > diff --git a/datapath/datapath.c b/datapath/datapath.c
> > index 87c96ae..1386917 100644
> > --- a/datapath/datapath.c
> > +++ b/datapath/datapath.c
> > +void skb_cb_set_l2_size(struct sk_buff *skb)
> > +{
> > +       struct ethhdr *eth;
> > +       int nh_ofs;
> > +       __be16 dl_type = 0;
> > +
> > +       skb_reset_mac_header(skb);
> > +
> > +       eth = eth_hdr(skb);
> > +       nh_ofs = sizeof(struct ethhdr);
> > +       if (likely(eth->h_proto >= htons(ETH_TYPE_MIN))) {
> > +               dl_type = eth->h_proto;
> > +
> > +               while (dl_type == htons(ETH_P_8021Q) &&
> > +                               skb->len >= nh_ofs + sizeof(struct vlan_hdr)) {
> > +                       struct vlan_hdr *vh = (struct vlan_hdr*)(skb->data + nh_ofs);
> > +                       dl_type = vh->h_vlan_encapsulated_proto;
> > +                       nh_ofs += sizeof(struct vlan_hdr);
> > +               }
> > +
> > +               OVS_CB(skb)->l2_size = nh_ofs;
> > +       } else {
> > +               OVS_CB(skb)->l2_size = 0;
> > +       }
> > +}
> 
> It really seems like this should happen as part of flow extraction,
> similar to the initialization of the other layer pointers, rather than
> separately in every place that we get a packet.

Yes, I think that should work.

Thinking out aloud:

I think that it should be possible to implement your idea by
having ovs_flow_extract() set OVS_CB(skb)->l2_size.

Currently skb_cb_set_l2_size() is called in two locations,
ovs_packet_cmd_execute() and ovs_vport_receive().

In the case of ovs_packet_cmd_execute() there is a previously a
call to ovs_flow_extract().

And in the case of ovs_vport_receive() there is subsequently a
call to ovs_dp_process_received_packet() which in turn calls
ovs_flow_extract() if OVS_CB(skb)->flow is not set.

> However, I don't
> think that we should be skipping over more vlan headers than we can
> parse anyways so if we follow the model that I mentioned above then
> the initialization should just happen naturally.

I'm not quite sure that I follow this but as ovs_flow_extract() parses
VLAN headers it seems that moving the logic there should resolve the
problem you are describing.

> > +unsigned char *skb_cb_mpls_bos(const struct sk_buff *skb)
> > +{
> > +       return skb_mac_header(skb) + OVS_CB(skb)->l2_size;
> > +}
> 
> Isn't this actually supposed to point to the top of the stack? (I
> think it does but the name seems wrong.)

Yes, I agree it should be the top of the stack and the name seems
entirely inappropriate.

Perhaps a better name would be skb_cb_mpls_stack().
Or skb_cb_mpls_outermost_lse() or just skb_cb_mpls().

> > +ptrdiff_t skb_cb_l2_size(const struct sk_buff *skb)
> > +{
> > +       return OVS_CB(skb)->l2_size;
> > +}
> 
> It seems like in some places we use this accessor function and in
> other places just retrieve the size directly.  Does it provide much
> value?

I don't see any at this stage, I will remove it.

> > @@ -667,6 +706,11 @@ static int validate_set(const struct nlattr *a,
> >
> >                 return validate_tp_port(flow_key);
> >
> > +       case OVS_KEY_ATTR_MPLS:
> > +               if (!eth_p_mpls(flow_key->eth.type))
> > +                       return -EINVAL;
> 
> I think this will fail if you try to first push and then set an MPLS header.

That does seem likely. I'll verify that and fix it as necessary.

> > diff --git a/datapath/flow.c b/datapath/flow.c
> > index fad9e19..27e1920 100644
> > --- a/datapath/flow.c
> > +++ b/datapath/flow.c
> > @@ -728,6 +728,17 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key,
> >                         memcpy(key->ipv4.arp.tha, arp->ar_tha, ETH_ALEN);
> >                         key_len = SW_FLOW_KEY_OFFSET(ipv4.arp);
> >                 }
> > +       } else if (eth_p_mpls(key->eth.type)) {
> > +               error = check_header(skb, MPLS_HLEN);
> > +               if (unlikely(error))
> > +                       goto out;
> > +
> > +               key_len = SW_FLOW_KEY_OFFSET(mpls.top_lse);
> > +               memcpy(&key->mpls.top_lse, skb_network_header(skb), MPLS_HLEN);
> > +
> > +               /* Update network header */
> > +               skb_set_network_header(skb, skb_network_header(skb) -
> > +                                      skb->data + MPLS_HLEN);
> 
> Is it possible to actually use this network header pointer?

It is possible to use it in ovs_flow_extract_l3_onwards() in conjunction
with the inner/outer flow changes I have in other patches (and you have
asked me to drop).

It does seem to me that it may be incorrect if there is more than one MPLS
LSE present in the frame.

But I'm not sure if either of those answers relate to the point you are
making. Are there use-cases which concern you?

> > @@ -838,6 +849,7 @@ const int ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
> >         [OVS_KEY_ATTR_ETHERNET] = sizeof(struct ovs_key_ethernet),
> >         [OVS_KEY_ATTR_VLAN] = sizeof(__be16),
> >         [OVS_KEY_ATTR_ETHERTYPE] = sizeof(__be16),
> > +       [OVS_KEY_ATTR_MPLS] = sizeof(struct ovs_key_mpls),
> 
> Can you put this under the non-upstream group?

Yes, sure. Sorry for not doing so earlier.

> > @@ -1473,6 +1495,15 @@ int ovs_flow_to_nlattrs(const struct sw_flow_key *swkey, struct sk_buff *skb)
> >                 arp_key->arp_op = htons(swkey->ip.proto);
> >                 memcpy(arp_key->arp_sha, swkey->ipv4.arp.sha, ETH_ALEN);
> >                 memcpy(arp_key->arp_tha, swkey->ipv4.arp.tha, ETH_ALEN);
> > +       } else if (eth_p_mpls(swkey->eth.type)) {
> > +               struct ovs_key_mpls *mpls_key;
> > +
> > +               nla = nla_reserve(skb, OVS_KEY_ATTR_MPLS, sizeof(*mpls_key));
> > +               if (!nla)
> > +                       goto nla_put_failure;
> > +               mpls_key = nla_data(nla);
> > +               memset(mpls_key, 0, sizeof(struct ovs_key_mpls));
> 
> Is there any reason for this memset here?

None. I assume that it is a legacy of a now distant version of this code.
I will remove it.

> > diff --git a/datapath/flow.h b/datapath/flow.h
> > index 6949640..d8e350c 100644
> > --- a/datapath/flow.h
> > +++ b/datapath/flow.h
> > @@ -73,6 +73,9 @@ struct sw_flow_key {
> >                 __be16 type;            /* Ethernet frame type. */
> >         } eth;
> >         struct {
> > +               __be32 top_lse;         /* top label stack entry */
> > +       } mpls;
> 
> We should be able to make this a union with the IP headers: since we
> stop parsing after the MPLS header we can never have both at the same
> time so we can avoid expanding the size of the struct in the non-MPLS
> case.

Sure.

In the context of my inner/outer flow work that was not true, but
as you have asked me to look at implementing recirculation it should be
fine. It can always be moved if that turns out not to be the case.

> 
> > +#define ETH_TYPE_MIN 0x600
> 
> This really seems like it belongs in the upstream kernel someplace
> since there's a million uses of this constant.

It seems to be used about 1/2 dozen times.
I'll see about making a patch.