[ovs-dev] [PATCH 10/15] datapath: Add basic MPLS support to kernel

Wed Feb 27 15:39:02 UTC 2013

On Tue, Feb 26, 2013 at 11:10 PM, Simon Horman <horms at verge.net.au> wrote:
> On Tue, Feb 26, 2013 at 02:49:47PM -0800, Jesse Gross wrote:
>> A couple high level comments:
>>  * I'd like to nail down the format of the userspace/kernel interface
>> for multiple levels of tags.  We don't need to implement multiple
>> levels but we should have a good plan on how we would cleanly extend
>> the interface to do that in the future if we want.  Ben and I had
>> talked about using an array of tags in a single Netlink attribute,
>> which seems fairly clean and avoids needing lots of encap attributes.
>
> Is the implication here that the size of the array and thus depth of the
> tags is fixed?

I think we could use a variable sized array and then use the size from
the Netlink attribute to get the number of elements.  That way we
could easily expand it over time.

>> > diff --git a/datapath/actions.c b/datapath/actions.c
>> > index f638ffc..60522be 100644
>> > --- a/datapath/actions.c
>> > +++ b/datapath/actions.c
>> > +       new_mpls_lse = (__be32 *)(skb_mac_header(skb) + l2_size);
>> > +       *new_mpls_lse = mpls->mpls_lse;
>> > +
>> > +       set_ethertype(skb, mpls->mpls_ethertype);
>> > +       return 0;
>> > +}
>>
>> There are potentially a number of issues with offloading here: if we
>> have a packet that requires some form of offloading (say TSO) and we
>> push an MPLS header on the front, then it's quite likely that the NIC
>> can no longer handle the packet and we must emulate in software first.
>>  That would ideally mean GSO support for MPLS (as was done for vlan).
>
> I take it that would be work in the core kernel code.
> Could you give me a few pointers as to what I should be looking at?

It would mostly be the code around skb_gso_segment() that would need
to skip over and replicate the MPLS tags.  dev_hard_start_xmit() also
likely needs some work to make sure that we take the software path
instead of using hardware that can't offload.

>> > diff --git a/datapath/datapath.c b/datapath/datapath.c
>> > index 87c96ae..1386917 100644
>> > --- a/datapath/datapath.c
>> > +++ b/datapath/datapath.c
>> > +void skb_cb_set_l2_size(struct sk_buff *skb)
>> > +{
>> > +       struct ethhdr *eth;
>> > +       int nh_ofs;
>> > +       __be16 dl_type = 0;
>> > +
>> > +       skb_reset_mac_header(skb);
>> > +
>> > +       eth = eth_hdr(skb);
>> > +       nh_ofs = sizeof(struct ethhdr);
>> > +       if (likely(eth->h_proto >= htons(ETH_TYPE_MIN))) {
>> > +               dl_type = eth->h_proto;
>> > +
>> > +               while (dl_type == htons(ETH_P_8021Q) &&
>> > +                               skb->len >= nh_ofs + sizeof(struct vlan_hdr)) {
>> > +                       struct vlan_hdr *vh = (struct vlan_hdr*)(skb->data + nh_ofs);
>> > +                       dl_type = vh->h_vlan_encapsulated_proto;
>> > +                       nh_ofs += sizeof(struct vlan_hdr);
>> > +               }
>> > +
>> > +               OVS_CB(skb)->l2_size = nh_ofs;
>> > +       } else {
>> > +               OVS_CB(skb)->l2_size = 0;
>> > +       }
>> > +}
>>
>> It really seems like this should happen as part of flow extraction,
>> similar to the initialization of the other layer pointers, rather than
>> separately in every place that we get a packet.
>
> Yes, I think that should work.
>
> Thinking out aloud:
>
> I think that it should be possible to implement your idea by
> having ovs_flow_extract() set OVS_CB(skb)->l2_size.
>
> Currently skb_cb_set_l2_size() is called in two locations,
> ovs_packet_cmd_execute() and ovs_vport_receive().
>
> In the case of ovs_packet_cmd_execute() there is a previously a
> call to ovs_flow_extract().
>
> And in the case of ovs_vport_receive() there is subsequently a
> call to ovs_dp_process_received_packet() which in turn calls
> ovs_flow_extract() if OVS_CB(skb)->flow is not set.

All of that sounds right to me.

>> > diff --git a/datapath/flow.c b/datapath/flow.c
>> > index fad9e19..27e1920 100644
>> > --- a/datapath/flow.c
>> > +++ b/datapath/flow.c
>> > @@ -728,6 +728,17 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key,
>> >                         memcpy(key->ipv4.arp.tha, arp->ar_tha, ETH_ALEN);
>> >                         key_len = SW_FLOW_KEY_OFFSET(ipv4.arp);
>> >                 }
>> > +       } else if (eth_p_mpls(key->eth.type)) {
>> > +               error = check_header(skb, MPLS_HLEN);
>> > +               if (unlikely(error))
>> > +                       goto out;
>> > +
>> > +               key_len = SW_FLOW_KEY_OFFSET(mpls.top_lse);
>> > +               memcpy(&key->mpls.top_lse, skb_network_header(skb), MPLS_HLEN);
>> > +
>> > +               /* Update network header */
>> > +               skb_set_network_header(skb, skb_network_header(skb) -
>> > +                                      skb->data + MPLS_HLEN);
>>
>> Is it possible to actually use this network header pointer?
>
> It is possible to use it in ovs_flow_extract_l3_onwards() in conjunction
> with the inner/outer flow changes I have in other patches (and you have
> asked me to drop).
>
> It does seem to me that it may be incorrect if there is more than one MPLS
> LSE present in the frame.
>
> But I'm not sure if either of those answers relate to the point you are
> making. Are there use-cases which concern you?

I was just unsure of when it would be useful.  It seems like if we're
not going to try to act beyond the MPLS labels that we can parse then
it makes the most sense to just leave the network layer unset (maybe
we can use it to point to the start of the MPLS stack instead of
needing a special l2 length then?).