[ovs-dev] [PATCH 2/2] datapath: Add Stateless TCP Tunneling protocol.

Jesse Gross jesse at nicira.com
Mon Mar 23 19:23:43 UTC 2015


On Mon, Mar 9, 2015 at 3:12 PM, Pravin B Shelar <pshelar at nicira.com> wrote:
>  openvswitch_headers = \
> diff --git a/datapath/linux/.gitignore b/datapath/linux/.gitignore
> index 69d6658..1d1aae7 100644
> --- a/datapath/linux/.gitignore
> +++ b/datapath/linux/.gitignore
> @@ -50,6 +50,7 @@
>  /vport-lisp.c
>  /vport-netdev.c
>  /vport-patch.c
> +/vport-stt.c
>  /vport-vxlan.c
>  /vport.c
>  /vxlan.c

Should we add stt.c as well?

> diff --git a/datapath/linux/compat/stt.c b/datapath/linux/compat/stt.c
> new file mode 100644
> index 0000000..4cb4ea6
> --- /dev/null
> +++ b/datapath/linux/compat/stt.c
> +static int skb_list_linearize(struct sk_buff *head, gfp_t gfp_mask)
> +{
> +       struct sk_buff *skb;
> +       int tlen = 0;
> +       int err;
> +
> +       err = skb_linearize(head);
> +       if (err)
> +               return err;
> +
> +       skb = head->next;
> +       while (skb) {
> +               tlen += skb->len;
> +               skb = skb->next;
> +       }
> +       err = pskb_expand_head(head, 0, tlen, gfp_mask);
> +       if (err)
> +               return err;

Were you going to try to avoid linearizing and then expanding as Alex
had mentioned in his comments?

> +static int __build_segments(struct sk_buff **headp, bool ipv4, int l4_offset)
> +{
> +       struct sk_buff *head = *headp;
> +       struct sk_buff *nskb = NULL;
> +       struct sk_buff *rskb, *skb;
> +       struct tcphdr *tcph;
> +       int seg_len = 0;
> +       int hdr_len;
> +       int tcp_len;
> +       u32 seq;
> +
> +       /* GRO can produce skbs with only the headers, which we've
> +        * already pulled off.  We can just dump them.
> +        */
> +       while (head->len == 0) {
> +               nskb = head->next;
> +               copy_skb_metadata(nskb, head);
> +               consume_skb(head);
> +               head = nskb;
> +       }
> +       *headp = head;

> +       tcph = (struct tcphdr *)(head->data + l4_offset);
> +       tcp_len = tcph->doff * 4;
> +       hdr_len = l4_offset + tcp_len;

I think there is something wrong with the ordering of the calls in
stt_rcv(). reassemble() will just produce a linked list of skbs that
are expected to be joined here but we've already called functions that
do pskb_may_pull(), which won't traverse that list. In particular, it
doesn't make sense to both remove zero length skbs above and expect
that there is a TCP header available below.

> +
> +       if (unlikely((tcp_len < sizeof(struct tcphdr)) ||
> +                    (head->len < hdr_len)))
> +               return -EINVAL;
> +
> +       if (unlikely(!pskb_may_pull(head, hdr_len)))
> +               return -ENOMEM;

It seems risky to me to combine skb coalescing with TCP header
updating as we would need to very carefully check boundary conditions.
I feel like there are two halves of the loop anyways, so it seems
better to just break them apart.

> +static int __segment_skb(struct sk_buff **headp, bool ipv4, bool tcp, bool csum_partial, int l4_offset)
> +{
> +       int err;
> +
> +       err = straighten_frag_list(headp);
> +       if (unlikely(err))
> +               return err;
> +
> +       if (__linearize(*headp, ipv4, tcp, csum_partial)) {
> +               return skb_list_linearize(*headp, GFP_ATOMIC);
> +       }
> +
> +       if ((*headp)->next) {

I think we can push this check up above linearization. If there is no
list of skbs, we don't need to linearize either.

> +static int segment_skb(struct sk_buff **headp)
> +{

Isn't there a check for frag_list for the receive path missing
somewhere? It seems like linearization happens somewhat
unconditionally.

> +static int skb_list_xmit(struct rtable *rt, struct sk_buff *skb, __be32 src,
> +                        __be32 dst, __u8 tos, __u8 ttl, __be16 df)
> +{
> +       int len = 0;
> +
> +       while (skb) {
> +               struct sk_buff *next = skb->next;
> +
> +               if (next)
> +                       dst_clone(&rt->dst);
> +
> +               skb->next = NULL;
> +               len += iptunnel_xmit(NULL, rt, skb, src, dst, IPPROTO_TCP,
> +                                    tos, ttl, df, false);

I think there may be a problem in our version of ip_local_out() if it
thinks that fix_segment is set in the CB, so we need to find a way to
make sure that it is cleared.

> +int stt_xmit_skb(struct sk_buff *skb, struct rtable *rt,
> +                __be32 src, __be32 dst, __u8 tos,
> +                __u8 ttl, __be16 df, __be16 src_port, __be16 dst_port,
> +                __be64 tun_id)
[...]
> +       while (skb) {
> +               struct sk_buff *next_skb = skb->next;
> +
> +               skb->next = NULL;
> +
> +               if (next_skb)
> +                       dst_clone(&rt->dst);
> +
> +               /* Push STT and TCP header. */
> +               skb = push_stt_header(skb, tun_id, src_port, dst_port, src,
> +                                     dst, inner_h_proto, inner_nw_proto,
> +                                     dst_mtu(&rt->dst));
> +               if (unlikely(!skb))
> +                       goto next;

I think it's somewhat surprising for a function called
push_stt_header() to free the skb in the event of an error. I actually
think that there is a double free already between __push_stt_header()
and push_stt_header() so it seems like it is better to just propagate
the error back here and then free.

> +static bool set_offloads(struct sk_buff *skb)
> +{
[...]
> +       if (proto_type == (STT_PROTO_IPV4 | STT_PROTO_TCP)) {

Should we convert this to a switch statement? It seems mildly cleaner.

> +void stt_sock_release(struct net *net, __be16 port)

Is there a reason to not just pass in stt_sock here? It seems cleaner
and more consistent.

> +       struct stt_sock *stt_sock;
> +
> +       mutex_lock(&stt_mutex);
> +       rcu_read_lock();
> +       stt_sock = stt_find_sock(net, port);
> +       rcu_read_unlock();

If we can't pass in stt_sock then I don't think we need rcu_read_lock
since we're holding the mutex.

> diff --git a/datapath/vport-stt.c b/datapath/vport-stt.c
> new file mode 100644
> index 0000000..ccd4c71
> --- /dev/null
> +++ b/datapath/vport-stt.c
> +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,5,0)

I think we need a check for CONFIG_NETFILTER someplace, otherwise the
registration mechanism won't compile. I wonder if both of these checks
should be pushed down to stt.c though since that's where they actually
matter.

> diff --git a/datapath/vport.h b/datapath/vport.h
> index e256fc0..50e6289 100644
> --- a/datapath/vport.h
> +++ b/datapath/vport.h
> @@ -35,6 +35,7 @@ struct vport_parms;
>  struct vport_net {
>         struct vport __rcu *gre_vport;
>         struct vport __rcu *gre64_vport;
> +       struct vport __rcu *stt_vport;

I think this is no longer used now that the port number is configurable.

> diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c
> index 8e1b542..bd7f146 100644
> --- a/lib/netdev-vport.c
> +++ b/lib/netdev-vport.c
> @@ -153,7 +154,7 @@ netdev_vport_needs_dst_port(const struct netdev *dev)
>
>      return (class->get_config == get_tunnel_config &&
>              (!strcmp("geneve", type) || !strcmp("vxlan", type) ||
> -             !strcmp("lisp", type)));
> +             !strcmp("lisp", type) || !strcmp("stt", type)) );
>  }

I think we need some additional support for configurable STT ports in
netdev_vport_construct() and get_tunnel_config().

> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index dac3756..1b9da1d 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> +          <dt><code>stt</code></dt>
> +          <dd>
> +             The Stateless TCP Tunnel (STT) protocol encapsulates traffic in
> +             IPv4/TCP packets.  All traffic uses a destination port of 7471.
> +             The STT protocol does not engage in the usual TCP 3-way handshake,
> +             so it will have difficulty traversing stateful firewalls.
> +             STT is only available in kernel datapath on kernel 3.5 or newer.
> +          </dd>

Can you make this a little more descriptive? I think right now it is
hard to understand when this should or should not be used.



More information about the dev mailing list