[ovs-dev] [PATCH 2/2] datapath: Add Stateless TCP Tunneling protocol.

Jesse Gross jesse at nicira.com
Tue Mar 24 19:58:08 UTC 2015


On Tue, Mar 24, 2015 at 1:21 PM, Pravin Shelar <pshelar at nicira.com> wrote:
> On Mon, Mar 23, 2015 at 12:23 PM, Jesse Gross <jesse at nicira.com> wrote:
>> On Mon, Mar 9, 2015 at 3:12 PM, Pravin B Shelar <pshelar at nicira.com> wrote:
>>> diff --git a/datapath/linux/compat/stt.c b/datapath/linux/compat/stt.c
>>> new file mode 100644
>>> index 0000000..4cb4ea6
>>> --- /dev/null
>>> +++ b/datapath/linux/compat/stt.c
>>> +static int __build_segments(struct sk_buff **headp, bool ipv4, int l4_offset)
>>> +{
>>> +       struct sk_buff *head = *headp;
>>> +       struct sk_buff *nskb = NULL;
>>> +       struct sk_buff *rskb, *skb;
>>> +       struct tcphdr *tcph;
>>> +       int seg_len = 0;
>>> +       int hdr_len;
>>> +       int tcp_len;
>>> +       u32 seq;
>>> +
>>> +       /* GRO can produce skbs with only the headers, which we've
>>> +        * already pulled off.  We can just dump them.
>>> +        */
>>> +       while (head->len == 0) {
>>> +               nskb = head->next;
>>> +               copy_skb_metadata(nskb, head);
>>> +               consume_skb(head);
>>> +               head = nskb;
>>> +       }
>>> +       *headp = head;
>>
>>> +       tcph = (struct tcphdr *)(head->data + l4_offset);
>>> +       tcp_len = tcph->doff * 4;
>>> +       hdr_len = l4_offset + tcp_len;
>>
>> I think there is something wrong with the ordering of the calls in
>> stt_rcv(). reassemble() will just produce a linked list of skbs that
>> are expected to be joined here but we've already called functions that
>> do pskb_may_pull(), which won't traverse that list. In particular, it
>> doesn't make sense to both remove zero length skbs above and expect
>> that there is a TCP header available below.
>>
> I can remove skb zero len check that is not required. skb should
> always have TCP header since reassemble() used it to build the list.

What I mean is that the linked list created by reassemble() isn't
considered to be a single skb - it's a list of separate skbs.
Therefore, when we call pskb_may_pull() or similar functions that are
expected to bring data from different memory regions into the linear
data area, they won't do that. All of the skb stitching code should
happen before we start accessing the data region, otherwise it will
have to deal with the problem itself.

>>> +
>>> +       if (unlikely((tcp_len < sizeof(struct tcphdr)) ||
>>> +                    (head->len < hdr_len)))
>>> +               return -EINVAL;
>>> +
>>> +       if (unlikely(!pskb_may_pull(head, hdr_len)))
>>> +               return -ENOMEM;
>>
>> It seems risky to me to combine skb coalescing with TCP header
>> updating as we would need to very carefully check boundary conditions.
>> I feel like there are two halves of the loop anyways, so it seems
>> better to just break them apart.
>>
> Do you mean convert it into two separate loops?

Yes, that's what I was thinking.

>>> +static int skb_list_xmit(struct rtable *rt, struct sk_buff *skb, __be32 src,
>>> +                        __be32 dst, __u8 tos, __u8 ttl, __be16 df)
>>> +{
>>> +       int len = 0;
>>> +
>>> +       while (skb) {
>>> +               struct sk_buff *next = skb->next;
>>> +
>>> +               if (next)
>>> +                       dst_clone(&rt->dst);
>>> +
>>> +               skb->next = NULL;
>>> +               len += iptunnel_xmit(NULL, rt, skb, src, dst, IPPROTO_TCP,
>>> +                                    tos, ttl, df, false);
>>
>> I think there may be a problem in our version of ip_local_out() if it
>> thinks that fix_segment is set in the CB, so we need to find a way to
>> make sure that it is cleared.
>>
>
> right, I will fix STT and I will send another patch for other vport types.

I think the existing vports should be OK because their handle offloads
compat code already knows how to initialize this.

>>> +int stt_xmit_skb(struct sk_buff *skb, struct rtable *rt,
>>> +                __be32 src, __be32 dst, __u8 tos,
>>> +                __u8 ttl, __be16 df, __be16 src_port, __be16 dst_port,
>>> +                __be64 tun_id)
>> [...]
>>> +       while (skb) {
>>> +               struct sk_buff *next_skb = skb->next;
>>> +
>>> +               skb->next = NULL;
>>> +
>>> +               if (next_skb)
>>> +                       dst_clone(&rt->dst);
>>> +
>>> +               /* Push STT and TCP header. */
>>> +               skb = push_stt_header(skb, tun_id, src_port, dst_port, src,
>>> +                                     dst, inner_h_proto, inner_nw_proto,
>>> +                                     dst_mtu(&rt->dst));
>>> +               if (unlikely(!skb))
>>> +                       goto next;
>>
>> I think it's somewhat surprising for a function called
>> push_stt_header() to free the skb in the event of an error. I actually
>> think that there is a double free already between __push_stt_header()
>> and push_stt_header() so it seems like it is better to just propagate
>> the error back here and then free.
>>
>
> I have to free skb in push_stt header since it can change skb, so I
> will fix the double free bug.

Hmm, ok.

>>> +void stt_sock_release(struct net *net, __be16 port)
>>
>> Is there a reason to not just pass in stt_sock here? It seems cleaner
>> and more consistent.
>>
> I need to take the mutex to do lookup so I can not do it from vport-stt.c.

I don't think that we need to a look it up at all. In
stt_tnl_destroy() we already have stt_port->stt_sock so I think we can
pass this in directly.

As an aside, I don't believe that stt_mutex() is actually required
because the only user is OVS and the critical section is also
protected by ovs_mutex. However, I understand if you want to keep
stt_mutex to more closely mirror the other upstream tunnel types.

>>> +       struct stt_sock *stt_sock;
>>> +
>>> +       mutex_lock(&stt_mutex);
>>> +       rcu_read_lock();
>>> +       stt_sock = stt_find_sock(net, port);
>>> +       rcu_read_unlock();
>>
>> If we can't pass in stt_sock then I don't think we need rcu_read_lock
>> since we're holding the mutex.
>>
> It is just to keep RCU checker happy.

I think it should be OK without it - list_for_each_entry_rcu() doesn't
purposely doesn't trigger RCU warnings.

>>> diff --git a/datapath/vport-stt.c b/datapath/vport-stt.c
>>> new file mode 100644
>>> index 0000000..ccd4c71
>>> --- /dev/null
>>> +++ b/datapath/vport-stt.c
>>> +#if LINUX_VERSION_CODE >= KERNEL_VERSION(3,5,0)
>>
>> I think we need a check for CONFIG_NETFILTER someplace, otherwise the
>> registration mechanism won't compile. I wonder if both of these checks
>> should be pushed down to stt.c though since that's where they actually
>> matter.
>>
> ok, I will add stt stub function to handle case without netfilter.

I wonder if it is better to just make the whole thing not compile like
we do if the kernel version is too low. After all, there's not much
point in a STT port that can't receive packets so it could be better
to just not allow it's creation.



More information about the dev mailing list