[ovs-dev] [PATCH 8/8] gre: Add support for path MTU discovery.
Jesse Gross
jesse at nicira.com
Fri Mar 5 18:02:33 UTC 2010
Thanks for the quick review! Some responses below.
On Thu, Mar 4, 2010 at 6:50 PM, Ben Pfaff <blp at nicira.com> wrote:
> On Thu, Mar 04, 2010 at 01:22:13PM -0500, Jesse Gross wrote:
> > + encap_proto = *(__be16 *)((skb->data +
> > + (ip_hdr(skb)->ihl << 2)) + 2);
>
> Can 'ip_hdr(skb)->ihl << 2' with a malicious ihl == 15 value cause us to
> read past the end of the header?
>
Not here since we already checked the GRE header size but I added a check
for the flags, which are used to compute the header size.
>
> The "+ 2" here looks like magic. I guess there must be a gre_hdr
> somewhere (I see one in nf_conntrack_proto_gre, for example) to reduce
> the magic?
>
>
I cleaned this up a little bit but since most of the fields in the GRE
header are optional they move around and there isn't really a structure to
make it all look nice. The one in nf_conntrack_proto_gre really just covers
the flags, so it isn't too useful.
> + if (skb->protocol == htons(ETH_P_IP)) {
> > + if (mtu < 68) {
> > + if (ntohs(ip_hdr(skb)->tot_len) >= 68)
> > + mtu = 68;
>
> It took me a few minutes to figure out where this number came from. I
> guess it is from RFC 791: "Every internet module must be able to forward
> a datagram of 68 octets without further fragmentation. This is because
> an internet header may be up to 60 octets, and the minimum fragment is 8
> octets." I don't know whether that is obvious to everyone else.
>
> I wonder whether the kernel upstream would be open to adding an
> IP_MIN_MTU macro.
>
I think the reason why there is no IP_MIN_MTU is that it isn't very clearly
defined. 68 is the absolute lowest MTU, 576 is the smallest packet size but
it can be fragmented, various RFCs say that the MTU should be
"reasonable"... IPv6 is much better defined because all of these values are
1280. I added a local macro with a comment but I think the situation is too
confused for it to be generally applicable.
>
> I don't see a guarantee that ->tot_len doesn't read past the end of the
> header (and the same for the later IPv6 case).
>
We're not actually reading data from the entire packet here, merely checking
the length of the original packet. When we actually build the ICMP response
it checks the length of whatever piece it needs. However, the size of the
IPv6 header wasn't being checked, so actually reading ->payload_len could be
a problem. I added a check for that case.
>
> > + else
> > + goto out;
> > + }
> > +
> > + header_len += sizeof(struct iphdr);
>
> Should this be "header_len += ip_hdrlen(ip_hdr(skb))?"
>
No, this is just used to check that we can access the basic fields in the IP
header. I don't read any of the options.
>
> > + } else if (skb->protocol == htons(ETH_P_IPV6)) {
> > + if (mtu < IPV6_MIN_MTU) {
> > + unsigned int packet_length = sizeof(struct ipv6hdr)
> +
> > +
> ntohs(ipv6_hdr(skb)->payload_len);
> > +
> > + if (packet_length >= IPV6_MIN_MTU
> > + || ntohs(ipv6_hdr(skb)->payload_len) == 0)
> > + mtu = IPV6_MIN_MTU;
> > + else
> > + goto out;
> > + }
> > +
> > + header_len += sizeof(struct ipv6hdr);
>
> I don't know whether there is a need to skip past IPv6 options.
>
Same comment as with IPv4.
> > @@ -491,9 +567,11 @@ static void ipgre_err(struct sk_buff *skb, u32
> info)
> > }
> >
> > /* If only 8 bytes returned, keyed message will be dropped here */
> > - if (skb_headlen(skb) < grehlen)
> > + if (!pskb_may_pull(skb, grehlen))
> > return;
>
> Wow, it's odd that pskb_may_pull() wasn't already used here. I wonder
> why.
>
I think it is just laziness: this prevents you from having to reset
pointers. I suspect the data is in the linear data portion pretty much all
the time anyways.
> > + if (*icmp_typep > NR_ICMP_TYPES
> > + || (*icmp_typep <= ICMP_PARAMETERPROB
> > + && *icmp_typep != ICMP_ECHOREPLY
> > + && *icmp_typep != ICMP_ECHO))
> > + return false;
>
> RFC 792 just says "no ICMP messages are sent about ICMP messages." This
> seems to violate that. Is it for a good reason?
>
The great thing about standards is there are so many to choose from... RFC
1122 clarifies that ICMP messages are not to be sent in response to an ICMP
error message. This is also what Linux implements.
> > + get_random_bytes(&iph->id, sizeof iph->id);
>
> This selection of 'id' seems poor. There is the ip_select_ident()
> function, but we don't have a dst_entry here. Maybe we could get one.
> Or we could use 0 and set the DF bit, since the id is only used for
> reassembly.
I don't think there is a really good way to come up with an id. Getting a
dst_entry is not really appropriate because packets are just passing through
this machine, it doesn't really have any knowledge or state of the overlay
network that is useful. We are essentially spoofing packets from the
destination of the original packet so we are trying to avoid collisions with
its id field. ip_select_ident falls back to random data in some
circumstances anyways, so it isn't so different. Plus, I'm not sure that
the consequences of a collision are that bad anyways: incorrect fragments
will be reassembled, the checksum will fail, and the packet will be
discarded. Hopefully we will be luckier next time.
Conversely, setting the DF bit could be bad if someone needs to fragment
because the packet will be dropped. We don't maintain any state, so we
can't resend the fragmentation needed message, which will cause packets to
be blackholed. If fragmentation is not needed, then the id field isn't used
anyways.
>
> > + nskb->csum = skb_copy_and_csum_bits(skb, (u8 *)old_iph - skb->data,
> > + payload, payload_length,
> > + nskb->csum);
> > + icmph->checksum = csum_fold(nskb->csum);
>
> Is there value in setting nskb->csum here? Should we set
> nskb->ip_summed to CHECKSUM_NONE? That would match what
> icmp_push_reply() in net/ipv4/icmp.c does in seemingly similar
> circumstances.
>
No, it's just being used as a temporary location. ip_summed is initialized
to CHECKSUM_NONE on allocation, so it is the same behavior as
icmp_push_reply.
> > + nskb = netdev_alloc_skb(dev, NET_IP_ALIGN + eth_hdr_len
> > + + header_length + payload_length);
>
> I don't know whether netdev_alloc_skb() is considered OK to use for
> allocating an sk_buff to *send*. The uses within net/ seem to be
> oriented around *receiving*, and the comment on netdev_alloc_skb() says
> that it is for allocating a buffer for *receiving* on a given device.
>
We're pretending that this packet was received in response to something that
was transmitted - it gets handed up by a call to netif_rx.
>
> > + if (!nskb)
> > + return false;
> > +
> > + skb_reserve(nskb, NET_IP_ALIGN);
>
> But if netdev_alloc_skb() is OK, then you could use
> netdev_alloc_skb_ip_align() to avoid writing the skb_reserve()
> explicitly.
>
I didn't use it because it was just added in 2.6.33 but I added it to our
compatibility layer and called that instead.
>
> I would be tempted to write this as something like
>
> eh->h_proto = old_eh->h_proto;
> if (eh->h_proto == htons(ETH_P_8021Q)) {
> struct vlan_ethhdr *vh = (struct vlan_ethhdr *)eh;
>
> vh->h_vlan_TCI = vlan_eth_hdr(skb)->h_vlan_TCI;
> vh->h_vlan_encapsulated_proto = skb->protocol;
> }
>
That's a good idea.
>
> (or would it even be possible to memcpy the vlan_eth_hdr from the old
> skb into the new one?)
>
No - we still have to swap the addresses.
>
> > @@ -758,6 +1143,9 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff
> *skb, struct net_device *dev
> > skb->protocol = 0;
> > }
> >
> > + if (dev->br_port)
> > + skb_dst_drop(skb);
> > +
>
> I'm curious about why skb_dst_drop(skb) is conditional on the device
> being a bridge port.
>
It was originally a sanity check since bridged traffic should not have a dst
set and I was trying to ensure that non-local traffic was not affected in
any way by the local routing table. dst is dropped for all traffic later
on, so it's not leaking references. However, I took it out because any
non-local traffic won't have a dst and any packet with dst must be local and
there for OK to use the routing table.
>
> > @@ -898,7 +1298,7 @@ static netdev_tx_t ipgre_tunnel_xmit(struct sk_buff
> *skb, struct net_device *dev
> >
> > max_headroom = LL_RESERVED_SPACE(tdev) + gre_hlen;
> >
> > - if (skb_headroom(skb) < max_headroom || skb_shared(skb)||
> > + if (skb_headroom(skb) < max_headroom ||
> > (skb_cloned(skb) && !skb_clone_writable(skb, 0))) {
> > struct sk_buff *new_skb = skb_realloc_headroom(skb,
> max_headroom);
> > if (!new_skb) {
>
> Why is the above change valid?
>
>
I believe that we should always get our own private clone and if that isn't
the case other things will break. If this is incorrect, I'd prefer that
things break more obviously rather than subtly so I took out this condition
and added a warning if it is shared.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-dev/attachments/20100305/44577c0c/attachment-0003.html>
More information about the dev
mailing list