[ovs-dev] [PATCH] Implement new fragment handling policy.

Thu Oct 20 01:09:30 UTC 2011

On Wed, Oct 19, 2011 at 3:19 PM, Ben Pfaff <blp at nicira.com> wrote:
> On Mon, Oct 17, 2011 at 09:19:44PM -0700, Jesse Gross wrote:
>> On Tue, Oct 11, 2011 at 4:33 PM, Ben Pfaff <blp at nicira.com> wrote:
>> > diff --git a/datapath/flow.c b/datapath/flow.c
>> > index 7322295..9a9b0aa 100644
>> > --- a/datapath/flow.c
>> > +++ b/datapath/flow.c
>> > + * Correct behavior when there's more than one fragment header is anybody's
>> > + * guess.  This version reports whether the final fragment header is a first
>> > + * fragment.
>>
>> I thought about this a bit more and I think what you have here is correct.
>>
>> According to RFC 2460, this type of parsing is illegal because you're
>> supposed to fully process each extension header as a protocol layer
>> before moving onto the next instead of skipping through them.  This
>> means that end hosts (which are all that should matter here since
>> routers do not do fragmentation and reassembly) will actually do
>> iterative reassembly in the case of multiple fragment headers.
>
> Wow.  I never considered that possibility.  Looking at the Linux IPv6
> implementation, I think that's what really happens in practice, too.
>
> Your analysis (snipped) is very complete.  Do you want any of it in
> the comment?  For now I just changed this to:
>
>  * When there is more than one fragment header, this version reports whether
>  * the final fragment header that it examines is a first fragment.

I think that's fine.

>> > +static int skip_exthdr(const struct sk_buff *skb, int start, u8 *nexthdrp,
>> > +                      u8 *tos_frag)
>> [...]
>> > +                       if (ntohs(*fp) & ~0x7) {
>> > +                               *tos_frag |= OVS_FRAG_TYPE_FIRST;
>> > +                               break;
>> > +                       }
>> > +                       *tos_frag |= OVS_FRAG_TYPE_LATER;
>>
>> Aren't these two cases (FIRST and LATER) reversed?
>
> Oops.  Yes, you're right.

Looks good.

>> > @@ -140,11 +202,11 @@ static int parse_ipv6hdr(struct sk_buff *skb, struct sw_flow_key *key,
>> >        payload_ofs = (u8 *)(nh + 1) - skb->data;
>> >
>> >        key->ip.proto = NEXTHDR_NONE;
>> > -       key->ip.tos = ipv6_get_dsfield(nh) & ~INET_ECN_MASK;
>> > +       key->ip.tos_frag = ipv6_get_dsfield(nh) & ~INET_ECN_MASK;
>> >        ipv6_addr_copy(&key->ipv6.addr.src, &nh->saddr);
>> >        ipv6_addr_copy(&key->ipv6.addr.dst, &nh->daddr);
>> >
>> > -       payload_ofs = ipv6_skip_exthdr(skb, payload_ofs, &nexthdr);
>> > +       payload_ofs = skip_exthdr(skb, payload_ofs, &nexthdr, &key->ip.tos_frag);
>>
>> I think we're missing the UDP GSO case here to mark it as OVS_FRAG_TYPE_FIRST.
>
> TCP GSO is broken into SKB_GSO_TCPV4 and SKB_GSO_TCPV6.  UDP GSO is
> SKB_GSO_UDP for both v4 and v6, right?

Yes, that's right.

>> > diff --git a/datapath/flow.h b/datapath/flow.h
>> > index ae12fe4..31a02fa 100644
>> > --- a/datapath/flow.h
>> > +++ b/datapath/flow.h
>> > @@ -29,6 +29,10 @@ struct sw_flow_actions {
>> >        struct nlattr actions[];
>> >  };
>> >
>> > +/* Mask for the OVS_FRAG_TYPE_* value in the low 2 bits of ip.tos_frag in
>> > + * struct sw_flow_key. */
>> > +#define OVS_FRAG_TYPE_MASK 3
>>
>> What if we defined this in terms of INET_ECN_MASK?  That seems a
>> little clearer.
>
> OK.  (I didn't do it to start off because they aren't actually ECN
> bits.)

Yeah, I know they're not actually ECN bits, which is why I think it
makes sense to redefine it but I think this better shows that they are
in the same place.

>> > diff --git a/lib/nx-match.c b/lib/nx-match.c
>> > index beaed3d..49aaef9 100644
>> > --- a/lib/nx-match.c
>> > +++ b/lib/nx-match.c
>> > +static void
>> > +nxm_put_tos_frag(struct ofpbuf *b, const struct cls_rule *cr)
>> > +{
>> > +    uint8_t tos_frag = cr->flow.tos_frag;
>> > +    uint8_t tos_frag_mask = cr->wc.tos_frag_mask;
>> > +
>> > +    if (tos_frag_mask & IP_DSCP_MASK) {
>> > +        nxm_put_8(b, NXM_OF_IP_TOS, tos_frag & IP_DSCP_MASK);
>> > +    }
>> > +    if (tos_frag_mask & FLOW_FRAG_MASK) {
>> > +        uint8_t value, mask;
>> > +
>> > +        value = mask = 0;
>> > +        if (tos_frag_mask & FLOW_FRAG_ANY) {
>> > +            mask |= 1;
>> > +            if (tos_frag & FLOW_FRAG_ANY) {
>> > +                value |= 1;
>> > +            }
>> > +        }
>> > +        if (tos_frag_mask & FLOW_FRAG_FIRST) {
>> > +            mask |= 2;
>> > +            if (tos_frag & FLOW_FRAG_FIRST) {
>> > +                value |= 2;
>> > +            }
>> > +        }
>> > +
>> > +        if (mask == 3) {
>> > +            mask = UINT8_MAX;
>> > +        }
>>
>> I'm not sure why we use explicit values here instead of symbolic
>> constants.  I believe that they have same value and we use the
>> constants on the parse side.
>
> I was trying to keep the FLOW_* constants independent of the values in
> nicira-ext.h, but as you point out they leaked anyway.
>
> OK, now I've declared symbolic values for those bits in nicira-ext,
> asserted that they're the same as the FLOW_FRAG_ bits, and just use
> them directly in nx-match.c too:

Thanks, that looks nicer.

>> One general thing that occurs to me is that the not_first option is
>> not particularly useful and that a not fragmented/first fragment
>> combination would be vastly more interesting (by flipping the first
>> bit to be later).  I'm not sure how much it matters though because you
>> can get pretty much the same information by looking at the L4 fields
>> and exactly the same thing by using multiple flows.
>
> OK, good point.  I applied the following:

> diff --git a/lib/classifier.c b/lib/classifier.c
> index d3632a3..0335f13 100644
> --- a/lib/classifier.c
> +++ b/lib/classifier.c
> @@ -614,21 +614,21 @@ cls_rule_format(const struct cls_rule *rule, struct ds *s)
>         ds_put_format(s, "nw_tos=%"PRIu8",", f->tos_frag & IP_DSCP_MASK);
>     }
>     switch (wc->tos_frag_mask & FLOW_FRAG_MASK) {
> -    case FLOW_FRAG_ANY | FLOW_FRAG_FIRST:
> +    case FLOW_FRAG_ANY | FLOW_FRAG_LATER:
>         ds_put_format(s, "frag=%s,",
>                       f->tos_frag & FLOW_FRAG_ANY
> -                      ? (f->tos_frag & FLOW_FRAG_FIRST ? "first" : "later")
> -                      : (f->tos_frag & FLOW_FRAG_FIRST ? "<error>" : "no"));
> +                      ? (f->tos_frag & FLOW_FRAG_LATER ? "later" : "first")
> +                      : (f->tos_frag & FLOW_FRAG_LATER ? "no" : "<error>"));

I think the last condition is reversed because !FLOW_FRAG_ANY and
FLOW_FRAG_LATER is still an error.

> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index 58fc18e..536a213 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -2954,7 +2954,7 @@ rule_dpif_lookup(struct ofproto_dpif *ofproto, const struct flow *flow,
>     }
>
>     cls = &ofproto->up.tables[table_id];
> -    if (flow->tos_frag & FLOW_FRAG_FIRST
> +    if ((flow->tos_frag & FLOW_FRAG_MASK) == FLOW_FRAG_ANY

Is there a reason for this to not be (flow->tos_frag & FLOW_FRAG_ANY)?
 I realize that it does the masking out for last frags as well, which
is not strictly necessary although I'm not sure that it matters all
that much.

Otherwise, the incremental looks good.  However, I realized that there
is one more issue: when we pass up the flow key for UDP GSO packets,
they will all reflect the first fragment and not the later packets.