[ovs-dev] [IPv6 Redux 5/6] nicira-ext: Support matching IPv6 traffic.

Jesse Gross jesse at nicira.com
Sun Jan 23 19:11:27 UTC 2011


On Sun, Jan 23, 2011 at 3:31 AM, Justin Pettit <jpettit at nicira.com> wrote:
> On Jan 22, 2011, at 5:41 PM, Jesse Gross wrote:
>>> +
>>> +       /* We don't process jumbograms. */
>>> +       if (!ntohs(nh->payload_len))
>>> +               return -EINVAL;
>>
>> Since this is testing against zero, the byte swap is a waste of time.
>> Zero is always zero.
>
> Right.  In quite a few places throughout the code, we use this for documentation/warning-others-who-use-this-field purposes; plus, I figured the compiler might optimize it out.  Regardless, I removed it, since you seem to have a preference.

It's usually best to assume the compiler is pretty dumb because most
of the time you will be right.  In this case, for a 16-bit swap on x86
with the Linux implementation it actually is optimized out.  However,
if it were a 32-bit value then it does make a difference here.

On the fast path at least, documentation in the form of a comment is
better if necessary.  Alternately, you could do:
if (nh->payload_len == htons(0))

This can be handled at compile time for all sizes, architectures, and
values since the compiler will do constant folding.

>>> +       while (1) {
>>> +               if ((nexthdr != NEXTHDR_HOP)
>>> +                               && (nexthdr != NEXTHDR_ROUTING)
>>> +                               && (nexthdr != NEXTHDR_DEST)
>>> +                               && (nexthdr != NEXTHDR_MOBILITY)
>>> +                               && (nexthdr != NEXTHDR_AUTH)
>>> +                               && (nexthdr != NEXTHDR_FRAGMENT)) {
>>> +                       /* It's either a terminal header (e.g., TCP, UDP) or one we
>>> +                        * don't understand.  In either case, we're done with the
>>> +                        * packet, so use it to fill in 'nw_proto'. */
>>> +                       break;
>>> +               }
>>
>> There's an existing Linux function that is almost exactly the same as
>> this loop: ipv6_skip_exthdr().  It would be much better if we could
>> use it instead, especially since I see some protocol handling bugs
>> below.  I see that it handles NEXTHDR_MOBILITY and NEXTHDR_NONE
>> differently but otherwise the logic is very similar.
>
> I agree it would be great to reuse the kernel's code, but do you think it's worth using?  It doesn't handle Mobility headers, doesn't complain about nested fragments (I'm not positive this is a protocol violation based on my reading (I think it's silent on the issue), but it seems unlikely to be legitimate), and its handling of the None header is odd.  If any new headers are introduced, we'll have to rebuild this functionality, if we want compatibility with older kernels.  Any issues with the code I wrote to parse extenders should be ferreted out quickly.

Yes, I think we should use it.  It allows for reduced code, fewer
bugs, is a slightly more efficient implementation, and would be
necessary for upstreaming.  Here are the differences:

Mobility headers:
I thought that this was odd, so I started reading the RFCs.  It now
appears that the kernel's implementation is the right one and this is
wrong.  Mobility headers are, for all intents and purposes, terminal
headers.  They contain information for updating bindings, not protocol
data.  There is a next header field but this is what RFC 3775 has to
say about it:

Identifies the protocol following this header. This field is intended
to be used by a future extension. Currently, this field SHOULD be set
to 59 (IPPROTO_NONE).

So parsing this as we do currently would result in NONE for protocol,
instead of MOBILITY which is much more interesting.

Fragments:
I don't think it is wrong that we check for duplicate fragment headers
but I'm not sure that it is incredibly critical that we do so.  For
one thing, there are plenty of other parameters in an IPv6 packet that
could be inconsistent (and even others within the fragment header)
that we don't check, so people shouldn't really rely on this level of
validation.

Linux, at least, will happily process multiple fragment headers since
it handles extension header in order and once it has reassembled a
packet it will move onto the next header, even if it is a fragment.
This behavior is required by RFC 2460:

The contents and semantics of each extension header determine whether
or not to proceed to the next header.  Therefore, extension headers
must be processed strictly in the order they appear in the packet;

On the other hand, only end hosts are supposed to do fragmentation in
IPv6, so it is not clear how this could arise in a legitimate
situation.  Overall, though it seems like there are so many ways for
fragments to be inconsistent that we don't check for that this one
seems fairly minor.

None:
ipv6_skip_exthdr() returns an error if it either encounters a NONE
header or runs off the end of the packet when parsing it.  Currently
we distinguish the latter by using zero as the protocol.  However,
further reading shows that hop-by-hop options are protocol 0.  We skip
over hop-by-hop options, so we should never actually return 0 for the
protocol but the potential for confusion would make it seem a poor
choice.  Returning NONE is probably the safe course for the error
case.

Any backwards compatibility issues can be solved easily by putting a
copy of the most recent kernel's function in the compat directory.
It's self-contained so it should not be difficult at all.  In
addition, if there are pieces that you feel are missing from the
function, then you should propose upstream changes as a sanity check.
If they are accepted then it is easy to include the new version;
otherwise we may find out that there are good reasons for it being
done the way it currently is (as with some of the issues here).
ipv6_skip_exthdr() is used in a variety of places, including
ip6tables, so the use cases should be well aligned.

>>> +
>>> +       skb_set_transport_header(skb, nh_ofs + nh_len);
>>> +       key->nw_proto = nexthdr;
>>
>> The IPv4 code pulls enough into the linear data area for the L4
>> protocols as well, which means that those checks only test skb->len.
>> We either need to do something similar here or modify the other
>> checks.
>
> I just did the pull directly in parse_ipv6hdr().  It seems like it may be cleaner to do it in the transport headers in the long run.

The reason why it was done the way it is currently is to reduce the
number of calls to pskb_may_pull().  That doesn't really apply to IPv6
right now because of the extension header parsing but it might make
more sense if we use ipv6_skip_exthdr() because we won't be calling
pskb_may_pull() repeatedly then.

>>> --- a/include/openvswitch/datapath-protocol.h
>>> +++ b/include/openvswitch/datapath-protocol.h
>>> @@ -235,6 +235,8 @@ struct odp_flow_key {
>>>     uint8_t  nw_tos;            /* IP ToS (DSCP field, 6 bits). */
>>>     uint8_t  arp_sha[6];        /* ARP source hardware address. */
>>>     uint8_t  arp_tha[6];        /* ARP target hardware address. */
>>> +    uint32_t ipv6_src[4];       /* IPv6 source address. */
>>> +    uint32_t ipv6_dst[4];       /* IPv6 destination address. */
>>
>> This structure is really getting unwieldy.  After the ND additions,
>> it's almost triple the previous size.  I guess it is OK as a temporary
>> measure but it needs to be addressed soon.
>
> I agree.  I was anticipating that this would go in after the netlink changes, so I didn't attempt to optimize it.  This will hopefully get replaced in a few days.

That's fine but the optimizations are independent of the Netlink
changes and doesn't come automatically with the switchover.

>
>>> diff --git a/lib/classifier.c b/lib/classifier.c
>>> index a4d53c6..b606f6f 100644
>>> --- a/lib/classifier.c
>>> +++ b/lib/classifier.c
>>> @@ -472,8 +545,13 @@ cls_rule_format(const struct cls_rule *rule, struct ds *s)
>>>     if (!skip_type && !(w & FWW_DL_TYPE)) {
>>>         ds_put_format(s, "dl_type=0x%04"PRIx16",", ntohs(f->dl_type));
>>>     }
>>> -    format_ip_netmask(s, "nw_src", f->nw_src, wc->nw_src_mask);
>>> -    format_ip_netmask(s, "nw_dst", f->nw_dst, wc->nw_dst_mask);
>>> +    if (f->dl_type == htons(ETH_TYPE_IPV6)) {
>>> +        format_ipv6_netmask(s, "ipv6_src", &f->ipv6_src, &wc->ipv6_src_mask);
>>> +        format_ipv6_netmask(s, "ipv6_dst", &f->ipv6_dst, &wc->ipv6_dst_mask);
>>> +    } else {
>>> +        format_ip_netmask(s, "nw_src", f->nw_src, wc->nw_src_mask);
>>> +        format_ip_netmask(s, "nw_dst", f->nw_dst, wc->nw_dst_mask);
>>> +    }
>>
>> Specifying IPv6 in the src/dst fields is a break from the convention
>> that we've had in the past of generic names (certainly nw_src applies
>> to IPv6 as much as IPv4).  I think it's probably nicer to do it this
>> way but then we should make everything consistent.
>
> Changing "nw_src" to accept IPv6 addresses would be a pretty big undertaking and have repercussions throughout the code base, since many IPv4-specific functions have "nw_src" in their name.  Similarly, changing all references of "nw_src" to "ipv4_src" would be a big undertaking, and it would break everyone's scripts who uses "nw_src".  I suppose we could alias "nw_src" to "ipv4_src", but you still have the function name issue.  While I agree with you that it's inconsistent, I don't think it's a project we should tackle right now.  (Just look at how IP is almost always used to refer to IPv4.  ;-) )

I was just talking about the naming, not converting everything to
handle both.  While I don't think that maintaining compatibility with
scripts is a great reason to keep things as they are, it can wait for
the future.

>> I don't really think this function is that important but there's an
>> even faster way to do this if you already know that it is a valid
>> CIDR:  __builtin_popcount.  New processors have a POPCNT instruction
>> in hardware.  You would also be able drop the byte swapping.
>
> I agree, but I followed the ofputil_netmask_to_wcbits() example, which does something similar.  I actually used the __builtin_popcount function in my initial series of patches, but Ben said that it can be horrendously inefficient sometimes, and __builtin_ctz was better more often.

Yes, Ben is right that since distributions need to support processors
older than Nehalem, people won't be able to take advantage of POPCNT,
even if their CPU supports it.  It's really not that important anyway.

>
>> Most of these comments also apply to the related functions in this file.
>
> Okay, I cleaned them up.  I'm not sure the performance benefit from having both 32-bit and 8-bit versions is worth the code bloat.  Someplace like "ipv6_addr_bitand() is fairly compact, but the others are huge.  Thoughts?

Unless it's performance critical, it's probably not worth it.
However, depending on what you did for alignment/strict aliasing (like
if you are making a copy) you might be able to put it in an array of
longs to automatically get the native word size without any duplicate
code.




More information about the dev mailing list