[ovs-dev] [IPv6 Redux 5/6] nicira-ext: Support matching IPv6 traffic.

Justin Pettit jpettit at nicira.com
Sun Jan 23 11:31:45 UTC 2011


On Jan 22, 2011, at 5:41 PM, Jesse Gross wrote:

> On Sat, Jan 22, 2011 at 3:11 AM, Justin Pettit <jpettit at nicira.com> wrote:
>> 
>> +       if (skb->len < nh_ofs + nh_len)
>> +               return -EINVAL;
> 
> Please add likely()/unlikely() to all of these error checks as appropriate.

Okay.  Done

>> +       key->nw_tos = ipv6_get_dsfield(nh);
> 
> What about masking out the ECN bits?

Thanks.  Fixed.

>> +
>> +       /* We don't process jumbograms. */
>> +       if (!ntohs(nh->payload_len))
>> +               return -EINVAL;
> 
> Since this is testing against zero, the byte swap is a waste of time.
> Zero is always zero.

Right.  In quite a few places throughout the code, we use this for documentation/warning-others-who-use-this-field purposes; plus, I figured the compiler might optimize it out.  Regardless, I removed it, since you seem to have a preference.

> As I said before, I think this is a somewhat artificial limitation on
> processing of jumbograms (at least at this point in the processing).
> However, that can wait for the future to be lifted.

I agree, and we can address it later.

>> +       while (1) {
>> +               if ((nexthdr != NEXTHDR_HOP)
>> +                               && (nexthdr != NEXTHDR_ROUTING)
>> +                               && (nexthdr != NEXTHDR_DEST)
>> +                               && (nexthdr != NEXTHDR_MOBILITY)
>> +                               && (nexthdr != NEXTHDR_AUTH)
>> +                               && (nexthdr != NEXTHDR_FRAGMENT)) {
>> +                       /* It's either a terminal header (e.g., TCP, UDP) or one we
>> +                        * don't understand.  In either case, we're done with the
>> +                        * packet, so use it to fill in 'nw_proto'. */
>> +                       break;
>> +               }
> 
> There's an existing Linux function that is almost exactly the same as
> this loop: ipv6_skip_exthdr().  It would be much better if we could
> use it instead, especially since I see some protocol handling bugs
> below.  I see that it handles NEXTHDR_MOBILITY and NEXTHDR_NONE
> differently but otherwise the logic is very similar.

I agree it would be great to reuse the kernel's code, but do you think it's worth using?  It doesn't handle Mobility headers, doesn't complain about nested fragments (I'm not positive this is a protocol violation based on my reading (I think it's silent on the issue), but it seems unlikely to be legitimate), and its handling of the None header is odd.  If any new headers are introduced, we'll have to rebuild this functionality, if we want compatibility with older kernels.  Any issues with the code I wrote to parse extenders should be ferreted out quickly.

>> +                       nh_len += (auth_hdr->hdrlen * 4) + 2;
> 
> The order of operations is reversed here.  It should be
> (auth_hdr->hdrlen + 2) * 4.

Heh.  The document I read stated: "Size of the AH header in 32 bit words - 2."  I had asked QA to test this header specifically, but they were still working on developing a test.  Thanks for catching this.

>> +                       if ((frag_hdr->frag_off & ntohs(0xfff8)) != 0) {
> 
> This should be htons() instead.

Thanks.  Fixed.

>> +
>> +       /* The payload length claims to be smaller than the size of the
>> +        * headers we've already processed. */
>> +       if (ntohs(nh->payload_len) < nh_len - sizeof *nh)
>> +               return -EINVAL;
> 
> 'nh' is no longer valid after a call to pskb_may_pull().  Also, sizeof
> should have parentheses.

Great catch.

>> +
>> +       skb_set_transport_header(skb, nh_ofs + nh_len);
>> +       key->nw_proto = nexthdr;
> 
> The IPv4 code pulls enough into the linear data area for the L4
> protocols as well, which means that those checks only test skb->len.
> We either need to do something similar here or modify the other
> checks.

I just did the pull directly in parse_ipv6hdr().  It seems like it may be cleaner to do it in the transport headers in the long run.

>> +static inline bool icmp6hdr_ok(struct sk_buff *skb)
> 
> You should drop the inline from here as well.

Done.

>> +               nh_len = parse_ipv6hdr(skb, key);
>> +               if (unlikely(nh_len < 0)) {
>> +                       if (nh_len == -EINVAL) {
>> +                               key->nw_proto = 0;
> 
> Shouldn't this already be 0?

I think there used to be a slightly different code path in parse_ipv6hdr() previously.  You are correct, though, so I removed that.

>> --- a/include/openvswitch/datapath-protocol.h
>> +++ b/include/openvswitch/datapath-protocol.h
>> @@ -235,6 +235,8 @@ struct odp_flow_key {
>>     uint8_t  nw_tos;            /* IP ToS (DSCP field, 6 bits). */
>>     uint8_t  arp_sha[6];        /* ARP source hardware address. */
>>     uint8_t  arp_tha[6];        /* ARP target hardware address. */
>> +    uint32_t ipv6_src[4];       /* IPv6 source address. */
>> +    uint32_t ipv6_dst[4];       /* IPv6 destination address. */
> 
> This structure is really getting unwieldy.  After the ND additions,
> it's almost triple the previous size.  I guess it is OK as a temporary
> measure but it needs to be addressed soon.

I agree.  I was anticipating that this would go in after the netlink changes, so I didn't attempt to optimize it.  This will hopefully get replaced in a few days.

>> diff --git a/lib/classifier.c b/lib/classifier.c
>> index a4d53c6..b606f6f 100644
>> --- a/lib/classifier.c
>> +++ b/lib/classifier.c
>> @@ -472,8 +545,13 @@ cls_rule_format(const struct cls_rule *rule, struct ds *s)
>>     if (!skip_type && !(w & FWW_DL_TYPE)) {
>>         ds_put_format(s, "dl_type=0x%04"PRIx16",", ntohs(f->dl_type));
>>     }
>> -    format_ip_netmask(s, "nw_src", f->nw_src, wc->nw_src_mask);
>> -    format_ip_netmask(s, "nw_dst", f->nw_dst, wc->nw_dst_mask);
>> +    if (f->dl_type == htons(ETH_TYPE_IPV6)) {
>> +        format_ipv6_netmask(s, "ipv6_src", &f->ipv6_src, &wc->ipv6_src_mask);
>> +        format_ipv6_netmask(s, "ipv6_dst", &f->ipv6_dst, &wc->ipv6_dst_mask);
>> +    } else {
>> +        format_ip_netmask(s, "nw_src", f->nw_src, wc->nw_src_mask);
>> +        format_ip_netmask(s, "nw_dst", f->nw_dst, wc->nw_dst_mask);
>> +    }
> 
> Specifying IPv6 in the src/dst fields is a break from the convention
> that we've had in the past of generic names (certainly nw_src applies
> to IPv6 as much as IPv4).  I think it's probably nicer to do it this
> way but then we should make everything consistent.

Changing "nw_src" to accept IPv6 addresses would be a pretty big undertaking and have repercussions throughout the code base, since many IPv4-specific functions have "nw_src" in their name.  Similarly, changing all references of "nw_src" to "ipv4_src" would be a big undertaking, and it would break everyone's scripts who uses "nw_src".  I suppose we could alias "nw_src" to "ipv4_src", but you still have the function name issue.  While I agree with you that it's inconsistent, I don't think it's a project we should tackle right now.  (Just look at how IP is almost always used to refer to IPv4.  ;-) )

>> diff --git a/lib/flow.c b/lib/flow.c
>> index 13c236d..d990673 100644
>> --- a/lib/flow.c
>> +++ b/lib/flow.c
>> +static int
>> +parse_ipv6(struct ofpbuf *packet, struct flow *flow)
> [...]
>> +    while (1) {
>> +        if ((nexthdr != IPPROTO_HOPOPTS)
>> +                && (nexthdr != IPPROTO_ROUTING)
>> +                && (nexthdr != IPPROTO_DSTOPTS)
>> +                && (nexthdr != IPPROTO_MOBILITY)
>> +                && (nexthdr != IPPROTO_AH)
>> +                && (nexthdr != IPPROTO_FRAGMENT)) {
>> +                       /* It's either a terminal header (e.g., TCP, UDP) or one we
>> +                        * don't understand.  In either case, we're done with the
>> +             * packet, so use it to fill in 'nw_proto'. */
> 
> Missing indentation?

Actual tabs in the two earlier lines.  I must have switched vim's tab processing mode thinking I was in the kernel, since the code is very similar.

> This function looks semantically equivalent to the kernel version, so
> some of the comments apply here as well.

Yep, I'll take a pass through it.

>> diff --git a/lib/packets.c b/lib/packets.c
>> index 2dc82fe..52fa65e 100644
>> --- a/lib/packets.c
>> +++ b/lib/packets.c
>> +int
>> +ipv6_count_cidr_bits(const struct in6_addr *netmask)
>> +{
>> +    const uint32_t *netmaskp;
>> +    int i;
>> +    int count = 0;
>> +
>> +    assert(ipv6_is_cidr(netmask));
>> +    netmaskp = (const uint32_t *)&netmask->s6_addr[0];
> 
> Is this guaranteed to be 32-bit aligned?  If you have sufficient
> alignment guarantees you could use unsigned long instead and halve the
> iterations on a 64-bit machine.

On Linux with GCC, I believe it will always be 32-bit aligned, but not necessarily 64-bit aligned.  Since we're not just targeting Linux, it's not a great assumption, so I'll clean it up.

> What about strict aliasing rules?

As before, I don't think it's a problem on Linux with GCC, but it's a bad assumption to make.

>> +
>> +    for (i=0; i<4; i++) {
>> +#if __GNUC__ >= 4
>> +        if (ntohl(netmaskp[i])) {
> 
> You don't need the ntohl here.

Same thing about documenting/optimization as earlier, but I went ahead and changed it.

>> +            count += 32 - __builtin_ctz(ntohl(netmaskp[i]));
> 
> I don't really think this function is that important but there's an
> even faster way to do this if you already know that it is a valid
> CIDR:  __builtin_popcount.  New processors have a POPCNT instruction
> in hardware.  You would also be able drop the byte swapping.

I agree, but I followed the ofputil_netmask_to_wcbits() example, which does something similar.  I actually used the __builtin_popcount function in my initial series of patches, but Ben said that it can be horrendously inefficient sometimes, and __builtin_ctz was better more often.

> Most of these comments also apply to the related functions in this file.

Okay, I cleaned them up.  I'm not sure the performance benefit from having both 32-bit and 8-bit versions is worth the code bloat.  Someplace like "ipv6_addr_bitand() is fairly compact, but the others are huge.  Thoughts?

--Justin






More information about the dev mailing list