[ovs-dev] [PATCH v4 6/6] Classifier: Track address prefixes.

Wed Dec 4 18:28:10 UTC 2013

Ben,

Thanks for your review!

On Dec 3, 2013, at 2:58 PM, Ben Pfaff <blp at nicira.com> wrote:

> Here are some preliminary comments.  I think I'd like a v5, if you're
> OK with that.
> 

No problem :-)

> 
> Bugs (?)
> ========
> 
> I think the intent is that the 'prefix' member of struct trie_node is
> stored "left justified", so that the first bit of the prefix is
> shifted into the MSB position of the uint32_t and the LSB position of
> the uint32_t is only nonzero if 'nbits' is 32.  But...  I don't
> understand this line of code in trie_equal_bits() and
> trie_prefix_equal_bits():
>    uint32_t diff = node->prefix ^ ntohl(value[ofs / 32]) << ofs % 32;
> and similarly in trie_get_prefix():
>    prefix = ntohl(*pr) << ofs % 32;
> because both of these seem to shift by the wrong amount for this
> interpretation.  If I've got 10 bits of prefix with ofs % 32 == 0, I
> need to shift left by 22 bits, not by 0 bits.  So... explain?
> 

‘ofs’ is the bit offset to the network byte order prefix at ‘value’ from where the comparison should start. For example, if you have an IPv6 address of 128 bits, the most significant bit is at offset 0 and the last bit is at offset 127. For 32-bit IPv4 addresses the MSB would be at offset 0 and the LSB at offset 31 (which may seem weird as LSB is typically considered to be the bit number 0). This way the code works the same for any prefix length. Also, the offset starting from MSB is in line with the bit numbering used for the IP addresses in the relevant RFCs.

So, if you have a ”/10” IPv4 prefix and you are at offset 0 into the prefix (i.e. comparison starting at the beginning of the prefix), the bits of interest would already be at the MSB position and the right amount to shift is 0 bits.

> 
> Style
> =====
> 
> The comment in trie_lookup() is missing a word or two ("Check that can
> skip based...").
> 
> I think that the check for !mf in trie_lookup() is unnecessary.  (Is
> there any way to have a trie without a match field?)
> 
> The indentation in trie_init() here made this statement really
> confusing at first glance:
>        /* Initialize subtable's prefix length on this field. */
>        subtable->trie_plen[trie_idx] = field
>            ? minimask_get_prefix_len(&subtable->mask, field)
>            : 0;
> I see why you did it that way but maybe a larger rewrite would make
> the whole loop body easier to understand, e.g.:
>        uint8_t *plen = &subtable->trie_plen[trie_idx];
>        *plen = field ? minimask_get_prefix_len(&subtable->mask, field) : 0;
>        if (*plen) {
>            struct cls_rule *head;
> 
>            HMAP_FOR_EACH (head, hmap_node, &subtable->rules) {
>                struct cls_rule *rule;
> 
>                FOR_EACH_RULE_IN_LIST (rule, head) {
>                    trie_insert(trie, rule, *plen);
>                }
>            }
>        }
> but who knows...
> 

I like this better: 

        uint8_t plen;

        plen = field ? minimask_get_prefix_len(&subtable->mask, field) : 0;
        /* Initialize subtable's prefix length on this field. */
        subtable->trie_plen[trie_idx] = plen;

        if (plen) {
            struct cls_rule *head;

            HMAP_FOR_EACH (head, hmap_node, &subtable->rules) {
                struct cls_rule *rule;

                FOR_EACH_RULE_IN_LIST (rule, head) {
                    trie_insert(trie, rule, plen);
                }
            }
        }

> The 'tries' parameter to check_tries() is kind of funny.  This
> parameter is a function of the cls_subtable, that is, it could be
> stored in the cls_subtable at trie creation time and never touched
> again.  But we recalculate it on every call to find_match_wc().  I
> came up with various other ways to do it (e.g. move to struct
> cls_subtable), but I'm not sure in the end that it's actually a
> valuable optimization worth keeping.

Right, that was more of an experiment on my part. I simplified the code and removed the ‘tries’ range.

> 
> I'm also not sure that the be32ofs member of struct trie_ctx is
> valuable and worth keeping.  I think that it can be trivially
> eliminated:
> 
> diff --git a/lib/classifier.c b/lib/classifier.c
> index 338872b..2c34cf2 100644
> --- a/lib/classifier.c
> +++ b/lib/classifier.c
> @@ -471,7 +471,6 @@ classifier_remove(struct classifier *cls, struct cls_rule *rule)
> struct trie_ctx {
>     const struct cls_trie *trie;
>     bool lookup_done;     /* Status of the lookup. */
> -    uint8_t be32ofs;      /* U32 offset of the field in question. */
>     uint8_t match_plen;   /* Longest prefix than could possibly match. */
>     uint8_t maskbits;     /* Prefix length needed to avoid false matches. */
> };
> @@ -480,7 +479,6 @@ static void
> trie_ctx_init(struct trie_ctx *ctx, const struct cls_trie *trie)
> {
>     ctx->trie = trie;
> -    ctx->be32ofs = trie->field->flow_be32ofs;
>     ctx->lookup_done = false;
> }
> 
> @@ -989,10 +987,11 @@ check_tries(const struct flow *flow, struct trie_ctx trie_ctx[CLS_MAX_TRIES],
>      * needed to avoid folding in additional bits to the wildcards mask. */
>     for (j = tries->start; j < tries->end; j++) {
>         struct trie_ctx *ctx = &trie_ctx[j];
> +        int be32ofs = ctx->trie->field->flow_be32ofs;
> 

I guess I wanted to avoid resolving this many indirections this deep in the classifier, especially when we typically do this multiple times for each subtable (in the typical use case we find that the trie field is not relevant for the first two ranges in the following if statement). Since we have the context for keeping track of on-demand lookup, we might as well use it for precomputing stuff we know would be repeated many times over otherwise.

>         /* Is the trie field relevant for this subtable and range of fields? */
>         if (field_plen[j] &&
> -            ctx->be32ofs >= ofs->start && ctx->be32ofs < ofs->end) {
> +            be32ofs >= ofs->start && be32ofs < ofs->end) {
>             /* On-demand trie lookup. */
>             if (!ctx->lookup_done) {
>                 ctx->match_plen = trie_lookup(ctx->trie, flow, &ctx->maskbits);
> @@ -1010,14 +1009,14 @@ check_tries(const struct flow *flow, struct trie_ctx trie_ctx[CLS_MAX_TRIES],
>                  * relevant for the current flow. */
> 
>                 /* Can skip if the field is already unwildcarded. */
> -                if (mask_prefix_bits_set(wc, ctx->be32ofs, ctx->maskbits)) {
> +                if (mask_prefix_bits_set(wc, be32ofs, ctx->maskbits)) {
>                     return true;
>                 }
>                 /* Check that the trie result will not unwildcard more bits
>                  * than this stage will. */
>                 if (ctx->maskbits <= field_plen[j]) {
>                     /* Unwildcard the bits and skip the rest. */
> -                    mask_set_prefix_bits(wc, ctx->be32ofs, ctx->maskbits);
> +                    mask_set_prefix_bits(wc, be32ofs, ctx->maskbits);
>                     /* Note: Prerequisite already unwildcarded, as the only
>                      * prerequisite of the supported trie lookup fields is
>                      * the ethertype, which is currently always unwildcarded.
> 

More than happy to do this if you do not buy my rationale above :-)

> In the trie_lookup_value() prototype, I'd appreciate it if 'value'
> were declared as "const ovs_be32 value[]".  Even though that's
> semantically the same as "const ovs_be32 *value", it makes it clear
> why it's a pointer instead of a single ovs_be32.
> 

OK.

> 
> Documentation
> =============
> 
> I wrote some comments for classifier.h:
> 

I have incorporated these with the minor corrections below, thanks!

> diff --git a/lib/classifier.h b/lib/classifier.h
> index f5fc323..032c573 100644
> --- a/lib/classifier.h
> +++ b/lib/classifier.h
> @@ -24,15 +24,79 @@
>  * =====
>  *
>  * A flow classifier holds any number of "rules", each of which specifies
> - * values to match for some fields or subfields and a priority.  The primary
> - * design goal for the classifier is that, given a packet, it can as quickly as
> - * possible find the highest-priority rule that matches the packet.
> - *
> - * Each OpenFlow table is implemented as a flow classifier.
> - *
> - *
> - * Basic Design
> - * ============
> + * values to match for some fields or subfields and a priority.  Each OpenFlow
> + * table is implemented as a flow classifier.
> + *
> + * The classifier has two primary design goals.  The first is obvious: given a
> + * set of packet headers, as quickly as possible find the highest-priority rule
> + * that matches those headers.  The following section describes the second
> + * goal.
> + *
> + *
> + * "Un-wildcarding"
> + * ================
> + *
> + * A primary goal of the flow classifier is to produce, as a side effect of a
> + * packet lookup, a wildcard mask that indicates which bits of the packet
> + * headers were essential to the classification result.  Ideally, a 1-bit in
> + * any position of this mask means that, if the corresponding bit in the packet
> + * header were flipped, then the classification result might change.  A 0-bit
> + * means that changing the packet header bit would have no effect.  Thus, the
> + * wildcarded bits are the ones that played no role in the classification
> + * decision.
> + *
> + * Such a wildcard mask is useful with datapaths that support installing flows
> + * that wildcard fields or subfields.  If an OpenFlow lookup for a TCP flow
> + * does not actually look at the TCP source or destination ports, for example,
> + * then the switch may install into the datapath a flow that wildcards the port
> + * numbers, which in turn allows the datapath to handle packets that arrive for
> + * other TCP source or destination ports without additional help from
> + * ovs-vswitchd.  This is useful for the Open vSwitch software and,
> + * potentially, for ASIC-based switches as well.
> + *
> + * Some properties of the wildcard mask:
> + *
> + *     - "False 1-bits" are acceptable, that is, setting a bit in the wildcard
> + *       mask to a 1 will never cause a packet to be forwarded the wrong way.

-> “to 1 will”

> + *       As a corollary, a wildcard mask composed of all 1-bits will always
> + *       yield correct (but often needlessly inefficient) behavior.
> + *
> + *     - "False 0-bits" can cause problems, so they must be avoided.  In the
> + *       extreme case, a mask of all 0-bits is only correct if the classifier
> + *       contains only a single flow that matches all packets.
> + *
> + *     - 0-bits are desirable because they allow the datapath to act more
> + *       autonomously, relying on ovs-vswitchd to process fewer flow setups,

-> “relying less on ovs-vswitchd to process flow setups."

> + *       thereby improving performance.
> + *
> + *     - We don't know a good way to generate wildcard masks with the maximum
> + *       (correct) number of 0-bits.  We use various approximations, described
> + *       in later sections.
> + *
> + *     - Wildcard masks for lookups in a given classifier yield a
> + *       non-overlapping set of rules.  More specifically:
> + *
> + *       Consider an classifier C1 filled with an arbitrary collection of rules
> + *       and an empty classifier C2.  Now take a set of packet headers H and
> + *       look it up in C1, yielding a highest-priority matching rule R1 and
> + *       wildcard mask M.  Form a new classifier rule R2 out of packet headers
> + *       H and mask M, and add R2 to C2 with a fixed priority.  If one were to
> + *       do this for every possible set of packet headers H, then this
> + *       process would not attempt to add any overlapping rules to C2, that is,
> + *       any packet lookup using the rules generated by this process matches at
> + *       most one rule in C2.
> + *
> + * During the lookup process, the classifier starts out with a wildcard mask
> + * that is all 0-bits, that is, fully wildcarded.  As lookup proceeds, each
> + * step tends to add constraints to the wildcard mask, that is, change
> + * wildcarded 0-bits into exact-match 1-bits.  We call this "un-wildcarding".
> + * A lookup step that examines a particular field must unwildcard that field.
> + * In general, un-wildcarding is necessary for correctness but undesirable for
> + * performance.
> + *
> + *
> + * Basic Classifier Design
> + * =======================
>  *
>  * Suppose that all the rules in a classifier had the same form.  For example,
>  * suppose that they all matched on the source and destination Ethernet address
> @@ -51,8 +115,10 @@
>  * cls_subtable" in the classifier and tracks the highest-priority match that
>  * it finds.  The subtables are kept in a descending priority order according
>  * to the highest priority rule in each subtable, which allows lookup to skip
> - * over subtables that can't possibly have a higher-priority match than
> - * already found.
> + * over subtables that can't possibly have a higher-priority match than already
> + * found.  Eliminating lookups through priority ordering aids both classifier
> + * primary design goals: skipping lookups saves time and avoids un-wildcarding
> + * fields that those lookups would have examined.
>  *
>  * One detail: a classifier can contain multiple rules that are identical other
>  * than their priority.  When this happens, only the highest priority rule out
> @@ -61,24 +127,28 @@
>  * list inside that highest-priority rule.
>  *
>  *
> - * Staged Lookup
> - * =============
> + * Staged Lookup (Wildcard Optimization)
> + * =====================================
>  *
>  * Subtable lookup is performed in ranges defined for struct flow, starting
>  * from metadata (registers, in_port, etc.), then L2 header, L3, and finally
>  * L4 ports.  Whenever it is found that there are no matches in the current
> - * subtable, the rest of the subtable can be skipped.  The rationale of this
> - * logic is that as many fields as possible can remain wildcarded.
> + * subtable, the rest of the subtable can be skipped.
> + *
> + * Staged lookup does not reduce lookup time, and it may increase it, because
> + * it changes a single hash table lookup into multiple hash table lookup.  It

-> “into multiple hash table lookups.”

> + * reduces un-wildcarding significantly in important use cases.
>  *
>  *
> - * Prefix Tracking
> - * ===============
> + * Prefix Tracking (Wildcard Optimization)
> + * =======================================
>  *
>  * Classifier uses prefix trees ("tries") for tracking the used
>  * address space, enabling skipping classifier tables containing
> - * longer masks than necessary for the given address.  This enables
> - * less unwildcarding for datapath flows in parts of the address space
> - * without host routes.
> + * longer masks than necessary for the given address.  This reduces
> + * unwildcarding for datapath flows in parts of the address space 

-> “un-wildcarding” (for consistency)

> + * without host routes, but consulting extra data structures (the
> + * tries) may slightly increase lookup time.
>  *
>  * Trie lookup is interwoven with staged lookup, so that a trie is
>  * searched only when the configured trie field becomes relevant for
> @@ -100,8 +170,8 @@
>  * flow table.  Currently this limit is 3.
>  *
>  *
> - * Partitioning
> - * ============
> + * Partitioning (Lookup Time and Wildcard Optimization)
> + * ====================================================
>  *
>  * Suppose that a given classifier is being used to handle multiple stages in a
>  * pipeline using "resubmit", with metadata (that is, the OpenFlow 1.1+ field
> @@ -133,6 +203,9 @@
>  * any cls_subtable that doesn't match on metadata.  We handle that by giving
>  * any such cls_subtable TAG_ALL as its 'tags' so that it matches any tag.)
>  *
> + * Partitioning saves lookup time by reducing the number of subtable lookups.
> + * Each eliminated subtable lookup also reduces the amount of un-wildcarding.
> + *
>  *
>  * Thread-safety
>  * =============

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-dev/attachments/20131204/4b43b2d3/attachment-0003.html>