[ovs-dev] [PATCH v4 6/6] Classifier: Track address prefixes.

Ben Pfaff blp at nicira.com
Tue Dec 3 22:58:47 UTC 2013


On Tue, Nov 26, 2013 at 11:24:52AM -0800, Jarno Rajahalme wrote:
> Add a prefix tree (trie) structure for tracking the used address
> space, enabling skipping classifier tables containing longer masks
> than necessary for an address field value in a flow being classified.
> This enables less unwildcarding for datapath flows in parts of the
> address space without host routes.
> 
> Trie lookup is interwoven to the staged lookup, so that a trie is
> searched only when the configured trie field field becomes relevant
> for the lookup.  The trie lookup results are retained so that each
> trie is checked at most once for each classifier lookup.
> 
> This implementation tracks the number of rules at each address prefix
> for the whole classifier.  More aggressive table skipping would be
> possible by maintaining lists of tables that have prefixes at the
> lengths encountered on tree traversal, or by maintaining separate
> tries for subsets of rules separated by metadata fields.
> 
> Prefix tracking is configured via OVSDB.  A new column "fieldspec" is
> added to the database table "Flow_Table".  "fieldspec" is a string map
> with only the "prefix" key defined so far (other keys can be added in
> future):
> 
> "prefix"   A comma separated list of field names for which prefix
>            lookup should be used.
> 
> As of now, the fields for which prefix lookup can be enabled are:
> - tun_id, tun_src, tun_dst
> - metadata
> - nw_src, nw_dst (or aliases ip_src and ip_dst)
> - ipv6_src, ipv6_dst
> 
> There is a maximum number of fields that can be enabled for any one
> flow table.  Currently this limit is 3.
> 
> Examples:
> 
> ovs-vsctl set Bridge br0 flow_tables:0=@N1 -- \
>  --id=@N1 create Flow_Table name=table0
> ovs-vsctl set Bridge br0 flow_tables:1=@N1 -- \
>  --id=@N1 create Flow_Table name=table1
> 
> ovs-vsctl set Flow_Table table0 fieldspec:prefix=tun_id
> ovs-vsctl set Flow_Table table1 fieldspec:prefix=ip_dst,ip_src
> 
> ovs-vsctl remove Flow_Table table1 fieldspec prefix
> 
> Signed-off-by: Jarno Rajahalme <jrajahalme at nicira.com>

Here are some preliminary comments.  I think I'd like a v5, if you're
OK with that.


Bugs (?)
========

I think the intent is that the 'prefix' member of struct trie_node is
stored "left justified", so that the first bit of the prefix is
shifted into the MSB position of the uint32_t and the LSB position of
the uint32_t is only nonzero if 'nbits' is 32.  But...  I don't
understand this line of code in trie_equal_bits() and
trie_prefix_equal_bits():
    uint32_t diff = node->prefix ^ ntohl(value[ofs / 32]) << ofs % 32;
and similarly in trie_get_prefix():
    prefix = ntohl(*pr) << ofs % 32;
because both of these seem to shift by the wrong amount for this
interpretation.  If I've got 10 bits of prefix with ofs % 32 == 0, I
need to shift left by 22 bits, not by 0 bits.  So... explain?


Style
=====

The comment in trie_lookup() is missing a word or two ("Check that can
skip based...").

I think that the check for !mf in trie_lookup() is unnecessary.  (Is
there any way to have a trie without a match field?)

The indentation in trie_init() here made this statement really
confusing at first glance:
        /* Initialize subtable's prefix length on this field. */
        subtable->trie_plen[trie_idx] = field
            ? minimask_get_prefix_len(&subtable->mask, field)
            : 0;
I see why you did it that way but maybe a larger rewrite would make
the whole loop body easier to understand, e.g.:
        uint8_t *plen = &subtable->trie_plen[trie_idx];
        *plen = field ? minimask_get_prefix_len(&subtable->mask, field) : 0;
        if (*plen) {
            struct cls_rule *head;

            HMAP_FOR_EACH (head, hmap_node, &subtable->rules) {
                struct cls_rule *rule;

                FOR_EACH_RULE_IN_LIST (rule, head) {
                    trie_insert(trie, rule, *plen);
                }
            }
        }
but who knows...

The 'tries' parameter to check_tries() is kind of funny.  This
parameter is a function of the cls_subtable, that is, it could be
stored in the cls_subtable at trie creation time and never touched
again.  But we recalculate it on every call to find_match_wc().  I
came up with various other ways to do it (e.g. move to struct
cls_subtable), but I'm not sure in the end that it's actually a
valuable optimization worth keeping.

I'm also not sure that the be32ofs member of struct trie_ctx is
valuable and worth keeping.  I think that it can be trivially
eliminated:

diff --git a/lib/classifier.c b/lib/classifier.c
index 338872b..2c34cf2 100644
--- a/lib/classifier.c
+++ b/lib/classifier.c
@@ -471,7 +471,6 @@ classifier_remove(struct classifier *cls, struct cls_rule *rule)
 struct trie_ctx {
     const struct cls_trie *trie;
     bool lookup_done;     /* Status of the lookup. */
-    uint8_t be32ofs;      /* U32 offset of the field in question. */
     uint8_t match_plen;   /* Longest prefix than could possibly match. */
     uint8_t maskbits;     /* Prefix length needed to avoid false matches. */
 };
@@ -480,7 +479,6 @@ static void
 trie_ctx_init(struct trie_ctx *ctx, const struct cls_trie *trie)
 {
     ctx->trie = trie;
-    ctx->be32ofs = trie->field->flow_be32ofs;
     ctx->lookup_done = false;
 }
 
@@ -989,10 +987,11 @@ check_tries(const struct flow *flow, struct trie_ctx trie_ctx[CLS_MAX_TRIES],
      * needed to avoid folding in additional bits to the wildcards mask. */
     for (j = tries->start; j < tries->end; j++) {
         struct trie_ctx *ctx = &trie_ctx[j];
+        int be32ofs = ctx->trie->field->flow_be32ofs;
 
         /* Is the trie field relevant for this subtable and range of fields? */
         if (field_plen[j] &&
-            ctx->be32ofs >= ofs->start && ctx->be32ofs < ofs->end) {
+            be32ofs >= ofs->start && be32ofs < ofs->end) {
             /* On-demand trie lookup. */
             if (!ctx->lookup_done) {
                 ctx->match_plen = trie_lookup(ctx->trie, flow, &ctx->maskbits);
@@ -1010,14 +1009,14 @@ check_tries(const struct flow *flow, struct trie_ctx trie_ctx[CLS_MAX_TRIES],
                  * relevant for the current flow. */
 
                 /* Can skip if the field is already unwildcarded. */
-                if (mask_prefix_bits_set(wc, ctx->be32ofs, ctx->maskbits)) {
+                if (mask_prefix_bits_set(wc, be32ofs, ctx->maskbits)) {
                     return true;
                 }
                 /* Check that the trie result will not unwildcard more bits
                  * than this stage will. */
                 if (ctx->maskbits <= field_plen[j]) {
                     /* Unwildcard the bits and skip the rest. */
-                    mask_set_prefix_bits(wc, ctx->be32ofs, ctx->maskbits);
+                    mask_set_prefix_bits(wc, be32ofs, ctx->maskbits);
                     /* Note: Prerequisite already unwildcarded, as the only
                      * prerequisite of the supported trie lookup fields is
                      * the ethertype, which is currently always unwildcarded.

In the trie_lookup_value() prototype, I'd appreciate it if 'value'
were declared as "const ovs_be32 value[]".  Even though that's
semantically the same as "const ovs_be32 *value", it makes it clear
why it's a pointer instead of a single ovs_be32.


Documentation
=============

I wrote some comments for classifier.h:

diff --git a/lib/classifier.h b/lib/classifier.h
index f5fc323..032c573 100644
--- a/lib/classifier.h
+++ b/lib/classifier.h
@@ -24,15 +24,79 @@
  * =====
  *
  * A flow classifier holds any number of "rules", each of which specifies
- * values to match for some fields or subfields and a priority.  The primary
- * design goal for the classifier is that, given a packet, it can as quickly as
- * possible find the highest-priority rule that matches the packet.
- *
- * Each OpenFlow table is implemented as a flow classifier.
- *
- *
- * Basic Design
- * ============
+ * values to match for some fields or subfields and a priority.  Each OpenFlow
+ * table is implemented as a flow classifier.
+ *
+ * The classifier has two primary design goals.  The first is obvious: given a
+ * set of packet headers, as quickly as possible find the highest-priority rule
+ * that matches those headers.  The following section describes the second
+ * goal.
+ *
+ *
+ * "Un-wildcarding"
+ * ================
+ *
+ * A primary goal of the flow classifier is to produce, as a side effect of a
+ * packet lookup, a wildcard mask that indicates which bits of the packet
+ * headers were essential to the classification result.  Ideally, a 1-bit in
+ * any position of this mask means that, if the corresponding bit in the packet
+ * header were flipped, then the classification result might change.  A 0-bit
+ * means that changing the packet header bit would have no effect.  Thus, the
+ * wildcarded bits are the ones that played no role in the classification
+ * decision.
+ *
+ * Such a wildcard mask is useful with datapaths that support installing flows
+ * that wildcard fields or subfields.  If an OpenFlow lookup for a TCP flow
+ * does not actually look at the TCP source or destination ports, for example,
+ * then the switch may install into the datapath a flow that wildcards the port
+ * numbers, which in turn allows the datapath to handle packets that arrive for
+ * other TCP source or destination ports without additional help from
+ * ovs-vswitchd.  This is useful for the Open vSwitch software and,
+ * potentially, for ASIC-based switches as well.
+ *
+ * Some properties of the wildcard mask:
+ *
+ *     - "False 1-bits" are acceptable, that is, setting a bit in the wildcard
+ *       mask to a 1 will never cause a packet to be forwarded the wrong way.
+ *       As a corollary, a wildcard mask composed of all 1-bits will always
+ *       yield correct (but often needlessly inefficient) behavior.
+ *
+ *     - "False 0-bits" can cause problems, so they must be avoided.  In the
+ *       extreme case, a mask of all 0-bits is only correct if the classifier
+ *       contains only a single flow that matches all packets.
+ *
+ *     - 0-bits are desirable because they allow the datapath to act more
+ *       autonomously, relying on ovs-vswitchd to process fewer flow setups,
+ *       thereby improving performance.
+ *
+ *     - We don't know a good way to generate wildcard masks with the maximum
+ *       (correct) number of 0-bits.  We use various approximations, described
+ *       in later sections.
+ *
+ *     - Wildcard masks for lookups in a given classifier yield a
+ *       non-overlapping set of rules.  More specifically:
+ *
+ *       Consider an classifier C1 filled with an arbitrary collection of rules
+ *       and an empty classifier C2.  Now take a set of packet headers H and
+ *       look it up in C1, yielding a highest-priority matching rule R1 and
+ *       wildcard mask M.  Form a new classifier rule R2 out of packet headers
+ *       H and mask M, and add R2 to C2 with a fixed priority.  If one were to
+ *       do this for every possible set of packet headers H, then this
+ *       process would not attempt to add any overlapping rules to C2, that is,
+ *       any packet lookup using the rules generated by this process matches at
+ *       most one rule in C2.
+ *
+ * During the lookup process, the classifier starts out with a wildcard mask
+ * that is all 0-bits, that is, fully wildcarded.  As lookup proceeds, each
+ * step tends to add constraints to the wildcard mask, that is, change
+ * wildcarded 0-bits into exact-match 1-bits.  We call this "un-wildcarding".
+ * A lookup step that examines a particular field must unwildcard that field.
+ * In general, un-wildcarding is necessary for correctness but undesirable for
+ * performance.
+ *
+ *
+ * Basic Classifier Design
+ * =======================
  *
  * Suppose that all the rules in a classifier had the same form.  For example,
  * suppose that they all matched on the source and destination Ethernet address
@@ -51,8 +115,10 @@
  * cls_subtable" in the classifier and tracks the highest-priority match that
  * it finds.  The subtables are kept in a descending priority order according
  * to the highest priority rule in each subtable, which allows lookup to skip
- * over subtables that can't possibly have a higher-priority match than
- * already found.
+ * over subtables that can't possibly have a higher-priority match than already
+ * found.  Eliminating lookups through priority ordering aids both classifier
+ * primary design goals: skipping lookups saves time and avoids un-wildcarding
+ * fields that those lookups would have examined.
  *
  * One detail: a classifier can contain multiple rules that are identical other
  * than their priority.  When this happens, only the highest priority rule out
@@ -61,24 +127,28 @@
  * list inside that highest-priority rule.
  *
  *
- * Staged Lookup
- * =============
+ * Staged Lookup (Wildcard Optimization)
+ * =====================================
  *
  * Subtable lookup is performed in ranges defined for struct flow, starting
  * from metadata (registers, in_port, etc.), then L2 header, L3, and finally
  * L4 ports.  Whenever it is found that there are no matches in the current
- * subtable, the rest of the subtable can be skipped.  The rationale of this
- * logic is that as many fields as possible can remain wildcarded.
+ * subtable, the rest of the subtable can be skipped.
+ *
+ * Staged lookup does not reduce lookup time, and it may increase it, because
+ * it changes a single hash table lookup into multiple hash table lookup.  It
+ * reduces un-wildcarding significantly in important use cases.
  *
  *
- * Prefix Tracking
- * ===============
+ * Prefix Tracking (Wildcard Optimization)
+ * =======================================
  *
  * Classifier uses prefix trees ("tries") for tracking the used
  * address space, enabling skipping classifier tables containing
- * longer masks than necessary for the given address.  This enables
- * less unwildcarding for datapath flows in parts of the address space
- * without host routes.
+ * longer masks than necessary for the given address.  This reduces
+ * unwildcarding for datapath flows in parts of the address space
+ * without host routes, but consulting extra data structures (the
+ * tries) may slightly increase lookup time.
  *
  * Trie lookup is interwoven with staged lookup, so that a trie is
  * searched only when the configured trie field becomes relevant for
@@ -100,8 +170,8 @@
  * flow table.  Currently this limit is 3.
  *
  *
- * Partitioning
- * ============
+ * Partitioning (Lookup Time and Wildcard Optimization)
+ * ====================================================
  *
  * Suppose that a given classifier is being used to handle multiple stages in a
  * pipeline using "resubmit", with metadata (that is, the OpenFlow 1.1+ field
@@ -133,6 +203,9 @@
  * any cls_subtable that doesn't match on metadata.  We handle that by giving
  * any such cls_subtable TAG_ALL as its 'tags' so that it matches any tag.)
  *
+ * Partitioning saves lookup time by reducing the number of subtable lookups.
+ * Each eliminated subtable lookup also reduces the amount of un-wildcarding.
+ *
  *
  * Thread-safety
  * =============



More information about the dev mailing list