[ovs-dev] [PATCH v4 1/2] dpif-netdev: Add SMC cache after EMC cache

Wang, Yipeng1 yipeng1.wang at intel.com
Mon Jul 9 23:55:37 UTC 2018


Thanks for the comments, please see my reply inlined. I made all the changes you suggested and included in v5 which I will send out soon.

>> diff --git a/lib/cmap.c b/lib/cmap.c
>> index 07719a8..db1c806 100644
>> --- a/lib/cmap.c
>> +++ b/lib/cmap.c
>> @@ -373,6 +373,79 @@ cmap_find(const struct cmap *cmap, uint32_t hash)
>>                         hash);
>>  }
>>
>> +/* Find a node by the index of the entry of cmap. For example, index 7
>> +means
>> + * the second bucket and the third item.
>> + * Notice that it is not protected by the optimistic lock (versioning)
>> +because
>> + * of performance reasons. Currently it is only used by the datapath DFC cache.
>> + *
>> + * Return node for the entry of index or NULL if the index beyond
>> +boundary */ const struct cmap_node * cmap_find_by_index(const struct
>> +cmap *cmap, uint16_t index) {
>> +    const struct cmap_impl *impl = cmap_get_impl(cmap);
>> +
>> +    uint32_t b = index / CMAP_K;
>> +    uint32_t e = index % CMAP_K;
>> +
>> +    if (b > impl->mask) {
>> +        return NULL;
>> +    }
>> +
>> +    const struct cmap_bucket *bucket = &impl->buckets[b];
>> +
>> +    return cmap_node_next(&bucket->nodes[e]);
>> +}
>> +
>> +/* Find the index of certain hash value. Currently only used by the
>> +datapath
>> + * DFC cache.
>> + *
>> + * Return the index of the entry if found, or UINT32_MAX if not found
>[[BO'M]]  An intro the concept of index would be useful here especially as it does not currently exist in cmap. Something like: "The
>'index' of a cmap entry is a way to combine the specific bucket and item occupied by an entry into a convenient single integer value. It
>is not used internally by cmap." Unless of course that is actually wrong :)
[Wang, Yipeng]  I will add the comments you suggested. It indeed makes it more understandable.

>If a cmap's capacity exceeds the range of UINT32_MAX what happens? Does something different happen if the entry is in a bucket
>that can be expressed in a uint32_t versus a bucket that is outside of that range?
[Wang, Yipeng] I currently assume the cmap cannot be larger than uint32_max. It is a very large number and I guess OvS should not deal with
this big table.  But you are right that I should make it clear,
 I add comments in cmap header file and other places to explain this restriction for the newly added API.
>
>> +*/
>> @@ -155,6 +166,11 @@ struct netdev_flow_key {  #define
>> EM_FLOW_HASH_MASK (EM_FLOW_HASH_ENTRIES - 1)  #define
>> EM_FLOW_HASH_SEGS 2
>>
>> +#define SMC_ENTRY_PER_BUCKET 4
>[[BO'M]] SMC_ENTRY_PER_BUCKET -> SMC_ENTRIES_PER_BUCKET.
>Also a comment something like "A bucket forms the set of possible entries that a flow hash can occupy. Therefore
>SMC_ENTRIES_PER_BUCKET for SMC in analagous to EM_FLOW_HASH_SEGS for EMC." Might help people familiar with the current
>EMC to grok the SMC a little faster.
>
[Wang, Yipeng] I added more explanations as you suggested. For the SMC_ENTRIES_PER_BUCKET, it is actually slightly different from
EM_FLOW_HASH_SEGS.  For EMC, two hash functions are used to index two locations, for SMC currently, I just use one hash
function to index a bucket, and iterate the entries in that bucket. It is more like a middle ground between EMC and CMAP.

>> @@ -2297,6 +2373,76 @@ emc_lookup(struct emc_cache *cache, const struct
>> netdev_flow_key *key)
>>      return NULL;
>>  }
>>
>> +static inline const struct cmap_node *
>> +smc_entry_get(struct dp_netdev_pmd_thread *pmd, struct smc_cache *cache,
>> +                const uint32_t hash)
>> +{
>> +    struct smc_bucket *bucket = &cache->buckets[hash & SMC_MASK];
>> +    uint16_t sig = hash >> 16;
>> +    uint16_t index = UINT16_MAX;
>> +
>> +    for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) {
>> +        if (bucket->sig[i] == sig) {
>> +            index = bucket->flow_idx[i];
>> +            break;
>> +        }
>> +    }
>> +    if (index != UINT16_MAX) {
>> +        return cmap_find_by_index(&pmd->flow_table, index);
>> +    }
>> +    return NULL;
>> +}
>> +
>> +static void
>> +smc_clear_entry(struct smc_bucket *b, int idx) {
>> +    b->flow_idx[idx] = UINT16_MAX;
>> +}
>> +
>> +static inline int
>[[BO'M]] As return value seems to be 1 for ok and 0 for failure suggest using a bool for the return value. Also a comment describing
>when an insert may fail. Describe insertion strategy which seems to be 'if entry already exists update it, otherwise insert in a free
>space, if no free space available randomly pick an entry form the bucket'
>
[Wang, Yipeng] Thanks for pointing it out. I eventually changed it to void function since I realize that I do not need a return value indeed.
I will include this change and more comments for the logic in V5. 

>> +smc_insert(struct dp_netdev_pmd_thread *pmd,
>> +           const struct netdev_flow_key *key,
>> +           uint32_t hash)
>> +{
>> +    struct smc_cache *smc_cache = &pmd->flow_cache.smc_cache;
>> +    struct smc_bucket *bucket = &smc_cache->buckets[key->hash &
>> SMC_MASK];
>> +    uint16_t index;
>> +    uint32_t cmap_index;
>> +    bool smc_enable_db;
>> +
>> +    atomic_read_relaxed(&pmd->dp->smc_enable_db, &smc_enable_db);
>> +    if (!smc_enable_db) {
>> +        return 0;
>> +    }
>> +
>> +    cmap_index = cmap_find_index(&pmd->flow_table, hash);
>> +    index = (cmap_index >= UINT16_MAX) ? UINT16_MAX :
>> + (uint16_t)cmap_index;
>> +
>> +    /* If the index is larger than SMC can handel */
>[[BO'M]]     handel -> handle.
>Also should state explicitly that while cmap_find_index can generate an index of to UINT32_MAX we only handle values of entry up to
>UINT16_MAX in order to reduce SMC memory footprint.
[Wang, Yipeng] I will add the suggested comments.

>> +    if (index == UINT16_MAX) {
>> +        return 0;
>> +    }
>> +
>> +    uint16_t sig = key->hash >> 16;
>> +    for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) {
>> +        if(bucket->sig[i] == sig) {
>> +            bucket->flow_idx[i] = index;
>> +            return 1;
>> +        }
>> +    }
>> +    for (int i = 0; i < SMC_ENTRY_PER_BUCKET; i++) {
>> +        if(bucket->flow_idx[i] == UINT16_MAX) {
>> +            bucket->sig[i] = sig;
>> +            bucket->flow_idx[i] = index;
>> +            return 1;
>> +        }
>> +    }
>> +    int idx = random_uint32() & (SMC_ENTRY_PER_BUCKET - 1);
>[[BO'M]] This is assuming SMC_ENTRY_PER_BUCKET is a power of 2? But there is no BUILD_ASSERT_DECL for that. Would using mod
>be avoid the ^2 constraint be too slow?
>Also 'idx' makes me think of the cmap 'index' concept which is not relevant here. Re-using 'i' above as the array index would be more
>consistent.
[Wang, Yipeng]  You are right. I should use mod. If it is power of 2, I believe the compiler will optimize it to use &.
>
>> @@ -5132,10 +5289,67 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>>      packet_batch_per_flow_update(batch, pkt, mf);  }
>>
>> -/* Try to process all ('cnt') the 'packets' using only the exact match cache
>> +/* SMC lookup function for a batch of pckets.
>> + * By doing batching SMC lookup, we can use prefetch
>> + * to hide memory access latency.
>> + */
>> +static inline void
>> +smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>> +            struct netdev_flow_key *keys,
>> +            struct netdev_flow_key **missed_keys,
>> +            struct dp_packet_batch *packets_,
>> +            struct packet_batch_per_flow batches[],
>> +            size_t *n_batches, const int cnt) {
>> +    int i;
>> +    struct dp_packet *packet;
>> +    size_t n_smc_hit = 0, n_missed = 0;
>> +    struct dfc_cache *cache = &pmd->flow_cache;
>> +    struct smc_cache *smc_cache = &cache->smc_cache;
>> +    const struct cmap_node *flow_node;
>> +
>> +    /* Prefetch buckets for all packets */
>> +    for (i = 0; i < cnt; i++) {
>> +        OVS_PREFETCH(&smc_cache->buckets[keys[i].hash & SMC_MASK]);
>> +    }
>> +
>> +    DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
>> +        struct dp_netdev_flow *flow = NULL;
>> +        flow_node = smc_entry_get(pmd, smc_cache, keys[i].hash);
>> +
>> +        if (OVS_LIKELY(flow_node != NULL)) {
>> +            /* Optimistically we only check the head of the node linked list */
>[[BO'M]] So this is only an issue for dpcls flows that share the same hash for pmd->flow_table (once time in 2^32 flows)? In that case
>the dp_netdev_flows hidden behind first one will not be hit in DFC ever but it would be found in dpcls_lookup? i.e. functionally still
>correct just a very rare performance hit. Would a loop to follow the possible node chain hurt the usual case performance much?
[Wang, Yipeng] You understand it correctly. I guess I was worried about the overhead of the loop (and branches) and thought it was corner case anyway.
But I tested again and it seems did not harm the performance in my new test run by adding the loop back. So I will add it back.
>> @@ -5417,17 +5652,18 @@ dp_netdev_input__(struct dp_netdev_pmd_thread
>> *pmd,  #endif
>>      OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
>>          struct netdev_flow_key keys[PKT_ARRAY_SIZE];
>> +    struct netdev_flow_key *missed_keys[PKT_ARRAY_SIZE];
>>      struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
>>      size_t n_batches;
>>      odp_port_t in_port;
>>
>>      n_batches = 0;
>> -    emc_processing(pmd, packets, keys, batches, &n_batches,
>> +    dfc_processing(pmd, packets, keys, missed_keys, batches,
>> + &n_batches,
>>                              md_is_valid, port_no);
>>      if (!dp_packet_batch_is_empty(packets)) {
>>          /* Get ingress port from first packet's metadata. */
>>          in_port = packets->packets[0]->md.in_port.odp_port;
>> -        fast_path_processing(pmd, packets, keys,
>> +        fast_path_processing(pmd, packets, missed_keys,
>>                               batches, &n_batches, in_port);
>[[BO'M]] Does the introduction of 'missed_keys' (arry of pointers to keys) add anything to the existing 'keys' array populated by
>dfc_processing? It's still the same information passed to fast_path_processing just now as an array of pointers as opposed to a pointer
>to an array.
 [Wang, Yipeng] It is a good question and there is no major difference functionally; it is just a pointer array now instead of the key array.
Here is the reason I made this change. At beginning, SMC lookup is not batched. Every time a EMC miss will go lookup SMC.
If a key misses both EMC and SMC, it will be left in the key array for future fast_path_processing.
I found it is not optimal. If I can batch the SMC lookup it will enable me to do prefetch to improve the performance (or even software pipelining).
Thus, with batching SMC lookup, EMC lookup will still be the same as original that keys missing EMC will be left into the key array. Then I pass the key array into SMC to do batching lookup.
Along with the SMC lookup, I form the key pointer array to point to those keys that miss SMC as well, and eventually pass the pointer array to fast_path_processing.
I use the new pointer array because I do not want to move the keys around to form a new key array during SMC lookup, that would be costly. Another way to do it
Is to have a "hit mask" showing which keys in the key array already hit SMC. But then I still need to modify many functions to pass this mask around.
Functionally it should not change anything to the masterhead.
>>
>> @@ -6377,7 +6613,7 @@ dpcls_remove(struct dpcls *cls, struct dpcls_rule
>> *rule)
>>
>>  /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
>>   * in 'mask' the values in 'key' and 'target' are the same. */ -static inline bool
>> +static bool
>>  dpcls_rule_matches_key(const struct dpcls_rule *rule,
>>                         const struct netdev_flow_key *target)  { @@ -6404,7 +6640,7 @@
>> dpcls_rule_matches_key(const struct dpcls_rule *rule,
>>   *
>>   * Returns true if all miniflows found a corresponding rule. */  static bool -
>> dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],
>> +dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key *keys[],
>[[BO'M]] Is there a benefit to changing from keys[] to *keys[] ? struct netdev_flow_key **keys as used elsewhere is easier to
>understand.
[Wang, Yipeng] Similar to what I explained above, it is just because the batching SMC lookup now forms a pointer array instead
of the original key array. And I pass the pointer array around after SMC lookup instead of the key array as before. 
Functionally it is same as master head and no other specific reasons.



More information about the dev mailing list