[ovs-dev] Proposal: EMC load-shedding

O Mahony, Billy billy.o.mahony at intel.com
Tue Aug 1 13:31:52 UTC 2017


Hi All,

This proposal is an attempt to make a more general solution to the same issue
of EMC thrashing addressed by
https://mail.openvswitch.org/pipermail/ovs-dev/2017-July/335940.html. That patch
proposes that when the EMC is overloaded that recirculated packets are neither
inserted into the EMC nor are they looked up in the EMC.

However, while it is a cheap decision, picking 'recirculated' as the category
of packet to drop from the EMC is not very granular - dropping them from the
EMC may result in an EMC that is now under-utilized or still overloaded - it all 
depends on what proportion of packets were in the recirculated category.

Once there are too many entries contending for any cache it's going to become
a liability as the lookup_cost + (miss_rate * cost_of_miss) grows to be greater
than cost_of_miss. And in that overloaded case any scheme where a very cheap
decision can be made to not insert a certain category of packets and to also
not check the EMC for that same category will reduce the cache miss rate.

Looking at a packet's RSS hash can also provide a "very cheap decision that can
be made to not insert a certain category of packets and to also not check the
EMC for that same category"

I don't want to propose an actual patch right now but if the general principle
is agreed then I would be happy to write at least a first iteration of the
patch or maybe the authors of the patch above would like to do so?

So I suggest that when the EMC becomes "overloaded" load is shed based on RSS
/5-tuple hash. If the hash is above a certain threshold it becomes eligible to
be inserted into the EMC. Packets with a hash below the threshold are non
inserted nor are they looked up in the EMC. The threshold can be changed in an
ongoing and granular way to adapt to current traffic conditions and maintain an
optimum EMC utilization.

So in simple, non-optimized (and non-compiling) terms:

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 47a9fa0..df5b9db 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -4599,11 +4599,12 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
         miniflow_extract(packet, &key->mf);
         key->len = 0; /* Not computed yet. */
         key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);

         /* If EMC is disabled skip emc_lookup */
-        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);
+        if (key->hash > flow_cache.shed_threshold) {
+            flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);
+        }
         if (OVS_LIKELY(flow)) {
             dp_netdev_queue_batches(packet, flow, &key->mf, batches,
                                     n_batches);
         } else {
             /* Exact match cache missed. Group missed packets together at
@@ -4777,11 +4778,12 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
             continue;
         }

         flow = dp_netdev_flow_cast(rules[i]);

-        emc_probabilistic_insert(pmd, &keys[i], flow);
+        if (keys[i].hash > pmd->flow_cache->shed_threshold) {
+            emc_probabilistic_insert(pmd, &keys[i], flow);
+        }
         dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches);
     }

     dp_netdev_count_packet(pmd, DP_STAT_MASKED_HIT, cnt - miss_cnt);
     dp_netdev_count_packet(pmd, DP_STAT_LOOKUP_HIT, lookup_cnt);

The other question is setting the shed_threshold value. Generally:

 sched_threshold = some_function_of(emc_load)

Some suggestions for calculating emc_load:
* the num_alive_entries/num_entries
* the num_evictions/num_insertions (where num_evictions is the number of
  insertions that overwrote and existing alive entry).
* something more adaptable:
  emc_load = some_function_of (cost_of_emc_miss,
                               cost_of_emc_lookup,
                               probability_of_emc_miss)


I'd be interested to know what people think of doing something like this and
any more details that can be fleshed out, corner cases and so on.

Regards,
Billy



More information about the dev mailing list