[ovs-dev] [RFC Patch] dpif-netdev: Sorted subtable vectors per in_port in dpcls
Fischetti, Antonio
antonio.fischetti at intel.com
Mon Jun 20 15:42:37 UTC 2016
Hi Jan,
that's an interesting approach.
I added few comments and questions below.
Thanks,
Antonio
> -----Original Message-----
> From: dev [mailto:dev-bounces at openvswitch.org] On Behalf Of Jan
> Scheurich
> Sent: Thursday, June 16, 2016 2:56 PM
> To: dev at openvswitch.org
> Subject: [ovs-dev] [RFC Patch] dpif-netdev: Sorted subtable vectors
> per in_port in dpcls
>
> The user-space datapath (dpif-netdev) consists of a first level
> "exact match cache" (EMC) matching on 5-tuples and the normal
> megaflow classifier. With many parallel packet flows (e.g. TCP
> connections) the EMC becomes inefficient and the OVS forwarding
> performance is determined by the megaflow classifier.
>
> The megaflow classifier (dpcls) consists of a variable number of hash
> tables (aka subtables), each containing megaflow entries with the
> same mask of packet header and metadata fields to match upon. A dpcls
> lookup matches a given packet against all subtables in sequence until
> it hits a match. As megaflow cache entries are by construction non-
> overlapping, the first match is the only match.
>
> Today the order of the subtables in the dpcls is essentially random
> so that on average a dpcsl lookup has to visit N/2 subtables for a
> hit, when N is the total number of subtables. Even though every
> single hash-table lookup is fast, the performance of the current
> dpcls degrades when there are many subtables.
>
> How does the patch address this issue:
>
> In reality there is often a strong correlation between the ingress
> port and a small subset of subtables that have hits. The entire
> megaflow cache typically decomposes nicely into partitions that are
> hit only by packets entering from a range of similar ports (e.g.
> traffic from Phy -> VM vs. traffic from VM -> Phy).
>
> Therefore, keeping a separate list of subtables per ingress port,
> sorted by frequency of hits, reduces the average number of subtables
> lookups in the dpcls to a minimum, even if the total number of
> subtables gets large.
>
> The patch introduces 32 subtable vectors per dpcls and hashes the
> ingress port to select the subtable vector. The patch also counts
> matches per 32 slots in each vector (hashing the subtable pointer to
> obtain the slot) and sorts the vectors according to match frequency
> every second.
>
> To monitor the effectiveness of the patch we have enhanced the ovs-
> appctl dpif-netdev/pmd-stats-show command with an extra line "avg.
> subtable lookups per hit" to report the average number of subtable
> lookup needed for a megaflow match. Ideally, this should be close to
> 1 and much smaller than N/2.
>
> I have benchmarked a cloud L3 overlay pipeline with a VXLAN overlay
> mesh. With pure L3 tenant traffic between VMs on different nodes the
> resulting netdev dpcls contains N=4 subtables.
>
> Disabling the EMC, I have measured a baseline performance (in+out) of
> ~1.32 Mpps (64 bytes, 1000 L4 flows). The average number of subtable
> lookups per dpcls match is 2.5.
>
> With the patch the average number of subtable lookups per dpcls match
> goes down to 1.25 (apparently there are still two ports of different
> nature hashed to the same vector, otherwise it should be exactly
> one). Even so the forwarding performance grows by ~30% to 1.72 Mpps.
[Antonio F]
What if the range of the hash values is partitioned per port type?
I mean if the 1st range - say the first 10 values - are reserved for
dpdk ports
if <is a dpdk port> // range is [0..9]
uint32_t port_idx = 0 + hash_port_no(port_no) % 10;
else // range is [10..NUM_SUBTABLE_VECTORS]
uint32_t port_idx = 10 + hash_port_no(port_no) % (NUM_SUBTABLE_VECTORS - 10);
Could this help to avoid collisions between ports of different types,
eg a dpdk with a vhostuser port?
>
> As the number of subtables will often be higher in reality, we can
[Antonio F] Do you know how many subtables can be used in a real scenario?
> assume that this is at the lower end of the speed-up one can expect
> from this optimization. Just running a parallel ping between the
> VXLAN tunnel endpoints increases the number of subtables and hence
> the average number of subtable lookups from 2.5 to 3.5 with a
> corresponding decrease of throughput to 1.14 Mpps. With the patch the
> parallel ping has no impact on average number of subtable lookups and
> performance. The performance gain is then ~50%.
>
> Signed-off-by: Jan Scheurich <jan.scheurich at ericsson.com>
>
> -----
>
> lib/dpif-netdev.c | 110
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++---------------
> 1 file changed, 92 insertions(+), 18 deletions(-)
>
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 8d39d9e..c491110 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -149,9 +149,15 @@ struct emc_cache {
>
> /* Simple non-wildcarding single-priority classifier. */
>
> +#define NUM_SUBTABLE_VECTORS 32
> +#define NUM_SUBTABLE_SLOTS 32
> +#define DPCLS_OPTIMIATION_INTERVAL 1000
> +
> struct dpcls {
> struct cmap subtables_map;
> - struct pvector subtables;
> + struct pvector subtables[NUM_SUBTABLE_VECTORS];
> + uint64_t hit_cnt[NUM_SUBTABLE_VECTORS][NUM_SUBTABLE_SLOTS];
> + long long int next_optimization;
> };
>
> /* A rule to be inserted to the classifier. */
> @@ -164,12 +170,14 @@ struct dpcls_rule {
>
> static void dpcls_init(struct dpcls *);
> static void dpcls_destroy(struct dpcls *);
> +static void dpcls_try_optimize(struct dpcls *);
> static void dpcls_insert(struct dpcls *, struct dpcls_rule *,
> const struct netdev_flow_key *mask);
> static void dpcls_remove(struct dpcls *, struct dpcls_rule *);
> -static bool dpcls_lookup(const struct dpcls *cls,
> +static bool dpcls_lookup(struct dpcls *cls,
> const struct netdev_flow_key keys[],
> - struct dpcls_rule **rules, size_t cnt);
> + struct dpcls_rule **rules, size_t cnt,
> + const odp_port_t port_no, int
> *num_lookups_p);
>
> /* Datapath based on the network device interface from netdev.h.
> *
> @@ -239,6 +247,7 @@ enum dp_stat_type {
> DP_STAT_MASKED_HIT, /* Packets that matched in the flow
> table. */
> DP_STAT_MISS, /* Packets that did not match. */
> DP_STAT_LOST, /* Packets not passed up to the
> client. */
> + DP_STATS_LOOKUP_HIT, /* Number of subtable lookups for
> flow table hits */
> DP_N_STATS
> };
>
> @@ -657,8 +666,10 @@ pmd_info_show_stats(struct ds *reply,
>
> ds_put_format(reply,
> "\temc hits:%llu\n\tmegaflow hits:%llu\n"
> + "\tavg. subtable lookups per hit:%5.2f\n"
> "\tmiss:%llu\n\tlost:%llu\n",
> stats[DP_STAT_EXACT_HIT],
> stats[DP_STAT_MASKED_HIT],
> + stats[DP_STAT_MASKED_HIT] > 0 ?
> (1.0*stats[DP_STATS_LOOKUP_HIT])/stats[DP_STAT_MASKED_HIT] : 0,
> stats[DP_STAT_MISS], stats[DP_STAT_LOST]);
>
> if (total_cycles == 0) {
> @@ -1855,13 +1866,13 @@ emc_lookup(struct emc_cache *cache, const
> struct netdev_flow_key *key)
> }
>
> static struct dp_netdev_flow *
> -dp_netdev_pmd_lookup_flow(const struct dp_netdev_pmd_thread *pmd,
> +dp_netdev_pmd_lookup_flow(struct dp_netdev_pmd_thread *pmd,
> const struct netdev_flow_key *key)
> {
> struct dp_netdev_flow *netdev_flow;
> struct dpcls_rule *rule;
>
> - dpcls_lookup(&pmd->cls, key, &rule, 1);
> + dpcls_lookup(&pmd->cls, key, &rule, 1, 0, 0); /* Use 0 as dummy
> in_port */
> netdev_flow = dp_netdev_flow_cast(rule);
>
> return netdev_flow;
> @@ -2874,6 +2885,7 @@ reload:
>
> emc_cache_slow_sweep(&pmd->flow_cache);
> coverage_try_clear();
> + dpcls_try_optimize(&pmd->cls);
> ovsrcu_quiesce();
>
> atomic_read_relaxed(&pmd->change_seq, &seq);
> @@ -3814,7 +3826,8 @@ static inline void
> fast_path_processing(struct dp_netdev_pmd_thread *pmd,
> struct dp_packet_batch *packets_,
> struct netdev_flow_key *keys,
> - struct packet_batch_per_flow batches[], size_t
> *n_batches)
> + struct packet_batch_per_flow batches[], size_t
> *n_batches,
> + odp_port_t port_no)
> {
> int cnt = packets_->count;
> #if !defined(__CHECKER__) && !defined(_WIN32)
> @@ -3827,7 +3840,7 @@ fast_path_processing(struct
> dp_netdev_pmd_thread *pmd,
> struct dpcls_rule *rules[PKT_ARRAY_SIZE];
> struct dp_netdev *dp = pmd->dp;
> struct emc_cache *flow_cache = &pmd->flow_cache;
> - int miss_cnt = 0, lost_cnt = 0;
> + int miss_cnt = 0, lost_cnt = 0, lookup_cnt;
> bool any_miss;
> size_t i;
>
> @@ -3835,7 +3848,7 @@ fast_path_processing(struct
> dp_netdev_pmd_thread *pmd,
> /* Key length is needed in all the cases, hash computed on
> demand. */
> keys[i].len =
> netdev_flow_key_size(miniflow_n_values(&keys[i].mf));
> }
> - any_miss = !dpcls_lookup(&pmd->cls, keys, rules, cnt);
> + any_miss = !dpcls_lookup(&pmd->cls, keys, rules, cnt, port_no,
> &lookup_cnt);
> if (OVS_UNLIKELY(any_miss) && !fat_rwlock_tryrdlock(&dp-
> >upcall_rwlock)) {
> uint64_t actions_stub[512 / 8], slow_stub[512 / 8];
> struct ofpbuf actions, put_actions;
> @@ -3893,6 +3906,7 @@ fast_path_processing(struct
> dp_netdev_pmd_thread *pmd,
> }
>
> dp_netdev_count_packet(pmd, DP_STAT_MASKED_HIT, cnt - miss_cnt);
> + dp_netdev_count_packet(pmd, DP_STATS_LOOKUP_HIT, lookup_cnt);
> dp_netdev_count_packet(pmd, DP_STAT_MISS, miss_cnt);
> dp_netdev_count_packet(pmd, DP_STAT_LOST, lost_cnt);
> }
> @@ -3925,7 +3939,7 @@ dp_netdev_input__(struct dp_netdev_pmd_thread
> *pmd,
> md_is_valid, port_no);
> if (OVS_UNLIKELY(newcnt)) {
> packets->count = newcnt;
> - fast_path_processing(pmd, packets, keys, batches,
> &n_batches);
> + fast_path_processing(pmd, packets, keys, batches,
> &n_batches, port_no);
> }
>
> for (i = 0; i < n_batches; i++) {
> @@ -4354,14 +4368,24 @@ struct dpcls_subtable {
> static void
> dpcls_init(struct dpcls *cls)
> {
> + int i;
> +
> cmap_init(&cls->subtables_map);
> - pvector_init(&cls->subtables);
> + for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {
> + pvector_init(&cls->subtables[i]);
> + }
> + memset(cls->hit_cnt, 0, sizeof cls->hit_cnt);
> + cls->next_optimization = time_msec() +
> DPCLS_OPTIMIATION_INTERVAL;
> }
>
> static void
> dpcls_destroy_subtable(struct dpcls *cls, struct dpcls_subtable
> *subtable)
> {
> - pvector_remove(&cls->subtables, subtable);
> + int i;
> +
> + for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {
> + pvector_remove(&cls->subtables[i], subtable);
> + }
> cmap_remove(&cls->subtables_map, &subtable->cmap_node,
> subtable->mask.hash);
> cmap_destroy(&subtable->rules);
> @@ -4376,13 +4400,16 @@ dpcls_destroy(struct dpcls *cls)
> {
> if (cls) {
> struct dpcls_subtable *subtable;
> + int i;
>
> CMAP_FOR_EACH (subtable, cmap_node, &cls->subtables_map) {
> ovs_assert(cmap_count(&subtable->rules) == 0);
> dpcls_destroy_subtable(cls, subtable);
> }
> cmap_destroy(&cls->subtables_map);
> - pvector_destroy(&cls->subtables);
> + for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {
> + pvector_destroy(&cls->subtables[i]);
> + }
> }
> }
>
> @@ -4390,6 +4417,7 @@ static struct dpcls_subtable *
> dpcls_create_subtable(struct dpcls *cls, const struct
> netdev_flow_key *mask)
> {
> struct dpcls_subtable *subtable;
> + int i;
>
> /* Need to add one. */
> subtable = xmalloc(sizeof *subtable
> @@ -4397,8 +4425,11 @@ dpcls_create_subtable(struct dpcls *cls, const
> struct netdev_flow_key *mask)
> cmap_init(&subtable->rules);
> netdev_flow_key_clone(&subtable->mask, mask);
> cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask-
> >hash);
> - pvector_insert(&cls->subtables, subtable, 0);
> - pvector_publish(&cls->subtables);
> + for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {
> + /* Insert new subtable with priority zero (no hits) */
> + pvector_insert(&cls->subtables[i], subtable, 0);
> + pvector_publish(&cls->subtables[i]);
> + }
>
> return subtable;
> }
> @@ -4417,6 +4448,33 @@ dpcls_find_subtable(struct dpcls *cls, const
> struct netdev_flow_key *mask)
> return dpcls_create_subtable(cls, mask);
> }
>
> +
> +/* Periodically sort the subtable vectors according to hit counts */
> +static inline void
> +dpcls_try_optimize(struct dpcls *cls)
> +{
> + long long int now = time_msec();
> +
> + if (now > cls->next_optimization) {
> + struct dpcls_subtable *subtable;
> + int port_idx, subtable_idx;
> +
> + for (port_idx=0; port_idx<NUM_SUBTABLE_VECTORS; port_idx++)
> {
> + struct pvector *pvec = &cls->subtables[port_idx];
> + PVECTOR_FOR_EACH (subtable, pvec) {
> + subtable_idx = hash_pointer(subtable, 210365) %
> NUM_SUBTABLE_SLOTS;
> + pvector_change_priority(pvec, subtable,
> + cls->hit_cnt[port_idx][subtable_idx]);
> + }
> + pvector_publish(pvec);
> + }
> +
> + /* Start new measuring interval */
> + memset(cls->hit_cnt, 0, sizeof cls->hit_cnt);
> + cls->next_optimization = now + DPCLS_OPTIMIATION_INTERVAL;
> + }
> +}
> +
> /* Insert 'rule' into 'cls'. */
> static void
> dpcls_insert(struct dpcls *cls, struct dpcls_rule *rule,
> @@ -4433,6 +4491,7 @@ static void
> dpcls_remove(struct dpcls *cls, struct dpcls_rule *rule)
> {
> struct dpcls_subtable *subtable;
> + int i;
>
> ovs_assert(rule->mask);
>
> @@ -4441,7 +4500,9 @@ dpcls_remove(struct dpcls *cls, struct
> dpcls_rule *rule)
> if (cmap_remove(&subtable->rules, &rule->cmap_node, rule-
> >flow.hash)
> == 0) {
> dpcls_destroy_subtable(cls, subtable);
> - pvector_publish(&cls->subtables);
> + for (i=0; i<NUM_SUBTABLE_VECTORS; i++) {
> + pvector_publish(&cls->subtables[i]);
> + }
> }
> }
>
> @@ -4474,8 +4535,9 @@ dpcls_rule_matches_key(const struct dpcls_rule
> *rule,
> *
> * Returns true if all flows found a corresponding rule. */
> static bool
> -dpcls_lookup(const struct dpcls *cls, const struct netdev_flow_key
> keys[],
> - struct dpcls_rule **rules, const size_t cnt)
> +dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],
> + struct dpcls_rule **rules, const size_t cnt,
> + const odp_port_t port_no, int *num_lookups_p)
> {
> /* The batch size 16 was experimentally found faster than 8 or
> 32. */
> typedef uint16_t map_type;
> @@ -4495,7 +4557,12 @@ dpcls_lookup(const struct dpcls *cls, const
> struct netdev_flow_key keys[],
> }
> memset(rules, 0, cnt * sizeof *rules);
>
> - PVECTOR_FOR_EACH (subtable, &cls->subtables) {
> + /* Select the subtable vector for the in_port */
> + uint32_t port_idx = hash_port_no(port_no) %
> NUM_SUBTABLE_VECTORS;
[Antonio F] May we simply use the port_no instead of its hashed value?
Otherwise can we compute port_idx
at the beginning or at the port creation so to avoid computing it
on every batch of keys?
> + uint64_t *hit_cnt = cls->hit_cnt[port_idx];
> + int lookups_match = 0, subtable_pos = 1;
> +
> + PVECTOR_FOR_EACH (subtable, &cls->subtables[port_idx]) {
> const struct netdev_flow_key *mkeys = keys;
> struct dpcls_rule **mrules = rules;
> map_type remains = 0;
> @@ -4527,6 +4594,10 @@ dpcls_lookup(const struct dpcls *cls, const
> struct netdev_flow_key keys[],
> CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
> if (OVS_LIKELY(dpcls_rule_matches_key(rule,
> &mkeys[i]))) {
> mrules[i] = rule;
> + /* Update the subtable hit statistics for
> the in_port vector */
> + int subtable_idx = hash_pointer(subtable,
> 210365) % NUM_SUBTABLE_SLOTS;
> + hit_cnt[subtable_idx]++;
> + lookups_match += subtable_pos;
> goto next;
> }
> }
> @@ -4538,8 +4609,11 @@ dpcls_lookup(const struct dpcls *cls, const
> struct netdev_flow_key keys[],
> remains |= maps[m];
> }
> if (!remains) {
> + if (num_lookups_p) *num_lookups_p = lookups_match;
> return true; /* All found. */
> }
> + subtable_pos++;
> }
> + if (num_lookups_p) *num_lookups_p = lookups_match;
> return false; /* Some misses. */
> }
>
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev
More information about the dev
mailing list