[ovs-dev] [PATCH v2] dpif-netdev: dpcls per in_port with sorted subtables

Tue Jul 19 15:22:32 UTC 2016

Hi Jan,
I added some comments prefixed with [Antonio F].

Thanks,
Antonio

> -----Original Message-----
> From: dev [mailto:dev-bounces at openvswitch.org] On Behalf Of Jan
> Scheurich
> Sent: Friday, July 15, 2016 5:35 PM
> To: dev at openvswitch.org
> Subject: [ovs-dev] [PATCH v2] dpif-netdev: dpcls per in_port with
> sorted subtables
> 
> This turns the previous RFC PATCH dpif-netdev: dpcls per in_port with
> sorted
> subtables into a non-RFC patch v2.
> 
> The user-space datapath (dpif-netdev) consists of a first level
> "exact match
> cache" (EMC) matching on 5-tuples and the normal megaflow classifier.
> With
> many parallel packet flows (e.g. TCP connections) the EMC becomes
> inefficient
> and the OVS forwarding performance is determined by the megaflow
> classifier.
> 
> The megaflow classifier (dpcls) consists of a variable number of hash
> tables
> (aka subtables), each containing megaflow entries with the same mask
> of
> packet header and metadata fields to match upon. A dpcls lookup
> matches a
> given packet against all subtables in sequence until it hits a match.
> As
> megaflow cache entries are by construction non-overlapping, the first
> match
> is the only match.
> 
> Today the order of the subtables in the dpcls is essentially random
> so that
> on average a dpcsl lookup has to visit N/2 subtables for a hit, when
> N is the
> total number of subtables. Even though every single hash-table lookup
> is
> fast, the performance of the current dpcls degrades when there are
> many
> subtables.
> 
> How does the patch address this issue:
> 
> In reality there is often a strong correlation between the ingress
> port and a
> small subset of subtables that have hits. The entire megaflow cache
> typically
> decomposes nicely into partitions that are hit only by packets
> entering from
> a range of similar ports (e.g. traffic from Phy  -> VM vs. traffic
> from VM ->
> Phy).
> 
> Therefore, maintaining a separate dpcls instance per ingress port
> with its
> subtable vector sorted by frequency of hits reduces the average
> number of
> subtables lookups in the dpcls to a minimum, even if the total number
> of
> subtables gets large. This is possible because megaflows always have
> an exact
> match on in_port, so every megaflow belongs to unique dpcls instance.
> 
> For thread safety, the PMD thread needs to block out revalidators
> during the
> periodic optimization. We use ovs_mutex_trylock() to avoid blocking
> the PMD.
> 
> To monitor the effectiveness of the patch we have enhanced the ovs-
> appctl
> dpif-netdev/pmd-stats-show command with an extra line "avg. subtable
> lookups
> per hit" to report the average number of subtable lookup needed for a
> megaflow match. Ideally, this should be close to 1 and almost all
> cases much
> smaller than N/2.
> 
> I have benchmarked a cloud L3 overlay pipeline with a VXLAN overlay
> mesh.
> With pure L3 tenant traffic between VMs on different nodes the
> resulting
> netdev dpcls contains N=4 subtables.
> 
> Disabling the EMC, I have measured a baseline performance (in+out) of
> ~1.32
> Mpps (64 bytes, 1000 L4 flows). The average number of subtable
> lookups per
> dpcls match is 2.5.
> 
> With the patch the average number of subtable lookups per dpcls match
> is
> reduced 1 and the forwarding performance grows by ~30% to 1.72 Mpps.
> 
> As the actual number of subtables will often be higher in reality, we
> can
> assume that this is at the lower end of the speed-up one can expect
> from this
> optimization. Just running a parallel ping between the VXLAN tunnel
> endpoints
> increases the number of subtables and hence the average number of
> subtable
> lookups from 2.5 to 3.5 with a corresponding decrease of throughput
> to 1.14
> Mpps. With the patch the parallel ping has no impact on average
> number of
> subtable lookups and performance. The performance gain is then ~50%.
> 
> The main change to the previous patch is that instead of having a
> subtable
> vector per in_port in a single dplcs instance, we now have one dpcls
> instance
> with a single subtable per ingress port. This is better aligned with
> the
> design base code and also improves the number of subtable lookups in
> a miss
> case.
> 
> The PMD tests have been adjusted to the additional line in pmd-stats-
> show.
> 
> Signed-off-by: Jan Scheurich <jan.scheurich at ericsson.com>
> 
> 
> Changes in v2:
> - Rebased to master (commit 3041e1fc9638)
> - Take the pmd->flow_mutex during optimization to block out
> revalidators
>   Use trylock in order to not block the PMD thread
> - Made in_port an explicit input parameter to fast_path_processing()
> - Fixed coding style issues
> 
> 
> ---
> 
>  lib/dpif-netdev.c | 195
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------
> ----
>  tests/pmd.at      |   6 +++--
>  2 files changed, 173 insertions(+), 28 deletions(-)
> 
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index e0107b7..e7eaab0 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -149,7 +149,11 @@ struct emc_cache {
>  ^L
>  /* Simple non-wildcarding single-priority classifier. */
> 
> +#define DPCLS_OPTIMIZATION_INTERVAL 1000

[Antonio F] I'd prefer to add a comment to specify if these are sec, msec,..

> +
>  struct dpcls {
> +    struct cmap_node node;
> +    odp_port_t in_port;
>      struct cmap subtables_map;
>      struct pvector subtables;
>  };
> @@ -164,12 +168,14 @@ struct dpcls_rule {
> 
>  static void dpcls_init(struct dpcls *);
>  static void dpcls_destroy(struct dpcls *);
> +static void dpcls_optimize(struct dpcls *);
>  static void dpcls_insert(struct dpcls *, struct dpcls_rule *,
>                           const struct netdev_flow_key *mask);
>  static void dpcls_remove(struct dpcls *, struct dpcls_rule *);
> -static bool dpcls_lookup(const struct dpcls *cls,
> +static bool dpcls_lookup(struct dpcls *cls,
>                           const struct netdev_flow_key keys[],
> -                         struct dpcls_rule **rules, size_t cnt);
> +                         struct dpcls_rule **rules, size_t cnt,
> +                         int *num_lookups_p);
>  ^L
>  /* Datapath based on the network device interface from netdev.h.
>   *
> @@ -239,6 +245,7 @@ enum dp_stat_type {
>      DP_STAT_MASKED_HIT,         /* Packets that matched in the flow
> table. */
>      DP_STAT_MISS,               /* Packets that did not match. */
>      DP_STAT_LOST,               /* Packets not passed up to the
> client. */
> +    DP_STAT_LOOKUP_HIT,        /* Number of subtable lookups for
> flow table hits */

[Antonio F] Line longer than 80 chars.

>      DP_N_STATS
>  };
> 
> @@ -419,15 +426,19 @@ struct dp_netdev_pmd_thread {
>       * will only be accessed by its own pmd thread. */
>      struct emc_cache flow_cache;
> 
> -    /* Classifier and Flow-Table.
> +    /* Flow-Table and classifiers
>       *
>       * Writers of 'flow_table' must take the 'flow_mutex'.
> Corresponding
> -     * changes to 'cls' must be made while still holding the
> 'flow_mutex'.
> +     * changes to 'classifiesr' must be made while still holding the
> 'flow_mutex'.

[Antonio F] Line longer than 80 chars. 

[Antonio F] Trivial typo: 'classifiers'

>       */
>      struct ovs_mutex flow_mutex;
> -    struct dpcls cls;
>      struct cmap flow_table OVS_GUARDED; /* Flow table. */
> 
> +    /* One classifier per in_port polled by the pmd */
> +    struct cmap classifiers;
> +    /* Periodically sort subtable vectors according to hit
> frequencies */
> +    long long int next_optimization;
> +
>      /* Statistics. */
>      struct dp_netdev_pmd_stats stats;
> 
> @@ -540,6 +551,8 @@ static void dp_netdev_pmd_unref(struct
> dp_netdev_pmd_thread *pmd);
>  static void dp_netdev_pmd_flow_flush(struct dp_netdev_pmd_thread
> *pmd);
>  static void pmd_load_cached_ports(struct dp_netdev_pmd_thread *pmd)
>      OVS_REQUIRES(pmd->port_mutex);
> +static inline void
> +dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd);
> 
>  static inline bool emc_entry_alive(struct emc_entry *ce);
>  static void emc_clear_entry(struct emc_entry *ce);
> @@ -659,8 +672,10 @@ pmd_info_show_stats(struct ds *reply,
> 
>      ds_put_format(reply,
>                    "\temc hits:%llu\n\tmegaflow hits:%llu\n"
> +                  "\tavg. subtable lookups per hit:%.2f\n"
>                    "\tmiss:%llu\n\tlost:%llu\n",
>                    stats[DP_STAT_EXACT_HIT],
> stats[DP_STAT_MASKED_HIT],
> +                  stats[DP_STAT_MASKED_HIT] > 0 ?
> (1.0*stats[DP_STAT_LOOKUP_HIT])/stats[DP_STAT_MASKED_HIT] : 0,

[Antonio F] Line longer than 80 chars.

>                    stats[DP_STAT_MISS], stats[DP_STAT_LOST]);
> 
>      if (total_cycles == 0) {
> @@ -1503,14 +1518,51 @@ dp_netdev_flow_hash(const ovs_u128 *ufid)
>      return ufid->u32[0];
>  }
> 
> +static inline struct dpcls *
> +dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd,
> +                           odp_port_t in_port)
> +{
> +    struct dpcls *cls;
> +    uint32_t hash = hash_port_no(in_port);
> +    CMAP_FOR_EACH_WITH_HASH (cls, node, hash, &pmd->classifiers) {
> +        if (cls->in_port == in_port) {
> +            /* Port classifier exists already */
> +            return cls;
> +        }
> +    }
> +    return NULL;
> +}
> +
> +static inline struct dpcls *
> +dp_netdev_pmd_find_dpcls(struct dp_netdev_pmd_thread *pmd,
> +                         odp_port_t in_port)
> +{
> +    struct dpcls *cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
> +    uint32_t hash = hash_port_no(in_port);
> +
> +    if (!cls) {
> +        /* Create new classifier for in_port */
> +        cls = xmalloc(sizeof(*cls));
> +        dpcls_init(cls);
> +        cls->in_port = in_port;
> +        cmap_insert(&pmd->classifiers, &cls->node, hash);
> +        VLOG_DBG("Creating dpcls %p for in_port %d", cls, in_port);
> +    }
> +    return cls;
> +}
> +
>  static void
>  dp_netdev_pmd_remove_flow(struct dp_netdev_pmd_thread *pmd,
>                            struct dp_netdev_flow *flow)
>      OVS_REQUIRES(pmd->flow_mutex)
>  {
>      struct cmap_node *node = CONST_CAST(struct cmap_node *, &flow-
> >node);
> +    struct dpcls *cls;
> +    odp_port_t in_port = flow->flow.in_port.odp_port;
> 
> -    dpcls_remove(&pmd->cls, &flow->cr);
> +    cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
> +    ovs_assert(cls != NULL);
> +    dpcls_remove(cls, &flow->cr);
>      cmap_remove(&pmd->flow_table, node, dp_netdev_flow_hash(&flow-
> >ufid));
>      flow->dead = true;
> 
> @@ -1857,15 +1909,20 @@ emc_lookup(struct emc_cache *cache, const
> struct netdev_flow_key *key)
>  }
> 
>  static struct dp_netdev_flow *
> -dp_netdev_pmd_lookup_flow(const struct dp_netdev_pmd_thread *pmd,
> -                          const struct netdev_flow_key *key)
> +dp_netdev_pmd_lookup_flow(struct dp_netdev_pmd_thread *pmd,
> +                          const struct netdev_flow_key *key,
> +                          int *lookup_num_p)
>  {
> -    struct dp_netdev_flow *netdev_flow;
> +    struct dpcls *cls;
>      struct dpcls_rule *rule;
> +    odp_port_t in_port = MINIFLOW_GET_U32(&key->mf, in_port);
> +    struct dp_netdev_flow *netdev_flow = NULL;
> 
> -    dpcls_lookup(&pmd->cls, key, &rule, 1);
> -    netdev_flow = dp_netdev_flow_cast(rule);
> -
> +    cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
> +    if (OVS_LIKELY(cls)) {
> +        dpcls_lookup(cls, key, &rule, 1, lookup_num_p);
> +        netdev_flow = dp_netdev_flow_cast(rule);
> +    }
>      return netdev_flow;
>  }
> 
> @@ -2098,6 +2155,8 @@ dp_netdev_flow_add(struct dp_netdev_pmd_thread
> *pmd,
>  {
>      struct dp_netdev_flow *flow;
>      struct netdev_flow_key mask;
> +    struct dpcls *cls;
> +    odp_port_t in_port = match->flow.in_port.odp_port;
> 
>      netdev_flow_mask_init(&mask, match);
>      /* Make sure wc does not have metadata. */
> @@ -2116,7 +2175,11 @@ dp_netdev_flow_add(struct dp_netdev_pmd_thread
> *pmd,
>      ovsrcu_set(&flow->actions, dp_netdev_actions_create(actions,
> actions_len));
> 
>      netdev_flow_key_init_masked(&flow->cr.flow, &match->flow,
> &mask);
> -    dpcls_insert(&pmd->cls, &flow->cr, &mask);
> +
> +    /* Select dpcls for in_port. Relies on in_port to be exact match
> */
> +    ovs_assert(match->wc.masks.in_port.odp_port == UINT32_MAX);
> +    cls = dp_netdev_pmd_find_dpcls(pmd, in_port);
> +    dpcls_insert(cls, &flow->cr, &mask);
> 
>      cmap_insert(&pmd->flow_table, CONST_CAST(struct cmap_node *,
> &flow->node),
>                  dp_netdev_flow_hash(&flow->ufid));
> @@ -2197,7 +2260,7 @@ dpif_netdev_flow_put(struct dpif *dpif, const
> struct dpif_flow_put *put)
>      }
> 
>      ovs_mutex_lock(&pmd->flow_mutex);
> -    netdev_flow = dp_netdev_pmd_lookup_flow(pmd, &key);
> +    netdev_flow = dp_netdev_pmd_lookup_flow(pmd, &key, NULL);
>      if (!netdev_flow) {
>          if (put->flags & DPIF_FP_CREATE) {
>              if (cmap_count(&pmd->flow_table) < MAX_FLOWS) {
> @@ -2872,6 +2935,7 @@ reload:
>              lc = 0;
> 
>              coverage_try_clear();
> +            dp_netdev_pmd_try_optimize(pmd);
>              if (!ovsrcu_try_quiesce()) {
>                  emc_cache_slow_sweep(&pmd->flow_cache);
>              }
> @@ -3033,8 +3097,9 @@ dp_netdev_configure_pmd(struct
> dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>      ovs_mutex_init(&pmd->cond_mutex);
>      ovs_mutex_init(&pmd->flow_mutex);
>      ovs_mutex_init(&pmd->port_mutex);
> -    dpcls_init(&pmd->cls);
>      cmap_init(&pmd->flow_table);
> +    cmap_init(&pmd->classifiers);
> +    pmd->next_optimization = time_msec() +
> DPCLS_OPTIMIZATION_INTERVAL;
>      ovs_list_init(&pmd->poll_list);
>      hmap_init(&pmd->tx_ports);
>      hmap_init(&pmd->port_cache);
> @@ -3050,10 +3115,16 @@ dp_netdev_configure_pmd(struct
> dp_netdev_pmd_thread *pmd, struct dp_netdev *dp,
>  static void
>  dp_netdev_destroy_pmd(struct dp_netdev_pmd_thread *pmd)
>  {
> +    struct dpcls *cls;
> +
>      dp_netdev_pmd_flow_flush(pmd);
> -    dpcls_destroy(&pmd->cls);
>      hmap_destroy(&pmd->port_cache);
>      hmap_destroy(&pmd->tx_ports);
> +    /* All flows (including their dpcls_rules) have been deleted
> already */
> +    CMAP_FOR_EACH(cls, node, &pmd->classifiers) {
> +        dpcls_destroy(cls);
> +    }
> +    cmap_destroy(&pmd->classifiers);
>      cmap_destroy(&pmd->flow_table);
>      ovs_mutex_destroy(&pmd->flow_mutex);
>      latch_destroy(&pmd->exit_latch);
> @@ -3798,7 +3869,7 @@ handle_packet_upcall(struct
> dp_netdev_pmd_thread *pmd, struct dp_packet *packet,
>           * to be locking everyone out of making flow installs.  If
> we
>           * move to a per-core classifier, it would be reasonable. */
>          ovs_mutex_lock(&pmd->flow_mutex);
> -        netdev_flow = dp_netdev_pmd_lookup_flow(pmd, key);
> +        netdev_flow = dp_netdev_pmd_lookup_flow(pmd, key, NULL);
>          if (OVS_LIKELY(!netdev_flow)) {
>              netdev_flow = dp_netdev_flow_add(pmd, &match, &ufid,
>                                               add_actions->data,
> @@ -3814,7 +3885,8 @@ static inline void
>  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>                       struct dp_packet_batch *packets_,
>                       struct netdev_flow_key *keys,
> -                     struct packet_batch_per_flow batches[], size_t
> *n_batches)
> +                     struct packet_batch_per_flow batches[], size_t
> *n_batches,
> +                     odp_port_t in_port)
>  {
>      int cnt = packets_->count;
>  #if !defined(__CHECKER__) && !defined(_WIN32)
> @@ -3824,10 +3896,12 @@ fast_path_processing(struct
> dp_netdev_pmd_thread *pmd,
>      enum { PKT_ARRAY_SIZE = NETDEV_MAX_BURST };
>  #endif
>      struct dp_packet **packets = packets_->packets;
> +    struct dpcls *cls;
>      struct dpcls_rule *rules[PKT_ARRAY_SIZE];
>      struct dp_netdev *dp = pmd->dp;
>      struct emc_cache *flow_cache = &pmd->flow_cache;
>      int miss_cnt = 0, lost_cnt = 0;
> +    int lookup_cnt = 0, add_lookup_cnt;
>      bool any_miss;
>      size_t i;
> 
> @@ -3835,7 +3909,14 @@ fast_path_processing(struct
> dp_netdev_pmd_thread *pmd,
>          /* Key length is needed in all the cases, hash computed on
> demand. */
>          keys[i].len =
> netdev_flow_key_size(miniflow_n_values(&keys[i].mf));
>      }
> -    any_miss = !dpcls_lookup(&pmd->cls, keys, rules, cnt);
> +    /* Get the classifier for the in_port */
> +    cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
> +    if (OVS_LIKELY(cls)) {
> +        any_miss = !dpcls_lookup(cls, keys, rules, cnt,
> &lookup_cnt);
> +    } else {
> +        any_miss = true;
> +        memset(rules, 0, sizeof(rules));
> +    }
>      if (OVS_UNLIKELY(any_miss) && !fat_rwlock_tryrdlock(&dp-
> >upcall_rwlock)) {
>          uint64_t actions_stub[512 / 8], slow_stub[512 / 8];
>          struct ofpbuf actions, put_actions;
> @@ -3853,8 +3934,9 @@ fast_path_processing(struct
> dp_netdev_pmd_thread *pmd,
>              /* It's possible that an earlier slow path execution
> installed
>               * a rule covering this flow.  In this case, it's a lot
> cheaper
>               * to catch it here than execute a miss. */
> -            netdev_flow = dp_netdev_pmd_lookup_flow(pmd, &keys[i]);
> +            netdev_flow = dp_netdev_pmd_lookup_flow(pmd, &keys[i],
> &add_lookup_cnt);

[Antonio F] Line longer than 80 chars.

>              if (netdev_flow) {
> +                lookup_cnt += add_lookup_cnt;
>                  rules[i] = &netdev_flow->cr;
>                  continue;
>              }
> @@ -3893,6 +3975,7 @@ fast_path_processing(struct
> dp_netdev_pmd_thread *pmd,
>      }
> 
>      dp_netdev_count_packet(pmd, DP_STAT_MASKED_HIT, cnt - miss_cnt);
> +    dp_netdev_count_packet(pmd, DP_STAT_LOOKUP_HIT, lookup_cnt);

[Antonio F] Should we divide lookup_cnt by the processed packets, 
I mean 
       dp_netdev_count_packet(pmd, DP_STAT_LOOKUP_HIT, lookup_cnt/cnt);
?

>      dp_netdev_count_packet(pmd, DP_STAT_MISS, miss_cnt);
>      dp_netdev_count_packet(pmd, DP_STAT_LOST, lost_cnt);
>  }
> @@ -3919,13 +4002,16 @@ dp_netdev_input__(struct dp_netdev_pmd_thread
> *pmd,
>      struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
>      long long now = time_msec();
>      size_t newcnt, n_batches, i;
> +    odp_port_t in_port;
> 
>      n_batches = 0;
>      newcnt = emc_processing(pmd, packets, keys, batches, &n_batches,
>                              md_is_valid, port_no);
>      if (OVS_UNLIKELY(newcnt)) {
>          packets->count = newcnt;
> -        fast_path_processing(pmd, packets, keys, batches,
> &n_batches);
> +        /* Get ingress port from first packet's metadata. */
> +        in_port = packets->packets[0]->md.in_port.odp_port;
> +        fast_path_processing(pmd, packets, keys, batches,
> &n_batches, in_port);
>      }
> 
>      for (i = 0; i < n_batches; i++) {
> @@ -3942,14 +4028,14 @@ dp_netdev_input(struct dp_netdev_pmd_thread
> *pmd,
>                  struct dp_packet_batch *packets,
>                  odp_port_t port_no)
>  {
> -     dp_netdev_input__(pmd, packets, false, port_no);
> +    dp_netdev_input__(pmd, packets, false, port_no);
>  }
> 
>  static void
>  dp_netdev_recirculate(struct dp_netdev_pmd_thread *pmd,
>                        struct dp_packet_batch *packets)
>  {
> -     dp_netdev_input__(pmd, packets, true, 0);
> +    dp_netdev_input__(pmd, packets, true, 0);
>  }
> 
>  struct dp_netdev_execute_aux {
> @@ -4374,6 +4460,8 @@ struct dpcls_subtable {
> 
>      /* These fields are accessed by readers. */
>      struct cmap rules;           /* Contains "struct dpcls_rule"s.
> */
> +    uint32_t hit_cnt;            /* Number of match hits in subtable
> in current
> +                                    optimization interval. */
>      struct netdev_flow_key mask; /* Wildcards for fields (const). */
>      /* 'mask' must be the last field, additional space is allocated
> here. */
>  };
> @@ -4390,6 +4478,7 @@ dpcls_init(struct dpcls *cls)
>  static void
>  dpcls_destroy_subtable(struct dpcls *cls, struct dpcls_subtable
> *subtable)
>  {
> +    VLOG_DBG("Destroying subtable %p for in_port %d", subtable, cls-
> >in_port);
>      pvector_remove(&cls->subtables, subtable);
>      cmap_remove(&cls->subtables_map, &subtable->cmap_node,
>                  subtable->mask.hash);
> @@ -4424,10 +4513,13 @@ dpcls_create_subtable(struct dpcls *cls,
> const struct netdev_flow_key *mask)
>      subtable = xmalloc(sizeof *subtable
>                         - sizeof subtable->mask.mf + mask->len);
>      cmap_init(&subtable->rules);
> +    subtable->hit_cnt = 0;
>      netdev_flow_key_clone(&subtable->mask, mask);
>      cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask-
> >hash);
> +    /* Insert new subtable with priority zero (no hits yet) */
>      pvector_insert(&cls->subtables, subtable, 0);
>      pvector_publish(&cls->subtables);
> +    VLOG_DBG("Creating %lu. subtable %p for in_port %d",
> cmap_count(&cls->subtables_map), subtable, cls->in_port);

[Antonio F] Line too long.

> 
>      return subtable;
>  }
> @@ -4446,6 +4538,43 @@ dpcls_find_subtable(struct dpcls *cls, const
> struct netdev_flow_key *mask)
>      return dpcls_create_subtable(cls, mask);
>  }
> 
> +
> +/* Periodically sort the dpcls subtable vectors according to hit
> counts */
> +static inline void
> +dpcls_optimize(struct dpcls *cls)

[Antonio F] Why not calling this function 'dpcls_sort_subtables'
or 'dpcls_upd_subtbl_priority' ?

> +    OVS_REQUIRES(pmd->flow_mutex)
> +{
> +    struct dpcls_subtable *subtable;
> +
> +    struct pvector *pvec = &cls->subtables;
> +    PVECTOR_FOR_EACH (subtable, pvec) {
> +        pvector_change_priority(pvec, subtable, subtable->hit_cnt);
> +        subtable->hit_cnt = 0;
> +    }
> +    pvector_publish(pvec);
> +}
> +
> +static inline void
> +dp_netdev_pmd_try_optimize(struct dp_netdev_pmd_thread *pmd)
> +{
> +    struct dpcls *cls;
> +    long long int now = time_msec();
> +
> +    if (now > pmd->next_optimization) {
> +        /* Try to obtain the flow lock to block out revalidator
> threads.
> +         * If not possible, just try next time. */
> +        if (ovs_mutex_trylock(&pmd->flow_mutex)) {
> +            /* Optimize each classifier */
> +            CMAP_FOR_EACH(cls, node, &pmd->classifiers) {
> +                dpcls_optimize(cls);
> +            }
> +            ovs_mutex_unlock(&pmd->flow_mutex);
> +            /* Start new measuring interval */
> +            pmd->next_optimization = now +
> DPCLS_OPTIMIZATION_INTERVAL;
> +        }
> +    }
> +}
> +
>  /* Insert 'rule' into 'cls'. */
>  static void
>  dpcls_insert(struct dpcls *cls, struct dpcls_rule *rule,
> @@ -4453,6 +4582,7 @@ dpcls_insert(struct dpcls *cls, struct
> dpcls_rule *rule,
>  {
>      struct dpcls_subtable *subtable = dpcls_find_subtable(cls,
> mask);
> 
> +    /* Refer to subtable's mask, also for later removal. */
>      rule->mask = &subtable->mask;
>      cmap_insert(&subtable->rules, &rule->cmap_node, rule-
> >flow.hash);
>  }
> @@ -4465,10 +4595,11 @@ dpcls_remove(struct dpcls *cls, struct
> dpcls_rule *rule)
> 
>      ovs_assert(rule->mask);
> 
> +    /* Get subtable from reference in rule->mask. */
>      INIT_CONTAINER(subtable, rule->mask, mask);
> -
>      if (cmap_remove(&subtable->rules, &rule->cmap_node, rule-
> >flow.hash)
>          == 0) {
> +        /* Delete empty subtable. */
>          dpcls_destroy_subtable(cls, subtable);
>          pvector_publish(&cls->subtables);
>      }
> @@ -4503,8 +4634,9 @@ dpcls_rule_matches_key(const struct dpcls_rule
> *rule,
>   *
>   * Returns true if all flows found a corresponding rule. */
>  static bool
> -dpcls_lookup(const struct dpcls *cls, const struct netdev_flow_key
> keys[],
> -             struct dpcls_rule **rules, const size_t cnt)
> +dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key keys[],
> +             struct dpcls_rule **rules, const size_t cnt,
> +             int *num_lookups_p)
>  {
>      /* The batch size 16 was experimentally found faster than 8 or
> 32. */
>      typedef uint16_t map_type;
> @@ -4524,6 +4656,8 @@ dpcls_lookup(const struct dpcls *cls, const
> struct netdev_flow_key keys[],
>      }
>      memset(rules, 0, cnt * sizeof *rules);
> 
> +    int lookups_match = 0, subtable_pos = 1;
> +
>      PVECTOR_FOR_EACH (subtable, &cls->subtables) {
>          const struct netdev_flow_key *mkeys = keys;
>          struct dpcls_rule **mrules = rules;
> @@ -4556,6 +4690,8 @@ dpcls_lookup(const struct dpcls *cls, const
> struct netdev_flow_key keys[],
>                  CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
>                      if (OVS_LIKELY(dpcls_rule_matches_key(rule,
> &mkeys[i]))) {
>                          mrules[i] = rule;
> +                        subtable->hit_cnt++;

[Antonio F] I think we can be enough confident that hit_cnt does not overflow.
It's a uint32_d var and the ranking refresh it's every 1000 msec.
Even if a port is hitting the same subtable at the line-rate - say 14 Mpps -
it's enough less than 2^32-1.
So we don't need to check if hit_cnt overflows, or am I missing something? 
If that's the case, maybe a comment here would help?

> +                        lookups_match += subtable_pos;
>                          goto next;
>                      }
>                  }
> @@ -4567,8 +4703,15 @@ dpcls_lookup(const struct dpcls *cls, const
> struct netdev_flow_key keys[],
>              remains |= maps[m];
>          }
>          if (!remains) {
> +            if (num_lookups_p) {
> +                *num_lookups_p = lookups_match;
> +            }
>              return true;              /* All found. */
>          }
> +        subtable_pos++;
> +    }
> +    if (num_lookups_p) {
> +        *num_lookups_p = lookups_match;
>      }
>      return false;                     /* Some misses. */
>  }
> diff --git a/tests/pmd.at b/tests/pmd.at
> index 3216762..c033047 100644
> --- a/tests/pmd.at
> +++ b/tests/pmd.at
> @@ -162,10 +162,11 @@ dummy at ovs-dummy: hit:0 missed:0
>                 p0 7/1: (dummy-pmd: configured_rx_queues=4,
> configured_tx_queues=<cleared>, requested_rx_queues=4,
> requested_tx_queues=<cleared>)
>  ])
> 
> -AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed
> SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 4], [0], [dnl
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed
> SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 5], [0], [dnl
>  pmd thread numa_id <cleared> core_id <cleared>:
>         emc hits:0
>         megaflow hits:0
> +       avg. subtable lookups per hit:0.00
>         miss:0
>         lost:0
>  ])
> @@ -188,10 +189,11 @@ AT_CHECK([cat ovs-vswitchd.log |
> filter_flow_install | strip_xout], [0], [dnl
> 
> recirc_id(0),in_port(1),eth(src=50:54:00:00:00:77,dst=50:54:00:00:01:
> 78),eth_type(0x0800),ipv4(frag=no), actions: <del>
>  ])
> 
> -AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed
> SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 4], [0], [dnl
> +AT_CHECK([ovs-appctl dpif-netdev/pmd-stats-show | sed
> SED_NUMA_CORE_PATTERN | sed '/cycles/d' | grep pmd -A 5], [0], [dnl
>  pmd thread numa_id <cleared> core_id <cleared>:
>         emc hits:19
>         megaflow hits:0
> +       avg. subtable lookups per hit:0.00
>         miss:1
>         lost:0
>  ])
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> http://openvswitch.org/mailman/listinfo/dev