[ovs-dev] [PATCH] Improve kernel hash table

Sat Aug 20 11:31:47 UTC 2011

On Sat, Aug 20, 2011 at 4:49 AM, Pravin Shelar <pshelar at nicira.com> wrote:
> Currently OVS used its own hashing implmentation for hash tables which
> has some problems, e.g. error case on deletion code. Following patch
> replaces that with hlist based hash table which is consistent with other
> kernel hash tables.
>
> Signed-off-by: Pravin Shelar <pshelar at nicira.com>

I get a whole pile of compiler errors with this on 2.6.38:

/home/jesse/openvswitch/datapath/linux/datapath.c: In function
‘get_table_protected’:
/home/jesse/openvswitch/datapath/linux/datapath.c:112:9: error:
dereferencing pointer to incomplete type
/home/jesse/openvswitch/datapath/linux/datapath.c:112:2: warning: type
defaults to ‘int’ in type name
/home/jesse/openvswitch/datapath/linux/datapath.c:112:2: warning:
return from incompatible pointer type
/home/jesse/openvswitch/datapath/linux/datapath.c: In function
‘dp_process_received_packet’:
/home/jesse/openvswitch/datapath/linux/datapath.c:298:22: error:
dereferencing pointer to incomplete type
/home/jesse/openvswitch/datapath/linux/datapath.c:298:22: warning:
type defaults to ‘int’ in declaration of ‘_________p1’
/home/jesse/openvswitch/datapath/linux/datapath.c:298:22: error:
dereferencing pointer to incomplete type
/home/jesse/openvswitch/datapath/linux/datapath.c:298:3: warning: type
defaults to ‘int’ in type name
/home/jesse/openvswitch/datapath/linux/datapath.c:298:22: error:
dereferencing pointer to incomplete type
/home/jesse/openvswitch/datapath/linux/datapath.c:298:3: warning: type
defaults to ‘int’ in type name
/home/jesse/openvswitch/datapath/linux/datapath.c:298:3: warning:
passing argument 1 of ‘flow_lookup’ from incompatible pointer type
/home/jesse/openvswitch/datapath/linux/../flow.h:165:17: note:
expected ‘struct flow_table *’ but argument is of type ‘int *’
/home/jesse/openvswitch/datapath/linux/datapath.c: In function ‘flush_flows’:
/home/jesse/openvswitch/datapath/linux/datapath.c:539:2: error:
dereferencing pointer to incomplete type
/home/jesse/openvswitch/datapath/linux/datapath.c:539:2: warning: type
defaults to ‘int’ in type name
/home/jesse/openvswitch/datapath/linux/datapath.c:539:2: warning:
assignment from incompatible pointer type
/home/jesse/openvswitch/datapath/linux/datapath.c: In function ‘expand_table’:
/home/jesse/openvswitch/datapath/linux/datapath.c:633:3: error:
dereferencing pointer to incomplete type
/home/jesse/openvswitch/datapath/linux/datapath.c:633:3: warning: type
defaults to ‘int’ in type name
/home/jesse/openvswitch/datapath/linux/datapath.c:633:3: warning:
assignment from incompatible pointer type
/home/jesse/openvswitch/datapath/linux/datapath.c: In function ‘odp_dp_cmd_new’:
/home/jesse/openvswitch/datapath/linux/datapath.c:1357:2: error:
dereferencing pointer to incomplete type
/home/jesse/openvswitch/datapath/linux/datapath.c:1357:2: warning:
type defaults to ‘int’ in type name
/home/jesse/openvswitch/datapath/linux/datapath.c:1357:2: warning:
assignment from incompatible pointer type

I'm guessing that you didn't try this on a kernel that has real
definitions for the RCU sparse/lockdep checks.

Also, it doesn't look like you actually deleted table.[h|c].

> diff --git a/datapath/datapath.c b/datapath/datapath.c
> index b191239..e198357 100644
> --- a/datapath/datapath.c
> +++ b/datapath/datapath.c
> @@ -225,7 +225,7 @@ static void destroy_dp_rcu(struct rcu_head *rcu)
>  {
>        struct datapath *dp = container_of(rcu, struct datapath, rcu);
>
> -       tbl_destroy((struct tbl __force *)dp->table, flow_free_tbl);
> +       flow_deferred_destroy(dp->table);

It's not necessary to wait for an extra RCU grace period before
deleting the table.  Previously it was done immediately, now it's a
deferred free.

> @@ -2070,6 +2050,7 @@ static int __init dp_init(void)
>        if (err < 0)
>                goto error_unreg_notifier;
>
> +

Extra blank line here.

> diff --git a/datapath/flow.c b/datapath/flow.c
> index 27038ec..24db221 100644
> --- a/datapath/flow.c
> +++ b/datapath/flow.c
> +struct hlist_head  __rcu * alloc_buckets(unsigned int n_buckets)
>  {
> -       struct sw_flow *flow = flow_cast(node);
> +       struct hlist_head  __rcu * l1;

I think we could use a more descriptive name than l1 because there is
now only one level in the table.

> +       unsigned int i;
> +
> +       l1 = kmalloc((n_buckets) * sizeof(struct hlist_head *), GFP_KERNEL);

You can't just directly allocate the array of buckets because it will
quickly become too large to allocate reliably (on 64-bit machines even
the initial 1024 bucket allocation is already 2 pages).  The previous
implementation went to a lot of work to avoid this by using a
hierarchy of single page allocations.  The much easier solution that
is now available, which we talked about in the past, is to use
flex_arrays to handle the hierarchy automatically.  I already
backported the flex_array code from the latest kernel (and improved
it) so it should be easy to use.

> +void flow_free_tbl(struct sw_flow *flow)
> +{
>        flow->dead = true;
>        flow_put(flow);

This seems like a strange name for this function now.

> +static void flow_destroy_table_rcu_cb(struct rcu_head *rcu)
> +{
> +       struct flow_table *table = container_of(rcu, struct flow_table, rcu);
> +       int i;
> +
> +       for (i = 0; i < table->buckets; i++) {
> +               struct sw_flow *flow;
> +               struct hlist_head *head = &table->hash_buckets[i];
> +               struct hlist_node *node, *n;
> +
> +               hlist_for_each_entry_safe(flow, node, n, head, hash_node) {
> +                       hlist_del_init_rcu(&flow->hash_node);
> +                       flow_free_tbl(flow);
> +               }
> +       }

Do we actually want to unconditionally free the flows?  What about
when we expand the table?

> +struct sw_flow *flow_tbl_next(struct flow_table *table, u32 *bucket, u32 *last)
> +{
> +       struct sw_flow *flow;
> +       struct hlist_head *head;
> +       struct hlist_node *n;
> +       int i;
> +start:
> +       if (*bucket >= table->buckets)
> +               return NULL;

Can't this be written as a simple loop?

> +void flow_insert(struct flow_table *table, struct sw_flow *flow)
> +{
> +       struct hlist_head *head;
> +       struct hlist_node *n;
> +       struct sw_flow *cflow;
> +       u32 hash;
> +
> +       hash = flow_hash(&flow->key, flow->key_len);
> +       head = find_bucket(table, hash);
> +
> +       hlist_for_each_entry_rcu(cflow, n, head, hash_node) {
> +               if (!memcmp(&cflow->key, &flow->key, flow->key_len))
> +                       return;
> +       }

We should never add duplicate flow entries, so the above code isn't
necessary.  If it does happen, then it will cause a memory leak
because the new flow will silently disappear.

> diff --git a/datapath/flow.h b/datapath/flow.h
> index 6a3c539..6bfe893 100644
> --- a/datapath/flow.h
> +++ b/datapath/flow.h
>  struct sw_flow {
>        struct rcu_head rcu;
> -       struct tbl_node tbl_node;
> +       struct hlist_node  hash_node;
>
>        struct sw_flow_key key;
>        struct sw_flow_actions __rcu *sf_acts;
> +       int key_len;

It looks like key_len was only added so that we can rehash the flows
when the table is resized.  I think perhaps a better use of memory
would be to store the hash value itself.  This way we wouldn't have to
recompute the hashes on resize and we can also use it to speed up
lookup by skipping comparisons of flows that may land in the same
bucket but actually have different hash values.

> +#define TBL_L1_BITS    (PAGE_SHIFT - ilog2(sizeof(struct hlist_head *)))
> +#define TBL_L1_SIZE    (1 << TBL_L1_BITS)
> +#define TBL_MIN_BUCKETS (TBL_L1_SIZE << 2)
> +#define TBL_MAX_BUCKETS (TBL_L1_SIZE * TBL_L1_SIZE)

These constants only made sense in the context of the old
implementation due to its use of hierarchical pages of memory.  I
think the only one that still makes sense is the default size of the
table.  However, you can just directly pick some value that is used
for initial creation, no need to compute it.  1024 is probably still a
fine starting point.

> +struct hlist_head  __rcu * alloc_buckets(unsigned int n_buckets);
> +void free_buckets(struct hlist_head * head);

It seems a little strange for these internal functions to be exposed,
especially since they are so simple.

> +void flow_deferred_destroy(struct flow_table *table);

I think the name of this function should reflect that it is the flow
table, not the flow, that is being destroyed especially since there is
another function that is named very similarly.

> diff --git a/datapath/tunnel.c b/datapath/tunnel.c
> index 1ef81ab..1df4813 100644
> --- a/datapath/tunnel.c
> +++ b/datapath/tunnel.c
> -static struct tbl __rcu *port_table __read_mostly;
> +static unsigned int port_table_size __read_mostly = 1024;
> +
> +module_param(port_table_size, uint, 0644);

Please don't add module parameters like this.  We really don't need
endless knobs that people will never know how to tweak.  Currently the
maximum number of ports is 1024, so I think we can reasonably pick
some value for hash table size and stick with it (I think 1024 is
fine).  It also needs to be a power of two but that's not enforced
anywhere.  In addition, it's possible to change it at runtime but that
would cause problems here because changing the parameter won't resize
the table but the new value is used to traverse the data structure.

> -static int cache_cleaner_cb(struct tbl_node *tbl_node, void *aux)
> +static void vport_cache_cleaner(struct tnl_vport *tnl_vport)

I'm not sure that the vport prefix makes much sense here.  You could
also probably just put this code inline now.

> @@ -860,15 +799,27 @@ static int cache_cleaner_cb(struct tbl_node *tbl_node, void *aux)
>                spin_unlock_bh(&tnl_vport->cache_lock);
>        }
>
> -       return 0;
> +       return;

The return statement is unnecessary now.

> +int tnl_init()
> +{
> +       port_table = alloc_buckets(port_table_size);
> +       if (!port_table)
> +               return -ENOMEM;

Previously we would allocate the hash table when the first tunnel port
was created but now we're doing it as soon as the module is loaded.
Why the change?  Short of a compelling reason, I think it's better to
do it on demand, otherwise we are consuming resources for something
that will potentially never get used.

> +void tnl_exit()
> +{
> +       struct hlist_node *n, *pos;
> +       struct hlist_head *hash_head;
> +       struct tnl_vport * tnl_vport;
> +       int i;
> +
> +       if (!port_table)
> +               return;
> +
> +       for (i = 0; i < port_table_size; i++) {
> +               hash_head = &port_table[i];
> +               hlist_for_each_entry_safe(tnl_vport, pos, n, hash_head,
> +                                         hash_node) {
> +                       hlist_del_init_rcu(&tnl_vport->hash_node);
> +                       __free_port_rcu(tnl_vport);
> +               }
> +       }

When a datapath is deleted, it calls delete on all of the ports
automatically so it should be necessary to do it here as well.

In tnl_destroy(), there's a check to see whether the port exists
before deleting it.  This should no longer be necessary because it was
only needed in the case where modifying a port failed after deletion
but before insertion.  That can no longer happen.

In addition, I don't think that deletion of any of the types of vports
can fail now.  That means that we can now return void from these
delete functions.  I would like for us to go through and trace back
all the functions that could previously fail due to hash table memory
allocation that can no longer do so and remove those error paths.