[ovs-dev] [PATCH 5/7] datapath: Change listing flows to use an iterator concept.

Jesse Gross jesse at nicira.com
Mon Dec 27 22:00:28 UTC 2010


On Thu, Dec 23, 2010 at 8:15 PM, Ben Pfaff <blp at nicira.com> wrote:
> One of the goals for Open vSwitch is to decouple kernel and userspace
> software, so that either one can be upgraded or rolled back independent of
> the other.  To do this in full generality, it must be possible to change
> the kernel's idea of the flow key separately from the userspace version.
> In turn, that means that flow keys must become variable-length.  This does
> not, however, fit in well with the ODP_FLOW_LIST ioctl in its current form,
> because that would require userspace to know how much space to allocate
> for each flow's key in advance, or to allocate as much space as could
> possibly be needed.  Neither choice is very attractive.
>
> This commit prepares for a different solution, by replacing ODP_FLOW_LIST
> by a new ioctl ODP_FLOW_DUMP that retrieves a single flow from the datapath
> on each call.  It is much cleaner to allocate the maximum amount of space
> for a single flow key than to do so for possibly a very large number of
> flow keys.
>
> As a side effect, this patch also fixes a race condition that sometimes
> made "ovs-dpctl dump-flows" print an error: previously, flows were listed
> and then their actions were retrieved, which left a window in which
> ovs-vswitchd could delete the flow.  Now dumping a flow and its actions is
> a single step, closing that window.
>
> Dumping all of the flows in a datapath is no longer an atomic step, so now
> it is possible to miss some flows or see a single flow twice during
> iteration, if the flow table is modified by another process.  It doesn't
> look like this should be a problem for ovs-vswitchd.
>
> It would be faster to retrieve a number of flows in batch instead of just
> one at a time, but that will naturally happen later when the kernel
> datapath interface is changed to use Netlink, so this patch does not bother
> with it.

This whole concept made me gag a little bit but I don't see a
particularly better way, so I guess it is OK for now.  Let's just make
sure that it goes away soon though, since it is both inefficient,
racy, and ugly.

> diff --git a/datapath/datapath.c b/datapath/datapath.c
> index b95c8f2..7b55698 100644
> --- a/datapath/datapath.c
> +++ b/datapath/datapath.c
> +static struct sw_flow *do_dump_flow(struct datapath *dp, u32 __user *state)
> +{
> +       struct tbl *table = get_table_protected(dp);
> +       struct tbl_node *tbl_node;
> +       u32 bucket, obj;
> +
> +       if (get_user(bucket, &state[0]) || get_user(obj, &state[1]))
> +               return ERR_PTR(-EFAULT);
> +
> +       tbl_node = tbl_next(table, bucket, obj, &bucket, &obj);
> +
> +       if (__put_user(bucket, &state[0]) || __put_user(obj, &state[1]))
> +               return ERR_PTR(-EFAULT);

I think this should use put_user().  Just because we could do the read
doesn't mean the write is OK (in practice, I don't know of any
architecture that cares about the type in access_ok() but I don't see
any point in cutting corners here).

> +static int dump_flow(struct datapath *dp, struct odp_flow_dump __user *ufdp)
> +{
> +       struct odp_flow __user *ufp;

Can we avoid these types of abbreviations for names?  After a while
they start looking like random collections of letters that are hard to
read (ufdp is particularly bad).

> +       struct sw_flow *flow;
> +
> +       flow = do_dump_flow(dp, ufdp->state);
> +       if (IS_ERR(flow))
> +               return PTR_ERR(flow);
> +
> +       if (get_user(ufp, &ufdp->flow))

This provokes a warning from sparse since the type of ufdp->flow does
not have the __user annotation, so we should add a cast.

> diff --git a/datapath/table.c b/datapath/table.c
> index 2806308..ccdf45b 100644
> --- a/datapath/table.c
> +++ b/datapath/table.c
> @@ -246,6 +246,49 @@ int tbl_foreach(struct tbl *table,
>        return 0;
>  }
>
> +/**
> + * tbl_next - find next node in hash table
> + * @table: table to iterate
> + * @s_bucket: hash value of bucket to start from
> + * @s_obj: index to start from within first bucket
> + * @e_bucket: returns bucket to pass as @s_bucket on next call
> + * @e_obj: returns index to pass as @e_bucket on next call

Is it actually that interesting to break apart the input and output
arguments?  All callers currently use the same values for each and it
seems likely to stay that way.

> + *
> + * Returns the next node in @table in hash order, using @s_bucket and @s_obj to
> + * determine where to begin iteration and storing the values to pass on the
> + * next iteration into @e_bucket and @e_obj.  Returns %NULL when no nodes
> + * remain in the hash table.
> + *
> + * Pass 0 for @s_bucket and @s_obj to begin a new iteration.
> + */
> +struct tbl_node *tbl_next(struct tbl *table,
> +                         u32 s_bucket, u32 s_obj,
> +                         u32 *e_bucket, u32 *e_obj)
> +{

The constants here look fundamentally wrong to me.  I realize that the
logic mirrors tbl_foreach() but that looks wrong to me as well.  I
think it will work (as tbl_foreach() obviously has been) but that's
only because the L1 and L2 sizes are the same and most of the values
line up.  For example, why are there no references to the L2 size?
The logic of find_bucket(), which is essentially doing the same thing
by linearly mapping a hash value to a bucket, looks much more correct
to me.  Am I missing something?  Regardless, I see lots of mixing
(i.e. find_bucket() does not match alloc_buckets()).

> +       unsigned int l2_size = table->n_buckets >> TBL_L1_BITS;

Shouldn't this be l1_size?  And shouldn't the constant be TBL_L1_SHIFT
(which is defined as TBL_L2_BITS, which happens to have the same value
as TBL_L1_BITS).

> +       unsigned int i;
> +
> +       for (i = s_bucket >> TBL_L1_BITS; i < l2_size; i++) {

I think this should also be TBL_L1_SHIFT and l1_size.

> +               struct tbl_bucket __rcu **l2 = table->buckets[i];
> +               unsigned int j;
> +
> +               for (j = s_bucket & (TBL_L1_SIZE - 1); j < TBL_L1_SIZE; j++) {

This is iterating over L2, so why not TBL_L2_SIZE?

> +                       struct tbl_bucket *bucket = rcu_dereference(l2[j]);
> +
> +                       if (bucket && s_obj < bucket->n_objs) {
> +                               *e_bucket = (i << TBL_L1_BITS) + j;

TBL_L1_SHIFT?

> diff --git a/include/openvswitch/datapath-protocol.h b/include/openvswitch/datapath-protocol.h
> index b8f93de..2dc3f83 100644
> --- a/include/openvswitch/datapath-protocol.h
> +++ b/include/openvswitch/datapath-protocol.h
> +/* ODP_FLOW_DUMP argument.
> + *
> + * This is used to iterate through the flow table flow-by-flow.  Each
> + * ODP_FLOW_DUMP call either stores a new odp_flow into 'flow' or stores
> + * ODPFF_EOF into flow->flags to indicate that the end of the table has been
> + * reaches, and updates 'state' in-place.
> + *
> + * Before the first call, zero 'state'.  The format of 'state' is otherwise
> + * unspecified.
> + */
> +struct odp_flow_dump {
> +       struct odp_flow *flow;
> +       uint32_t state[4];

Why are there 4 state variables?  Is it just future proofing (I'd
really like this to be gone before the future comes...).

> diff --git a/lib/hmap.c b/lib/hmap.c
> index 6b850fd..f4ae786 100644
> --- a/lib/hmap.c
> +++ b/lib/hmap.c
> @@ -218,3 +218,43 @@ hmap_random_node(const struct hmap *hmap)
>     }
>     return node;
>  }
> +
> +/* Returns the next node in 'hmap' in hash order, or NULL if no nodes remain in
> + * 'hmap'.  Uses '*bucketp' and '*offsetp' to determine where to begin
> + * iteration, and stores new values to pass on the next iteration into them
> + * before returning.
> + *
> + * Before beginning iteration, store 0 into '*bucketp' and '*offsetp'.
> + */
> +struct hmap_node *
> +hmap_at_position(const struct hmap *hmap,
> +                 uint32_t *bucketp, uint32_t *offsetp)
> +{

Is this worth breaking out into a separate patch?

> +    size_t offset;
> +    size_t b_idx;
> +
> +    offset = *offsetp;
> +    for (b_idx = *bucketp & hmap->mask; b_idx <= hmap->mask; b_idx++) {
> +        struct hmap_node *node;
> +        size_t n_idx;
> +
> +        for (n_idx = 0, node = hmap->buckets[b_idx]; node != NULL;
> +             n_idx++, node = node->next) {
> +            if (n_idx == offset) {
> +                if (node->next) {
> +                    *bucketp = node->hash;
> +                    *offsetp = offset + 1;
> +                } else {
> +                    *bucketp = node->hash + 1;
> +                    *offsetp = 0;
> +                }

I think it is interesting that you are storing the hash as the bucket,
instead of the index, as is done in the kernel.  My immediate thought
was that this was done to survive resizes, though in that case you
would also need some extra logic to only count the object offset for
nodes with the same hash value.  On the other hand, resizes usually
are due to insertion and deletion, which is always going to cause
problems here.  So is there a reason for using the hash?

> diff --git a/ofproto/ofproto.c b/ofproto/ofproto.c
> index e4bc199..f435690 100644
> --- a/ofproto/ofproto.c
> +++ b/ofproto/ofproto.c
> @@ -4470,36 +4470,29 @@ ofproto_expire(struct ofproto *ofproto)
>  static void
>  ofproto_update_used(struct ofproto *p)
>  {
> -    struct odp_flow *flows;
> -    size_t n_flows;
> -    size_t i;
> -    int error;
> +    struct dpif_dump dump;
> +    struct odp_flow f;
>
> -    error = dpif_flow_list_all(p->dpif, &flows, &n_flows);
> -    if (error) {
> -        return;
> -    }
> +    memset(&f, 0, sizeof f);

On subsequent calls we will have a NULL actions pointer and a non-zero
actions_len (since the kernel will initialize it with the amount of
space needed).  This seems not great and violates the documentation
for dpif_dump_next(), which says both should be zeroed if actions are
not of interest.




More information about the dev mailing list