[ovs-dev] [PATCH 5/7] datapath: Change listing flows to use an iterator concept.

Ben Pfaff blp at nicira.com
Tue Dec 28 17:49:51 UTC 2010


On Mon, Dec 27, 2010 at 05:00:28PM -0500, Jesse Gross wrote:
> On Thu, Dec 23, 2010 at 8:15 PM, Ben Pfaff <blp at nicira.com> wrote:
> > One of the goals for Open vSwitch is to decouple kernel and userspace
> > software, so that either one can be upgraded or rolled back independent of
> > the other.  To do this in full generality, it must be possible to change
> > the kernel's idea of the flow key separately from the userspace version.
> > In turn, that means that flow keys must become variable-length.  This does
> > not, however, fit in well with the ODP_FLOW_LIST ioctl in its current form,
> > because that would require userspace to know how much space to allocate
> > for each flow's key in advance, or to allocate as much space as could
> > possibly be needed.  Neither choice is very attractive.
> >
> > This commit prepares for a different solution, by replacing ODP_FLOW_LIST
> > by a new ioctl ODP_FLOW_DUMP that retrieves a single flow from the datapath
> > on each call.  It is much cleaner to allocate the maximum amount of space
> > for a single flow key than to do so for possibly a very large number of
> > flow keys.
> >
> > As a side effect, this patch also fixes a race condition that sometimes
> > made "ovs-dpctl dump-flows" print an error: previously, flows were listed
> > and then their actions were retrieved, which left a window in which
> > ovs-vswitchd could delete the flow.  Now dumping a flow and its actions is
> > a single step, closing that window.
> >
> > Dumping all of the flows in a datapath is no longer an atomic step, so now
> > it is possible to miss some flows or see a single flow twice during
> > iteration, if the flow table is modified by another process.  It doesn't
> > look like this should be a problem for ovs-vswitchd.
> >
> > It would be faster to retrieve a number of flows in batch instead of just
> > one at a time, but that will naturally happen later when the kernel
> > datapath interface is changed to use Netlink, so this patch does not bother
> > with it.
> 
> This whole concept made me gag a little bit but I don't see a
> particularly better way, so I guess it is OK for now.  Let's just make
> sure that it goes away soon though, since it is both inefficient,
> racy, and ugly.

For what it's worth, batching using Netlink will not make the races go
away.  Given a sufficient number of flows (enough to overflow
NLMSG_GOODSIZE bytes), the kernel "dump" function will still be called
repeatedly with locks being released between calls.

> > diff --git a/datapath/datapath.c b/datapath/datapath.c
> > index b95c8f2..7b55698 100644
> > --- a/datapath/datapath.c
> > +++ b/datapath/datapath.c
> > +static struct sw_flow *do_dump_flow(struct datapath *dp, u32 __user *state)
> > +{
> > +       struct tbl *table = get_table_protected(dp);
> > +       struct tbl_node *tbl_node;
> > +       u32 bucket, obj;
> > +
> > +       if (get_user(bucket, &state[0]) || get_user(obj, &state[1]))
> > +               return ERR_PTR(-EFAULT);
> > +
> > +       tbl_node = tbl_next(table, bucket, obj, &bucket, &obj);
> > +
> > +       if (__put_user(bucket, &state[0]) || __put_user(obj, &state[1]))
> > +               return ERR_PTR(-EFAULT);
> 
> I think this should use put_user().  Just because we could do the read
> doesn't mean the write is OK (in practice, I don't know of any
> architecture that cares about the type in access_ok() but I don't see
> any point in cutting corners here).

OK, done.

> > +static int dump_flow(struct datapath *dp, struct odp_flow_dump __user *ufdp)
> > +{
> > +       struct odp_flow __user *ufp;
> 
> Can we avoid these types of abbreviations for names?  After a while
> they start looking like random collections of letters that are hard to
> read (ufdp is particularly bad).

OK, I changed ufdp to udumpp and ufp to uflowp.  Hope that's better (or
suggest names you like).

> > +       struct sw_flow *flow;
> > +
> > +       flow = do_dump_flow(dp, ufdp->state);
> > +       if (IS_ERR(flow))
> > +               return PTR_ERR(flow);
> > +
> > +       if (get_user(ufp, &ufdp->flow))
> 
> This provokes a warning from sparse since the type of ufdp->flow does
> not have the __user annotation, so we should add a cast.

Yeah, I was pretty careless about that.  I've fixed this one now.  Maybe
I should start to run sparse routinely.

(How do I figure out what version of 'sparse' I have?  --version is a
no-op.)

> > diff --git a/datapath/table.c b/datapath/table.c
> > index 2806308..ccdf45b 100644
> > --- a/datapath/table.c
> > +++ b/datapath/table.c
> > @@ -246,6 +246,49 @@ int tbl_foreach(struct tbl *table,
> >        return 0;
> >  }
> >
> > +/**
> > + * tbl_next - find next node in hash table
> > + * @table: table to iterate
> > + * @s_bucket: hash value of bucket to start from
> > + * @s_obj: index to start from within first bucket
> > + * @e_bucket: returns bucket to pass as @s_bucket on next call
> > + * @e_obj: returns index to pass as @e_bucket on next call
> 
> Is it actually that interesting to break apart the input and output
> arguments?  All callers currently use the same values for each and it
> seems likely to stay that way.

OK, changed.

> > + *
> > + * Returns the next node in @table in hash order, using @s_bucket and @s_obj to
> > + * determine where to begin iteration and storing the values to pass on the
> > + * next iteration into @e_bucket and @e_obj.  Returns %NULL when no nodes
> > + * remain in the hash table.
> > + *
> > + * Pass 0 for @s_bucket and @s_obj to begin a new iteration.
> > + */
> > +struct tbl_node *tbl_next(struct tbl *table,
> > +                         u32 s_bucket, u32 s_obj,
> > +                         u32 *e_bucket, u32 *e_obj)
> > +{
> 
> The constants here look fundamentally wrong to me.  I realize that the
> logic mirrors tbl_foreach() but that looks wrong to me as well.  I
> think it will work (as tbl_foreach() obviously has been) but that's
> only because the L1 and L2 sizes are the same and most of the values
> line up.  For example, why are there no references to the L2 size?
> The logic of find_bucket(), which is essentially doing the same thing
> by linearly mapping a hash value to a bucket, looks much more correct
> to me.  Am I missing something?  Regardless, I see lots of mixing
> (i.e. find_bucket() does not match alloc_buckets()).

You're right, there's a lot wrong here.

I sent out a separate patch series to clean up the existing code in this
file and I've applied the same fixes to the new code in this commit.

> > +       unsigned int l2_size = table->n_buckets >> TBL_L1_BITS;
> 
> Shouldn't this be l1_size?  And shouldn't the constant be TBL_L1_SHIFT
> (which is defined as TBL_L2_BITS, which happens to have the same value
> as TBL_L1_BITS).

Yes, I changed it to "unsigned int n_l1 = table->n_buckets >>
TBL_L1_SHIFT".

> > +       unsigned int i;
> > +
> > +       for (i = s_bucket >> TBL_L1_BITS; i < l2_size; i++) {
> 
> I think this should also be TBL_L1_SHIFT and l1_size.

Yes, thanks.

> > +               struct tbl_bucket __rcu **l2 = table->buckets[i];
> > +               unsigned int j;
> > +
> > +               for (j = s_bucket & (TBL_L1_SIZE - 1); j < TBL_L1_SIZE; j++) {
> 
> This is iterating over L2, so why not TBL_L2_SIZE?

Yes, thanks.

> > +                       struct tbl_bucket *bucket = rcu_dereference(l2[j]);
> > +
> > +                       if (bucket && s_obj < bucket->n_objs) {
> > +                               *e_bucket = (i << TBL_L1_BITS) + j;
> 
> TBL_L1_SHIFT?

Yes, thanks.

> > diff --git a/include/openvswitch/datapath-protocol.h b/include/openvswitch/datapath-protocol.h
> > index b8f93de..2dc3f83 100644
> > --- a/include/openvswitch/datapath-protocol.h
> > +++ b/include/openvswitch/datapath-protocol.h
> > +/* ODP_FLOW_DUMP argument.
> > + *
> > + * This is used to iterate through the flow table flow-by-flow.  Each
> > + * ODP_FLOW_DUMP call either stores a new odp_flow into 'flow' or stores
> > + * ODPFF_EOF into flow->flags to indicate that the end of the table has been
> > + * reaches, and updates 'state' in-place.
> > + *
> > + * Before the first call, zero 'state'.  The format of 'state' is otherwise
> > + * unspecified.
> > + */
> > +struct odp_flow_dump {
> > +       struct odp_flow *flow;
> > +       uint32_t state[4];
> 
> Why are there 4 state variables?  Is it just future proofing (I'd
> really like this to be gone before the future comes...).

Just future proofing.  I've changed it to 2 now.

I agree that I want this to be gone soon (ideally before I even push
this commit, by getting the followups reviewed and pushing it all at
once, but that may be idle dreaming).

> > diff --git a/lib/hmap.c b/lib/hmap.c
> > index 6b850fd..f4ae786 100644
> > --- a/lib/hmap.c
> > +++ b/lib/hmap.c
> > @@ -218,3 +218,43 @@ hmap_random_node(const struct hmap *hmap)
> >     }
> >     return node;
> >  }
> > +
> > +/* Returns the next node in 'hmap' in hash order, or NULL if no nodes remain in
> > + * 'hmap'.  Uses '*bucketp' and '*offsetp' to determine where to begin
> > + * iteration, and stores new values to pass on the next iteration into them
> > + * before returning.
> > + *
> > + * Before beginning iteration, store 0 into '*bucketp' and '*offsetp'.
> > + */
> > +struct hmap_node *
> > +hmap_at_position(const struct hmap *hmap,
> > +                 uint32_t *bucketp, uint32_t *offsetp)
> > +{
> 
> Is this worth breaking out into a separate patch?

Why not.  Done.

> > +    size_t offset;
> > +    size_t b_idx;
> > +
> > +    offset = *offsetp;
> > +    for (b_idx = *bucketp & hmap->mask; b_idx <= hmap->mask; b_idx++) {
> > +        struct hmap_node *node;
> > +        size_t n_idx;
> > +
> > +        for (n_idx = 0, node = hmap->buckets[b_idx]; node != NULL;
> > +             n_idx++, node = node->next) {
> > +            if (n_idx == offset) {
> > +                if (node->next) {
> > +                    *bucketp = node->hash;
> > +                    *offsetp = offset + 1;
> > +                } else {
> > +                    *bucketp = node->hash + 1;
> > +                    *offsetp = 0;
> > +                }
> 
> I think it is interesting that you are storing the hash as the bucket,
> instead of the index, as is done in the kernel.  My immediate thought
> was that this was done to survive resizes, though in that case you
> would also need some extra logic to only count the object offset for
> nodes with the same hash value.  On the other hand, resizes usually
> are due to insertion and deletion, which is always going to cause
> problems here.  So is there a reason for using the hash?

No, using the bucket number works just as well here.  I'll change it.

> > diff --git a/ofproto/ofproto.c b/ofproto/ofproto.c
> > index e4bc199..f435690 100644
> > --- a/ofproto/ofproto.c
> > +++ b/ofproto/ofproto.c
> > @@ -4470,36 +4470,29 @@ ofproto_expire(struct ofproto *ofproto)
> >  static void
> >  ofproto_update_used(struct ofproto *p)
> >  {
> > -    struct odp_flow *flows;
> > -    size_t n_flows;
> > -    size_t i;
> > -    int error;
> > +    struct dpif_dump dump;
> > +    struct odp_flow f;
> >
> > -    error = dpif_flow_list_all(p->dpif, &flows, &n_flows);
> > -    if (error) {
> > -        return;
> > -    }
> > +    memset(&f, 0, sizeof f);
> 
> On subsequent calls we will have a NULL actions pointer and a non-zero
> actions_len (since the kernel will initialize it with the amount of
> space needed).  This seems not great and violates the documentation
> for dpif_dump_next(), which says both should be zeroed if actions are
> not of interest.

Good point.

Thinking about the interface again, I think that it just isn't very
good, so I changed it to only update actions_len if it was initially
supplied as nonzero.  To my surprise, the kernel (and userspace)
datapaths already did that; I thought that I had made them implement the
interface that I had specified.  So this turned out to be a change only
to comments.

When I was working ahead on further patches in the Netlink conversion, I
discovered that the flow iteration interface was not general enough, so
I also reworked this patch to make it more complete.

I'll send out the entire revised patch series later.

Thanks for the review,

Ben.




More information about the dev mailing list