[ovs-dev] [PATCH 2/3] datapath: Mega flow implementation

Andy Zhou azhou at nicira.com
Fri Jun 7 10:35:34 UTC 2013


Thanks for the review.

On Thu, Jun 6, 2013 at 7:01 PM, Jesse Gross <jesse at nicira.com> wrote:

> On Wed, Jun 5, 2013 at 10:46 PM, Andy Zhou <azhou at nicira.com> wrote:
> > Add mega flow support in kernel datapath.
> >
> > Pravin has made significant contributions to this patch. Including
> > the mega flow id look up scheme, API clean ups, and bug fixes.
> >
> > Co-authored-by: Pravin B Shelar <pshelar at nicira.com>
> > Signed-off-by: Pravin B Shelar <pshelar at nicira.com>
> > Signed-off-by: Andy Zhou <azhou at nicira.com>
>
> I would probably just squash the previous patch and this one together
> since they really come together and the first one is so small.
>
> I received a sparse error after applying this patch:
>   CHECK   /home/jgross/openvswitch/datapath/linux/flow.c
> /home/jgross/openvswitch/datapath/linux/flow.c:179:40: warning:
> restricted __be16 degrades to integer
>
>
It looks like a legitimate problem, although one that will not
> actually affect things in practice because of the value.
>

Agreed, will fix

>
> > diff --git a/datapath/datapath.c b/datapath/datapath.c
> > index 42af315..98d78a8 100644
> > --- a/datapath/datapath.c
> > +++ b/datapath/datapath.c
> > @@ -372,13 +372,13 @@ static size_t key_attr_size(void)
> >  {
> >         return    nla_total_size(4)   /* OVS_KEY_ATTR_PRIORITY */
> >                 + nla_total_size(0)   /* OVS_KEY_ATTR_TUNNEL */
> > -                 + nla_total_size(8)   /* OVS_TUNNEL_KEY_ATTR_ID */
> > -                 + nla_total_size(4)   /* OVS_TUNNEL_KEY_ATTR_IPV4_SRC
> */
> > -                 + nla_total_size(4)   /* OVS_TUNNEL_KEY_ATTR_IPV4_DST
> */
> > -                 + nla_total_size(1)   /* OVS_TUNNEL_KEY_ATTR_TOS */
> > -                 + nla_total_size(1)   /* OVS_TUNNEL_KEY_ATTR_TTL */
> > -                 + nla_total_size(0)   /*
> OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT */
> > -                 + nla_total_size(0)   /* OVS_TUNNEL_KEY_ATTR_CSUM */
> > +               + nla_total_size(8)   /* OVS_TUNNEL_KEY_ATTR_ID */
> > +               + nla_total_size(4)   /* OVS_TUNNEL_KEY_ATTR_IPV4_SRC */
> > +               + nla_total_size(4)   /* OVS_TUNNEL_KEY_ATTR_IPV4_DST */
> > +               + nla_total_size(1)   /* OVS_TUNNEL_KEY_ATTR_TOS */
> > +               + nla_total_size(1)   /* OVS_TUNNEL_KEY_ATTR_TTL */
> > +               + nla_total_size(0)   /*
> OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT */
> > +               + nla_total_size(0)   /* OVS_TUNNEL_KEY_ATTR_CSUM */
> >                 + nla_total_size(4)   /* OVS_KEY_ATTR_IN_PORT */
> >                 + nla_total_size(4)   /* OVS_KEY_ATTR_SKB_MARK */
> >                 + nla_total_size(12)  /* OVS_KEY_ATTR_ETHERNET */
>
> These are actually intentionally indented to show a grouping of the
> tunnel sub-attributes. Regardless, we should also try to keep
> unrelated changes out of a patch.
>
Did not know this is intentional.  Will roll back the change.

>
> > @@ -1138,14 +1142,27 @@ static int ovs_flow_cmd_fill_info(struct sw_flow
> *flow, struct datapath *dp,
> [...]
> > +    nla = nla_nest_start(skb, OVS_FLOW_ATTR_MASK);
> > +    if (!nla)
> > +        goto nla_put_failure;
> > +
> > +    err = ovs_flow_to_nlattrs(&flow->key,
> > +                              &flow->mfm->key, skb);
> > +    if (err)
> > +        goto error;
> > +
> > +    nla_nest_end(skb, nla);
>
> It looks to me like we don't allocate any extra space in the buffer
> for the mask when called ovs_flow_cmd_build_info().
>
Do you mean ovs_dp_cmd_msg_size() does not include the size of mask (and
key) attributes?

>
> The indentation here also uses spaces instead of tabs.
>
Will fix.

>
> > @@ -1241,23 +1261,33 @@ static int ovs_flow_cmd_new_or_set(struct
> sk_buff *skb, struct genl_info *info)
> >         error = -EINVAL;
> >         if (!a[OVS_FLOW_ATTR_KEY])
> >                 goto error;
> > -       error = ovs_flow_from_nlattrs(&key, &key_len,
> a[OVS_FLOW_ATTR_KEY]);
> > -       if (error)
> > -               goto error;
> > +
> > +       mfm = sw_flow_mask_alloc();
>
> We should make sure that all symbols that are shared across
> compilation units are prefixed with ovs_ so that we don't pollute the
> global kernel namespace.
>
Will fix.

>
> I suspect that doing the allocation for the mask up front will result
> in a lot of extra mallocs and frees on the flow setup path since most
> of the time we will be hitting an existing match. Is there a way to
> eliminate that?
>
Good point, I will implement this change.

>
> > +       if (mfm == NULL)
> > +               return -ENOMEM;
>
> Just checking !mfm is probably more canonical.
>
Will fix.

>
> > +       ovs_match_init(&match, &key, mfm);
> > +       error = ovs_match_from_nlattrs(&match,
> > +                       a[OVS_FLOW_ATTR_KEY], a[OVS_FLOW_ATTR_MASK]);
> > +       if (error) {
> > +               goto err_kfree;
> > +       }
>
> Kernel style is not have brackets around since line if statements.
>
Will fix.

>
> > @@ -1267,7 +1297,23 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff
> *skb, struct genl_info *info)
> >                 goto err_unlock_ovs;
> >
> >         table = ovsl_dereference(dp->table);
> > -       flow = ovs_flow_tbl_lookup(table, &key, key_len);
> > +
> > +       /* Deduplicate the mega flow mask. Note ovs_mutex is held. */
> > +       existing_mfm = sw_flow_mask_find(mfm);
> > +       if (existing_mfm) {
> > +               flow = ovs_flow_tbl_lookup(table, &key, existing_mfm,
> key_len);
> > +               if (flow) {
> > +                       if (!ovs_flow_cmp_id(flow, &key, key_len)) {
> > +                               error = -EINVAL;
> > +                               goto err_unlock_ovs;
> > +                       }
>
> I believe this allows for inserting the same flow with multiple masks
> but this isn't really good because the order of the masks (and
> therefore priority) is somewhat arbitrary. If we just do an lookup
> similar to how a packet is processed on the exact flow then it seems a
> little more robust and means that we can eliminate the need to have a
> second hash table.
>
So just look for any existing flow with key only. The mask attribute is
only be used
for creating a new flow?


>
> > @@ -1294,14 +1340,17 @@ static int ovs_flow_cmd_new_or_set(struct
> sk_buff *skb, struct genl_info *info)
> >                 }
> >                 clear_stats(flow);
> >
> > +               rcu_assign_pointer(flow->mfm, mfm);
> >                 rcu_assign_pointer(flow->sf_acts, acts);
> >
> >                 /* Put flow in bucket. */
> >                 ovs_flow_tbl_insert(table, flow, &key, key_len);
> >
> > +               if (existing_mfm == NULL)
> > +                       sw_flow_mask_insert(mfm);
>
> Same (minor) comment here about the NULL check.
>
Will fix.

>
> > diff --git a/datapath/flow.h b/datapath/flow.h
> > index dba66cf..9fb5a1c 100644
> > --- a/datapath/flow.h
> > +++ b/datapath/flow.h
> > @@ -117,9 +119,13 @@ struct sw_flow_key {
> >  struct sw_flow {
> >         struct rcu_head rcu;
> >         struct hlist_node hash_node[2];
> > +       struct hlist_node hash_node_id[2];
>
> I don't think there is a need to have two instances of hash_node_id
> because it is only accessed under lock. The reason for having two in
> packet lookup is to avoid a hiccup in traffic when we move a flow from
> one table to another. However, there are no such concurrent accesses
> in the lookup table.
>
It seems we are going down the path of the removing the hash_node_id
entirely, right?

>
> > +struct sw_flow_desc {
> > +       size_t start;
> > +       size_t end;
> > +};
>
> I found the name "desc" to not be very intuitive - is the another name
> that is more self-explanatory?
>
How about sw_flow_key_range? Any other suggestions?

>
> > +static inline u16 sw_flow_desc_roundup_size(const struct sw_flow_desc
> *desc)
> > +{
> > +       u16 actual = sw_flow_desc_actual_size(desc);
> > +       u16 n32s = actual / sizeof(u32);
> > +       u16 roundup;
> > +
> > +       roundup = (actual & 0x3) ? (n32s + 1) * sizeof(u32) : actual;
> > +
> > +       return roundup;
> > +}
>
> Is this the same as DIV_ROUND_UP?
>
I did not use it because DIV_ROUND_UP gives the roundup of u32, I needed
the roundup in bytes.  However, I could rewrite this function using
DIV_ROUND_UP.

>
> > +bool ovs_match_validate(const struct sw_flow_match *match,
> > +               u64 key_attrs, u64 mask_attrs);
> > +
>
> It looks like we don't use this function outside of flow.c, in which
> case I think it could be made static.
>
Agreed.

>
> >  struct flow_table {
> >         struct flex_array *buckets;
> > +       struct flex_array *buckets_ids;
>
> As I look at this further, I've become more and more convinced that we
> really only want a single hash table. Otherwise, there are a lot of
> questions about sizing, the need to expand, rehash, etc.
>
> I agree single hash bucket will work.

> +struct sw_flow_mask {
> > +       struct sw_flow_desc desc;
> > +       struct sw_flow_key key;
> > +       struct list_head list;
> > +       struct rcu_head rcu;
> > +       struct kref kref;
>
> I probably wouldn't use kref here since the use of atomics makes it
> look like there is some extra locking going on. In our case, the lock
> is provided by ovs_mutex, so we could just use an int.
>
Agreed, as long as packet path does not change the reference count.

>
> > +struct sw_flow_mask *sw_flow_mask_alloc(void);
> > +void sw_flow_mask_add_ref(struct sw_flow_mask *);
> > +int sw_flow_mask_del_ref(struct sw_flow_mask *);
> > +void sw_flow_mask_insert(struct sw_flow_mask *);
> > +void sw_flow_mask_set(struct sw_flow_mask *, struct sw_flow_desc *, u8
> val);
> > +struct sw_flow_mask *sw_flow_mask_find(const struct sw_flow_mask *);
>
> All of these should be prefixed with "ovs_".
>
Sure.

> I'm still working on flow.c - in particular, I want to go through the
> usage of key length and encap attributes carefully. I should have the
> rest of the comments tomorrow.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-dev/attachments/20130607/4aa14039/attachment-0003.html>


More information about the dev mailing list