[ovs-dev] [netlink v4 27/52] datapath: Change userspace vport interface to use Netlink attributes.

Jesse Gross jesse at nicira.com
Mon Jan 17 23:21:45 UTC 2011


On Mon, Jan 17, 2011 at 1:47 PM, Ben Pfaff <blp at nicira.com> wrote:
> Jesse, I've noticed that your followups to my patches are now in
> quoted-printable and tabs are getting rendered as sequences of
> alternating spaces and hard spaces.  This looks OK in the gmail view but
> "mutt" doesn't like it.  This didn't happen before.  I don't think I've
> changed any of my local settings, so I wonder whether anything has
> changed on your end?

I wrote this on the plane and had to copy things around before I send
it off, so my guess is that caused the formatting changes.

> In copy_vport_to_user():
>> > + ? ? ? nla = nla_reserve(skb, ODPPT_STATS, sizeof(struct rtnl_link_stats64));
>> > + ? ? ? if (!nla)
>> > + ? ? ? ? ? ? ? goto nla_put_failure;
>> > + ? ? ? if (vport_get_stats(vport, nla_data(nla)))
>> > + ? ? ? ? ? ? ? __skb_trim(skb, skb->len - nla->nla_len);
>>
>> Everywhere else if we run into a problem we abort the operation but
>> here we try to keep going.  Is there a reason for this?
>
> I was making a distinction between a failure to add data to the skb and
> a failure to fetch stats.  The former shouldn't fail in the normal case.
> If it does, it probably means that we are dumping out all the vports in
> a datapath and that this one that we're at now just won't fit, so that
> we have to back up and try again on the next callback (this can't even
> happen until the vport layer has been changed to use AF_NETLINK, but
> adding the error checking now instead of later makes later patches
> smaller and easier to read).  The latter means that there's something
> wrong with an underlying vport and it seems better to me to just leave
> out the stats, since we can't get them.

OK, that make sense.  I was primarily comparing it to the
get_options() block of code, which will abort if get_options() returns
an error.  I guess that get_options() should only return an error if
it can't add the data to the skb and not if the option fetching itself
fails (which should be documented).

Separately, I see the documentation for get/set_options both in
vport.c and vport.h seems to be out of date.  I also see that nothing
actually calls vport_get_options().  Should we move that whole block
of code relating to getting the options in copy_vport_to_user() to
vport_get_options() and call that?

>> > +static int change_vport(struct vport *vport, struct nlattr *a[ODPPT_MAX + 1])
>> > +{
>> > + ? ? ? int err = 0;
>> > + ? ? ? if (a[ODPPT_STATS])
>> > + ? ? ? ? ? ? ? err = vport_set_stats(vport, nla_data(a[ODPPT_STATS]));
>> > + ? ? ? if (!err && a[ODPPT_ADDRESS])
>> > + ? ? ? ? ? ? ? err = vport_set_addr(vport, nla_data(a[ODPPT_ADDRESS]));
>> > + ? ? ? if (!err && a[ODPPT_MTU])
>> > + ? ? ? ? ? ? ? err = vport_set_mtu(vport, nla_get_u32(a[ODPPT_MTU]));
>>
>> The fact we can fail and return with a transaction half complete concerns me.
>
> This is a pervasive issue in kernel Netlink implementations.  Take a
> look at the implementation of RTM_SETLINK in do_setlink() in
> net/core/rtnetlink.c.
>
> My understanding is that this is considered unavoidable behavior and so
> the "solution" is for userspace to change just one thing at a time if it
> cares.

Yeah, I more or less figured that would be the solution.

>
>> > +static int attach_vport(struct odp_vport __user *uodp_vport)
>> > ?{
>> [...]
>> > +       port_no = odp_vport->port_no;
>> > +       if (port_no > 0) {
>> > +               vport = get_vport_protected(dp, odp_vport->port_no);
>> > +               err = -EBUSY;
>> > +               if (vport)
>> > +                       goto exit_unlock_dp;
>>
>> Is there actually a good reason to allow userspace to request a
>> particular port?  It seems cleaner to just assign a port number
>> however we see fit.
>
> Yes, I see two reasons for userspace to request a port number (obviously
> we don't take advantage of either of these yet):
>
>        - To intentionally reuse a port number.  If I'm taking down a
>          vport and replacing it with a different one then I might want
>          to make sure that the new one has the same number as the old
>          so that flows stay valid.  This might be useful for some
>          bonding scenarios, for example.
>
>        - To intentionally avoid reusing a port number.  For example,
>          suppose that one VIF disappears and a new VIF comes up
>          quickly.  It is a bit of a security risk for the new VIF to
>          have the same port number, since there might be OpenFlow flows
>          for the old VIF's port number.  So we should try to avoid
>          reusing port numbers quickly.  (This has been bothering me for
>          years but somehow has never bubbled to the top of the pile.)

OK, those sound like good reasons.

>> > diff --git a/include/openvswitch/datapath-protocol.h b/include/openvswitch/datapath-protocol.h
>> > index b6ed979..ae82531 100644
>> > --- a/include/openvswitch/datapath-protocol.h
>> > +++ b/include/openvswitch/datapath-protocol.h
>> > +struct odp_vport {
>> > + ? ? ? int32_t dp_idx;
>> > + ? ? ? int32_t port_no;
>>
>> Why are these signed?
>
> They are signed because we need to have values that mean "none
> specified".  At this point in the series, the value 0 is meaningful for
> both dp_idx and port_no.  Much later on (in one of the supplementary
> series I sent out), dp_idx changes to an ifindex, so at that point I
> start using 0 for a "none specified" value (and could potentially switch
> to an unsigned type, except that the proper POSIX type for an ifindex is
> "int" so actually I switch to that).
>
> We could switch to uint32_t and use UINT32_MAX as "none" if you prefer.

I think where it makes sense we should use unsigned types, if only
because signed types makes me wonder about the consequences of a
negative index.  Obviously, for things like ifindexes we should use
the proper type.

>
> This is related to the next issue, see below:
>
>> > + ? ? ? uint32_t type;
>> > + ? ? ? uint32_t len;
>> > + ? ? ? uint32_t total_len;
>> > +};
>> > +
>> > +enum {
>> > + ? ? ? ?ODPPT_UNSPEC,
>> > + ? ? ? ?ODPPT_NAME, ? ? ? ? ? ?/* string name, up to IFNAMSIZ bytes long */
>> > + ? ? ? ?ODPPT_STATS, ? ? ? ? ? /* struct rtnl_link_stats64 */
>> > + ? ? ? ?ODPPT_ADDRESS, ? ? ? ? /* hardware address */
>> > + ? ? ? ?ODPPT_MTU, ? ? ? ? ? ? /* 32-bit maximum transmission unit */
>> > + ? ? ? ?ODPPT_OPTIONS, ? ? ? ? /* nested attributes, varies by vport type */
>> > + ? ? ? ?ODPPT_IFINDEX, ? ? ? ? /* 32-bit ifindex of backing netdev */
>> > + ? ? ? ?ODPPT_IFLINK, ? ? ? ? ?/* 32-bit ifindex on which packets are sent */
>> > + ? ? ? ?__ODPPT_MAX
>> > ?};
>>
>> The rationale for putting fields in the header vs attributes is not
>> very clear to me (with the obvious exception of the length fields).
>> It's perhaps not such a big deal here because we do need the fixed
>> header at the beginning to hold the lengths.  However, in the later
>> patches where we actually use Netlink, this seems to lead unnecessary
>> complexity.  Having these fields in the header means that different
>> operations have different headers, which means that we have different
>> families.  The end result is that we have a proliferation of families
>> and structs.  I would expect it to have one family with maybe a fixed
>> header consisting of just the DP index.
>
> The difference between approaches is a matter of philosophy.  I had
> viewed each of the four families (datapath, vport, flow, upcall) as a
> separate entity and thus given each of them a separate header and
> family.  But I can see that using a single family with a shared header
> or perhaps no header at all would simplify other matters, as you say.
>
> There is no technical reason we cannot do it the way you suggest.  It
> had not occurred to me before to do it that way.  If you prefer this
> approach, then I will rework things to work that way instead.

I guess I would prefer to make things a single family - I have a hard
time logically justifying having so many different families and it
seems that other parts of the kernel generally only have one.
Combining them would shrink the interface, which is a pretty big win
to me.

>
>> > diff --git a/include/openvswitch/tunnel.h b/include/openvswitch/tunnel.h
>> > index 44facfa..1e7d2b4 100644
>> > --- a/include/openvswitch/tunnel.h
>> > +++ b/include/openvswitch/tunnel.h
>> > @@ -43,6 +43,15 @@
>> > ?#include <linux/types.h>
>> > ?#include "openvswitch/datapath-protocol.h"
>> >
>> > +/* ODPPT_OPTIONS attributes for tunnels. */
>> > +enum {
>> > + ? ? ? TNLAT_UNSPEC,
>> > + ? ? ? TNLAT_CONFIG, ? ? ? ? ? /* struct tnl_port_config */
>>
>> I think this really needs to be broken apart into separate attributes.
>>  I already have plans to change this struct, so not it's not realistic
>> to expect it to stay the constant.
>
> OK, that's fine.  Here's a proposal for attributes for you to review
> before I go further along in implementation.  I've retained the flags as
> a 32-bit bitmask member; another way to do this with Netlink is to use
> attributes that do not carry any data as flags:

The attributes look good, I don't think there is much benefit in
breaking apart the flags into separate attributes.

>> > ?static int
>> > ?dpif_linux_port_add(struct dpif *dpif, struct netdev *netdev,
>> > ? ? ? ? ? ? ? ? ? ? uint16_t *port_nop)
>>
>> The kernel can return 32-bit port numbers but userspace (and OpenFlow
>> 1.0) only use 16-bit ports.  Is there anything to prevent overflow
>> other than the fact that we're unlikely to have both that 64k ports?
>> Right now we use linear allocation but I'm not sure that it's
>> something we want to guarantee.
>
> No, there's nothing that prevents overflow.  My approach here has been
> lazy, updating code as I make a pass over it to change from 16-bit to
> 32-bit port numbers, so there are undoubtedly many places with this kind
> of problem.  I'd been planning eventually to make a proper sweep over
> the code and fix up these problems.  Do you think that I should do
> something earlier?

No, it should be fine for now.  With the current allocation method we
shouldn't run into a problem.  Changing the userspace datatype for
ports doesn't require touching the kernel interface so it shouldn't be
disruptive in the future.

>> > diff --git a/lib/dpif-linux.h b/lib/dpif-linux.h
>> > index 1dff4b7..552f432 100644
>> > --- a/lib/dpif-linux.h
>> > +++ b/lib/dpif-linux.h
>> > @@ -1,5 +1,5 @@
>> > ?/*
>> > - * Copyright (c) 2010 Nicira Networks.
>> > + * Copyright (c) 2010, 2011 Nicira Networks.
>> > ?*
>> > ?* Licensed under the Apache License, Version 2.0 (the "License");
>> > ?* you may not use this file except in compliance with the License.
>> > @@ -18,6 +18,37 @@
>> > ?#define DPIF_LINUX_H 1
>> >
>> > ?#include <stdbool.h>
>> > +#include "openvswitch/datapath-protocol.h"
>> > +
>> > +struct ofpbuf;
>> > +
>> > +struct dpif_linux_vport {
>> > + ? ?/* ioctl command argument. */
>> > + ? ?int cmd;
>>
>> 'cmd' feels a little out of place to me because it isn't actually part
>> of the port is often left uninitialized (in replies for example).
>> Would it make sense to just have it be a separate argument to
>> dpif_linux_vport_transact()?
>
> That is in fact how an earlier version of the series worked.  I changed
> it to be part of the structure for two reasons, both of which are
> related to how Netlink works.  First, Netlink includes the command as
> part of the request message, not as a separate argument, so later on it
> makes logical sense to keep everything together.  Second, Netlink
> includes a command as part of replies, too, and this command can be
> important, especially for broadcast notifications of changes (e.g. is
> this a "new" or "delete" notification?).  It seemed that those were good
> enough reasons to put it in the structure, and after that it seemed like
> a good enough reason to do it that way from the beginning.

OK, that makes sense.

>> > ?static int
>> > ?netdev_vport_get_mtu(const struct netdev *netdev, int *mtup)
>> > ?{
>> > - ? ?struct odp_vport_mtu vport_mtu;
>> > - ? ?int err;
>> > -
>> > - ? ?ovs_strlcpy(vport_mtu.devname, netdev_get_name(netdev),
>> > - ? ? ? ? ? ? ? ?sizeof vport_mtu.devname);
>> > + ? ?struct dpif_linux_vport reply;
>> > + ? ?struct ofpbuf *buf;
>> > + ? ?int error;
>> >
>> > - ? ?err = netdev_vport_do_ioctl(ODP_VPORT_MTU_GET, &vport_mtu);
>> > - ? ?if (err) {
>> > - ? ? ? ?return err;
>> > + ? ?error = dpif_linux_vport_get(netdev_get_name(netdev), &reply, &buf);
>> > + ? ?if (!error) {
>> > + ? ? ? ?*mtup = reply.mtu;
>>
>> Hmm, we probably need some way of detecting that the port doesn't have an MTU.
>
> I think that comes back to the patch "datapath: Consider tunnels to have
> no MTU, ..." that I sent out last Thursday.  If you haven't read that
> yet then maybe you should.

OK, I guess if we have an MTU of zero at this point we assume there is no MTU.

>
>> > +static const struct tnl_port_config *
>> > +tnl_port_config_from_nlattr(const struct nlattr *options, size_t options_len)
>> > +{
>> > + ? ?static const struct nl_policy odp_tunnel_policy[] = {
>> > + ? ? ? ?[TNLAT_CONFIG] = { .type = NL_A_UNSPEC,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? .min_len = sizeof(struct tnl_port_config),
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? .max_len = sizeof(struct tnl_port_config),
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? .optional = false }
>> > + ? ?};
>> > + ? ?struct nlattr *a[ARRAY_SIZE(odp_tunnel_policy)];
>> > + ? ?struct ofpbuf buf;
>> > +
>> > + ? ?ofpbuf_use_const(&buf, options, options_len);
>> > + ? ?if (!nl_policy_parse(&buf, 0, odp_tunnel_policy,
>> > + ? ? ? ? ? ? ? ? ? ? ? ? a, ARRAY_SIZE(odp_tunnel_policy))) {
>>
>> Why do we use parse in one place and find in another?  Aren't we
>> trying to solve essentially the same problem in both places?
>
> There's no good reason.  I didn't invent nl_attr_find() until after I'd
> written that function, and forgot to go back and fix it up.  There's no
> point in fixing it up now since we're breaking up TNLAT_CONFIG into
> multiple attributes anyway, so I'll leave it for that.

OK.

>
> Here's the diff versus (the mostly recently rebased version of) the
> version before your comments.  I'm omitting the tunnel.h changes since
> I'm expecting a design review from you first:

Looks good.




More information about the dev mailing list