[ovs-dev] [PATCH 11/16] tunneling: Add datapath GRE support.

Jesse Gross jesse at nicira.com
Thu Apr 15 23:44:25 UTC 2010


On Thu, Apr 15, 2010 at 1:37 PM, Ben Pfaff <blp at nicira.com> wrote:

> On Tue, Apr 13, 2010 at 10:41:13AM -0400, Jesse Gross wrote:
> > Add a new vport type that implements GRE support inside of the
> > datapath instead of relying on Linux devices.  This provides
> > greater scalability, performance, and control.
>
> It might be worth noting more specifically in the commit log the
> problems that we saw with the netdev-based GRE.  If we ever want to
> convince anyone upstream that the vport abstraction makes sense then we
> will need details.
>

Sure, I expanded the commit message a bit.


>
> I would appreciate a brief comment on the GRE_F_IN_* constants and maybe
> on the struct gre_port_config members also.  I don't know the design
> well enough for their meanings to be obvious to me.
>

I documented those flags.


>
> I think that new_config_rcu() could be simplified into a call to
> kmemdup().
>

Yes, thanks.


>
> I think that a temporary could be eliminated from assign_config_rcu() by
> calling call_rcu() before assigning the new configuration.
>

This seems a little overzealous to me.  Is this actually safe?  Couldn't the
old config be freed and then someone else enter a rcu_read_lock section
after the free but before the new assignment?  In any case it seems like it
adds complexity for very little benefit.


>
> In add_port, I think that adding a tbl_count() or tbl_should_expand()
> function would be more transparent than summing all of the *_ports
> counts.
>

I added a tbl_count() function.


>
> I doubt that the *_ports variables really need to be "atomic_t"s.  They
> are only modified under the vport_lock.  They are only read outside that
> lock to determine whether they are zero, in situations where races don't
> matter.  I think that "unsigned int" would work just as well.  (Not that
> atomic_t is a big deal, but it prompted me to look around to see if
> anything tricky was going on.  Nothing was.)
>

Yes, you're right.


>
> I was surprised to find port_cmp() modifying its argument.  I didn't
> expect that.  I guess that this is because the vport's "mutable" member
> might change between the rcu_dereference in port_cmp() and any
> opportunity the caller would have to dereference it.  (This could be
> made more transparent by having the tbl code just return a bucket for
> the caller to search, avoiding a callback function.)  I am kind of
> surprised that "vport_gre"s matching criteria can change at all--I would
> have guessed that a new vport would have to be created to replace the
> old one.
>

The Linux GRE implementation had some parameters that couldn't be changed
after the port was created - actually not the ones affecting matching - and
it drove me crazy (especially since it varied based on kernel version) so I
wanted to avoid that.  I added some comments though to clarify.


>
> I assume that find_port() is a fast-path?  I wonder whether there is a
> way to avoid potentially four separate hashes and searches.  In practice
> do you think it is likely that all four types of port matches will be in
> use?
>

My guess is that it is exceedingly rare that all four would be in use.  The
majority of the time I would expect only one to be used and occasionally
two.


>
> I skipped over all the code from check_ipv4_address() to gre_init()
> because it looked very familiar.  I remember reviewing something like it
> before.  If there are bits you want reviewed in detail, just let me
> know.
>

All the path MTU discovery stuff is largely the same.  The actual transmit,
receive, and ICMP functions are new but they have the same basic structure
as the Linux implementation so I don't think there is anything too crazy or
controversial.


>
> I found myself wondering whether the minimum size of a "tbl" should be
> smaller than a page.  I guess that there will often only be a few GRE
> ports.
>

Potentially it might be a good idea, though I'm not sure that it matters
that much in practice.  One change I did make is to only allocate the table
if someone actually creates a GRE port.


>
> Is there anything that keeps the module loaded while gre ports exist?
> It's not obvious to me.  (But 2.6.x kernels have some cleverness here
> for which I've forgotten the details.)
>

The module won't unload if any datapaths are currently configured (the same
behavior as before).  Deleting a datapath detaches all of its ports.
 vport_exit() will delete any ports not currently attached to a datapath, so
we shouldn't have any dangling ports.


>
> Is gre_modify() only called when the lookup criteria change?  If not,
> then it would seem better to avoid del_port()/add_port() if only, say,
> the tos or ttl were to change, so that there is no time at which packets
> would miss in the table during the change.
>

No, it can be called for any type of change.  I changed it to only delete
and add from the hash table if it needs to.


>
> Also in gre_modify(), do we end up with an orphaned vport if del_port()
> succeeds but add_port() fails?
>

The port won't be able to receive any packets if it isn't in the hash table
but it is otherwise still a normal port.  It is accessible from the
datapath, so it isn't orphaned in the sense that there is no reference to it
(importantly, it can be deleted).  I did add check when the port is
destroyed to not try to remove it from the hash table if it couldn't be
added earlier.


>
> It's pretty weird for tbl_remove() and thus del_port() to be able to
> fail due to lack of memory.  Maybe we should fix that by allowing null
> pointers in buckets.  Dunno.
>

I agree that it is weird behavior.  I'm not sure that we can fully remove
the possibility of remove/delete/destroy calls failing due to out of memory
since a bunch of things use RCU and might have this problem.  I'll take a
look at it later but it's probably not a high risk.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openvswitch.org/pipermail/ovs-dev/attachments/20100415/44f6051f/attachment-0003.html>


More information about the dev mailing list