[ovs-dev] [PATCHv2] dpif-linux: Give each port its own userspace-kernel channel.

Ben Pfaff blp at nicira.com
Wed Jan 9 19:48:28 UTC 2013


Thanks for the revision.  I have some further comments.

On Mon, Jan 07, 2013 at 05:32:29PM -0800, Justin Pettit wrote:
> +    - The Linux datapath implementation creates a different kernel-
> +      userspace channel for each port instead of sharing a static 16
> +      channels to provide better performance isolation.

Previously I commented on this but I didn't see a response in your
follow-up.  The first time, I said that I think that's going to prompt
more questions than it's going to answer.  How about:

    - With the Linux datapath, packets for new flows are now queued
      separately on a per-port basis, so it should no longer be
      possible for a large number of new flows arriving on one port to
      prevent new flows from being processed on other ports.

It's still very technical, but I think it is likely to help more
readers.

I think that part of the problem I'm having with 'channels' and
'n_channels' is that I tend to read the 'n_channels' name as meaning
the number of channels actually present in the (possibly sparse)
arrays 'channels' and 'epoll_events', instead of the allocated size of
the array itself.  That could be cured at least partially with a
comment on n_channels (currently it has none), but I think that
another name would make it more obvious.  I like the 'array_size' name
that you used in the add_channel() function.  So the result would be
something like this:

    /* Upcall messages. */
    int array_size;             /* Size of 'channels' and 'epoll_events'. */
    struct dpif_channel *channels;
    struct epoll_event *epoll_events;
    int epoll_fd;               /* epoll fd that includes channel socks. */
    int n_events;               /* Num events returned by epoll_wait(). */
    int event_offset;           /* Offset into 'epoll_events'. */

In add_channel(), then, a comparison like
    if (port_no >= dpif->array_size) {
would then make perfect sense to me, without the need for a local
variable.

I'd ordinarily write "sizeof *dpif->channels" and "sizeof
*dpif->epoll_events" in the allocations in add_channel(), since that's
the typical OVS style:

        dpif->channels = xrealloc(dpif->channels,
                                  array_size * sizeof(struct dpif_channel));
...
        dpif->epoll_events = xrealloc(dpif->epoll_events,
                                      array_size * sizeof(struct epoll_event));

It might be worth growing these two dynamic arrays by doubling, rather
than one at a time, since we could add a large number of elements and
it's nice to avoid N**2 behavior in the reallocation.

In add_channel(), I think that the assignment
    dpif->n_channels = array_size;
is wrong in the case where the new port_no is less than some existing
port number (which can happen if a port is deleted and then a new one
created).  I think this assignment should go inside the "if" block.

del_channel() attempts to use EPOLL_CTL_ADD to remove a channel, but I
do not think that this can work.  Also, it's not necessary to specify
an event to remove a fd from an epoll set, unless we want to support
Linux earlier than 2.6.9, which we don't.

Should the "if" test at the top of del_channel() check for port_no >=
dpif->n_channels (rather than ">")?

add_channel() returns success without storing or destroying the
nl_sock if epoll_fd < 0.  I think this leaks memory and a fd if recv
isn't turned on.  (It might make more sense not to create a socket at
all if recv isn't turned on.)

Also, if recv isn't turned on, then dpif_linux_port_add() still
creates a Netlink socket and tells the kernel to use that socket.  I'd
suggest that it should use 0, indicating that the kernel should not
send upcalls at all, since no one is going to want to receive them.

In dpif_linux_port_get_pid(), I think that "port_no == UINT32_MAX ||
port_no >= dpif->n_channels" can be simplified to just the second
clause.

Arguably, destroy_channels() should turn off upcalls for all the
vports where we've enabled them, to save kernel time and buffer space
when ovs-vswitchd exits gracefully.



More information about the dev mailing list