[ovs-dev] [PATCHv2] dpif-linux: Give each port its own userspace-kernel channel.

Justin Pettit jpettit at nicira.com
Thu Jan 10 17:30:31 UTC 2013


On Jan 9, 2013, at 11:48 AM, Ben Pfaff <blp at nicira.com> wrote:

> Previously I commented on this but I didn't see a response in your
> follow-up.  The first time, I said that I think that's going to prompt
> more questions than it's going to answer.  How about:
> 
>    - With the Linux datapath, packets for new flows are now queued
>      separately on a per-port basis, so it should no longer be
>      possible for a large number of new flows arriving on one port to
>      prevent new flows from being processed on other ports.

That's better.  Thanks.

> I think that part of the problem I'm having with 'channels' and
> 'n_channels' is that I tend to read the 'n_channels' name as meaning
> the number of channels actually present in the (possibly sparse)
> arrays 'channels' and 'epoll_events', instead of the allocated size of
> the array itself.  That could be cured at least partially with a
> comment on n_channels (currently it has none), but I think that
> another name would make it more obvious.  I like the 'array_size' name
> that you used in the add_channel() function.  So the result would be
> something like this:
> 
>    /* Upcall messages. */
>    int array_size;             /* Size of 'channels' and 'epoll_events'. */
>    struct dpif_channel *channels;
>    struct epoll_event *epoll_events;
>    int epoll_fd;               /* epoll fd that includes channel socks. */
>    int n_events;               /* Num events returned by epoll_wait(). */
>    int event_offset;           /* Offset into 'epoll_events'. */

That is better.  I called it "uc_array_size", because it looked overly generic having an element called "array_size" in dpif_linux struct.

> I'd ordinarily write "sizeof *dpif->channels" and "sizeof
> *dpif->epoll_events" in the allocations in add_channel(), since that's
> the typical OVS style:
> 
>        dpif->channels = xrealloc(dpif->channels,
>                                  array_size * sizeof(struct dpif_channel));
> ...
>        dpif->epoll_events = xrealloc(dpif->epoll_events,
>                                      array_size * sizeof(struct epoll_event));

Okay.

> It might be worth growing these two dynamic arrays by doubling, rather
> than one at a time, since we could add a large number of elements and
> it's nice to avoid N**2 behavior in the reallocation.

As we discussed in person, this should be a pretty infrequent operation with a lot of other overhead associated with adding a port, so we decided it was okay to leave it as is.

> In add_channel(), I think that the assignment
>    dpif->n_channels = array_size;
> is wrong in the case where the new port_no is less than some existing
> port number (which can happen if a port is deleted and then a new one
> created).  I think this assignment should go inside the "if" block.

Fixed.

> del_channel() attempts to use EPOLL_CTL_ADD to remove a channel, but I
> do not think that this can work.  Also, it's not necessary to specify
> an event to remove a fd from an epoll set, unless we want to support
> Linux earlier than 2.6.9, which we don't.

Damn you, copy and paste!!

> Should the "if" test at the top of del_channel() check for port_no >=
> dpif->n_channels (rather than ">")?

Whoops.  Thanks.

> add_channel() returns success without storing or destroying the
> nl_sock if epoll_fd < 0.  I think this leaks memory and a fd if recv
> isn't turned on.  (It might make more sense not to create a socket at
> all if recv isn't turned on.)

Good point.  Fixed.

> Also, if recv isn't turned on, then dpif_linux_port_add() still
> creates a Netlink socket and tells the kernel to use that socket.  I'd
> suggest that it should use 0, indicating that the kernel should not
> send upcalls at all, since no one is going to want to receive them.

Good point.  Changed.

> In dpif_linux_port_get_pid(), I think that "port_no == UINT32_MAX ||
> port_no >= dpif->n_channels" can be simplified to just the second
> clause.

Yeah, I did that on purpose to be explicit that it was a special value, but I went ahead and changed it.

> Arguably, destroy_channels() should turn off upcalls for all the
> vports where we've enabled them, to save kernel time and buffer space
> when ovs-vswitchd exits gracefully.

That's a good suggestion.  Added.

I've attached an incremental to the end.  Let me know what you think.

Thanks,

--Justin


-=-=-=-=-=-=-=-=-


diff --git a/NEWS b/NEWS
index c845579..3f06a7a 100644
--- a/NEWS
+++ b/NEWS
@@ -3,6 +3,10 @@ post-v1.9.0
     - The maximum size of the MAC learning table is now configurable.
     - New support for the VXLAN tunnel protocol (see the IETF draft here:
       http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02).
+    - With the Linux datapath, packets for new flows are now queued
+      separately on a per-port basis, so it should no longer be
+      possible for a large number of new flows arriving on one port to
+      prevent new flows from being processed on other ports.
     - New "vlog/disable-rate-limit" and "vlog/enable-rate-limit" commands
       available through ovs-appctl allow control over logging rate limits.
     - The OpenFlow "dp_desc" may now be configured by setting the value of 
@@ -17,9 +21,6 @@ post-v1.9.0
 
 v1.9.0 - xx xxx xxxx
 --------------------
-    - The Linux datapath implementation creates a different kernel-
-      userspace channel for each port instead of sharing a static 16
-      channels to provide better performance isolation.
     - The tunneling code no longer assumes input and output keys are symmetric.
       If they are not, PMTUD needs to be disabled for tunneling to work. Note
       this only applies to flow-based keys.
diff --git a/lib/dpif-linux.c b/lib/dpif-linux.c
index 512d90a..45a2d38 100644
--- a/lib/dpif-linux.c
+++ b/lib/dpif-linux.c
@@ -140,10 +140,10 @@ struct dpif_linux {
     int dp_ifindex;
 
     /* Upcall messages. */
+    int uc_array_size;          /* Size of 'channels' and 'epoll_events'. */
     struct dpif_channel *channels;
-    int n_channels;
+    struct epoll_event *epoll_events;
     int epoll_fd;               /* epoll fd that includes channel socks. */
-    struct epoll_event *epoll_events; /* Array count equal to 'n_channels'. */
     int n_events;               /* Num events returned by epoll_wait(). */
     int event_offset;           /* Offset into 'epoll_events'. */
 
@@ -261,18 +261,35 @@ open_dpif(const struct dpif_linux_dp *dp, struct dpif **dp
 static void
 destroy_channels(struct dpif_linux *dpif)
 {
-    struct dpif_channel *ch;
+    int i;
 
     if (dpif->epoll_fd < 0) {
         return;
     }
 
-    for (ch = dpif->channels; ch < &dpif->channels[dpif->n_channels]; ch++) {
+    for (i = 0; i < dpif->uc_array_size; i++ ) {
+        struct dpif_linux_vport vport_request;
+        struct dpif_channel *ch = &dpif->channels[i];
+        uint32_t upcall_pid = 0;
+
+        if (!ch->sock) {
+            continue;
+        }
+
+        /* Turn off upcalls. */
+        dpif_linux_vport_init(&vport_request);
+        vport_request.cmd = OVS_VPORT_CMD_SET;
+        vport_request.dp_ifindex = dpif->dp_ifindex;
+        vport_request.port_no = i;
+        vport_request.upcall_pid = &upcall_pid;
+        dpif_linux_vport_transact(&vport_request, NULL, NULL);
+
         nl_sock_destroy(ch->sock);
     }
+
     free(dpif->channels);
     dpif->channels = NULL;
-    dpif->n_channels = 0;
+    dpif->uc_array_size = 0;
 
     free(dpif->epoll_events);
     dpif->epoll_events = NULL;
@@ -285,7 +302,6 @@ destroy_channels(struct dpif_linux *dpif)
 static int
 add_channel(struct dpif_linux *dpif, uint32_t port_no, struct nl_sock *sock)
 {
-    int array_size = port_no + 1;
     struct epoll_event event;
 
     if (dpif->epoll_fd < 0) {
@@ -294,19 +310,26 @@ add_channel(struct dpif_linux *dpif, uint32_t port_no, str
 
     /* We assume that the datapath densely chooses port numbers, which
      * can therefore be used as an index into an array of channels. */
-    if (array_size > dpif->n_channels) {
+    if (port_no >= dpif->uc_array_size) {
+        int new_size = port_no + 1;
         int i;
 
+        if (new_size > 65535) {
+            VLOG_WARN_RL(&error_rl, "%s: datapath port %"PRIu32" too big",
+                     dpif_name(&dpif->dpif), port_no);
+            return EFBIG;
+        }
+
         dpif->channels = xrealloc(dpif->channels,
-                                  array_size * sizeof(struct dpif_channel));
-        for (i = dpif->n_channels; i < array_size; i++) {
+                                  new_size * sizeof *dpif->channels);
+        for (i = dpif->uc_array_size; i < port_no; i++) {
             dpif->channels[i].sock = NULL;
         }
 
         dpif->epoll_events = xrealloc(dpif->epoll_events,
-                                      array_size * sizeof(struct epoll_event));
+                                      new_size * sizeof *dpif->epoll_events);
+        dpif->uc_array_size = new_size;
     }
-    dpif->n_channels = array_size;
 
     memset(&event, 0, sizeof event);
     event.events = EPOLLIN;
@@ -326,9 +349,8 @@ static void
 del_channel(struct dpif_linux *dpif, uint32_t port_no)
 {
     struct dpif_channel *ch;
-    struct epoll_event event;
 
-    if (dpif->epoll_fd < 0 || port_no > dpif->n_channels) {
+    if (dpif->epoll_fd < 0 || port_no >= dpif->uc_array_size) {
         return;
     }
 
@@ -337,10 +359,7 @@ del_channel(struct dpif_linux *dpif, uint32_t port_no)
         return;
     }
 
-    memset(&event, 0, sizeof event);
-    event.events = EPOLLIN;
-    event.data.u32 = port_no;
-    epoll_ctl(dpif->epoll_fd, EPOLL_CTL_ADD, nl_sock_fd(ch->sock), &event);
+    epoll_ctl(dpif->epoll_fd, EPOLL_CTL_DEL, nl_sock_fd(ch->sock), NULL);
 
     nl_sock_destroy(ch->sock);
     ch->sock = NULL;
@@ -412,14 +431,16 @@ dpif_linux_port_add(struct dpif *dpif_, struct netdev *net
     const char *type = netdev_get_type(netdev);
     struct dpif_linux_vport request, reply;
     const struct ofpbuf *options;
-    struct nl_sock *sock;
+    struct nl_sock *sock = NULL;
     uint32_t upcall_pid;
     struct ofpbuf *buf;
     int error;
 
-    error = nl_sock_create(NETLINK_GENERIC, &sock);
-    if (error) {
-        return error;
+    if (dpif->epoll_fd >= 0) {
+        error = nl_sock_create(NETLINK_GENERIC, &sock);
+        if (error) {
+            return error;
+        }
     }
 
     dpif_linux_vport_init(&request);
@@ -446,7 +467,7 @@ dpif_linux_port_add(struct dpif *dpif_, struct netdev *netde
     }
 
     request.port_no = *port_nop;
-    upcall_pid = nl_sock_pid(sock);
+    upcall_pid = sock ? nl_sock_pid(sock) : 0;
     request.upcall_pid = &upcall_pid;
 
     error = dpif_linux_vport_transact(&request, &reply, &buf);
@@ -463,12 +484,14 @@ dpif_linux_port_add(struct dpif *dpif_, struct netdev *net
     }
     ofpbuf_delete(buf);
 
-    error = add_channel(dpif, *port_nop, sock);
-    if (error) {
-        VLOG_INFO("%s: could not add channel for port %s",
-                  dpif_name(dpif_), name);
-        nl_sock_destroy(sock);
-        return error;
+    if (sock) {
+        error = add_channel(dpif, *port_nop, sock);
+        if (error) {
+            VLOG_INFO("%s: could not add channel for port %s",
+                      dpif_name(dpif_), name);
+            nl_sock_destroy(sock);
+            return error;
+        }
     }
 
     return 0;
@@ -553,8 +576,7 @@ dpif_linux_port_get_pid(const struct dpif *dpif_, uint32_t p
     } else {
         /* The UINT32_MAX "reserved" port number uses the "ovs-system"'s
          * channel, since it is not heavily loaded. */
-        int idx = (port_no == UINT32_MAX || port_no >= dpif->n_channels)
-                  ? 0 : port_no;
+        int idx = (port_no >= dpif->uc_array_size) ? 0 : port_no;
         return nl_sock_pid(dpif->channels[idx].sock);
     }
 }
@@ -1203,7 +1225,7 @@ dpif_linux_recv(struct dpif *dpif_, struct dpif_upcall *up
 
         do {
             retval = epoll_wait(dpif->epoll_fd, dpif->epoll_events,
-                                dpif->n_channels, 0);
+                                dpif->uc_array_size, 0);
         } while (retval < 0 && errno == EINTR);
         if (retval < 0) {
             static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 1);
@@ -1279,7 +1301,7 @@ dpif_linux_recv_purge(struct dpif *dpif_)
        return;
     }
 
-    for (ch = dpif->channels; ch < &dpif->channels[dpif->n_channels]; ch++) {
+    for (ch = dpif->channels; ch < &dpif->channels[dpif->uc_array_size]; ch++) 
         if (ch->sock) {
             nl_sock_drain(ch->sock);
         }








More information about the dev mailing list