[ovs-dev] [PATCH 1/5] netlink: Expose method to get Netlink pid of a socket.

Ben Pfaff blp at nicira.com
Thu Sep 22 17:38:47 UTC 2011

On Wed, Sep 21, 2011 at 11:02:03PM -0700, Jesse Gross wrote:
> On Wed, Sep 21, 2011 at 9:37 PM, Ben Pfaff <blp at nicira.com> wrote:
> > On Mon, Sep 19, 2011 at 03:00:04PM -0700, Jesse Gross wrote:
> >> In the future, the kernel will use unicast messages instead of
> >> multicast to send upcalls. ??As a result, we need to be able to
> >> tell it where to direct the traffic. ??This adds a function to expose
> >> the Netlink pid of a socket so it can be included in messages to the
> >> kernel.
> >
> > The Netlink socket library in OVS has logic to keep fds used for Netlink
> > dump operations separate from fds used for multicast reception, because
> > we don't want reading dump output to discard an arbitrary number of
> > possibly interleaved multicast messages.
> >
> > Even though it's not technically multicast, this new way of receiving
> > upcalls is still asynchronous and could still have the same issue as a
> > socket subscribed to a multicast group.
> That's a good point although you'll also have similar issues if you
> execute transactions on the same socket so it's still necessary to
> segregate operations on a single socket (as is done here).  We could
> try to deal with that too although I'm not sure it's worth it and all
> these games with swapping FDs behind the scenes makes me somewhat
> nervous.

The biggest reason for the fd-swapping games is to overcome an awkward
limitation of Netlink dump operations, which is that you can only have
one of them going on at a time on a single Netlink socket.  So either
the Netlink clients like dpif-linux have to deal with this themselves,
by keeping track of when a dump is going on and opening new sockets
when another one starts, or the netlink library has to do it.  I think
it's less error-prone to have the logic in just one place, instead of
spreading it out among all the clients.

Note that dpif-linux doesn't read all of a dump's output in one go.
It exposes an iterator interface to its clients and reads additional
Netlink packets as the client consumes the existing ones.  It would
waste a lot of memory to consume the whole flow dump at once.

This problem is admittedly a bit of a corner case.  It's not too
common to want to do two dumps at the same time.  (In my opinion,
that's another good reason to centralize the logic; if it's
distributed and redundant across clients that rarely encounter the
problem, then if they ever do we'll be running untested code.)

The next biggest reason is much less of a corner case: if there's an
ongoing dump operation on a netlink fd, then simple request-reply
transactions on the same fd will screw up the dump, by skipping past
some (perhaps all) of the dump replies.  We could store those dump
replies, but that requires an unbounded amount of memory.  Using a
separate fd avoids the problem.

Keeping multicast traffic separate from other traffic is way down the
importance scale, since it's easy for clients to segregate the
nl_socks they're using for transactions and dumps from the nl_socks
they're using for receiving multicast traffic.  In fact, it looks like
all the existing clients in OVS already do that.  I think that it is
necessary for correctness, so I guess that we should simply declare
that a socket that receives asynchronous messages must not be used for
transactions or dumps.  Here's a patch that does that.  After this is
applied, I agree that your original patch is fine.

--8<--------------------------cut here-------------------------->8--

From: Ben Pfaff <blp at nicira.com>
Date: Thu, 22 Sep 2011 10:37:59 -0700
Subject: [PATCH] netlink-socket: Async notifications are incompatible with
 other operations.

A Netlink socket that receives asynchronous notifications (e.g. from a
multicast group) cannot be used for transactions or dumps, because those
operations would discard asynchronous messages that arrive while waiting
for replies.

This commit documents this issue in a comment on nl_sock_join_mcgroup().
It also removes an internal attempt to avoid mixing multicast reception
with other operations.  The attempt was incomplete, because it only
handled dumps even though ordinary transactions are also problematic.  It
seems better to remove it than to fix it because, first, all of the
existing users in OVS already separate multicast reception from other
operations and, second, an upcoming commit will start using unicast
Netlink for asynchronous notifications, which has the same issues but
doesn't use nl_sock_join_mcgroup().
 lib/netlink-socket.c |   15 ++++++++-------
 1 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
index f4635c8..085918d 100644
--- a/lib/netlink-socket.c
+++ b/lib/netlink-socket.c
@@ -62,7 +62,6 @@ struct nl_sock
     int fd;
     uint32_t pid;
     int protocol;
-    bool any_groups;
     struct nl_dump *dump;
@@ -92,7 +91,6 @@ nl_sock_create(int protocol, struct nl_sock **sockp)
         goto error;
     sock->protocol = protocol;
-    sock->any_groups = false;
     sock->dump = NULL;
     retval = alloc_pid(&sock->pid);
@@ -164,6 +162,10 @@ nl_sock_destroy(struct nl_sock *sock)
 /* Tries to add 'sock' as a listener for 'multicast_group'.  Returns 0 if
  * successful, otherwise a positive errno value.
+ * A socket that is subscribed to a multicast group that receives asynchronous
+ * notifications must not be used for Netlink transactions or dumps, because
+ * transactions and dumps can cause notifications to be lost.
+ *
  * Multicast group numbers are always positive.
  * It is not an error to attempt to join a multicast group to which a socket
@@ -181,7 +183,6 @@ nl_sock_join_mcgroup(struct nl_sock *sock, unsigned int multicast_group)
                   multicast_group, strerror(errno));
         return errno;
-    sock->any_groups = true;
     return 0;
@@ -536,10 +537,10 @@ nl_dump_start(struct nl_dump *dump,
     nlmsghdr->nlmsg_flags |= NLM_F_DUMP | NLM_F_ACK;
     dump->seq = nlmsghdr->nlmsg_seq;
     dump->buffer = NULL;
-    if (sock->any_groups || sock->dump) {
-        /* 'sock' might belong to some multicast group, or it already has an
-         * ongoing dump.  Clone the socket to avoid possibly intermixing
-         * multicast messages or previous dump results with our results. */
+    if (sock->dump) {
+        /* 'sock' already has an ongoing dump.  Clone the socket to avoid
+         * possibly intermixing multicast messages or previous dump results
+         * with our results. */
         dump->status = nl_sock_clone(sock, &dump->sock);
         if (dump->status) {

More information about the dev mailing list