[ovs-dev] [PATCH] netlink-socket: Work around kernel Netlink dump thread races.
Ben Pfaff
blp at nicira.com
Thu Jul 10 23:58:03 UTC 2014
Thanks, applied to all relevant branches.
On Thu, Jul 10, 2014 at 04:42:31PM -0700, Alex Wang wrote:
> Thx for the explanation, looks good to me,
>
> And confirm that it solve the pkt duplication issue,
>
> Acked-by: Alex Wang <alexw at nicira.com>
>
>
> On Thu, Jul 10, 2014 at 4:03 PM, Ben Pfaff <blp at nicira.com> wrote:
>
> > The Linux kernel Netlink implementation has two races that cause problems
> > for processes that attempt to dump a table in a multithreaded manner.
> >
> > The first race is in the structure of the kernel netlink_recv() function.
> > This function pulls a message from the socket queue and, if there is none,
> > reports EAGAIN:
> > skb = skb_recv_datagram(sk, flags, noblock, &err);
> > if (skb == NULL)
> > goto out;
> > Only if a message is successfully read from the socket queue does the
> > function, toward the end, try to queue up a new message to be dumped:
> > if (nlk->cb && atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf /
> > 2) {
> > ret = netlink_dump(sk);
> > if (ret) {
> > sk->sk_err = ret;
> > sk->sk_error_report(sk);
> > }
> > }
> > This means that if thread A reads a message from a dump, then thread B
> > attempts to read one before A queues up the next, B will get EAGAIN. This
> > means that, following EAGAIN, B needs to wait until A returns to userspace
> > before it tries to read the socket again. nl_dump_next() already does
> > this, using 'dump->status_seq' (although the need for it has never been
> > explained clearly, to my knowledge).
> >
> > The second race is more serious. Suppose thread X and thread Y both
> > simultaneously attempt to queue up a new message to be dumped, using the
> > call to netlink_dump() quoted above. netlink_dump() begins with:
> > mutex_lock(nlk->cb_mutex);
> >
> > cb = nlk->cb;
> > if (cb == NULL) {
> > err = -EINVAL;
> > goto errout_skb;
> > }
> > Suppose that X gets cb_mutex first and finds that the dump is complete. It
> > will therefore, toward the end of netlink_dump(), clear nlk->cb to NULL to
> > indicate that no dump is in progress and release the mutex:
> > nlk->cb = NULL;
> > mutex_unlock(nlk->cb_mutex);
> > When Y grabs cb_mutex afterward, it will see that nlk->cb is NULL and
> > return -EINVAL as quoted above. netlink_recv() stuffs -EINVAL in sk_err,
> > but that error is not reported immediately; instead, it is saved for the
> > next read from the socket. Since Open vSwitch maintains a pool of Netlink
> > sockets, that next failure can crop up pretty much anywhere. One of the
> > worst places for it to crop up is in the execution of a later transaction
> > (e.g. in nl_sock_transact_multiple__()), because userspace treats Netlink
> > transactions as idempotent and will re-execute them when socket errors
> > occur. For a transaction that sends a packet, this causes packet
> > duplication, which we actually observed in practice. (ENOBUFS should
> > actually cause transactions to be re-executed in many cases, but EINVAL
> > should not; this is a separate bug in the userspace netlink code.)
> >
> > VMware-BZ: #1283188
> > Reported-and-tested-by: Alex Wang <alexw at nicira.com>
> > Signed-off-by: Ben Pfaff <blp at nicira.com>
> > ---
> > lib/netlink-socket.c | 10 ++++++++++
> > lib/netlink-socket.h | 2 ++
> > 2 files changed, 12 insertions(+)
> >
> > diff --git a/lib/netlink-socket.c b/lib/netlink-socket.c
> > index 42ba7e1..8f8d603 100644
> > --- a/lib/netlink-socket.c
> > +++ b/lib/netlink-socket.c
> > @@ -714,6 +714,7 @@ nl_dump_start(struct nl_dump *dump, int protocol,
> > const struct ofpbuf *request)
> > atomic_init(&dump->status, status << 1);
> > dump->nl_seq = nl_msg_nlmsghdr(request)->nlmsg_seq;
> > dump->status_seq = seq_create();
> > + ovs_mutex_init(&dump->mutex);
> > }
> >
> > /* Attempts to retrieve another reply from 'dump' into 'buffer'. 'dump'
> > must
> > @@ -756,7 +757,15 @@ nl_dump_next(struct nl_dump *dump, struct ofpbuf
> > *reply, struct ofpbuf *buffer)
> > return false;
> > }
> >
> > + /* Take the mutex here to avoid a in-kernel race. If two threads
> > try
> > + * to read from a Netlink dump socket at once, then the socket
> > error
> > + * can be set to EINVAL, which will be encountered on the next
> > recv on
> > + * that socket, which could be anywhere due to the way that we
> > pool
> > + * Netlink sockets. Serializing the recv calls avoids the issue.
> > */
> > + ovs_mutex_lock(&dump->mutex);
> > retval = nl_sock_recv__(dump->sock, buffer, false);
> > + ovs_mutex_unlock(&dump->mutex);
> > +
> > if (retval) {
> > ofpbuf_clear(buffer);
> > if (retval == EAGAIN) {
> > @@ -835,6 +844,7 @@ nl_dump_done(struct nl_dump *dump)
> > }
> > nl_pool_release(dump->sock);
> > seq_destroy(dump->status_seq);
> > + ovs_mutex_destroy(&dump->mutex);
> > return status >> 1;
> > }
> >
> > diff --git a/lib/netlink-socket.h b/lib/netlink-socket.h
> > index dd32409..8ac201a 100644
> > --- a/lib/netlink-socket.h
> > +++ b/lib/netlink-socket.h
> > @@ -52,6 +52,7 @@
> > #include <stdint.h>
> > #include "ofpbuf.h"
> > #include "ovs-atomic.h"
> > +#include "ovs-thread.h"
> >
> > struct nl_sock;
> >
> > @@ -114,6 +115,7 @@ struct nl_dump {
> > atomic_uint status; /* Low bit set if we read final message.
> > * Other bits hold an errno (0 for
> > success). */
> > struct seq *status_seq; /* Tracks changes to the above 'status'.
> > */
> > + struct ovs_mutex mutex;
> > };
> >
> > void nl_dump_start(struct nl_dump *, int protocol,
> > --
> > 1.7.10.4
> >
> >
More information about the dev
mailing list