[ovs-dev] [PATCH v3 3/3] dpif-netlink: Introduce per-cpu upcall dispatch
Flavio Leitner
fbl at sysclose.org
Tue Jul 6 00:41:20 UTC 2021
Hi Mark,
On Mon, Jul 05, 2021 at 09:38:37AM -0400, Mark Gray wrote:
> The Open vSwitch kernel module uses the upcall mechanism to send
> packets from kernel space to user space when it misses in the kernel
> space flow table. The upcall sends packets via a Netlink socket.
> Currently, a Netlink socket is created for every vport. In this way,
> there is a 1:1 mapping between a vport and a Netlink socket.
> When a packet is received by a vport, if it needs to be sent to
> user space, it is sent via the corresponding Netlink socket.
>
> This mechanism, with various iterations of the corresponding user
> space code, has seen some limitations and issues:
>
> * On systems with a large number of vports, there is correspondingly
> a large number of Netlink sockets which can limit scaling.
> (https://bugzilla.redhat.com/show_bug.cgi?id=1526306)
> * Packet reordering on upcalls.
> (https://bugzilla.redhat.com/show_bug.cgi?id=1844576)
> * A thundering herd issue.
> (https://bugzilla.redhat.com/show_bug.cgi?id=1834444)
>
> This patch introduces an alternative, feature-negotiated, upcall
> mode using a per-cpu dispatch rather than a per-vport dispatch.
>
> In this mode, the Netlink socket to be used for the upcall is
> selected based on the CPU of the thread that is executing the upcall.
> In this way, it resolves the issues above as:
>
> a) The number of Netlink sockets scales with the number of CPUs
> rather than the number of vports.
> b) Ordering per-flow is maintained as packets are distributed to
> CPUs based on mechanisms such as RSS and flows are distributed
> to a single user space thread.
> c) Packets from a flow can only wake up one user space thread.
>
> Reported-at: https://bugzilla.redhat.com/1844576
> Signed-off-by: Mark Gray <mark.d.gray at redhat.com>
> ---
>
> Notes:
> v1 - Reworked based on Flavio's comments:
> * change DISPATCH_MODE_PER_CPU() to inline function
> * add `ovs-appctl` command to check dispatch mode for datapaths
> * fixed issue with userspace actions (tested using `ovs-ofctl monitor br0 65534 -P nxt_packet_in`)
> * update documentation as requested
> v2 - Reworked based on Flavio's comments:
> * Used dpif_netlink_upcall_per_cpu() for check in dpif_netlink_set_handler_pids()
> * Added macro for (ignored) Netlink PID
> * Fixed indentation issue
> * Added NEWS entry
> * Added section to ovs-vswitchd.8 man page
>
> NEWS | 7 +-
> .../linux/compat/include/linux/openvswitch.h | 7 +
> lib/automake.mk | 1 +
> lib/dpif-netdev.c | 1 +
> lib/dpif-netlink-unixctl.man | 6 +
> lib/dpif-netlink.c | 460 ++++++++++++++++--
> lib/dpif-provider.h | 32 +-
> lib/dpif.c | 17 +
> lib/dpif.h | 1 +
> ofproto/ofproto-dpif-upcall.c | 51 +-
> ofproto/ofproto.c | 12 -
> vswitchd/ovs-vswitchd.8.in | 1 +
> vswitchd/vswitch.xml | 23 +-
> 13 files changed, 522 insertions(+), 97 deletions(-)
> create mode 100644 lib/dpif-netlink-unixctl.man
>
> diff --git a/NEWS b/NEWS
> index a2a2dcf95d7d..80b13e358685 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -29,7 +29,12 @@ Post-v2.15.0
> - ovsdb-tool:
> * New option '--election-timer' to the 'create-cluster' command to set the
> leader election timer during cluster creation.
> -
> + - Per-cpu upcall dispatching:
> + * ovs-vswitchd will configure the kernel module using per-cpu dispatch
> + mode (if available). This changes the way upcalls are delivered to user
> + space in order to resolve a number of issues with per-vport dispatch.
> + The new debug appctl command `dpif-netlink/dispatch-mode`
> + will return the current dispatch mode for each datapath.
>
> v2.15.0 - 15 Feb 2021
> ---------------------
> diff --git a/datapath/linux/compat/include/linux/openvswitch.h b/datapath/linux/compat/include/linux/openvswitch.h
> index 875de20250ce..f29265df055e 100644
> --- a/datapath/linux/compat/include/linux/openvswitch.h
> +++ b/datapath/linux/compat/include/linux/openvswitch.h
> @@ -89,6 +89,8 @@ enum ovs_datapath_cmd {
> * set on the datapath port (for OVS_ACTION_ATTR_MISS). Only valid on
> * %OVS_DP_CMD_NEW requests. A value of zero indicates that upcalls should
> * not be sent.
> + * OVS_DP_ATTR_PER_CPU_PIDS: Per-cpu array of PIDs for upcalls when
> + * OVS_DP_F_DISPATCH_UPCALL_PER_CPU feature is set.
> * @OVS_DP_ATTR_STATS: Statistics about packets that have passed through the
> * datapath. Always present in notifications.
> * @OVS_DP_ATTR_MEGAFLOW_STATS: Statistics about mega flow masks usage for the
> @@ -105,6 +107,8 @@ enum ovs_datapath_attr {
> OVS_DP_ATTR_MEGAFLOW_STATS, /* struct ovs_dp_megaflow_stats */
> OVS_DP_ATTR_USER_FEATURES, /* OVS_DP_F_* */
> OVS_DP_ATTR_PAD,
> + OVS_DP_ATTR_PAD2,
> + OVS_DP_ATTR_PER_CPU_PIDS, /* Netlink PIDS to receive upcalls */
> __OVS_DP_ATTR_MAX
> };
>
> @@ -146,6 +150,9 @@ struct ovs_vport_stats {
> /* Allow tc offload recirc sharing */
> #define OVS_DP_F_TC_RECIRC_SHARING (1 << 2)
>
> +/* Allow per-cpu dispatch of upcalls */
> +#define OVS_DP_F_DISPATCH_UPCALL_PER_CPU (1 << 3)
> +
> /* Fixed logical ports. */
> #define OVSP_LOCAL ((__u32)0)
>
> diff --git a/lib/automake.mk b/lib/automake.mk
> index db90175918fc..192db5d87ce3 100644
> --- a/lib/automake.mk
> +++ b/lib/automake.mk
> @@ -556,6 +556,7 @@ MAN_FRAGMENTS += \
> lib/memory-unixctl.man \
> lib/netdev-dpdk-unixctl.man \
> lib/dpif-netdev-unixctl.man \
> + lib/dpif-netlink-unixctl.man \
> lib/ofp-version.man \
> lib/ovs.tmac \
> lib/ovs-replay.man \
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 026b52d27d45..b9f7bb86055a 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -8548,6 +8548,7 @@ const struct dpif_class dpif_netdev_class = {
> dpif_netdev_operate,
> NULL, /* recv_set */
> NULL, /* handlers_set */
> + NULL, /* number_handlers_required */
> dpif_netdev_set_config,
> dpif_netdev_queue_to_priority,
> NULL, /* recv */
> diff --git a/lib/dpif-netlink-unixctl.man b/lib/dpif-netlink-unixctl.man
> new file mode 100644
> index 000000000000..ce5c47e85c80
> --- /dev/null
> +++ b/lib/dpif-netlink-unixctl.man
> @@ -0,0 +1,6 @@
> +.SS "DPIF-NETLINK COMMANDS"
> +These commands are used to expose internal information of the "dpif-netlink"
> +kernel space datapath.
> +.
> +.IP "\fBdpif-netlink/dispatch-mode\fR"
> +Displays the "dispatch-mode" for all datapaths.
> diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
> index f92905dd83fd..68dade8943d9 100644
> --- a/lib/dpif-netlink.c
> +++ b/lib/dpif-netlink.c
> @@ -84,6 +84,9 @@ enum { MAX_PORTS = USHRT_MAX };
> #define EPOLLEXCLUSIVE (1u << 28)
> #endif
>
> +/* This PID is not used by the kernel datapath when using dispatch per CPU,
> + * but it is required to be set (not zero). */
> +#define DPIF_NETLINK_PER_CPU_PID UINT32_MAX
> struct dpif_netlink_dp {
> /* Generic Netlink header. */
> uint8_t cmd;
> @@ -98,6 +101,8 @@ struct dpif_netlink_dp {
> const struct ovs_dp_stats *stats; /* OVS_DP_ATTR_STATS. */
> const struct ovs_dp_megaflow_stats *megaflow_stats;
> /* OVS_DP_ATTR_MEGAFLOW_STATS.*/
> + const uint32_t *upcall_pids; /* OVS_DP_ATTR_PER_CPU_PIDS */
> + uint32_t n_upcall_pids;
> };
>
> static void dpif_netlink_dp_init(struct dpif_netlink_dp *);
> @@ -113,6 +118,10 @@ static int dpif_netlink_dp_get(const struct dpif *,
> static int
> dpif_netlink_set_features(struct dpif *dpif_, uint32_t new_features);
>
> +static void
> +dpif_netlink_unixctl_dispatch_mode(struct unixctl_conn *conn, int argc,
> + const char *argv[], void *aux);
> +
> struct dpif_netlink_flow {
> /* Generic Netlink header. */
> uint8_t cmd;
> @@ -178,11 +187,16 @@ struct dpif_windows_vport_sock {
> #endif
>
> struct dpif_handler {
> + /* per-vport dispatch mode */
> struct epoll_event *epoll_events;
> int epoll_fd; /* epoll fd that includes channel socks. */
> int n_events; /* Num events returned by epoll_wait(). */
> int event_offset; /* Offset into 'epoll_events'. */
>
> + /* per-cpu dispatch mode */
> + struct nl_sock *sock; /* Each handler thread holds one netlink
> + socket. */
> +
> #ifdef _WIN32
> /* Pool of sockets. */
> struct dpif_windows_vport_sock *vport_sock_pool;
> @@ -201,6 +215,8 @@ struct dpif_netlink {
> struct fat_rwlock upcall_lock;
> struct dpif_handler *handlers;
> uint32_t n_handlers; /* Num of upcall handlers. */
> +
> + /* Per-vport dispatch mode */
> struct dpif_channel *channels; /* Array of channels for each port. */
> int uc_array_size; /* Size of 'handler->channels' and */
> /* 'handler->epoll_events'. */
> @@ -241,8 +257,12 @@ static int open_dpif(const struct dpif_netlink_dp *, struct dpif **);
> static uint32_t dpif_netlink_port_get_pid(const struct dpif *,
> odp_port_t port_no);
> static void dpif_netlink_handler_uninit(struct dpif_handler *handler);
> -static int dpif_netlink_refresh_channels(struct dpif_netlink *,
> - uint32_t n_handlers);
> +static int dpif_netlink_refresh_handlers_vport_dispatch(struct dpif_netlink *,
> + uint32_t n_handlers);
> +static void destroy_all_channels(struct dpif_netlink *);
> +static int dpif_netlink_refresh_handlers_cpu_dispatch(struct dpif_netlink *);
> +static void destroy_all_handlers(struct dpif_netlink *);
> +
> static void dpif_netlink_vport_to_ofpbuf(const struct dpif_netlink_vport *,
> struct ofpbuf *);
> static int dpif_netlink_vport_from_ofpbuf(struct dpif_netlink_vport *,
> @@ -294,6 +314,11 @@ dpif_netlink_cast(const struct dpif *dpif)
> return CONTAINER_OF(dpif, struct dpif_netlink, dpif);
> }
>
> +static inline bool
> +dpif_netlink_upcall_per_cpu(const struct dpif_netlink *dpif) {
> + return !!((dpif)->user_features & OVS_DP_F_DISPATCH_UPCALL_PER_CPU);
> +}
> +
> static int
> dpif_netlink_enumerate(struct sset *all_dps,
> const struct dpif_class *dpif_class OVS_UNUSED)
> @@ -357,11 +382,35 @@ dpif_netlink_open(const struct dpif_class *class OVS_UNUSED, const char *name,
> dp_request.cmd = OVS_DP_CMD_SET;
> }
>
> + /* The Open vSwitch kernel module has two modes for dispatching upcalls:
> + * per-vport and per-cpu.
> + *
> + * When dispatching upcalls per-vport, the kernel will
> + * send the upcall via a Netlink socket that has been selected based on the
> + * vport that received the packet that is causing the upcall.
> + *
> + * When dispatching upcall per-cpu, the kernel will send the upcall via
> + * a Netlink socket that has been selected based on the cpu that received
> + * the packet that is causing the upcall.
> + *
> + * First we test to see if the kernel module supports per-cpu dispatching
> + * (the preferred method). If it does not support per-cpu dispatching, we
> + * fall back to the per-vport dispatch mode.
> + */
> dp_request.user_features |= OVS_DP_F_UNALIGNED;
> - dp_request.user_features |= OVS_DP_F_VPORT_PIDS;
> + dp_request.user_features &= ~OVS_DP_F_VPORT_PIDS;
> + dp_request.user_features |= OVS_DP_F_DISPATCH_UPCALL_PER_CPU;
> error = dpif_netlink_dp_transact(&dp_request, &dp, &buf);
> if (error) {
> - return error;
> + dp_request.user_features &= ~OVS_DP_F_DISPATCH_UPCALL_PER_CPU;
> + dp_request.user_features |= OVS_DP_F_VPORT_PIDS;
> + error = dpif_netlink_dp_transact(&dp_request, &dp, &buf);
> + if (error) {
> + return error;
> + }
> + VLOG_INFO("Dispatch mode(per-vport)");
> + } else {
> + VLOG_INFO("Dispatch mode:(per-cpu)");
> }
The messages are using different formats and extra parenthesis.
I think it is useful to log if OVS is using the dispatch mode, but
this is also used by ovs-dpctl and then the output mixes that log
with its normal output:
# ovs-dpctl show
2021-07-05T23:50:08Z|00001|dpif_netlink|INFO|Dispatch mode(per-vport)
system at ovs-system:
lookups: hit:1 missed:6 lost:0
flows: 6
masks: hit:12 total:5 hit/pkt:1.71
port 0: ovs-system (internal)
port 1: br0 (internal)
port 2: ovs-p0
port 3: ovs-p1
What do you think about this diff on top of your patch?
diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index 68dade894..00b9a2e81 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -408,9 +408,14 @@ dpif_netlink_open(const struct dpif_class *class OVS_UNUSED, const char *name,
if (error) {
return error;
}
- VLOG_INFO("Dispatch mode(per-vport)");
+
+ if (create) {
+ VLOG_INFO("Datapath dispatch mode: per-vport");
+ }
} else {
- VLOG_INFO("Dispatch mode:(per-cpu)");
+ if (create) {
+ VLOG_INFO("Datapath dispatch mode: per-cpu");
+ }
}
error = open_dpif(&dp, dpifp);
More information about the dev
mailing list