[ovs-dev] [RFC PATCH v1] dpdk: Support both shared and per port mempools.
Lam, Tiago
tiago.lam at intel.com
Fri May 25 16:53:56 UTC 2018
Hi Ian,
Thanks for bringing this forward.
I've tested this on current master by re-configuring MTUs of existing
ports, adding new ports, deleting existing ones, etc. It seems to be
behaving as expected:
- On the shared model, it allocates a new mempool only if the MTU
changes, otherwise reuses the already existing one. And it frees a
mempool only if it isn't being used by any port;
- On the per-port model, it allocates a new mempool for every port. And
it frees the mempools when the ports are destroyed / MTU is changed.
There's a catch on how mempools are freed, though. I have a comment on
that in-line below*.
(I see Kevin has sent a review as well, so I'll refrain to touch on the
same points, unless to re-iterate.)
On 18/05/2018 15:30, Ian Stokes wrote:
> This commit re-introduces the concept of shared mempools as the default
> memory model for DPDK devices. Per port mempools are still available but
> must be enabled explicitly by a user.
>
> OVS previously used a shared mempool model for ports with the same MTU
> and socket configuration. This was replaced by a per port mempool model
> to address issues flagged by users such as:
>
> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html
>
> However the per port model potentially requires an increase in memory
> resource requirements to support the same number of ports and configuration
> as the shared port model.
>
> This is considered a blocking factor for current deployments of OVS
> when upgrading to future OVS releases as a user may have to redimension
> memory for the same deployment configuration. This may not be possible for
> users.
>
> This commit resolves the issue by re-introducing shared mempools as
> the default memory behaviour in OVS DPDK but also refactors the memory
> configuration code to allow for per port mempools.
>
> This patch adds a new global config option, per-port-mp-enabled, that
> controls the enablement of per port mempools for DPDK devices.
>
> ovs-vsctl set Open_vSwitch . other_config:per-port-mp-enabled=true
>
> This value defaults to false; to enable per port mempool support,
> this field should be set to true when setting other global parameters
> on init (such as "dpdk-socket-mem", for example). Changing the value at
> runtime is not supported, and requires restarting the vswitch
> daemon.
>
> The mempool sweep functionality is also replaced with the
> sweep functionality from OVS 2.9 found in commits
>
> c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
> a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)
>
> As this patch is RFC there are a number of TO-DOs including adding a
> memory calculation section to the documentation for both models. This is
> expected to be completed in the v1 after RFC.
>
> Signed-off-by: Ian Stokes <ian.stokes at intel.com>
> ---
> Documentation/automake.mk | 1 +
> Documentation/topics/dpdk/index.rst | 1 +
> Documentation/topics/dpdk/memory.rst | 67 +++++++
> NEWS | 1 +
> lib/dpdk-stub.c | 6 +
> lib/dpdk.c | 13 ++
> lib/dpdk.h | 1 +
> lib/netdev-dpdk.c | 326 +++++++++++++++++++++--------------
> vswitchd/vswitch.xml | 16 ++
> 9 files changed, 305 insertions(+), 127 deletions(-)
> create mode 100644 Documentation/topics/dpdk/memory.rst
>
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> index 683ca14..14c2189 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -36,6 +36,7 @@ DOC_SOURCE = \
> Documentation/topics/dpdk/index.rst \
> Documentation/topics/dpdk/bridge.rst \
> Documentation/topics/dpdk/jumbo-frames.rst \
> + Documentation/topics/dpdk/memory.rst \
> Documentation/topics/dpdk/pdump.rst \
> Documentation/topics/dpdk/phy.rst \
> Documentation/topics/dpdk/pmd.rst \
> diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
> index 181f61a..cf24a7b 100644
> --- a/Documentation/topics/dpdk/index.rst
> +++ b/Documentation/topics/dpdk/index.rst
> @@ -40,3 +40,4 @@ The DPDK Datapath
> /topics/dpdk/qos
> /topics/dpdk/pdump
> /topics/dpdk/jumbo-frames
> + /topics/dpdk/memory
> diff --git a/Documentation/topics/dpdk/memory.rst b/Documentation/topics/dpdk/memory.rst
> new file mode 100644
> index 0000000..1198067
> --- /dev/null
> +++ b/Documentation/topics/dpdk/memory.rst
> @@ -0,0 +1,67 @@
> +..
> + Copyright 2018, Intel, Inc.
> +
> + Licensed under the Apache License, Version 2.0 (the "License"); you may
> + not use this file except in compliance with the License. You may obtain
> + a copy of the License at
> +
> + http://www.apache.org/licenses/LICENSE-2.0
> +
> + Unless required by applicable law or agreed to in writing, software
> + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> + License for the specific language governing permissions and limitations
> + under the License.
> +
> + Convention for heading levels in Open vSwitch documentation:
> +
> + ======= Heading 0 (reserved for the title in a document)
> + ------- Heading 1
> + ~~~~~~~ Heading 2
> + +++++++ Heading 3
> + ''''''' Heading 4
> +
> + Avoid deeper levels because they do not render well.
> +
> +=========================
> +DPDK Device Memory Models
> +=========================
> +
> +DPDK device memory can be allocated in one of two ways in OVS DPDK,
> +shared mempools or per port mempools. The specifics of both are detailed
> +below.
> +
> +Shared Mempools
> +---------------
> +
> +By Default OVS DPDK uses a shared mempool model. This means that multiple
> +ports can share the same mempool. For example when a port is added it will
> +have a given MTU and socket ID associated with it. If a mempool has been
> +created previously for an existing port that has the same MTU and socket ID,
> +that mempool is used for both ports. If there is no existing mempool
> +supporting these parameters then a new mempool is created.
> +
> +
> +Per Port Mempools
> +-----------------
> +
> +In the per port mempool model, mempools are created per device and are not
> +shared. The benefit of this is a more transparent memory model where mempools
> +will not be exhausted by other DPDK devices. However this comes at a potential
> +increase in cost for memory dimensioning for a given deployment. Users should
> +be aware of the memory requirements for their deployment before using this
> +model and allocate the required hugepage memory.
> +
> +Per port mempool support may be enabled via a global config value,
> +```per-port-mp-enabled```. Setting this to true enables the per port mempool
> +model for all DPDK devices in OVS::
> +
> + $ ovs-vsctl set Open_vSwitch . other_config:per-port-mp-enabled=true
> +
> +.. important::
> +
> + Changing this value requires restarting the daemon.
> +
> +.. todo::
> +
> + Add memory calculation section for both mempool models.
> diff --git a/NEWS b/NEWS
> index ec548b0..c9991cf 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -30,6 +30,7 @@ Post-v2.9.0
> * New 'check-dpdk' Makefile target to run a new system testsuite.
> See Testing topic for the details.
> * Add LSC interrupt support for DPDK physical devices.
> + * Support both shared and per port mempools for DPDK devices.
> - Userspace datapath:
> * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
> * Detailed PMD performance metrics available with new command
> diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
> index 041cd0c..d2a9718 100644
> --- a/lib/dpdk-stub.c
> +++ b/lib/dpdk-stub.c
> @@ -55,6 +55,12 @@ dpdk_vhost_iommu_enabled(void)
> return false;
> }
>
> +bool
> +dpdk_per_port_mempool_enabled(void)
> +{
> + return false;
> +}
> +
> void
> print_dpdk_version(void)
> {
> diff --git a/lib/dpdk.c b/lib/dpdk.c
> index 00dd974..e251970 100644
> --- a/lib/dpdk.c
> +++ b/lib/dpdk.c
> @@ -43,6 +43,8 @@ static FILE *log_stream = NULL; /* Stream for DPDK log redirection */
>
> static char *vhost_sock_dir = NULL; /* Location of vhost-user sockets */
> static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU support */
> +static bool per_port_mp_enabled = false; /* Status of per port mempool
> + * support */
>
> static int
> process_vhost_flags(char *flag, const char *default_val, int size,
> @@ -354,6 +356,11 @@ dpdk_init__(const struct smap *ovs_other_config)
> VLOG_INFO("IOMMU support for vhost-user-client %s.",
> vhost_iommu_enabled ? "enabled" : "disabled");
>
> + per_port_mp_enabled = smap_get_bool(ovs_other_config,
> + "per-port-mp-enabled", false);
> + VLOG_INFO("Per port mempool for DPDK devices %s.",
> + per_port_mp_enabled ? "enabled" : "disabled");
> +
> argv = grow_argv(&argv, 0, 1);
> argc = 1;
> argv[0] = xstrdup(ovs_get_program_name());
> @@ -498,6 +505,12 @@ dpdk_vhost_iommu_enabled(void)
> return vhost_iommu_enabled;
> }
>
> +bool
> +dpdk_per_port_mempool_enabled(void)
> +{
> + return per_port_mp_enabled;
> +}
> +
> void
> dpdk_set_lcore_id(unsigned cpu)
> {
> diff --git a/lib/dpdk.h b/lib/dpdk.h
> index b041535..243385c 100644
> --- a/lib/dpdk.h
> +++ b/lib/dpdk.h
> @@ -38,6 +38,7 @@ void dpdk_init(const struct smap *ovs_other_config);
> void dpdk_set_lcore_id(unsigned cpu);
> const char *dpdk_get_vhost_sock_dir(void);
> bool dpdk_vhost_iommu_enabled(void);
> +bool dpdk_per_port_mempool_enabled(void);
> void print_dpdk_version(void);
>
> #endif /* dpdk.h */
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 87152a7..cda2ace 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -91,13 +91,24 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
> #define NETDEV_DPDK_MBUF_ALIGN 1024
> #define NETDEV_DPDK_MAX_PKT_LEN 9728
>
> -/* Min number of packets in the mempool. OVS tries to allocate a mempool with
> - * roughly estimated number of mbufs: if this fails (because the system doesn't
> - * have enough hugepages) we keep halving the number until the allocation
> - * succeeds or we reach MIN_NB_MBUF */
> +/* Max and min number of packets in the mempool. OVS tries to allocate a
> + * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
> + * enough hugepages) we keep halving the number until the allocation succeeds
> + * or we reach MIN_NB_MBUF */
> +
> +#define MAX_NB_MBUF (4096 * 64)
> #define MIN_NB_MBUF (4096 * 4)
> #define MP_CACHE_SZ RTE_MEMPOOL_CACHE_MAX_SIZE
>
> +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
> +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF)
> + == 0);
> +
> +/* The smallest possible NB_MBUF that we're going to try should be a multiple
> + * of MP_CACHE_SZ. This is advised by DPDK documentation. */
> +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF))
> + % MP_CACHE_SZ == 0);
> +
> /*
> * DPDK XSTATS Counter names definition
> */
> @@ -296,13 +307,14 @@ static struct ovs_list dpdk_list OVS_GUARDED_BY(dpdk_mutex)
> static struct ovs_mutex dpdk_mp_mutex OVS_ACQ_AFTER(dpdk_mutex)
> = OVS_MUTEX_INITIALIZER;
>
> -/* Contains all 'struct dpdk_mp's. */
> -static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
> - = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
> +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
> + = OVS_LIST_INITIALIZER(&dpdk_mp_list);
>
> -/* Wrapper for a mempool released but not yet freed. */
> struct dpdk_mp {
> struct rte_mempool *mp;
> + int mtu;
> + int socket_id;
> + int refcount;
> struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
> };
>
> @@ -381,7 +393,7 @@ struct netdev_dpdk {
>
> PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
> struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
> - struct rte_mempool *mp;
> + struct dpdk_mp *dpdk_mp;
>
> /* virtio identifier for vhost devices */
> ovsrcu_index vid;
> @@ -526,88 +538,70 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp OVS_UNUSED,
> dp_packet_init_dpdk((struct dp_packet *) pkt, pkt->buf_len);
> }
>
> -static int
> -dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
> -{
> - unsigned ring_count;
> - /* This logic is needed because rte_mempool_full() is not guaranteed to
> - * be atomic and mbufs could be moved from mempool cache --> mempool ring
> - * during the call. However, as no mbufs will be taken from the mempool
> - * at this time, we can work around it by also checking the ring entries
> - * separately and ensuring that they have not changed.
> - */
> - ring_count = rte_mempool_ops_get_count(mp);
> - if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) == ring_count) {
> - return 1;
> - }
> -
> - return 0;
> -}
> -
> -/* Free unused mempools. */
> -static void
> -dpdk_mp_sweep(void)
> +/* Calculating the required number of mbufs differs depending on the
> + * mempool model being used. Check if per port mempools are in use before
> + * calculating.
> + */
> +static uint32_t
> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
> {
> - struct dpdk_mp *dmp, *next;
> + uint32_t n_mbufs;
>
> - ovs_mutex_lock(&dpdk_mp_mutex);
> - LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
> - if (dpdk_mp_full(dmp->mp)) {
> - VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
> - ovs_list_remove(&dmp->list_node);
> - rte_mempool_free(dmp->mp);
> - rte_free(dmp);
> + if (!per_port_mp) {
> + /* Shared mempools are being used.
> + * XXX: this is a really rough method of provisioning memory.
> + * It's impossible to determine what the exact memory requirements are
> + * when the number of ports and rxqs that utilize a particular mempool
> + * can change dynamically at runtime. For now, use this rough
> + * heurisitic.
> + */
> + if (mtu >= ETHER_MTU) {
> + n_mbufs = MAX_NB_MBUF;
Is this going to be configurable in the future?
> + } else {
> + n_mbufs = MIN_NB_MBUF;
> }
> + } else {
> + /* Per port mempools are being used.
> + * XXX: rough estimation of number of mbufs required for this port:
> + * <packets required to fill the device rxqs>
> + * + <packets that could be stuck on other ports txqs>
> + * + <packets in the pmd threads>
> + * + <additional memory for corner cases>
> + */
> + n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
> + + dev->requested_n_txq * dev->requested_txq_size
> + + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
> + + MIN_NB_MBUF;
> }
> - ovs_mutex_unlock(&dpdk_mp_mutex);
> -}
> -
> -/* Ensure a mempool will not be freed. */
> -static void
> -dpdk_mp_do_not_free(struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
> -{
> - struct dpdk_mp *dmp, *next;
>
> - LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
> - if (dmp->mp == mp) {
> - VLOG_DBG("Removing mempool \"%s\" from free list", dmp->mp->name);
> - ovs_list_remove(&dmp->list_node);
> - rte_free(dmp);
> - break;
> - }
> - }
> + return n_mbufs;
> }
>
> -/* Returns a valid pointer when either of the following is true:
> - * - a new mempool was just created;
> - * - a matching mempool already exists. */
> -static struct rte_mempool *
> -dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
> +static struct dpdk_mp *
> +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
> {
> char mp_name[RTE_MEMPOOL_NAMESIZE];
> const char *netdev_name = netdev_get_name(&dev->up);
> int socket_id = dev->requested_socket_id;
> uint32_t n_mbufs;
> uint32_t hash = hash_string(netdev_name, 0);
> - struct rte_mempool *mp = NULL;
> -
> - /*
> - * XXX: rough estimation of number of mbufs required for this port:
> - * <packets required to fill the device rxqs>
> - * + <packets that could be stuck on other ports txqs>
> - * + <packets in the pmd threads>
> - * + <additional memory for corner cases>
> - */
> - n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
> - + dev->requested_n_txq * dev->requested_txq_size
> - + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
> - + MIN_NB_MBUF;
> + struct dpdk_mp *dmp = NULL;
> + int ret;
> +
> + dmp = dpdk_rte_mzalloc(sizeof *dmp);
> + if (!dmp) {
> + return NULL;
> + }
> + dmp->socket_id = socket_id;
> + dmp->mtu = mtu;
> + dmp->refcount = 1;
> +
> + n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>
> - ovs_mutex_lock(&dpdk_mp_mutex);
> do {
> /* Full DPDK memory pool name must be unique and cannot be
> * longer than RTE_MEMPOOL_NAMESIZE. */
> - int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
> + ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
> "ovs%08x%02d%05d%07u",
> hash, socket_id, mtu, n_mbufs);
> if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
> @@ -623,102 +617,179 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
> netdev_name, n_mbufs, socket_id,
> dev->requested_n_rxq, dev->requested_n_txq);
>
> - mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
> - sizeof (struct dp_packet) - sizeof (struct rte_mbuf),
> - MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id);
> + dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
> + MP_CACHE_SZ,
> + sizeof (struct dp_packet)
> + - sizeof (struct rte_mbuf),
> + MBUF_SIZE(mtu)
> + - sizeof(struct dp_packet),
> + socket_id);
>
> - if (mp) {
> + if (dmp->mp) {
> VLOG_DBG("Allocated \"%s\" mempool with %u mbufs",
> mp_name, n_mbufs);
> /* rte_pktmbuf_pool_create has done some initialization of the
> - * rte_mbuf part of each dp_packet. Some OvS specific fields
> - * of the packet still need to be initialized by
> - * ovs_rte_pktmbuf_init. */
> - rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL);
> + * rte_mbuf part of each dp_packet, while ovs_rte_pktmbuf_init
> + * initializes some OVS specific fields of dp_packet.
> + */
> + rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
> + return dmp;
> } else if (rte_errno == EEXIST) {
> /* A mempool with the same name already exists. We just
> * retrieve its pointer to be returned to the caller. */
> - mp = rte_mempool_lookup(mp_name);
> + dmp->mp = rte_mempool_lookup(mp_name);
> /* As the mempool create returned EEXIST we can expect the
> * lookup has returned a valid pointer. If for some reason
> * that's not the case we keep track of it. */
> VLOG_DBG("A mempool with name \"%s\" already exists at %p.",
> - mp_name, mp);
> - /* Ensure this reused mempool will not be freed. */
> - dpdk_mp_do_not_free(mp);
> + mp_name, dmp->mp);
> } else {
> VLOG_ERR("Failed mempool \"%s\" create request of %u mbufs",
> mp_name, n_mbufs);
> }
Just a thought, but now with shared memory where n_mbufs are initially
set to 4096*64, one can see this error log printed a few times before
the memory gets allocated. We could potentially demote this to a WARN
and write a more friendly message and only print the error below, before
returning to the caller (at that point we surely couldn't allocate the
mempool).
> - } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
> + } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
>
> - ovs_mutex_unlock(&dpdk_mp_mutex);
> - return mp;
> + rte_free(dmp);
> + return NULL;
> }
>
> -/* Release an existing mempool. */
> +static int
> +dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
> +{
> + unsigned ring_count;
> + /* This logic is needed because rte_mempool_full() is not guaranteed to
> + * be atomic and mbufs could be moved from mempool cache --> mempool ring
> + * during the call. However, as no mbufs will be taken from the mempool
> + * at this time, we can work around it by also checking the ring entries
> + * separately and ensuring that they have not changed.
> + */
> + ring_count = rte_mempool_ops_get_count(mp);
> + if (rte_mempool_full(mp) && rte_mempool_ops_get_count(mp) == ring_count) {
> + return 1;
> + }
> +
> + return 0;
> +}
> +
> +/* Free unused mempools. */
> static void
> -dpdk_mp_release(struct rte_mempool *mp)
> +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
> {
> - if (!mp) {
> - return;
> + struct dpdk_mp *dmp, *next;
> +
> + LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
> + if (!dmp->refcount && dpdk_mp_full(dmp->mp)) {
> + VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
> + ovs_list_remove(&dmp->list_node);
> + rte_mempool_free(dmp->mp);
> + rte_free(dmp);
> + }
> }
> +}
> +
> +static struct dpdk_mp *
> +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
> +{
> + struct dpdk_mp *dmp;
> + bool reuse = false;
>
> ovs_mutex_lock(&dpdk_mp_mutex);
> - if (dpdk_mp_full(mp)) {
> - VLOG_DBG("Freeing mempool \"%s\"", mp->name);
> - rte_mempool_free(mp);
> - } else {
> - struct dpdk_mp *dmp;
> + /* Check if shared mempools are being used, if so check existing mempools
> + * to see if reuse is possible. */
> + if (!per_port_mp) {
> + LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
> + if (dmp->socket_id == dev->requested_socket_id
> + && dmp->mtu == mtu) {
> + VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
> + dmp->refcount++;
> + reuse = true;
> + break;
> + }
> + }
> + }
> + /* Sweep mempools after reuse or before create. */
> + dpdk_mp_sweep();
*Should dpdk_mp_sweep() be called when destroying an interface as well,
in common_destruct()? While testing, I've noticed that if I add one port
and delete the same port the mempool will still be allocated until you
add another port, since dpdk_mp_sweep() will only be called on the next
call to dpdp_mp_get(). This could potentially create problems in the
per-port model where one deletes a certain number of ports and can't add
any other ports because there's (hanging) mempools holding memory.
>
> - dmp = dpdk_rte_mzalloc(sizeof *dmp);
> + if (!reuse) {
> + dmp = dpdk_mp_create(dev, mtu, per_port_mp);
> if (dmp) {
> - dmp->mp = mp;
> - ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node);
> + ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
> }
> }
> +
> ovs_mutex_unlock(&dpdk_mp_mutex);
> +
> + return dmp;
> }
>
> -/* Tries to allocate a new mempool - or re-use an existing one where
> - * appropriate - on requested_socket_id with a size determined by
> - * requested_mtu and requested Rx/Tx queues.
> - * On success - or when re-using an existing mempool - the new configuration
> - * will be applied.
> +/* Decrement reference to a mempool. */
> +static void
> +dpdk_mp_put(struct dpdk_mp *dmp)
> +{
> + if (!dmp) {
> + return;
> + }
> +
> + ovs_mutex_lock(&dpdk_mp_mutex);
> + ovs_assert(dmp->refcount);
> + dmp->refcount--;
> + ovs_mutex_unlock(&dpdk_mp_mutex);
> +}
> +
> +/* Depending on the memory model being used this function tries
> + * identify and reuse an existing mempool or tries to allocate new
> + * mempool on requested_socket_id with mbuf size corresponding to
> + * requested_mtu. On success new configuration will be applied.
> * On error, device will be left unchanged. */
> static int
> netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
> OVS_REQUIRES(dev->mutex)
> {
> uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
> - struct rte_mempool *mp;
> + struct dpdk_mp *mp;
> int ret = 0;
> + bool per_port_mp = dpdk_per_port_mempool_enabled();
>
> - dpdk_mp_sweep();
> + /* With shared mempools we do not need to configure a mempool if the MTU
> + * and socket ID have not changed, the previous configuration is still
> + * valid so return 0 */
> + if (dev->mtu == dev->requested_mtu
> + && dev->socket_id == dev->requested_socket_id
> + && (!per_port_mp)) {
I also find that moving the `(!per_port_mp)` condition to the beginning
improves readability here. It even goes in hand with your comment just
above - "With shared mempools we do not ...".
> + return ret;
> + }
>
> - mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
> + mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
> if (!mp) {
> VLOG_ERR("Failed to create memory pool for netdev "
> "%s, with MTU %d on socket %d: %s\n",
> dev->up.name, dev->requested_mtu, dev->requested_socket_id,
> - rte_strerror(rte_errno));
> - ret = rte_errno;
> + rte_strerror(rte_errno));
Missing indentation here.
> + return rte_errno;
> } else {
> - /* If a new MTU was requested and its rounded value equals the one
> - * that is currently used, then the existing mempool is returned. */
> - if (dev->mp != mp) {
> - /* A new mempool was created, release the previous one. */
> - dpdk_mp_release(dev->mp);
> - } else {
> - ret = EEXIST;
> + /* Check for any pre-existing dpdk_mp for the device */
> + if (dev->dpdk_mp != NULL) {
> + /* If a new MTU was requested and its rounded value equals the
> + * one that is currently used, then the existing mempool is
> + * returned. */
> + if (dev->dpdk_mp->mp != mp->mp) {
> + /* A new mempool was created, release the previous one. */
> + dpdk_mp_put(dev->dpdk_mp);
> + } else {
> + /* If the mempool already exists in the current dpdk_mp then
> + * we need to ensure dpdk_mp that was just created is freed in
> + * the next sweep as it will not be used. */
The code path around the `else` block here will only be called when
`!per_port_mp`. This is because `dpdk_mp_get()` will only return an
already existing mempool when using the shared model. Otherwise a new
one is always returned, and thus the `if (dev->dpdk_mp->mp != mp->mp)`
will be true. Am I reading this right? If so then refactoring this a bit
to differentiate on `per_port_mp` might help on readability - this goes
in-line with Kevin's comment about making this piece a bit more readable.
On the same note, the comment above mislead me to think that the
allocated `mp` is being freed, which would result in error since the
same `mp` is then assigned below. Instead, what it is doing is
decrementing the refcount in struct dpdk_mp, which might end up being
freed on the next dpdk_mp_sweep() if `refcount=0`. But that won't happen
on the shared model unless no port is using the mempool.
Thanks,
Tiago.
> + dpdk_mp_put(mp);
> + ret = EEXIST;
> + }
> }
> - dev->mp = mp;
> + dev->dpdk_mp = mp;
> dev->mtu = dev->requested_mtu;
> dev->socket_id = dev->requested_socket_id;
> dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> }
>
> - return ret;
> + return 0;
> }
>
> static void
> @@ -835,7 +906,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>
> for (i = 0; i < n_rxq; i++) {
> diag = rte_eth_rx_queue_setup(dev->port_id, i, dev->rxq_size,
> - dev->socket_id, NULL, dev->mp);
> + dev->socket_id, NULL,
> + dev->dpdk_mp->mp);
> if (diag) {
> VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
> dev->up.name, i, rte_strerror(-diag));
> @@ -922,7 +994,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
> memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
> rte_eth_link_get_nowait(dev->port_id, &dev->link);
>
> - mbp_priv = rte_mempool_get_priv(dev->mp);
> + mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
> dev->buf_size = mbp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM;
>
> /* Get the Flow control configuration for DPDK-ETH */
> @@ -1176,7 +1248,7 @@ common_destruct(struct netdev_dpdk *dev)
> OVS_EXCLUDED(dev->mutex)
> {
> rte_free(dev->tx_q);
> - dpdk_mp_release(dev->mp);
> + dpdk_mp_put(dev->dpdk_mp);
>
> ovs_list_remove(&dev->list_node);
> free(ovsrcu_get_protected(struct ingress_policer *,
> @@ -1928,7 +2000,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
> return EAGAIN;
> }
>
> - nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
> + nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
> (struct rte_mbuf **) batch->packets,
> NETDEV_MAX_BURST);
> if (!nb_rx) {
> @@ -2167,7 +2239,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
> continue;
> }
>
> - pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
> + pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
> if (OVS_UNLIKELY(!pkts[txcnt])) {
> dropped += cnt - i;
> break;
> @@ -3047,7 +3119,7 @@ netdev_dpdk_get_mempool_info(struct unixctl_conn *conn,
> ovs_mutex_lock(&dev->mutex);
> ovs_mutex_lock(&dpdk_mp_mutex);
>
> - rte_mempool_dump(stream, dev->mp);
> + rte_mempool_dump(stream, dev->dpdk_mp->mp);
>
> ovs_mutex_unlock(&dpdk_mp_mutex);
> ovs_mutex_unlock(&dev->mutex);
> @@ -3742,7 +3814,7 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
>
> err = netdev_dpdk_mempool_configure(dev);
> if (!err) {
> - /* A new mempool was created. */
> + /* A new mempool was created or re-used. */
> netdev_change_seq_changed(&dev->up);
> } else if (err != EEXIST){
> return err;
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 7ab90d5..95520af 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -359,6 +359,22 @@
> </p>
> </column>
>
> + <column name="other_config" key="per-port-mp-enabled"
> + type='{"type": "boolean"}'>
> + <p>
> + By default OVS DPDK uses a shared mempool memory model wherein
> + devices that have the same MTU and socket values can share the same
> + mempool. Setting this value to <code>true</code> changes this
> + behaviour. Per port mempools allow DPDK devices to use a private
> + mempool per device. This can provide greater transapraency as
> + regards memory usage but potentially at the cost of greater memory
> + requirements.
> + </p>
> + <p>
> + Changing this value requires restarting the daemon.
> + </p>
> + </column>
> +
> <column name="other_config" key="tx-flush-interval"
> type='{"type": "integer",
> "minInteger": 0, "maxInteger": 1000000}'>
>
More information about the dev
mailing list