[ovs-dev] [RFC PATCH v2 1/1] dpdk: Support both shared and per port mempools.
Kevin Traynor
ktraynor at redhat.com
Wed Jun 20 17:24:46 UTC 2018
On 06/20/2018 01:34 PM, Ian Stokes wrote:
> On 6/19/2018 12:11 PM, Kevin Traynor wrote:
>> On 06/11/2018 05:37 PM, Ian Stokes wrote:
>>> This commit re-introduces the concept of shared mempools as the default
>>> memory model for DPDK devices. Per port mempools are still available but
>>> must be enabled explicitly by a user.
>>>
>>
>> Hi Ian, thanks for v2, comments below
>
> Thanks for the review Kevin, comments inline.
>
>>
>>> OVS previously used a shared mempool model for ports with the same MTU
>>> and socket configuration. This was replaced by a per port mempool model
>>> to address issues flagged by users such as:
>>>
>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html
>>>
>>>
>>> However the per port model potentially requires an increase in memory
>>> resource requirements to support the same number of ports and
>>> configuration
>>> as the shared port model.
>>>
>>> This is considered a blocking factor for current deployments of OVS
>>> when upgrading to future OVS releases as a user may have to redimension
>>> memory for the same deployment configuration. This may not be
>>> possible for
>>> users.
>>>
>>> This commit resolves the issue by re-introducing shared mempools as
>>> the default memory behaviour in OVS DPDK but also refactors the memory
>>> configuration code to allow for per port mempools.
>>>
>>> This patch adds a new global config option, per-port-mp, that
>>
>> s/per-port-mp/per-port-memory/
>
> Apologies, I spotted a few of these 'per-port-mp' myself, I believe it's
> in the documentation as well. I have them fixed for the next revision.
>
>>
>>> controls the enablement of per port mempools for DPDK devices.
>>>
>>> ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
>>>
>>> This value defaults to false; to enable per port memory support,
>>> this field should be set to true when setting other global parameters
>>> on init (such as "dpdk-socket-mem", for example). Changing the value at
>>> runtime is not supported, and requires restarting the vswitch
>>> daemon.
>>>
>>> The mempool sweep functionality is also replaced with the
>>> sweep functionality from OVS 2.9 found in commits
>>>
>>> c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
>>> a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)
>>>
>>> As this patch is RFC there are a number of TO-DOs including adding a
>>> memory calculation section to the documentation for both models. This is
>>> expected to be completed in the v1 after RFC.
>>>
>>> Signed-off-by: Ian Stokes <ian.stokes at intel.com>
>>>
>>> ---
>>> v1 -> v2
>>> * Rebase to head of master.
>>> * Change global config option 'per-port-mp-enabled' to
>>> 'per-port-memory'.
>>> in commit message & documentation and code.
>>> * Specify in documentation that restart of daemon is only required if
>>> per
>>> port-port-memory is set after dpdk-init=true.
>>> * Return comment 'Contains all 'struct dpdk_mp's.' which was lost in
>>> previous refactor.
>>> * Improve comments regarding unique mempool names in the shared mempool
>>> usecase.
>>> * Check per_port_mp condition first when verifying if new mempool is
>>> required in netdev_dpdk_mempool_configure() for the shared mempool
>>> usecase.
>>> * Correctly return ret variable instead of 0 for function
>>> netdev_dpdk_mempool_configure().
>>> * Move VLOG_ERR regarding failure to create mempool for a device
>>> prior to
>>> dpdk_mp_create() returns.
>>> * Add VLOG_DBG message flagging that the number of mbufs could not be
>>> allocated and that it will be retried with half that amount.
>>> * Fix indentation of VLOG_ERR in netdev_dpdk_mempool_configure().
>>> * Handle EEXIST case for per port memory in function dpdk_mp_get() to
>>> avoid duplication of dpdk_mps, correctly set refcount and return
>>> correct dpdk_mp for the device to use.
>>> ---
>>> Documentation/automake.mk | 1 +
>>> Documentation/topics/dpdk/index.rst | 1 +
>>> Documentation/topics/dpdk/memory.rst | 67 ++++++++
>>> NEWS | 1 +
>>> lib/dpdk-stub.c | 6 +
>>> lib/dpdk.c | 12 ++
>>> lib/dpdk.h | 1 +
>>> lib/netdev-dpdk.c | 298
>>> +++++++++++++++++++++++------------
>>> vswitchd/vswitch.xml | 17 ++
>>> 9 files changed, 304 insertions(+), 100 deletions(-)
>>> create mode 100644 Documentation/topics/dpdk/memory.rst
>>>
>>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>>> index 2202df4..141b33d 100644
>>> --- a/Documentation/automake.mk
>>> +++ b/Documentation/automake.mk
>>> @@ -36,6 +36,7 @@ DOC_SOURCE = \
>>> Documentation/topics/dpdk/index.rst \
>>> Documentation/topics/dpdk/bridge.rst \
>>> Documentation/topics/dpdk/jumbo-frames.rst \
>>> + Documentation/topics/dpdk/memory.rst \
>>> Documentation/topics/dpdk/pdump.rst \
>>> Documentation/topics/dpdk/phy.rst \
>>> Documentation/topics/dpdk/pmd.rst \
>>> diff --git a/Documentation/topics/dpdk/index.rst
>>> b/Documentation/topics/dpdk/index.rst
>>> index 181f61a..cf24a7b 100644
>>> --- a/Documentation/topics/dpdk/index.rst
>>> +++ b/Documentation/topics/dpdk/index.rst
>>> @@ -40,3 +40,4 @@ The DPDK Datapath
>>> /topics/dpdk/qos
>>> /topics/dpdk/pdump
>>> /topics/dpdk/jumbo-frames
>>> + /topics/dpdk/memory
>>> diff --git a/Documentation/topics/dpdk/memory.rst
>>> b/Documentation/topics/dpdk/memory.rst
>>> new file mode 100644
>>> index 0000000..7c00f0f
>>> --- /dev/null
>>> +++ b/Documentation/topics/dpdk/memory.rst
>>> @@ -0,0 +1,67 @@
>>> +..
>>> + Copyright 2018, Intel, Inc.
>>> +
>>> + Licensed under the Apache License, Version 2.0 (the
>>> "License"); you may
>>> + not use this file except in compliance with the License. You
>>> may obtain
>>> + a copy of the License at
>>> +
>>> + http://www.apache.org/licenses/LICENSE-2.0
>>> +
>>> + Unless required by applicable law or agreed to in writing,
>>> software
>>> + distributed under the License is distributed on an "AS IS"
>>> BASIS, WITHOUT
>>> + WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>> implied. See the
>>> + License for the specific language governing permissions and
>>> limitations
>>> + under the License.
>>> +
>>> + Convention for heading levels in Open vSwitch documentation:
>>> +
>>> + ======= Heading 0 (reserved for the title in a document)
>>> + ------- Heading 1
>>> + ~~~~~~~ Heading 2
>>> + +++++++ Heading 3
>>> + ''''''' Heading 4
>>> +
>>> + Avoid deeper levels because they do not render well.
>>> +
>>> +=========================
>>> +DPDK Device Memory Models
>>> +=========================
>>> +
>>> +DPDK device memory can be allocated in one of two ways in OVS DPDK,
>>> +shared mempools or per port mempools. The specifics of both are
>>> detailed
>>> +below.
>>> +
>>> +Shared Mempools
>>> +---------------
>>> +
>>> +By Default OVS DPDK uses a shared mempool model. This means that
>>> multiple
>>> +ports can share the same mempool. For example when a port is added
>>> it will
>>> +have a given MTU and socket ID associated with it. If a mempool has
>>> been
>>> +created previously for an existing port that has the same MTU and
>>> socket ID,
>>> +that mempool is used for both ports. If there is no existing mempool
>>> +supporting these parameters then a new mempool is created.
>>> +
>>> +Per Port Mempools
>>> +-----------------
>>> +
>>> +In the per port mempool model, mempools are created per device and
>>> are not
>>> +shared. The benefit of this is a more transparent memory model where
>>> mempools
>>> +will not be exhausted by other DPDK devices. However this comes at a
>>> potential
>>> +increase in cost for memory dimensioning for a given deployment.
>>> Users should
>>> +be aware of the memory requirements for their deployment before
>>> using this
>>> +model and allocate the required hugepage memory.
>>> +
>>> +Per port mempool support may be enabled via a global config value,
>>> +```per-port-mp-enabled```. Setting this to true enables the per port
>>> mempool
>>> +model for all DPDK devices in OVS::
>>> +
>>> + $ ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
>>> +
>>> +.. important::
>>> +
>>> + This value should be set before setting dpdk-init=true. If set
>>> after
>>> + dpdk-init=true then the daemon must be restarted to use
>>> per-port-memory.
>>> +
>>> +.. todo::
>>> +
>>> + Add memory calculation section for both mempool models.
>>> diff --git a/NEWS b/NEWS
>>> index 484c6dc..eff6cf1 100644
>>> --- a/NEWS
>>> +++ b/NEWS
>>> @@ -33,6 +33,7 @@ Post-v2.9.0
>>> See Testing topic for the details.
>>> * Add LSC interrupt support for DPDK physical devices.
>>> * Allow init to fail and record DPDK status/version in OVS
>>> database.
>>> + * Support both shared and per port mempools for DPDK devices.
>>> - Userspace datapath:
>>> * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a
>>> single PMD
>>> * Detailed PMD performance metrics available with new command
>>> diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
>>> index 1df1c58..fd3fe21 100644
>>> --- a/lib/dpdk-stub.c
>>> +++ b/lib/dpdk-stub.c
>>> @@ -56,6 +56,12 @@ dpdk_vhost_iommu_enabled(void)
>>> return false;
>>> }
>>> +bool
>>> +dpdk_per_port_mempool_enabled(void)
>>> +{
>>> + return false;
>>> +}
>>> +
>>> void
>>> print_dpdk_version(void)
>>> {
>>> diff --git a/lib/dpdk.c b/lib/dpdk.c
>>> index 09afd8c..6a02a17 100644
>>> --- a/lib/dpdk.c
>>> +++ b/lib/dpdk.c
>>> @@ -47,6 +47,7 @@ static char *vhost_sock_dir = NULL; /* Location
>>> of vhost-user sockets */
>>> static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU
>>> support */
>>> static bool dpdk_initialized = false; /* Indicates successful
>>> initialization
>>> * of DPDK. */
>>> +static bool per_port_memory = false; /* Status of per port memory
>>> support */
>>> static int
>>> process_vhost_flags(char *flag, const char *default_val, int size,
>>> @@ -358,6 +359,11 @@ dpdk_init__(const struct smap *ovs_other_config)
>>> VLOG_INFO("IOMMU support for vhost-user-client %s.",
>>> vhost_iommu_enabled ? "enabled" : "disabled");
>>> + per_port_memory = smap_get_bool(ovs_other_config,
>>> + "per-port-memory", false);
>>> + VLOG_INFO("Per port mempool for DPDK devices %s.",
>>> + per_port_memory ? "enabled" : "disabled");
>>> +
>>> argv = grow_argv(&argv, 0, 1);
>>> argc = 1;
>>> argv[0] = xstrdup(ovs_get_program_name());
>>> @@ -515,6 +521,12 @@ dpdk_vhost_iommu_enabled(void)
>>> return vhost_iommu_enabled;
>>> }
>>> +bool
>>> +dpdk_per_port_memory(void)
>>> +{
>>> + return per_port_memory;
>>> +}
>>> +
>>> void
>>> dpdk_set_lcore_id(unsigned cpu)
>>> {
>>> diff --git a/lib/dpdk.h b/lib/dpdk.h
>>> index efdaa63..bbb89d4 100644
>>> --- a/lib/dpdk.h
>>> +++ b/lib/dpdk.h
>>> @@ -39,6 +39,7 @@ void dpdk_init(const struct smap *ovs_other_config);
>>> void dpdk_set_lcore_id(unsigned cpu);
>>> const char *dpdk_get_vhost_sock_dir(void);
>>> bool dpdk_vhost_iommu_enabled(void);
>>> +bool dpdk_per_port_memory(void);
>>> void print_dpdk_version(void);
>>> void dpdk_status(const struct ovsrec_open_vswitch *);
>>> #endif /* dpdk.h */
>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>> index 2e2f568..2c7ff81 100644
>>> --- a/lib/netdev-dpdk.c
>>> +++ b/lib/netdev-dpdk.c
>>> @@ -91,13 +91,24 @@ static struct vlog_rate_limit rl =
>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>> #define NETDEV_DPDK_MBUF_ALIGN 1024
>>> #define NETDEV_DPDK_MAX_PKT_LEN 9728
>>> -/* Min number of packets in the mempool. OVS tries to allocate a
>>> mempool with
>>> - * roughly estimated number of mbufs: if this fails (because the
>>> system doesn't
>>> - * have enough hugepages) we keep halving the number until the
>>> allocation
>>> - * succeeds or we reach MIN_NB_MBUF */
>>> +/* Max and min number of packets in the mempool. OVS tries to
>>> allocate a
>>> + * mempool with MAX_NB_MBUF: if this fails (because the system
>>> doesn't have
>>> + * enough hugepages) we keep halving the number until the allocation
>>> succeeds
>>> + * or we reach MIN_NB_MBUF */
>>> +
>>> +#define MAX_NB_MBUF (4096 * 64)
>>> #define MIN_NB_MBUF (4096 * 4)
>>> #define MP_CACHE_SZ RTE_MEMPOOL_CACHE_MAX_SIZE
>>> +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>>> +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF /
>>> MIN_NB_MBUF)
>>> + == 0);
>>> +
>>> +/* The smallest possible NB_MBUF that we're going to try should be a
>>> multiple
>>> + * of MP_CACHE_SZ. This is advised by DPDK documentation. */
>>> +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF /
>>> MIN_NB_MBUF))
>>> + % MP_CACHE_SZ == 0);
>>> +
>>> /*
>>> * DPDK XSTATS Counter names definition
>>> */
>>> @@ -297,12 +308,14 @@ static struct ovs_mutex dpdk_mp_mutex
>>> OVS_ACQ_AFTER(dpdk_mutex)
>>> = OVS_MUTEX_INITIALIZER;
>>> /* Contains all 'struct dpdk_mp's. */
>>> -static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
>>> - = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
>>> +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
>>> + = OVS_LIST_INITIALIZER(&dpdk_mp_list);
>>> -/* Wrapper for a mempool released but not yet freed. */
>>> struct dpdk_mp {
>>> struct rte_mempool *mp;
>>> + int mtu;
>>> + int socket_id;
>>> + int refcount;
>>> struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
>>> };
>>> @@ -383,7 +396,7 @@ struct netdev_dpdk {
>>> PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
>>> struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
>>> - struct rte_mempool *mp;
>>> + struct dpdk_mp *dpdk_mp;
>>> /* virtio identifier for vhost devices */
>>> ovsrcu_index vid;
>>> @@ -544,73 +557,95 @@ dpdk_mp_full(const struct rte_mempool *mp)
>>> OVS_REQUIRES(dpdk_mp_mutex)
>>> * likely be ok, as there are additional checks during mempool
>>> * freeing but it would make things racey.
>>> */
>>> - return rte_mempool_full(mp);
>>> + rte_mempool_full(mp);
>>> + return 0;
>>
>> I don't think you intended this? unless you are in the business of
>> selling memory ;-)
>
> Ooops :-/ , This is left over from testing when mbufs are still in use,
> not sure how I missed it. Will correct for v3.
>
Maybe PATCH v1 ? or you are still testing out?
>>
>>> }
>>> /* Free unused mempools. */
>>> static void
>>> -dpdk_mp_sweep(void)
>>> +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
>>> {
>>> struct dpdk_mp *dmp, *next;
>>> - ovs_mutex_lock(&dpdk_mp_mutex);
>>> - LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
>>> - if (dpdk_mp_full(dmp->mp)) {
>>> + LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
>>> + if (!dmp->refcount && dpdk_mp_full(dmp->mp)) {
>>> VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
>>> ovs_list_remove(&dmp->list_node);
>>> rte_mempool_free(dmp->mp);
>>> rte_free(dmp);
>>> }
>>> }
>>> - ovs_mutex_unlock(&dpdk_mp_mutex);
>>> }
>>> -/* Ensure a mempool will not be freed. */
>>> -static void
>>> -dpdk_mp_do_not_free(struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
>>> +/* Calculating the required number of mbufs differs depending on the
>>> + * mempool model being used. Check if per port mempools are in use
>>> before
>>> + * calculating.
>>> + */
>>> +static uint32_t
>>> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool
>>> per_port_mp)
>>> {
>>> - struct dpdk_mp *dmp, *next;
>>> + uint32_t n_mbufs;
>>> - LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
>>> - if (dmp->mp == mp) {
>>> - VLOG_DBG("Removing mempool \"%s\" from free list",
>>> dmp->mp->name);
>>> - ovs_list_remove(&dmp->list_node);
>>> - rte_free(dmp);
>>> - break;
>>> + if (!per_port_mp) {
>>> + /* Shared mempools are being used.
>>> + * XXX: this is a really rough method of provisioning memory.
>>> + * It's impossible to determine what the exact memory
>>> requirements are
>>> + * when the number of ports and rxqs that utilize a
>>> particular mempool
>>> + * can change dynamically at runtime. For now, use this rough
>>> + * heurisitic.
>>> + */
>>> + if (mtu >= ETHER_MTU) {
>>> + n_mbufs = MAX_NB_MBUF;
>>> + } else {
>>> + n_mbufs = MIN_NB_MBUF;
>>> }
>>> + } else {
>>> + /* Per port mempools are being used.
>>> + * XXX: rough estimation of number of mbufs required for
>>> this port:
>>> + * <packets required to fill the device rxqs>
>>> + * + <packets that could be stuck on other ports txqs>
>>> + * + <packets in the pmd threads>
>>> + * + <additional memory for corner cases>
>>> + */
>>> + n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
>>> + + dev->requested_n_txq * dev->requested_txq_size
>>> + + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
>>> NETDEV_MAX_BURST
>>> + + MIN_NB_MBUF;
>>> }
>>> +
>>> + return n_mbufs;
>>> }
>>> -/* Returns a valid pointer when either of the following is true:
>>> - * - a new mempool was just created;
>>> - * - a matching mempool already exists. */
>>> -static struct rte_mempool *
>>> -dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>>> +static struct dpdk_mp *
>>> +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>> {
>>> char mp_name[RTE_MEMPOOL_NAMESIZE];
>>> const char *netdev_name = netdev_get_name(&dev->up);
>>> int socket_id = dev->requested_socket_id;
>>> uint32_t n_mbufs;
>>> uint32_t hash = hash_string(netdev_name, 0);
>>> - struct rte_mempool *mp = NULL;
>>> -
>>> - /*
>>> - * XXX: rough estimation of number of mbufs required for this port:
>>> - * <packets required to fill the device rxqs>
>>> - * + <packets that could be stuck on other ports txqs>
>>> - * + <packets in the pmd threads>
>>> - * + <additional memory for corner cases>
>>> - */
>>> - n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
>>> - + dev->requested_n_txq * dev->requested_txq_size
>>> - + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
>>> NETDEV_MAX_BURST
>>> - + MIN_NB_MBUF;
>>> + struct dpdk_mp *dmp = NULL;
>>> + int ret;
>>> +
>>> + dmp = dpdk_rte_mzalloc(sizeof *dmp);
>>> + if (!dmp) {
>>> + return NULL;
>>> + }
>>> + dmp->socket_id = socket_id;
>>> + dmp->mtu = mtu;
>>> + dmp->refcount = 1;
>>> +
>>> + n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>>> - ovs_mutex_lock(&dpdk_mp_mutex);
>>> do {
>>> /* Full DPDK memory pool name must be unique and cannot be
>>> - * longer than RTE_MEMPOOL_NAMESIZE. */
>>> - int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>>> + * longer than RTE_MEMPOOL_NAMESIZE. Note that for the shared
>>> + * mempool case this can result in one device using a mempool
>>> + * which references a different device in it's name. However as
>>> + * mempool names are hashed, the device name will not be
>>> readable
>>> + * so this is not an issue for tasks such as debugging.
>>> + */
>>> + ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>>> "ovs%08x%02d%05d%07u",
>>> hash, socket_id, mtu, n_mbufs);
>>> if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
>>> @@ -626,96 +661,158 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>>> netdev_name, n_mbufs, socket_id,
>>> dev->requested_n_rxq, dev->requested_n_txq);
>>> - mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
>>> - sizeof (struct dp_packet) - sizeof (struct rte_mbuf),
>>> - MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id);
>>> + dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
>>> + MP_CACHE_SZ,
>>> + sizeof (struct dp_packet)
>>> + - sizeof (struct rte_mbuf),
>>> + MBUF_SIZE(mtu)
>>> + - sizeof(struct dp_packet),
>>> + socket_id);
>>> - if (mp) {
>>> + if (dmp->mp) {
>>> VLOG_DBG("Allocated \"%s\" mempool with %u mbufs",
>>> mp_name, n_mbufs);
>>> /* rte_pktmbuf_pool_create has done some initialization
>>> of the
>>> - * rte_mbuf part of each dp_packet. Some OvS specific
>>> fields
>>> - * of the packet still need to be initialized by
>>> - * ovs_rte_pktmbuf_init. */
>>> - rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL);
>>> + * rte_mbuf part of each dp_packet, while
>>> ovs_rte_pktmbuf_init
>>> + * initializes some OVS specific fields of dp_packet.
>>> + */
>>> + rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
>>> + return dmp;
>>> } else if (rte_errno == EEXIST) {
>>> /* A mempool with the same name already exists. We just
>>> * retrieve its pointer to be returned to the caller. */
>>> - mp = rte_mempool_lookup(mp_name);
>>> + dmp->mp = rte_mempool_lookup(mp_name);
>>> /* As the mempool create returned EEXIST we can expect the
>>> * lookup has returned a valid pointer. If for some
>>> reason
>>> * that's not the case we keep track of it. */
>>> VLOG_DBG("A mempool with name \"%s\" already exists at
>>> %p.",
>>> - mp_name, mp);
>>> - /* Ensure this reused mempool will not be freed. */
>>> - dpdk_mp_do_not_free(mp);
>>> + mp_name, dmp->mp);
>>> + return dmp;
>>> } else {
>>> - VLOG_ERR("Failed mempool \"%s\" create request of %u
>>> mbufs",
>>> - mp_name, n_mbufs);
>>> + VLOG_DBG("Failed to create mempool \"%s\" with a request
>>> of "
>>> + "%u mbufs, retrying with %u mbufs",
>>> + mp_name, n_mbufs, n_mbufs/2);
>>> }
>>> - } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >=
>>> MIN_NB_MBUF);
>>> + } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >=
>>> MIN_NB_MBUF);
>>> - ovs_mutex_unlock(&dpdk_mp_mutex);
>>> - return mp;
>>> + VLOG_ERR("Failed to create mempool \"%s\" with a request of %u
>>> mbufs",
>>> + mp_name, n_mbufs);
>>> +
>>> + rte_free(dmp);
>>> + return NULL;
>>> }
>>> -/* Release an existing mempool. */
>>> -static void
>>> -dpdk_mp_release(struct rte_mempool *mp)
>>> +static struct dpdk_mp *
>>> +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>> {
>>> - if (!mp) {
>>> - return;
>>> - }
>>> + struct dpdk_mp *dmp, *next;
>>> + bool reuse = false;
>>> ovs_mutex_lock(&dpdk_mp_mutex);
>>> - if (dpdk_mp_full(mp)) {
>>> - VLOG_DBG("Freeing mempool \"%s\"", mp->name);
>>> - rte_mempool_free(mp);
>>> - } else {
>>> - struct dpdk_mp *dmp;p
>>> + /* Check if shared mempools are being used, if so check existing
>>> mempools
>>> + * to see if reuse is possible. */
>>> + if (!per_port_mp) {
>>> + LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>> + if (dmp->socket_id == dev->requested_socket_id
>>> + && dmp->mtu == mtu) {
>>> + VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
>>> + dmp->refcount++;
>>> + reuse = true;
>>> + break;
>>> + }
>>> + }
>>> + }
>>> + /* Sweep mempools after reuse or before create. */
>>> + dpdk_mp_sweep();
>>> - dmp = dpdk_rte_mzalloc(sizeof *dmp);
>>> + if (!reuse) {
>>> + dmp = dpdk_mp_create(dev, mtu, per_port_mp);
>>> if (dmp) {
>>> - dmp->mp = mp;
>>> - ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node);
>>> + /* We need to check for the EEXIST case for per port
>>> memory.
>>
>> I'm pretty sure this also applies to the shared mempool case, because
>> you have rounded the mtu with dpdk_buf_size() so the requested_mtu might
>> be different but the rounded value that is used for the mempool name
>> might be the same, meaning that you can get an EEXISTS.
>>
>
> Maybe I misunderstood but I would think in that case it will be caught
> by the 'reuse' code above? If the rounded value of a new ports
> requested_mtu is equal to that of an existing mempool mtu and its on the
> same socket it will be reused, you would not get to dpdk_mp_create() as
> the reuse condition will be true.
>
Yes - you are right, it won't occur. Maybe you could add a small comment
to say something like "Shared mempools will have reused and not
requested a mempool that already exists."
> I had tested this with 2 ports, one with a requested mtu of 1500 and a
> second with a requested mtu of 1800. Both are rounded to 2030 and the
> reuse case was hit. Do you think this wont always be the case though?
>
No, it's right - also a good test to make sure that the mtu is updated
correctly in that case :-)
>>> + * Compare the mempool returned by dmp to each entry in
>>> + * dpdk_mp_list. If a match is found, free dmp as a new
>>> entry
>>> + * is not required, set dmp to point to the existing
>>> + * entry and set refcount to 1 to avoid being freed at a
>>> later
>>> + * stage.
>>> + */
>>> + if (per_port_mp && rte_errno == EEXIST) {
>
> If it's the case that shared_mempool will always reuse we may not need
> the per_port_mp in the condition above.
>
Probably it's ok either way, but might be better to leave it there,
otherwise you'll have to think about where rte_errno is being set.
>>> + LIST_FOR_EACH (next, list_node, &dpdk_mp_list) {
>>> + if (dmp->mp == next->mp) {
>>> + rte_free(dmp);
>>> + dmp = next;
>>> + dmp->refcount = 1;
>>> + }
>>> + }
>>> + }
>>> + else {
>>> + ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>>> + }
>>
>> I think you need to increment refcount and use the safe list option. How
>> about
>
> I spotted the other mails and responded there also, my plan is to change
> to increment rather than = 1. but will keep the rte_free(dmp) and dmp =
> next if that sounds ok?
>
sounds good, thanks.
>>
>> if (rte_errno == EEXIST) {
>> LIST_FOR_EACH_SAFE (next, list_node, &dpdk_mp_list) {
>> if (dmp->mp == next->mp) {
>> next->refcount++;
>> rte_free(dmp);
>> break;
>> }
>> }
>> } else {
>> dmp->refcount++;
>> ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>> }
>>
>>> }
>>> }
>>> +
>>> ovs_mutex_unlock(&dpdk_mp_mutex);
>>> +
>>> + return dmp;
>>> }
>>> -/* Tries to allocate a new mempool - or re-use an existing one where
>>> - * appropriate - on requested_socket_id with a size determined by
>>> - * requested_mtu and requested Rx/Tx queues.
>>> - * On success - or when re-using an existing mempool - the new
>>> configuration
>>> - * will be applied.
>>> +/* Decrement reference to a mempool. */
>>> +static void
>>> +dpdk_mp_put(struct dpdk_mp *dmp)
>>> +{
>>> + if (!dmp) {
>>> + return;
>>> + }
>>> +
>>> + ovs_mutex_lock(&dpdk_mp_mutex);
>>> + ovs_assert(dmp->refcount);
>>> + dmp->refcount--;
>>> + ovs_mutex_unlock(&dpdk_mp_mutex);
>>> +}
>>> +
>>> +/* Depending on the memory model being used this function tries
>>> + * identify and reuse an existing mempool or tries to allocate new
>>> + * mempool on requested_socket_id with mbuf size corresponding to
>>> + * requested_mtu. On success new configuration will be applied.
>>> * On error, device will be left unchanged. */
>>> static int
>>> netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>>> OVS_REQUIRES(dev->mutex)
>>> {
>>> uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
>>> - struct rte_mempool *mp;
>>> + struct dpdk_mp *mp;
>>
>> I think it should have a different name now, as it leads to mp->mp
>> later. It would be good if you could make is consistent with the other
>> functions (dmp?)
>
> Sure, will do.
>>
>>> int ret = 0;
>>> + bool per_port_mp = dpdk_per_port_memory();
>>> - dpdk_mp_sweep();
>>> + /* With shared mempools we do not need to configure a mempool if
>>> the MTU
>>> + * and socket ID have not changed, the previous configuration is
>>> still
>>> + * valid so return 0 */
>>> + if (!per_port_mp && dev->mtu == dev->requested_mtu
>>> + && dev->socket_id == dev->requested_socket_id) {
>>> + return ret;
>>> + }
>>> - mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
>>> + mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
>>> if (!mp) {
>>> VLOG_ERR("Failed to create memory pool for netdev "
>>> "%s, with MTU %d on socket %d: %s\n",
>>> dev->up.name, dev->requested_mtu,
>>> dev->requested_socket_id,
>>> rte_strerror(rte_errno));
>>> - ret = rte_errno;
>>> + return rte_errno;
>>> } else {
>>> - /* If a new MTU was requested and its rounded value equals
>>> the one
>>> - * that is currently used, then the existing mempool is
>>> returned. */
>>> - if (dev->mp != mp) {
>>> - /* A new mempool was created, release the previous one. */
>>> - dpdk_mp_release(dev->mp);
>>> - } else {
>>> - ret = EEXIST;
>>> + /* Check for any pre-existing dpdk_mp for the device before
>>> accessing
>>> + * the associated mempool.
>>> + */
>>> + if (dev->dpdk_mp != NULL) {
>>> + /* If a new MTU was requested and its rounded value does
>>> not
>>> + * equal the rounded value of the current mempool,
>>> decrement
>>> + * the reference count to devices current dpdk_mp as its
>>> + * associated mempool will not be used for this device.
>>> + */
>>> + if (dev->dpdk_mp->mp != mp->mp) {
>>
>> Not sure if check is needed. If it were to be false, wouldn't we have
>> already reused or got an EEXISTS and handled that case?
>>
>
> Yes, your right, it can be removed.
>
>>> + dpdk_mp_put(dev->dpdk_mp);
>>> + }
>>> }
>>> - dev->mp = mp;
>>> + dev->dpdk_mp = mp;
>>> dev->mtu = dev->requested_mtu;
>>> dev->socket_id = dev->requested_socket_id;
>>> dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>> @@ -838,7 +935,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev,
>>> int n_rxq, int n_txq)
>>> for (i = 0; i < n_rxq; i++) {
>>> diag = rte_eth_rx_queue_setup(dev->port_id, i,
>>> dev->rxq_size,
>>> - dev->socket_id, NULL,
>>> dev->mp);
>>> + dev->socket_id, NULL,
>>> + dev->dpdk_mp->mp);
>>> if (diag) {
>>> VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
>>> dev->up.name, i, rte_strerror(-diag));
>>> @@ -926,7 +1024,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>>> memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
>>> rte_eth_link_get_nowait(dev->port_id, &dev->link);
>>> - mbp_priv = rte_mempool_get_priv(dev->mp);
>>> + mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
>>> dev->buf_size = mbp_priv->mbuf_data_room_size -
>>> RTE_PKTMBUF_HEADROOM;
>>> /* Get the Flow control configuration for DPDK-ETH */
>>> @@ -1180,7 +1278,7 @@ common_destruct(struct netdev_dpdk *dev)
>>> OVS_EXCLUDED(dev->mutex)
>>> {
>>> rte_free(dev->tx_q);
>>> - dpdk_mp_release(dev->mp);
>>> + dpdk_mp_put(dev->dpdk_mp);
>>> ovs_list_remove(&dev->list_node);
>>> free(ovsrcu_get_protected(struct ingress_policer *,
>>> @@ -1933,7 +2031,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
>>> return EAGAIN;
>>> }
>>> - nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
>>> + nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
>>> (struct rte_mbuf **)
>>> batch->packets,
>>> NETDEV_MAX_BURST);
>>> if (!nb_rx) {
>>> @@ -2172,7 +2270,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid,
>>> struct dp_packet_batch *batch)
>>> continue;
>>> }
>>> - pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
>>> + pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
>>> if (OVS_UNLIKELY(!pkts[txcnt])) {
>>> dropped += cnt - i;
>>> break;
>>> @@ -3052,7 +3150,7 @@ netdev_dpdk_get_mempool_info(struct
>>> unixctl_conn *conn,
>>> ovs_mutex_lock(&dev->mutex);
>>> ovs_mutex_lock(&dpdk_mp_mutex);
>>> - rte_mempool_dump(stream, dev->mp);
>>> + rte_mempool_dump(stream, dev->dpdk_mp->mp);
>>> ovs_mutex_unlock(&dpdk_mp_mutex);
>>> ovs_mutex_unlock(&dev->mutex);
>>> @@ -3749,7 +3847,7 @@ dpdk_vhost_reconfigure_helper(struct
>>> netdev_dpdk *dev)
>>> err = netdev_dpdk_mempool_configure(dev);
>>> if (!err) {
>>> - /* A new mempool was created. */
>>> + /* A new mempool was created or re-used. */
>>> netdev_change_seq_changed(&dev->up);
>>> } else if (err != EEXIST){
>>> return err;
>>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>>> index 1e27a02..dbf70ac 100644
>>> --- a/vswitchd/vswitch.xml
>>> +++ b/vswitchd/vswitch.xml
>>> @@ -359,6 +359,23 @@
>>> </p>
>>> </column>
>>> + <column name="other_config" key="per-port-memory"
>>> + type='{"type": "boolean"}'>
>>> + <p>
>>> + By default OVS DPDK uses a shared mempool memory model
>>> wherein
>>> + devices that have the same MTU and socket values can share
>>> the same
>>> + mempool. Setting this value to <code>true</code> changes this
>>> + behaviour. Per port memory allow DPDK devices to use private
>>> + memory per device. This can provide greater transapraency as
>>> + regards memory usage but potentially at the cost of
>>> greater memory
>>> + requirements.
>>> + </p>
>>> + <p>
>>> + Changing this value requires restarting the daemon if
>>> dpdk-init has
>>> + already been set to true.
>>> + </p>
>>> + </column>
>>> +
>>> <column name="other_config" key="tx-flush-interval"
>>> type='{"type": "integer",
>>> "minInteger": 0, "maxInteger": 1000000}'>
>>>
>>
>
More information about the dev
mailing list