[ovs-dev] [RFC PATCH v2 1/1] dpdk: Support both shared and per port mempools.

Ian Stokes ian.stokes at intel.com
Wed Jun 20 20:23:42 UTC 2018


On 6/20/2018 6:24 PM, Kevin Traynor wrote:
> On 06/20/2018 01:34 PM, Ian Stokes wrote:
>> On 6/19/2018 12:11 PM, Kevin Traynor wrote:
>>> On 06/11/2018 05:37 PM, Ian Stokes wrote:
>>>> This commit re-introduces the concept of shared mempools as the default
>>>> memory model for DPDK devices. Per port mempools are still available but
>>>> must be enabled explicitly by a user.
>>>>
>>>
>>> Hi Ian, thanks for v2, comments below
>>
>> Thanks for the review Kevin, comments inline.
>>
>>>
>>>> OVS previously used a shared mempool model for ports with the same MTU
>>>> and socket configuration. This was replaced by a per port mempool model
>>>> to address issues flagged by users such as:
>>>>
>>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html
>>>>
>>>>
>>>> However the per port model potentially requires an increase in memory
>>>> resource requirements to support the same number of ports and
>>>> configuration
>>>> as the shared port model.
>>>>
>>>> This is considered a blocking factor for current deployments of OVS
>>>> when upgrading to future OVS releases as a user may have to redimension
>>>> memory for the same deployment configuration. This may not be
>>>> possible for
>>>> users.
>>>>
>>>> This commit resolves the issue by re-introducing shared mempools as
>>>> the default memory behaviour in OVS DPDK but also refactors the memory
>>>> configuration code to allow for per port mempools.
>>>>
>>>> This patch adds a new global config option, per-port-mp, that
>>>
>>> s/per-port-mp/per-port-memory/
>>
>> Apologies, I spotted a few of these 'per-port-mp' myself, I believe it's
>> in the documentation as well. I have them fixed for the next revision.
>>
>>>
>>>> controls the enablement of per port mempools for DPDK devices.
>>>>
>>>>       ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
>>>>
>>>> This value defaults to false; to enable per port memory support,
>>>> this field should be set to true when setting other global parameters
>>>> on init (such as "dpdk-socket-mem", for example). Changing the value at
>>>> runtime is not supported, and requires restarting the vswitch
>>>> daemon.
>>>>
>>>> The mempool sweep functionality is also replaced with the
>>>> sweep functionality from OVS 2.9 found in commits
>>>>
>>>> c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
>>>> a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)
>>>>
>>>> As this patch is RFC there are a number of TO-DOs including adding a
>>>> memory calculation section to the documentation for both models. This is
>>>> expected to be completed in the v1 after RFC.
>>>>
>>>> Signed-off-by: Ian Stokes <ian.stokes at intel.com>
>>>>
>>>> ---
>>>> v1 -> v2
>>>> * Rebase to head of master.
>>>> * Change global config option 'per-port-mp-enabled' to
>>>> 'per-port-memory'.
>>>>     in commit message & documentation and code.
>>>> * Specify in documentation that restart of daemon is only required if
>>>> per
>>>>     port-port-memory is set after dpdk-init=true.
>>>> * Return comment 'Contains all 'struct dpdk_mp's.' which was lost in
>>>>     previous refactor.
>>>> * Improve comments regarding unique mempool names in the shared mempool
>>>>     usecase.
>>>> * Check per_port_mp condition first when verifying if new mempool is
>>>>     required in netdev_dpdk_mempool_configure() for the shared mempool
>>>>     usecase.
>>>> * Correctly return ret variable instead of 0 for function
>>>>     netdev_dpdk_mempool_configure().
>>>> * Move VLOG_ERR regarding failure to create mempool for a device
>>>> prior to
>>>>     dpdk_mp_create() returns.
>>>> * Add VLOG_DBG message flagging that the number of mbufs could not be
>>>>     allocated and that it will be retried with half that amount.
>>>> * Fix indentation of VLOG_ERR in netdev_dpdk_mempool_configure().
>>>> * Handle EEXIST case for per port memory in function dpdk_mp_get() to
>>>>     avoid duplication of dpdk_mps, correctly set refcount and return
>>>>     correct dpdk_mp for the device to use.
>>>> ---
>>>>    Documentation/automake.mk            |   1 +
>>>>    Documentation/topics/dpdk/index.rst  |   1 +
>>>>    Documentation/topics/dpdk/memory.rst |  67 ++++++++
>>>>    NEWS                                 |   1 +
>>>>    lib/dpdk-stub.c                      |   6 +
>>>>    lib/dpdk.c                           |  12 ++
>>>>    lib/dpdk.h                           |   1 +
>>>>    lib/netdev-dpdk.c                    | 298
>>>> +++++++++++++++++++++++------------
>>>>    vswitchd/vswitch.xml                 |  17 ++
>>>>    9 files changed, 304 insertions(+), 100 deletions(-)
>>>>    create mode 100644 Documentation/topics/dpdk/memory.rst
>>>>
>>>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>>>> index 2202df4..141b33d 100644
>>>> --- a/Documentation/automake.mk
>>>> +++ b/Documentation/automake.mk
>>>> @@ -36,6 +36,7 @@ DOC_SOURCE = \
>>>>        Documentation/topics/dpdk/index.rst \
>>>>        Documentation/topics/dpdk/bridge.rst \
>>>>        Documentation/topics/dpdk/jumbo-frames.rst \
>>>> +    Documentation/topics/dpdk/memory.rst \
>>>>        Documentation/topics/dpdk/pdump.rst \
>>>>        Documentation/topics/dpdk/phy.rst \
>>>>        Documentation/topics/dpdk/pmd.rst \
>>>> diff --git a/Documentation/topics/dpdk/index.rst
>>>> b/Documentation/topics/dpdk/index.rst
>>>> index 181f61a..cf24a7b 100644
>>>> --- a/Documentation/topics/dpdk/index.rst
>>>> +++ b/Documentation/topics/dpdk/index.rst
>>>> @@ -40,3 +40,4 @@ The DPDK Datapath
>>>>       /topics/dpdk/qos
>>>>       /topics/dpdk/pdump
>>>>       /topics/dpdk/jumbo-frames
>>>> +   /topics/dpdk/memory
>>>> diff --git a/Documentation/topics/dpdk/memory.rst
>>>> b/Documentation/topics/dpdk/memory.rst
>>>> new file mode 100644
>>>> index 0000000..7c00f0f
>>>> --- /dev/null
>>>> +++ b/Documentation/topics/dpdk/memory.rst
>>>> @@ -0,0 +1,67 @@
>>>> +..
>>>> +        Copyright 2018, Intel, Inc.
>>>> +
>>>> +      Licensed under the Apache License, Version 2.0 (the
>>>> "License"); you may
>>>> +      not use this file except in compliance with the License. You
>>>> may obtain
>>>> +      a copy of the License at
>>>> +
>>>> +          http://www.apache.org/licenses/LICENSE-2.0
>>>> +
>>>> +      Unless required by applicable law or agreed to in writing,
>>>> software
>>>> +      distributed under the License is distributed on an "AS IS"
>>>> BASIS, WITHOUT
>>>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>>> implied. See the
>>>> +      License for the specific language governing permissions and
>>>> limitations
>>>> +      under the License.
>>>> +
>>>> +      Convention for heading levels in Open vSwitch documentation:
>>>> +
>>>> +      =======  Heading 0 (reserved for the title in a document)
>>>> +      -------  Heading 1
>>>> +      ~~~~~~~  Heading 2
>>>> +      +++++++  Heading 3
>>>> +      '''''''  Heading 4
>>>> +
>>>> +      Avoid deeper levels because they do not render well.
>>>> +
>>>> +=========================
>>>> +DPDK Device Memory Models
>>>> +=========================
>>>> +
>>>> +DPDK device memory can be allocated in one of two ways in OVS DPDK,
>>>> +shared mempools or per port mempools. The specifics of both are
>>>> detailed
>>>> +below.
>>>> +
>>>> +Shared Mempools
>>>> +---------------
>>>> +
>>>> +By Default OVS DPDK uses a shared mempool model. This means that
>>>> multiple
>>>> +ports can share the same mempool. For example when a port is added
>>>> it will
>>>> +have a given MTU and socket ID associated with it. If a mempool has
>>>> been
>>>> +created previously for an existing port that has the same MTU and
>>>> socket ID,
>>>> +that mempool is used for both ports. If there is no existing mempool
>>>> +supporting these parameters then a new mempool is created.
>>>> +
>>>> +Per Port Mempools
>>>> +-----------------
>>>> +
>>>> +In the per port mempool model, mempools are created per device and
>>>> are not
>>>> +shared. The benefit of this is a more transparent memory model where
>>>> mempools
>>>> +will not be exhausted by other DPDK devices. However this comes at a
>>>> potential
>>>> +increase in cost for memory dimensioning for a given deployment.
>>>> Users should
>>>> +be aware of the memory requirements for their deployment before
>>>> using this
>>>> +model and allocate the required hugepage memory.
>>>> +
>>>> +Per port mempool support may be enabled via a global config value,
>>>> +```per-port-mp-enabled```. Setting this to true enables the per port
>>>> mempool
>>>> +model for all DPDK devices in OVS::
>>>> +
>>>> +   $ ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
>>>> +
>>>> +.. important::
>>>> +
>>>> +    This value should be set before setting dpdk-init=true. If set
>>>> after
>>>> +    dpdk-init=true then the daemon must be restarted to use
>>>> per-port-memory.
>>>> +
>>>> +.. todo::
>>>> +
>>>> +   Add memory calculation section for both mempool models.
>>>> diff --git a/NEWS b/NEWS
>>>> index 484c6dc..eff6cf1 100644
>>>> --- a/NEWS
>>>> +++ b/NEWS
>>>> @@ -33,6 +33,7 @@ Post-v2.9.0
>>>>           See Testing topic for the details.
>>>>         * Add LSC interrupt support for DPDK physical devices.
>>>>         * Allow init to fail and record DPDK status/version in OVS
>>>> database.
>>>> +     * Support both shared and per port mempools for DPDK devices.
>>>>       - Userspace datapath:
>>>>         * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a
>>>> single PMD
>>>>         * Detailed PMD performance metrics available with new command
>>>> diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
>>>> index 1df1c58..fd3fe21 100644
>>>> --- a/lib/dpdk-stub.c
>>>> +++ b/lib/dpdk-stub.c
>>>> @@ -56,6 +56,12 @@ dpdk_vhost_iommu_enabled(void)
>>>>        return false;
>>>>    }
>>>>    +bool
>>>> +dpdk_per_port_mempool_enabled(void)
>>>> +{
>>>> +    return false;
>>>> +}
>>>> +
>>>>    void
>>>>    print_dpdk_version(void)
>>>>    {
>>>> diff --git a/lib/dpdk.c b/lib/dpdk.c
>>>> index 09afd8c..6a02a17 100644
>>>> --- a/lib/dpdk.c
>>>> +++ b/lib/dpdk.c
>>>> @@ -47,6 +47,7 @@ static char *vhost_sock_dir = NULL;   /* Location
>>>> of vhost-user sockets */
>>>>    static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU
>>>> support */
>>>>    static bool dpdk_initialized = false; /* Indicates successful
>>>> initialization
>>>>                                           * of DPDK. */
>>>> +static bool per_port_memory = false; /* Status of per port memory
>>>> support */
>>>>      static int
>>>>    process_vhost_flags(char *flag, const char *default_val, int size,
>>>> @@ -358,6 +359,11 @@ dpdk_init__(const struct smap *ovs_other_config)
>>>>        VLOG_INFO("IOMMU support for vhost-user-client %s.",
>>>>                   vhost_iommu_enabled ? "enabled" : "disabled");
>>>>    +    per_port_memory = smap_get_bool(ovs_other_config,
>>>> +                                    "per-port-memory", false);
>>>> +    VLOG_INFO("Per port mempool for DPDK devices %s.",
>>>> +              per_port_memory ? "enabled" : "disabled");
>>>> +
>>>>        argv = grow_argv(&argv, 0, 1);
>>>>        argc = 1;
>>>>        argv[0] = xstrdup(ovs_get_program_name());
>>>> @@ -515,6 +521,12 @@ dpdk_vhost_iommu_enabled(void)
>>>>        return vhost_iommu_enabled;
>>>>    }
>>>>    +bool
>>>> +dpdk_per_port_memory(void)
>>>> +{
>>>> +    return per_port_memory;
>>>> +}
>>>> +
>>>>    void
>>>>    dpdk_set_lcore_id(unsigned cpu)
>>>>    {
>>>> diff --git a/lib/dpdk.h b/lib/dpdk.h
>>>> index efdaa63..bbb89d4 100644
>>>> --- a/lib/dpdk.h
>>>> +++ b/lib/dpdk.h
>>>> @@ -39,6 +39,7 @@ void dpdk_init(const struct smap *ovs_other_config);
>>>>    void dpdk_set_lcore_id(unsigned cpu);
>>>>    const char *dpdk_get_vhost_sock_dir(void);
>>>>    bool dpdk_vhost_iommu_enabled(void);
>>>> +bool dpdk_per_port_memory(void);
>>>>    void print_dpdk_version(void);
>>>>    void dpdk_status(const struct ovsrec_open_vswitch *);
>>>>    #endif /* dpdk.h */
>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>> index 2e2f568..2c7ff81 100644
>>>> --- a/lib/netdev-dpdk.c
>>>> +++ b/lib/netdev-dpdk.c
>>>> @@ -91,13 +91,24 @@ static struct vlog_rate_limit rl =
>>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>    #define NETDEV_DPDK_MBUF_ALIGN      1024
>>>>    #define NETDEV_DPDK_MAX_PKT_LEN     9728
>>>>    -/* Min number of packets in the mempool.  OVS tries to allocate a
>>>> mempool with
>>>> - * roughly estimated number of mbufs: if this fails (because the
>>>> system doesn't
>>>> - * have enough hugepages) we keep halving the number until the
>>>> allocation
>>>> - * succeeds or we reach MIN_NB_MBUF */
>>>> +/* Max and min number of packets in the mempool.  OVS tries to
>>>> allocate a
>>>> + * mempool with MAX_NB_MBUF: if this fails (because the system
>>>> doesn't have
>>>> + * enough hugepages) we keep halving the number until the allocation
>>>> succeeds
>>>> + * or we reach MIN_NB_MBUF */
>>>> +
>>>> +#define MAX_NB_MBUF          (4096 * 64)
>>>>    #define MIN_NB_MBUF          (4096 * 4)
>>>>    #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>>>    +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>>>> +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF /
>>>> MIN_NB_MBUF)
>>>> +                  == 0);
>>>> +
>>>> +/* The smallest possible NB_MBUF that we're going to try should be a
>>>> multiple
>>>> + * of MP_CACHE_SZ. This is advised by DPDK documentation. */
>>>> +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF /
>>>> MIN_NB_MBUF))
>>>> +                  % MP_CACHE_SZ == 0);
>>>> +
>>>>    /*
>>>>     * DPDK XSTATS Counter names definition
>>>>     */
>>>> @@ -297,12 +308,14 @@ static struct ovs_mutex dpdk_mp_mutex
>>>> OVS_ACQ_AFTER(dpdk_mutex)
>>>>        = OVS_MUTEX_INITIALIZER;
>>>>      /* Contains all 'struct dpdk_mp's. */
>>>> -static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
>>>> -    = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
>>>> +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
>>>> +    = OVS_LIST_INITIALIZER(&dpdk_mp_list);
>>>>    -/* Wrapper for a mempool released but not yet freed. */
>>>>    struct dpdk_mp {
>>>>         struct rte_mempool *mp;
>>>> +     int mtu;
>>>> +     int socket_id;
>>>> +     int refcount;
>>>>         struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
>>>>     };
>>>>    @@ -383,7 +396,7 @@ struct netdev_dpdk {
>>>>          PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
>>>>            struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
>>>> -        struct rte_mempool *mp;
>>>> +        struct dpdk_mp *dpdk_mp;
>>>>              /* virtio identifier for vhost devices */
>>>>            ovsrcu_index vid;
>>>> @@ -544,73 +557,95 @@ dpdk_mp_full(const struct rte_mempool *mp)
>>>> OVS_REQUIRES(dpdk_mp_mutex)
>>>>         * likely be ok, as there are additional checks during mempool
>>>>         * freeing but it would make things racey.
>>>>         */
>>>> -    return rte_mempool_full(mp);
>>>> +    rte_mempool_full(mp);
>>>> +    return 0;
>>>
>>> I don't think you intended this? unless you are in the business of
>>> selling memory ;-)
>>
>> Ooops :-/ , This is left over from testing when mbufs are still in use,
>> not sure how I missed it. Will correct for v3.
>>
> 
> Maybe PATCH v1 ? or you are still testing out?

I was holding back in case there was any major gotchas on the v2, I also 
have the documentation completed for memory calculation so make sense no 
to go to v1.
> 
>>>
>>>>    }
>>>>      /* Free unused mempools. */
>>>>    static void
>>>> -dpdk_mp_sweep(void)
>>>> +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
>>>>    {
>>>>        struct dpdk_mp *dmp, *next;
>>>>    -    ovs_mutex_lock(&dpdk_mp_mutex);
>>>> -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
>>>> -        if (dpdk_mp_full(dmp->mp)) {
>>>> +    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
>>>> +        if (!dmp->refcount && dpdk_mp_full(dmp->mp)) {
>>>>                VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
>>>>                ovs_list_remove(&dmp->list_node);
>>>>                rte_mempool_free(dmp->mp);
>>>>                rte_free(dmp);
>>>>            }
>>>>        }
>>>> -    ovs_mutex_unlock(&dpdk_mp_mutex);
>>>>    }
>>>>    -/* Ensure a mempool will not be freed. */
>>>> -static void
>>>> -dpdk_mp_do_not_free(struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
>>>> +/* Calculating the required number of mbufs differs depending on the
>>>> + * mempool model being used. Check if per port mempools are in use
>>>> before
>>>> + * calculating.
>>>> + */
>>>> +static uint32_t
>>>> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool
>>>> per_port_mp)
>>>>    {
>>>> -    struct dpdk_mp *dmp, *next;
>>>> +    uint32_t n_mbufs;
>>>>    -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
>>>> -        if (dmp->mp == mp) {
>>>> -            VLOG_DBG("Removing mempool \"%s\" from free list",
>>>> dmp->mp->name);
>>>> -            ovs_list_remove(&dmp->list_node);
>>>> -            rte_free(dmp);
>>>> -            break;
>>>> +    if (!per_port_mp) {
>>>> +        /* Shared mempools are being used.
>>>> +         * XXX: this is a really rough method of provisioning memory.
>>>> +         * It's impossible to determine what the exact memory
>>>> requirements are
>>>> +         * when the number of ports and rxqs that utilize a
>>>> particular mempool
>>>> +         * can change dynamically at runtime. For now, use this rough
>>>> +         * heurisitic.
>>>> +         */
>>>> +        if (mtu >= ETHER_MTU) {
>>>> +            n_mbufs = MAX_NB_MBUF;
>>>> +        } else {
>>>> +            n_mbufs = MIN_NB_MBUF;
>>>>            }
>>>> +    } else {
>>>> +        /* Per port mempools are being used.
>>>> +         * XXX: rough estimation of number of mbufs required for
>>>> this port:
>>>> +         * <packets required to fill the device rxqs>
>>>> +         * + <packets that could be stuck on other ports txqs>
>>>> +         * + <packets in the pmd threads>
>>>> +         * + <additional memory for corner cases>
>>>> +         */
>>>> +        n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
>>>> +                  + dev->requested_n_txq * dev->requested_txq_size
>>>> +                  + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
>>>> NETDEV_MAX_BURST
>>>> +                  + MIN_NB_MBUF;
>>>>        }
>>>> +
>>>> +    return n_mbufs;
>>>>    }
>>>>    -/* Returns a valid pointer when either of the following is true:
>>>> - *  - a new mempool was just created;
>>>> - *  - a matching mempool already exists. */
>>>> -static struct rte_mempool *
>>>> -dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>>>> +static struct dpdk_mp *
>>>> +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>>>    {
>>>>        char mp_name[RTE_MEMPOOL_NAMESIZE];
>>>>        const char *netdev_name = netdev_get_name(&dev->up);
>>>>        int socket_id = dev->requested_socket_id;
>>>>        uint32_t n_mbufs;
>>>>        uint32_t hash = hash_string(netdev_name, 0);
>>>> -    struct rte_mempool *mp = NULL;
>>>> -
>>>> -    /*
>>>> -     * XXX: rough estimation of number of mbufs required for this port:
>>>> -     * <packets required to fill the device rxqs>
>>>> -     * + <packets that could be stuck on other ports txqs>
>>>> -     * + <packets in the pmd threads>
>>>> -     * + <additional memory for corner cases>
>>>> -     */
>>>> -    n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
>>>> -              + dev->requested_n_txq * dev->requested_txq_size
>>>> -              + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
>>>> NETDEV_MAX_BURST
>>>> -              + MIN_NB_MBUF;
>>>> +    struct dpdk_mp *dmp = NULL;
>>>> +    int ret;
>>>> +
>>>> +    dmp = dpdk_rte_mzalloc(sizeof *dmp);
>>>> +    if (!dmp) {
>>>> +        return NULL;
>>>> +    }
>>>> +    dmp->socket_id = socket_id;
>>>> +    dmp->mtu = mtu;
>>>> +    dmp->refcount = 1;
>>>> +
>>>> +    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>>>>    -    ovs_mutex_lock(&dpdk_mp_mutex);
>>>>        do {
>>>>            /* Full DPDK memory pool name must be unique and cannot be
>>>> -         * longer than RTE_MEMPOOL_NAMESIZE. */
>>>> -        int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>>>> +         * longer than RTE_MEMPOOL_NAMESIZE. Note that for the shared
>>>> +         * mempool case this can result in one device using a mempool
>>>> +         * which references a different device in it's name. However as
>>>> +         * mempool names are hashed, the device name will not be
>>>> readable
>>>> +         * so this is not an issue for tasks such as debugging.
>>>> +         */
>>>> +        ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>>>>                               "ovs%08x%02d%05d%07u",
>>>>                               hash, socket_id, mtu, n_mbufs);
>>>>            if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
>>>> @@ -626,96 +661,158 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>>>>                      netdev_name, n_mbufs, socket_id,
>>>>                      dev->requested_n_rxq, dev->requested_n_txq);
>>>>    -        mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
>>>> -                 sizeof (struct dp_packet) - sizeof (struct rte_mbuf),
>>>> -                 MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id);
>>>> +        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
>>>> +                                          MP_CACHE_SZ,
>>>> +                                          sizeof (struct dp_packet)
>>>> +                                          - sizeof (struct rte_mbuf),
>>>> +                                          MBUF_SIZE(mtu)
>>>> +                                          - sizeof(struct dp_packet),
>>>> +                                          socket_id);
>>>>    -        if (mp) {
>>>> +        if (dmp->mp) {
>>>>                VLOG_DBG("Allocated \"%s\" mempool with %u mbufs",
>>>>                         mp_name, n_mbufs);
>>>>                /* rte_pktmbuf_pool_create has done some initialization
>>>> of the
>>>> -             * rte_mbuf part of each dp_packet. Some OvS specific
>>>> fields
>>>> -             * of the packet still need to be initialized by
>>>> -             * ovs_rte_pktmbuf_init. */
>>>> -            rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL);
>>>> +             * rte_mbuf part of each dp_packet, while
>>>> ovs_rte_pktmbuf_init
>>>> +             * initializes some OVS specific fields of dp_packet.
>>>> +             */
>>>> +            rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
>>>> +            return dmp;
>>>>            } else if (rte_errno == EEXIST) {
>>>>                /* A mempool with the same name already exists.  We just
>>>>                 * retrieve its pointer to be returned to the caller. */
>>>> -            mp = rte_mempool_lookup(mp_name);
>>>> +            dmp->mp = rte_mempool_lookup(mp_name);
>>>>                /* As the mempool create returned EEXIST we can expect the
>>>>                 * lookup has returned a valid pointer.  If for some
>>>> reason
>>>>                 * that's not the case we keep track of it. */
>>>>                VLOG_DBG("A mempool with name \"%s\" already exists at
>>>> %p.",
>>>> -                     mp_name, mp);
>>>> -            /* Ensure this reused mempool will not be freed. */
>>>> -            dpdk_mp_do_not_free(mp);
>>>> +                     mp_name, dmp->mp);
>>>> +            return dmp;
>>>>            } else {
>>>> -            VLOG_ERR("Failed mempool \"%s\" create request of %u
>>>> mbufs",
>>>> -                     mp_name, n_mbufs);
>>>> +            VLOG_DBG("Failed to create mempool \"%s\" with a request
>>>> of "
>>>> +                     "%u mbufs, retrying with %u mbufs",
>>>> +                     mp_name, n_mbufs, n_mbufs/2);
>>>>            }
>>>> -    } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >=
>>>> MIN_NB_MBUF);
>>>> +    } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >=
>>>> MIN_NB_MBUF);
>>>>    -    ovs_mutex_unlock(&dpdk_mp_mutex);
>>>> -    return mp;
>>>> +    VLOG_ERR("Failed to create mempool \"%s\" with a request of %u
>>>> mbufs",
>>>> +             mp_name, n_mbufs);
>>>> +
>>>> +    rte_free(dmp);
>>>> +    return NULL;
>>>>    }
>>>>    -/* Release an existing mempool. */
>>>> -static void
>>>> -dpdk_mp_release(struct rte_mempool *mp)
>>>> +static struct dpdk_mp *
>>>> +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>>>    {
>>>> -    if (!mp) {
>>>> -        return;
>>>> -    }
>>>> +    struct dpdk_mp *dmp, *next;
>>>> +    bool reuse = false;
>>>>          ovs_mutex_lock(&dpdk_mp_mutex);
>>>> -    if (dpdk_mp_full(mp)) {
>>>> -        VLOG_DBG("Freeing mempool \"%s\"", mp->name);
>>>> -        rte_mempool_free(mp);
>>>> -    } else {
>>>> -        struct dpdk_mp *dmp;p
>>>> +    /* Check if shared mempools are being used, if so check existing
>>>> mempools
>>>> +     * to see if reuse is possible. */
>>>> +    if (!per_port_mp) {
>>>> +        LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>>> +            if (dmp->socket_id == dev->requested_socket_id
>>>> +                && dmp->mtu == mtu) {
>>>> +                VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
>>>> +                dmp->refcount++;
>>>> +                reuse = true;
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +    /* Sweep mempools after reuse or before create. */
>>>> +    dpdk_mp_sweep();
>>>>    -        dmp = dpdk_rte_mzalloc(sizeof *dmp);
>>>> +    if (!reuse) {
>>>> +        dmp = dpdk_mp_create(dev, mtu, per_port_mp);
>>>>            if (dmp) {
>>>> -            dmp->mp = mp;
>>>> -            ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node);
>>>> +            /* We need to check for the EEXIST case for per port
>>>> memory.
>>>
>>> I'm pretty sure this also applies to the shared mempool case, because
>>> you have rounded the mtu with dpdk_buf_size() so the requested_mtu might
>>> be different but the rounded value that is used for the mempool name
>>> might be the same, meaning that you can get an EEXISTS.
>>>
>>
>> Maybe I misunderstood but I would think in that case it will be caught
>> by the 'reuse' code above? If the rounded value of a new ports
>> requested_mtu is equal to that of an existing mempool mtu and its on the
>> same socket it will be reused, you would not get to dpdk_mp_create() as
>> the reuse condition will be true.
>>
> 
> Yes - you are right, it won't occur. Maybe you could add a small comment
> to say something like "Shared mempools will have reused and not
> requested a mempool that already exists."

Can do.

Ian
> 
>> I had tested this with 2 ports, one with a requested mtu of 1500 and a
>> second with a requested mtu of 1800. Both are rounded to 2030 and the
>> reuse case was hit. Do you think this wont always be the case though?
>>
> 
> No, it's right - also a good test to make sure that the mtu is updated
> correctly in that case :-)
> 
>>>> +             * Compare the mempool returned by dmp to each entry in
>>>> +             * dpdk_mp_list. If a match is found, free dmp as a new
>>>> entry
>>>> +             * is not required, set dmp to point to the existing
>>>> +             * entry and set refcount to 1 to avoid being freed at a
>>>> later
>>>> +             * stage.
>>>> +             */
>>>> +            if (per_port_mp && rte_errno == EEXIST) {
>>
>> If it's the case that shared_mempool will always reuse we may not need
>> the per_port_mp in the condition above.
>>
> 
> Probably it's ok either way, but might be better to leave it there,
> otherwise you'll have to think about where rte_errno is being set.
>  >>>> +                LIST_FOR_EACH (next, list_node, &dpdk_mp_list) {
>>>> +                    if (dmp->mp == next->mp) {
>>>> +                        rte_free(dmp);
>>>> +                        dmp = next;
>>>> +                        dmp->refcount = 1;
>>>> +                    }
>>>> +                }
>>>> +            }
>>>> +            else {
>>>> +                ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>>>> +            }
>>>
>>> I think you need to increment refcount and use the safe list option. How
>>> about
>>
>> I spotted the other mails and responded there also, my plan is to change
>> to increment rather than = 1. but will keep the rte_free(dmp) and dmp =
>> next if that sounds ok?
>>
> 
> sounds good, thanks.
> 
>>>
>>> if (rte_errno == EEXIST) {
>>>       LIST_FOR_EACH_SAFE (next, list_node, &dpdk_mp_list) {
>>>           if (dmp->mp == next->mp) {
>>>               next->refcount++;
>>>               rte_free(dmp);
>>>               break;
>>>           }
>>>       }
>>> } else {
>>>       dmp->refcount++;
>>>       ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>>> }
>>>
>>>>            }
>>>>        }
>>>> +
>>>>        ovs_mutex_unlock(&dpdk_mp_mutex);
>>>> +
>>>> +    return dmp;
>>>>    }
>>>>    -/* Tries to allocate a new mempool - or re-use an existing one where
>>>> - * appropriate - on requested_socket_id with a size determined by
>>>> - * requested_mtu and requested Rx/Tx queues.
>>>> - * On success - or when re-using an existing mempool - the new
>>>> configuration
>>>> - * will be applied.
>>>> +/* Decrement reference to a mempool. */
>>>> +static void
>>>> +dpdk_mp_put(struct dpdk_mp *dmp)
>>>> +{
>>>> +    if (!dmp) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    ovs_mutex_lock(&dpdk_mp_mutex);
>>>> +    ovs_assert(dmp->refcount);
>>>> +    dmp->refcount--;
>>>> +    ovs_mutex_unlock(&dpdk_mp_mutex);
>>>> +}
>>>> +
>>>> +/* Depending on the memory model being used this function tries
>>>> + * identify and reuse an existing mempool or tries to allocate new
>>>> + * mempool on requested_socket_id with mbuf size corresponding to
>>>> + * requested_mtu. On success new configuration will be applied.
>>>>     * On error, device will be left unchanged. */
>>>>    static int
>>>>    netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>>>>        OVS_REQUIRES(dev->mutex)
>>>>    {
>>>>        uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
>>>> -    struct rte_mempool *mp;
>>>> +    struct dpdk_mp *mp;
>>>
>>> I think it should have a different name now, as it leads to mp->mp
>>> later. It would be good if you could make is consistent with the other
>>> functions (dmp?)
>>
>> Sure,  will do.
>>>
>>>>        int ret = 0;
>>>> +    bool per_port_mp = dpdk_per_port_memory();
>>>>    -    dpdk_mp_sweep();
>>>> +    /* With shared mempools we do not need to configure a mempool if
>>>> the MTU
>>>> +     * and socket ID have not changed, the previous configuration is
>>>> still
>>>> +     * valid so return 0 */
>>>> +    if (!per_port_mp && dev->mtu == dev->requested_mtu
>>>> +        && dev->socket_id == dev->requested_socket_id) {
>>>> +        return ret;
>>>> +    }
>>>>    -    mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
>>>> +    mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
>>>>        if (!mp) {
>>>>            VLOG_ERR("Failed to create memory pool for netdev "
>>>>                     "%s, with MTU %d on socket %d: %s\n",
>>>>                     dev->up.name, dev->requested_mtu,
>>>> dev->requested_socket_id,
>>>>                     rte_strerror(rte_errno));
>>>> -        ret = rte_errno;
>>>> +        return rte_errno;
>>>>        } else {
>>>> -        /* If a new MTU was requested and its rounded value equals
>>>> the one
>>>> -         * that is currently used, then the existing mempool is
>>>> returned. */
>>>> -        if (dev->mp != mp) {
>>>> -            /* A new mempool was created, release the previous one. */
>>>> -            dpdk_mp_release(dev->mp);
>>>> -        } else {
>>>> -            ret = EEXIST;
>>>> +        /* Check for any pre-existing dpdk_mp for the device before
>>>> accessing
>>>> +         * the associated mempool.
>>>> +         */
>>>> +        if (dev->dpdk_mp != NULL) {
>>>> +            /* If a new MTU was requested and its rounded value does
>>>> not
>>>> +             * equal the rounded value of the current mempool,
>>>> decrement
>>>> +             * the reference count to devices current dpdk_mp as its
>>>> +             * associated mempool will not be used for this device.
>>>> +             */
>>>> +            if (dev->dpdk_mp->mp != mp->mp) {
>>>      
>>> Not sure if check is needed. If it were to be false, wouldn't we have
>>> already reused or got an EEXISTS and handled that case?
>>>
>>
>> Yes, your right, it can be removed.
>>
>>>> +                dpdk_mp_put(dev->dpdk_mp);
>>>> +            }
>>>>            }
>>>> -        dev->mp = mp;
>>>> +        dev->dpdk_mp = mp;
>>>>            dev->mtu = dev->requested_mtu;
>>>>            dev->socket_id = dev->requested_socket_id;
>>>>            dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>> @@ -838,7 +935,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev,
>>>> int n_rxq, int n_txq)
>>>>              for (i = 0; i < n_rxq; i++) {
>>>>                diag = rte_eth_rx_queue_setup(dev->port_id, i,
>>>> dev->rxq_size,
>>>> -                                          dev->socket_id, NULL,
>>>> dev->mp);
>>>> +                                          dev->socket_id, NULL,
>>>> +                                          dev->dpdk_mp->mp);
>>>>                if (diag) {
>>>>                    VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
>>>>                              dev->up.name, i, rte_strerror(-diag));
>>>> @@ -926,7 +1024,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>>>>        memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
>>>>        rte_eth_link_get_nowait(dev->port_id, &dev->link);
>>>>    -    mbp_priv = rte_mempool_get_priv(dev->mp);
>>>> +    mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
>>>>        dev->buf_size = mbp_priv->mbuf_data_room_size -
>>>> RTE_PKTMBUF_HEADROOM;
>>>>          /* Get the Flow control configuration for DPDK-ETH */
>>>> @@ -1180,7 +1278,7 @@ common_destruct(struct netdev_dpdk *dev)
>>>>        OVS_EXCLUDED(dev->mutex)
>>>>    {
>>>>        rte_free(dev->tx_q);
>>>> -    dpdk_mp_release(dev->mp);
>>>> +    dpdk_mp_put(dev->dpdk_mp);
>>>>          ovs_list_remove(&dev->list_node);
>>>>        free(ovsrcu_get_protected(struct ingress_policer *,
>>>> @@ -1933,7 +2031,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
>>>>            return EAGAIN;
>>>>        }
>>>>    -    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
>>>> +    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
>>>>                                        (struct rte_mbuf **)
>>>> batch->packets,
>>>>                                        NETDEV_MAX_BURST);
>>>>        if (!nb_rx) {
>>>> @@ -2172,7 +2270,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid,
>>>> struct dp_packet_batch *batch)
>>>>                continue;
>>>>            }
>>>>    -        pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
>>>> +        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
>>>>            if (OVS_UNLIKELY(!pkts[txcnt])) {
>>>>                dropped += cnt - i;
>>>>                break;
>>>> @@ -3052,7 +3150,7 @@ netdev_dpdk_get_mempool_info(struct
>>>> unixctl_conn *conn,
>>>>            ovs_mutex_lock(&dev->mutex);
>>>>            ovs_mutex_lock(&dpdk_mp_mutex);
>>>>    -        rte_mempool_dump(stream, dev->mp);
>>>> +        rte_mempool_dump(stream, dev->dpdk_mp->mp);
>>>>              ovs_mutex_unlock(&dpdk_mp_mutex);
>>>>            ovs_mutex_unlock(&dev->mutex);
>>>> @@ -3749,7 +3847,7 @@ dpdk_vhost_reconfigure_helper(struct
>>>> netdev_dpdk *dev)
>>>>          err = netdev_dpdk_mempool_configure(dev);
>>>>        if (!err) {
>>>> -        /* A new mempool was created. */
>>>> +        /* A new mempool was created or re-used. */
>>>>            netdev_change_seq_changed(&dev->up);
>>>>        } else if (err != EEXIST){
>>>>            return err;
>>>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>>>> index 1e27a02..dbf70ac 100644
>>>> --- a/vswitchd/vswitch.xml
>>>> +++ b/vswitchd/vswitch.xml
>>>> @@ -359,6 +359,23 @@
>>>>            </p>
>>>>          </column>
>>>>    +      <column name="other_config" key="per-port-memory"
>>>> +              type='{"type": "boolean"}'>
>>>> +        <p>
>>>> +          By default OVS DPDK uses a shared mempool memory model
>>>> wherein
>>>> +          devices that have the same MTU and socket values can share
>>>> the same
>>>> +          mempool. Setting this value to <code>true</code> changes this
>>>> +          behaviour. Per port memory allow DPDK devices to use private
>>>> +          memory per device. This can provide greater transapraency as
>>>> +          regards memory usage but potentially at the cost of
>>>> greater memory
>>>> +          requirements.
>>>> +        </p>
>>>> +        <p>
>>>> +          Changing this value requires restarting the daemon if
>>>> dpdk-init has
>>>> +          already been set to true.
>>>> +        </p>
>>>> +      </column>
>>>> +
>>>>          <column name="other_config" key="tx-flush-interval"
>>>>                  type='{"type": "integer",
>>>>                         "minInteger": 0, "maxInteger": 1000000}'>
>>>>
>>>
>>
> 



More information about the dev mailing list