[ovs-dev] [PATCH v1] dpdk: Support both shared and per port mempools.

Ian Stokes ian.stokes at intel.com
Wed Jun 27 09:23:54 UTC 2018


On 6/26/2018 9:30 PM, Aaron Conole wrote:
> Hi Ian,
> 
> Ian Stokes <ian.stokes at intel.com> writes:
> 
>> This commit re-introduces the concept of shared mempools as the default
>> memory model for DPDK devices. Per port mempools are still available but
>> must be enabled explicitly by a user.
>>
>> OVS previously used a shared mempool model for ports with the same MTU
>> and socket configuration. This was replaced by a per port mempool model
>> to address issues flagged by users such as:
>>
>> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html
>>
>> However the per port model potentially requires an increase in memory
>> resource requirements to support the same number of ports and configuration
>> as the shared port model.
>>
>> This is considered a blocking factor for current deployments of OVS
>> when upgrading to future OVS releases as a user may have to redimension
>> memory for the same deployment configuration. This may not be possible for
>> users.
>>
>> This commit resolves the issue by re-introducing shared mempools as
>> the default memory behaviour in OVS DPDK but also refactors the memory
>> configuration code to allow for per port mempools.
>>
>> This patch adds a new global config option, per-port-memory, that
>> controls the enablement of per port mempools for DPDK devices.
>>
>>      ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
>>
>> This value defaults to false; to enable per port memory support,
>> this field should be set to true when setting other global parameters
>> on init (such as "dpdk-socket-mem", for example). Changing the value at
>> runtime is not supported, and requires restarting the vswitch
>> daemon.
>>
>> The mempool sweep functionality is also replaced with the
>> sweep functionality from OVS 2.9 found in commits
>>
>> c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
>> a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)
>>
>> A new document to discuss the specifics of the memory models and example
>> memory requirement calculations is also added.
>>
>> Signed-off-by: Ian Stokes <ian.stokes at intel.com>
>> ---
>>   Documentation/automake.mk            |   1 +
>>   Documentation/intro/install/dpdk.rst |   6 +
>>   Documentation/topics/dpdk/index.rst  |   1 +
>>   Documentation/topics/dpdk/memory.rst | 216 +++++++++++++++++++++++++
>>   NEWS                                 |   1 +
>>   lib/dpdk-stub.c                      |   6 +
>>   lib/dpdk.c                           |  12 ++
>>   lib/dpdk.h                           |   1 +
>>   lib/netdev-dpdk.c                    | 296 +++++++++++++++++++++++------------
>>   vswitchd/vswitch.xml                 |  17 ++
>>   10 files changed, 457 insertions(+), 100 deletions(-)
>>   create mode 100644 Documentation/topics/dpdk/memory.rst
>>
>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>> index bc728df..2444794 100644
>> --- a/Documentation/automake.mk
>> +++ b/Documentation/automake.mk
>> @@ -36,6 +36,7 @@ DOC_SOURCE = \
>>   	Documentation/topics/dpdk/index.rst \
>>   	Documentation/topics/dpdk/bridge.rst \
>>   	Documentation/topics/dpdk/jumbo-frames.rst \
>> +	Documentation/topics/dpdk/memory.rst \
>>   	Documentation/topics/dpdk/pdump.rst \
>>   	Documentation/topics/dpdk/phy.rst \
>>   	Documentation/topics/dpdk/pmd.rst \
>> diff --git a/Documentation/intro/install/dpdk.rst b/Documentation/intro/install/dpdk.rst
>> index 085e479..7dc75ef 100644
>> --- a/Documentation/intro/install/dpdk.rst
>> +++ b/Documentation/intro/install/dpdk.rst
>> @@ -170,6 +170,12 @@ Mount the hugepages, if not already mounted by default::
>>   
>>       $ mount -t hugetlbfs none /dev/hugepages``
>>   
>> +.. note::
>> +
>> +   The amount of hugepage memory required can be affected by various
>> +   aspects of the datapath and device configuration. Refer to
>> +   :doc:`/topics/dpdk/memory` for more details.
>> +
>>   .. _dpdk-vfio:
>>   
>>   Setup DPDK devices using VFIO
>> diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
>> index 181f61a..cf24a7b 100644
>> --- a/Documentation/topics/dpdk/index.rst
>> +++ b/Documentation/topics/dpdk/index.rst
>> @@ -40,3 +40,4 @@ The DPDK Datapath
>>      /topics/dpdk/qos
>>      /topics/dpdk/pdump
>>      /topics/dpdk/jumbo-frames
>> +   /topics/dpdk/memory
>> diff --git a/Documentation/topics/dpdk/memory.rst b/Documentation/topics/dpdk/memory.rst
>> new file mode 100644
>> index 0000000..2e0b3f5
>> --- /dev/null
>> +++ b/Documentation/topics/dpdk/memory.rst
>> @@ -0,0 +1,216 @@
>> +..
>> +      Copyright (c) 2018 Intel Corporation
>> +
>> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
>> +      not use this file except in compliance with the License. You may obtain
>> +      a copy of the License at
>> +
>> +          http://www.apache.org/licenses/LICENSE-2.0
>> +
>> +      Unless required by applicable law or agreed to in writing, software
>> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
>> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
>> +      License for the specific language governing permissions and limitations
>> +      under the License.
>> +
>> +      Convention for heading levels in Open vSwitch documentation:
>> +
>> +      =======  Heading 0 (reserved for the title in a document)
>> +      -------  Heading 1
>> +      ~~~~~~~  Heading 2
>> +      +++++++  Heading 3
>> +      '''''''  Heading 4
>> +
>> +      Avoid deeper levels because they do not render well.
>> +
>> +=========================
>> +DPDK Device Memory Models
>> +=========================
>> +
>> +DPDK device memory can be allocated in one of two ways in OVS DPDK,
>> +**shared memory** or **per port memory**. The specifics of both are
>> +detailed below.
>> +
>> +Shared Memory
>> +-------------
>> +
>> +By default OVS DPDK uses a shared memory model. This means that multiple
>> +ports can share the same mempool. For example when a port is added it will
>> +have a given MTU and socket ID associated with it. If a mempool has been
>> +created previously for an existing port that has the same MTU and socket ID,
>> +that mempool is used for both ports. If there is no existing mempool
>> +supporting these parameters then a new mempool is created.
>> +
>> +Per Port Memory
>> +---------------
>> +
>> +In the per port memory model, mempools are created per device and are not
>> +shared. The benefit of this is a more transparent memory model where mempools
>> +will not be exhausted by other DPDK devices. However this comes at a potential
>> +increase in cost for memory dimensioning for a given deployment. Users should
>> +be aware of the memory requirements for their deployment before using this
>> +model and allocate the required hugepage memory.
>> +
>> +Per port mempool support may be enabled via a global config value,
>> +```per-port-memory```. Setting this to true enables the per port memory
>> +model for all DPDK devices in OVS::
>> +
>> +   $ ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
>> +
>> +.. important::
>> +
>> +    This value should be set before setting dpdk-init=true. If set after
>> +    dpdk-init=true then the daemon must be restarted to use per-port-memory.
>> +
>> +Calculating Memory Requirements
>> +-------------------------------
>> +
>> +The amount of memory required for a given mempool can be calculated by the
>> +**number mbufs in the mempool \* mbuf size**.
>> +
>> +Users should be aware of the following:
>> +
>> +* The **number of mbufs** per mempool will differ between memory models.
>> +
>> +* The **size of each mbuf** will be affected by the requested **MTU** size.
>> +
>> +.. important::
>> +
>> +   An mbuf size in bytes can be larger than the requested MTU size due to
>> +   allignment and rounding needed in OVS DPDK.
>> +
>> +Below are a number of examples of memory requirement calculations for both
>> +shared and per port memory models.
>> +
>> +Shared Memory Calculations
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +In the shared memory model the number of mbufs requested is directly
>> +affected by the requested MTU size as described in the table below.
>> +
>> ++--------------------+-------------+
>> +|      MTU Size      |  Num MBUFS  |
>> ++====================+=============+
>> +| 1500 or greater    |   262144    |
>> ++--------------------+-------------+
>> +| Less than 1500     |   16384     |
>> ++------------+-------+-------------+
>> +
>> +.. Important::
>> +
>> +   If a deployment does not have enough memory to provide 262144 mbufs then
>> +   the requested amount is halved up until 16384.
>> +
>> +Example 1
>> ++++++++++
>> +::
>> +
>> + MTU = 1500 Bytes
>> + Number of mbufs = 262144
>> + Mbuf size = 3008 Bytes
>> + Memory required = 262144 * 3008 = 788 MB
>> +
>> +Example 2
>> ++++++++++
>> +::
>> +
>> + MTU = 1800 Bytes
>> + Number of mbufs = 262144
>> + Mbuf size = 3008 Bytes
>> + Memory required = 262144 * 3008 = 788 MB
>> +
>> +.. note::
>> +
>> +   Assuming the same socket is in use for example 1 and 2 the same mempool
>> +   would be shared.
>> +
>> +Example 3
>> ++++++++++
>> +::
>> +
>> + MTU = 6000 Bytes
>> + Number of mbufs = 262144
>> + Mbuf size = 8128 Bytes
>> + Memory required = 262144 * 8128 = 2130 MB
>> +
>> +Example 4
>> ++++++++++
>> +::
>> +
>> + MTU = 9000 Bytes
>> + Number of mbufs = 262144
>> + Mbuf size = 10176 Bytes
>> + Memory required = 262144 * 10176 = 2667 MB
>> +
>> +Per Port Memory Calculations
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +The number of mbufs requested in the per port model is more complicated and
>> +accounts for multiple dynamic factors in the datapath and device
>> +configuration.
>> +
>> +A rough estimation of the number of mbufs required for a port is:
>> +::
>> +
>> + packets required to fill the device rxqs +
>> + packets that could be stuck on other ports txqs +
>> + packets on the pmd threads +
>> + additional corner case memory.
>> +
>> +The algorithm in OVS used to calculate this is as follows:
>> +::
>> +
>> + requested number of rxqs * requested rxq size +
>> + requested number of txqs * requested txq size +
>> + min(RTE_MAX_LCORE, requested number of rxqs) * netdev_max_burst +
>> + MIN_NB_MBUF.
>> +
>> +where:
>> +
>> +* **requested number of rxqs**: Number of requested receive queues for a
>> +  device.
>> +* **requested rxq size**: The number of descriptors requested for a rx queue.
>> +* **requested number of txqs**: Number of requested transmit queues for a
>> +  device. Calculated as the number of PMDs configured +1.
>> +* **requested txq size**: the number of descriptors requested for a tx queue.
>> +* **min(RTE_MAX_LCORE,  requested number of rxqs)**: Compare the maximum
>> +  number of lcores supported by DPDK to the number of requested receive
>> +  queues for the device and use the variable of lesser value.
>> +* **NETDEV_MAX_BURST**: Maximum number of of packets in a burst, defined as
>> +  32.
>> +* **MIN_NB_MBUF**: Additional memory for corner case, defined as 16384.
>> +
>> +For all examples below assume the following values:
>> +
>> +* requested_rxq_size = 2048
>> +* requested_txq_size = 2048
>> +* RTE_MAX_LCORE = 128
>> +* netdev_max_burst = 32
>> +* MIN_NB_MBUF = 16384
>> +
>> +Example 1: (1 rxq, 1 PMD, 1500 MTU)
>> ++++++++++++++++++++++++++++++++++++
>> +::
>> +
>> + MTU = 1500
>> + Number of mbufs = (1 * 2048) + (2 * 2048) + (1 * 32) + (16384) = 22560
>> + Mbuf size = 3008 Bytes
>> + Memory required = 22560 * 3008 = 67 MB
>> +
>> +Example 2: (1 rxq, 2 PMD, 6000 MTU)
>> ++++++++++++++++++++++++++++++++++++
>> +::
>> +
>> + MTU = 6000
>> + Number of mbufs = (1 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 24608
>> + Mbuf size = 8128 Bytes
>> + Memory required = 24608 * 8128 = 200 MB
>> +
>> +Example 3: (2 rxq, 2 PMD, 9000 MTU)
>> ++++++++++++++++++++++++++++++++++++
>> +::
>> +
>> + MTU = 9000
>> + Number of mbufs = (2 * 2048) + (3 * 2048) + (1 * 32) + (16384) = 26656
>> + Mbuf size = 10176 Bytes
>> + Memory required = 26656 * 10176 = 271 MB
>> diff --git a/NEWS b/NEWS
>> index cd15a33..8428b44 100644
>> --- a/NEWS
>> +++ b/NEWS
>> @@ -34,6 +34,7 @@ Post-v2.9.0
>>          See Testing topic for the details.
>>        * Add LSC interrupt support for DPDK physical devices.
>>        * Allow init to fail and record DPDK status/version in OVS database.
>> +     * Support both shared and per port mempools for DPDK devices.
>>      - Userspace datapath:
>>        * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
>>        * Detailed PMD performance metrics available with new command
>> diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
>> index 1df1c58..1e0f461 100644
>> --- a/lib/dpdk-stub.c
>> +++ b/lib/dpdk-stub.c
>> @@ -56,6 +56,12 @@ dpdk_vhost_iommu_enabled(void)
>>       return false;
>>   }
>>   
>> +bool
>> +dpdk_per_port_memory(void)
>> +{
>> +    return false;
>> +}
>> +
>>   void
>>   print_dpdk_version(void)
>>   {
>> diff --git a/lib/dpdk.c b/lib/dpdk.c
>> index 5c68ce4..ca072a1 100644
>> --- a/lib/dpdk.c
>> +++ b/lib/dpdk.c
>> @@ -48,6 +48,7 @@ static char *vhost_sock_dir = NULL;   /* Location of vhost-user sockets */
>>   static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU support */
>>   static bool dpdk_initialized = false; /* Indicates successful initialization
>>                                          * of DPDK. */
>> +static bool per_port_memory = false; /* Status of per port memory support */
>>   
>>   static int
>>   process_vhost_flags(char *flag, const char *default_val, int size,
>> @@ -384,6 +385,11 @@ dpdk_init__(const struct smap *ovs_other_config)
>>       VLOG_INFO("IOMMU support for vhost-user-client %s.",
>>                  vhost_iommu_enabled ? "enabled" : "disabled");
>>   
>> +    per_port_memory = smap_get_bool(ovs_other_config,
>> +                                    "per-port-memory", false);
>> +    VLOG_INFO("Per port mempool for DPDK devices %s.",
>> +              per_port_memory ? "enabled" : "disabled");
>> +
>>       argv = grow_argv(&argv, 0, 1);
>>       argc = 1;
>>       argv[0] = xstrdup(ovs_get_program_name());
>> @@ -541,6 +547,12 @@ dpdk_vhost_iommu_enabled(void)
>>       return vhost_iommu_enabled;
>>   }
>>   
>> +bool
>> +dpdk_per_port_memory(void)
>> +{
>> +    return per_port_memory;
>> +}
>> +
>>   void
>>   dpdk_set_lcore_id(unsigned cpu)
>>   {
>> diff --git a/lib/dpdk.h b/lib/dpdk.h
>> index efdaa63..bbb89d4 100644
>> --- a/lib/dpdk.h
>> +++ b/lib/dpdk.h
>> @@ -39,6 +39,7 @@ void dpdk_init(const struct smap *ovs_other_config);
>>   void dpdk_set_lcore_id(unsigned cpu);
>>   const char *dpdk_get_vhost_sock_dir(void);
>>   bool dpdk_vhost_iommu_enabled(void);
>> +bool dpdk_per_port_memory(void);
>>   void print_dpdk_version(void);
>>   void dpdk_status(const struct ovsrec_open_vswitch *);
>>   #endif /* dpdk.h */
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index 1bde9cf..c08b4f2 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -91,13 +91,24 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>>   #define NETDEV_DPDK_MBUF_ALIGN      1024
>>   #define NETDEV_DPDK_MAX_PKT_LEN     9728
>>   
>> -/* Min number of packets in the mempool.  OVS tries to allocate a mempool with
>> - * roughly estimated number of mbufs: if this fails (because the system doesn't
>> - * have enough hugepages) we keep halving the number until the allocation
>> - * succeeds or we reach MIN_NB_MBUF */
>> +/* Max and min number of packets in the mempool.  OVS tries to allocate a
>> + * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
>> + * enough hugepages) we keep halving the number until the allocation succeeds
>> + * or we reach MIN_NB_MBUF */
>> +
>> +#define MAX_NB_MBUF          (4096 * 64)
>>   #define MIN_NB_MBUF          (4096 * 4)
>>   #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>   
>> +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>> +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF)
>> +                  == 0);
>> +
>> +/* The smallest possible NB_MBUF that we're going to try should be a multiple
>> + * of MP_CACHE_SZ. This is advised by DPDK documentation. */
>> +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF))
>> +                  % MP_CACHE_SZ == 0);
>> +
>>   /*
>>    * DPDK XSTATS Counter names definition
>>    */
>> @@ -297,12 +308,14 @@ static struct ovs_mutex dpdk_mp_mutex OVS_ACQ_AFTER(dpdk_mutex)
>>       = OVS_MUTEX_INITIALIZER;
>>   
>>   /* Contains all 'struct dpdk_mp's. */
>> -static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
>> -    = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
>> +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
>> +    = OVS_LIST_INITIALIZER(&dpdk_mp_list);
>>   
>> -/* Wrapper for a mempool released but not yet freed. */
>>   struct dpdk_mp {
>>        struct rte_mempool *mp;
>> +     int mtu;
>> +     int socket_id;
>> +     int refcount;
>>        struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
>>    };
>>   
>> @@ -384,7 +397,7 @@ struct netdev_dpdk {
>>   
>>       PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
>>           struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
>> -        struct rte_mempool *mp;
>> +        struct dpdk_mp *dpdk_mp;
>>   
>>           /* virtio identifier for vhost devices */
>>           ovsrcu_index vid;
>> @@ -550,68 +563,89 @@ dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
>>   
>>   /* Free unused mempools. */
>>   static void
>> -dpdk_mp_sweep(void)
>> +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
>>   {
>>       struct dpdk_mp *dmp, *next;
>>   
>> -    ovs_mutex_lock(&dpdk_mp_mutex);
>> -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
>> -        if (dpdk_mp_full(dmp->mp)) {
>> +    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
>> +        if (!dmp->refcount && dpdk_mp_full(dmp->mp)) {
>>               VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
>>               ovs_list_remove(&dmp->list_node);
>>               rte_mempool_free(dmp->mp);
>>               rte_free(dmp);
>>           }
>>       }
>> -    ovs_mutex_unlock(&dpdk_mp_mutex);
>>   }
>>   
>> -/* Ensure a mempool will not be freed. */
>> -static void
>> -dpdk_mp_do_not_free(struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
>> +/* Calculating the required number of mbufs differs depending on the
>> + * mempool model being used. Check if per port mempools are in use before
>> + * calculating.
>> + */
>> +static uint32_t
>> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>   {
>> -    struct dpdk_mp *dmp, *next;
>> +    uint32_t n_mbufs;
>>   
>> -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
>> -        if (dmp->mp == mp) {
>> -            VLOG_DBG("Removing mempool \"%s\" from free list", dmp->mp->name);
>> -            ovs_list_remove(&dmp->list_node);
>> -            rte_free(dmp);
>> -            break;
>> +    if (!per_port_mp) {
>> +        /* Shared mempools are being used.
>> +         * XXX: this is a really rough method of provisioning memory.
>> +         * It's impossible to determine what the exact memory requirements are
>> +         * when the number of ports and rxqs that utilize a particular mempool
>> +         * can change dynamically at runtime. For now, use this rough
>> +         * heurisitic.
>> +         */
>> +        if (mtu >= ETHER_MTU) {
>> +            n_mbufs = MAX_NB_MBUF;
>> +        } else {
>> +            n_mbufs = MIN_NB_MBUF;
>>           }
>> +    } else {
>> +        /* Per port mempools are being used.
>> +         * XXX: rough estimation of number of mbufs required for this port:
>> +         * <packets required to fill the device rxqs>
>> +         * + <packets that could be stuck on other ports txqs>
>> +         * + <packets in the pmd threads>
>> +         * + <additional memory for corner cases>
>> +         */
>> +        n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
>> +                  + dev->requested_n_txq * dev->requested_txq_size
>> +                  + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
>> +                  + MIN_NB_MBUF;
>>       }
>> +
>> +    return n_mbufs;
>>   }
>>   
>> -/* Returns a valid pointer when either of the following is true:
>> - *  - a new mempool was just created;
>> - *  - a matching mempool already exists. */
>> -static struct rte_mempool *
>> -dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>> +static struct dpdk_mp *
>> +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>   {
>>       char mp_name[RTE_MEMPOOL_NAMESIZE];
>>       const char *netdev_name = netdev_get_name(&dev->up);
>>       int socket_id = dev->requested_socket_id;
>>       uint32_t n_mbufs;
>>       uint32_t hash = hash_string(netdev_name, 0);
>> -    struct rte_mempool *mp = NULL;
>> -
>> -    /*
>> -     * XXX: rough estimation of number of mbufs required for this port:
>> -     * <packets required to fill the device rxqs>
>> -     * + <packets that could be stuck on other ports txqs>
>> -     * + <packets in the pmd threads>
>> -     * + <additional memory for corner cases>
>> -     */
>> -    n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
>> -              + dev->requested_n_txq * dev->requested_txq_size
>> -              + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
>> -              + MIN_NB_MBUF;
>> +    struct dpdk_mp *dmp = NULL;
>> +    int ret;
>> +
>> +    dmp = dpdk_rte_mzalloc(sizeof *dmp);
>> +    if (!dmp) {
>> +        return NULL;
>> +    }
>> +    dmp->socket_id = socket_id;
>> +    dmp->mtu = mtu;
>> +    dmp->refcount = 1;
>> +
>> +    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>>   
>> -    ovs_mutex_lock(&dpdk_mp_mutex);
>>       do {
>>           /* Full DPDK memory pool name must be unique and cannot be
>> -         * longer than RTE_MEMPOOL_NAMESIZE. */
>> -        int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>> +         * longer than RTE_MEMPOOL_NAMESIZE. Note that for the shared
>> +         * mempool case this can result in one device using a mempool
>> +         * which references a different device in it's name. However as
>> +         * mempool names are hashed, the device name will not be readable
>> +         * so this is not an issue for tasks such as debugging.
>> +         */
>> +        ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>>                              "ovs%08x%02d%05d%07u",
>>                              hash, socket_id, mtu, n_mbufs);
>>           if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
>> @@ -627,96 +661,157 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>>                     netdev_name, n_mbufs, socket_id,
>>                     dev->requested_n_rxq, dev->requested_n_txq);
>>   
>> -        mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
>> -                 sizeof (struct dp_packet) - sizeof (struct rte_mbuf),
>> -                 MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id);
>> +        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
>> +                                          MP_CACHE_SZ,
>> +                                          sizeof (struct dp_packet)
>> +                                          - sizeof (struct rte_mbuf),
>> +                                          MBUF_SIZE(mtu)
>> +                                          - sizeof(struct dp_packet),
>> +                                          socket_id);
>>   
>> -        if (mp) {
>> +        if (dmp->mp) {
>>               VLOG_DBG("Allocated \"%s\" mempool with %u mbufs",
>>                        mp_name, n_mbufs);
>>               /* rte_pktmbuf_pool_create has done some initialization of the
>> -             * rte_mbuf part of each dp_packet. Some OvS specific fields
>> -             * of the packet still need to be initialized by
>> -             * ovs_rte_pktmbuf_init. */
>> -            rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL);
>> +             * rte_mbuf part of each dp_packet, while ovs_rte_pktmbuf_init
>> +             * initializes some OVS specific fields of dp_packet.
>> +             */
>> +            rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
>> +            return dmp;
>>           } else if (rte_errno == EEXIST) {
>>               /* A mempool with the same name already exists.  We just
>>                * retrieve its pointer to be returned to the caller. */
>> -            mp = rte_mempool_lookup(mp_name);
>> +            dmp->mp = rte_mempool_lookup(mp_name);
>>               /* As the mempool create returned EEXIST we can expect the
>>                * lookup has returned a valid pointer.  If for some reason
>>                * that's not the case we keep track of it. */
>>               VLOG_DBG("A mempool with name \"%s\" already exists at %p.",
>> -                     mp_name, mp);
>> -            /* Ensure this reused mempool will not be freed. */
>> -            dpdk_mp_do_not_free(mp);
>> +                     mp_name, dmp->mp);
>> +            return dmp;
>>           } else {
>> -            VLOG_ERR("Failed mempool \"%s\" create request of %u mbufs",
>> -                     mp_name, n_mbufs);
>> +            VLOG_DBG("Failed to create mempool \"%s\" with a request of "
>> +                     "%u mbufs, retrying with %u mbufs",
>> +                     mp_name, n_mbufs, n_mbufs / 2);
>>           }
>> -    } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
>> +    } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
>>   
>> -    ovs_mutex_unlock(&dpdk_mp_mutex);
>> -    return mp;
>> +    VLOG_ERR("Failed to create mempool \"%s\" with a request of %u mbufs",
>> +             mp_name, n_mbufs);
>> +
>> +    rte_free(dmp);
>> +    return NULL;
>>   }
>>   
>> -/* Release an existing mempool. */
>> -static void
>> -dpdk_mp_release(struct rte_mempool *mp)
>> +static struct dpdk_mp *
>> +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>>   {
>> -    if (!mp) {
>> -        return;
>> -    }
>> +    struct dpdk_mp *dmp, *next;
>> +    bool reuse = false;
>>   
>>       ovs_mutex_lock(&dpdk_mp_mutex);
>> -    if (dpdk_mp_full(mp)) {
>> -        VLOG_DBG("Freeing mempool \"%s\"", mp->name);
>> -        rte_mempool_free(mp);
>> -    } else {
>> -        struct dpdk_mp *dmp;
>> +    /* Check if shared mempools are being used, if so check existing mempools
>> +     * to see if reuse is possible. */
>> +    if (!per_port_mp) {
>> +        LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>> +            if (dmp->socket_id == dev->requested_socket_id
>> +                && dmp->mtu == mtu) {
>> +                VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
>> +                dmp->refcount++;
>> +                reuse = true;
>> +                break;
>> +            }
>> +        }
>> +    }
>> +    /* Sweep mempools after reuse or before create. */
>> +    dpdk_mp_sweep();
>>   
>> -        dmp = dpdk_rte_mzalloc(sizeof *dmp);
>> +    if (!reuse) {
>> +        dmp = dpdk_mp_create(dev, mtu, per_port_mp);
>>           if (dmp) {
>> -            dmp->mp = mp;
>> -            ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node);
>> +            /* Shared mempools will hit the reuse case above so will not
>> +             * request a mempool that already exists but we need to check
>> +             * for the EEXIST case for per port memory case. Compare the
>> +             * mempool returned by dmp to each entry in dpdk_mp_list. If a
>> +             * match is found, free dmp as a new entry is not required, set
>> +             * dmp to point to the existing entry and increment the refcount
>> +             * to avoid being freed at a later stage.
>> +             */
>> +            if (per_port_mp && rte_errno == EEXIST) {
>> +                LIST_FOR_EACH (next, list_node, &dpdk_mp_list) {
>> +                    if (dmp->mp == next->mp) {
>> +                        rte_free(dmp);
>> +                        dmp = next;
>> +                        dmp->refcount++;
>> +                    }
>> +                }
>> +            } else {
>> +                ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
>> +            }
>>           }
>>       }
>> +
>> +
>> +    ovs_mutex_unlock(&dpdk_mp_mutex);
>> +
>> +    return dmp;
>> +}
>> +
>> +/* Decrement reference to a mempool. */
>> +static void
>> +dpdk_mp_put(struct dpdk_mp *dmp)
>> +{
>> +    if (!dmp) {
>> +        return;
>> +    }
>> +
>> +    ovs_mutex_lock(&dpdk_mp_mutex);
>> +    ovs_assert(dmp->refcount);
>> +    dmp->refcount--;
>>       ovs_mutex_unlock(&dpdk_mp_mutex);
>>   }
>>   
>> -/* Tries to allocate a new mempool - or re-use an existing one where
>> - * appropriate - on requested_socket_id with a size determined by
>> - * requested_mtu and requested Rx/Tx queues.
>> - * On success - or when re-using an existing mempool - the new configuration
>> - * will be applied.
>> +/* Depending on the memory model being used this function tries
>> + * identify and reuse an existing mempool or tries to allocate new
>> + * mempool on requested_socket_id with mbuf size corresponding to
>> + * requested_mtu. On success new configuration will be applied.
>>    * On error, device will be left unchanged. */
>>   static int
>>   netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>>       OVS_REQUIRES(dev->mutex)
>>   {
>>       uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
>> -    struct rte_mempool *mp;
>> +    struct dpdk_mp *dmp;
>>       int ret = 0;
>> +    bool per_port_mp = dpdk_per_port_memory();
>>   
>> -    dpdk_mp_sweep();
>> +    /* With shared mempools we do not need to configure a mempool if the MTU
>> +     * and socket ID have not changed, the previous configuration is still
>> +     * valid so return 0 */
>> +    if (!per_port_mp && dev->mtu == dev->requested_mtu
>> +        && dev->socket_id == dev->requested_socket_id) {
>> +        return ret;
>> +    }
>>   
>> -    mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
>> -    if (!mp) {
>> +    dmp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
>> +    if (!dmp) {
>>           VLOG_ERR("Failed to create memory pool for netdev "
>>                    "%s, with MTU %d on socket %d: %s\n",
>>                    dev->up.name, dev->requested_mtu, dev->requested_socket_id,
>>                    rte_strerror(rte_errno));
>> -        ret = rte_errno;
>> +        return rte_errno;
>>       } else {
>> -        /* If a new MTU was requested and its rounded value equals the one
>> -         * that is currently used, then the existing mempool is returned. */
>> -        if (dev->mp != mp) {
>> -            /* A new mempool was created, release the previous one. */
>> -            dpdk_mp_release(dev->mp);
>> -        } else {
>> -            ret = EEXIST;
>> +        /* Check for any pre-existing dpdk_mp for the device before accessing
>> +         * the associated mempool.
>> +         */
>> +        if (dev->dpdk_mp != NULL) {
>> +            /* A new MTU was requested and its rounded value does not
>> +             * equal the rounded value of the current mempool, decrement
>> +             * the reference count to the devices current dpdk_mp as its
>> +             * associated mempool will not be used for this device.
>> +             */
>> +            dpdk_mp_put(dev->dpdk_mp);
>>           }
>> -        dev->mp = mp;
>> +        dev->dpdk_mp = dmp;
>>           dev->mtu = dev->requested_mtu;
>>           dev->socket_id = dev->requested_socket_id;
>>           dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>> @@ -855,7 +950,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>>   
>>           for (i = 0; i < n_rxq; i++) {
>>               diag = rte_eth_rx_queue_setup(dev->port_id, i, dev->rxq_size,
>> -                                          dev->socket_id, NULL, dev->mp);
>> +                                          dev->socket_id, NULL,
>> +                                          dev->dpdk_mp->mp);
>>               if (diag) {
>>                   VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
>>                             dev->up.name, i, rte_strerror(-diag));
>> @@ -950,7 +1046,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>>       memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
>>       rte_eth_link_get_nowait(dev->port_id, &dev->link);
>>   
>> -    mbp_priv = rte_mempool_get_priv(dev->mp);
>> +    mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
>>       dev->buf_size = mbp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM;
>>   
>>       /* Get the Flow control configuration for DPDK-ETH */
>> @@ -1204,7 +1300,7 @@ common_destruct(struct netdev_dpdk *dev)
>>       OVS_EXCLUDED(dev->mutex)
>>   {
>>       rte_free(dev->tx_q);
>> -    dpdk_mp_release(dev->mp);
>> +    dpdk_mp_put(dev->dpdk_mp);
>>   
>>       ovs_list_remove(&dev->list_node);
>>       free(ovsrcu_get_protected(struct ingress_policer *,
>> @@ -1957,7 +2053,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
>>           return EAGAIN;
>>       }
>>   
>> -    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
>> +    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
>>                                       (struct rte_mbuf **) batch->packets,
>>                                       NETDEV_MAX_BURST);
>>       if (!nb_rx) {
>> @@ -2196,7 +2292,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
>>               continue;
>>           }
>>   
>> -        pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
>> +        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
>>           if (OVS_UNLIKELY(!pkts[txcnt])) {
>>               dropped += cnt - i;
>>               break;
>> @@ -3075,7 +3171,7 @@ netdev_dpdk_get_mempool_info(struct unixctl_conn *conn,
>>           ovs_mutex_lock(&dev->mutex);
>>           ovs_mutex_lock(&dpdk_mp_mutex);
>>   
>> -        rte_mempool_dump(stream, dev->mp);
>> +        rte_mempool_dump(stream, dev->dpdk_mp->mp);
>>   
>>           ovs_mutex_unlock(&dpdk_mp_mutex);
>>           ovs_mutex_unlock(&dev->mutex);
>> @@ -3772,7 +3868,7 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
>>   
>>       err = netdev_dpdk_mempool_configure(dev);
>>       if (!err) {
>> -        /* A new mempool was created. */
>> +        /* A new mempool was created or re-used. */
>>           netdev_change_seq_changed(&dev->up);
>>       } else if (err != EEXIST){
>>           return err;
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index b93da69..b9aba8c 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -360,6 +360,23 @@
>>           </p>
>>         </column>
>>   
>> +      <column name="other_config" key="per-port-memory"
>> +              type='{"type": "boolean"}'>
>> +        <p>
>> +          By default OVS DPDK uses a shared memory model wherein devices
>> +          that have the same MTU and socket values can share the same
>> +          mempool. Setting this value to <code>true</code> changes this
>> +          behaviour. Per port memory allow DPDK devices to use private
>> +          memory per device. This can provide greater transapraency as
>                                                           ^ transparency
> 
> Just one additional spelling mistake :)
> 

Thanks for the catch, my dependency on vims spellcheck has failed me :).

Ian
>> +          regards memory usage but potentially at the cost of greater memory
>> +          requirements.
>> +        </p>
>> +        <p>
>> +          Changing this value requires restarting the daemon if dpdk-init has
>> +          already been set to true.
>> +        </p>
>> +      </column>
>> +
>>         <column name="other_config" key="tx-flush-interval"
>>                 type='{"type": "integer",
>>                        "minInteger": 0, "maxInteger": 1000000}'>



More information about the dev mailing list