[ovs-dev] [RFC PATCH v2 1/1] dpdk: Support both shared and per port mempools.

Kevin Traynor ktraynor at redhat.com
Tue Jun 19 11:11:37 UTC 2018


On 06/11/2018 05:37 PM, Ian Stokes wrote:
> This commit re-introduces the concept of shared mempools as the default
> memory model for DPDK devices. Per port mempools are still available but
> must be enabled explicitly by a user.
> 

Hi Ian, thanks for v2, comments below

> OVS previously used a shared mempool model for ports with the same MTU
> and socket configuration. This was replaced by a per port mempool model
> to address issues flagged by users such as:
> 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html
> 
> However the per port model potentially requires an increase in memory
> resource requirements to support the same number of ports and configuration
> as the shared port model.
> 
> This is considered a blocking factor for current deployments of OVS
> when upgrading to future OVS releases as a user may have to redimension
> memory for the same deployment configuration. This may not be possible for
> users.
> 
> This commit resolves the issue by re-introducing shared mempools as
> the default memory behaviour in OVS DPDK but also refactors the memory
> configuration code to allow for per port mempools.
> 
> This patch adds a new global config option, per-port-mp, that

s/per-port-mp/per-port-memory/

> controls the enablement of per port mempools for DPDK devices.
> 
>     ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
> 
> This value defaults to false; to enable per port memory support,
> this field should be set to true when setting other global parameters
> on init (such as "dpdk-socket-mem", for example). Changing the value at
> runtime is not supported, and requires restarting the vswitch
> daemon.
> 
> The mempool sweep functionality is also replaced with the
> sweep functionality from OVS 2.9 found in commits
> 
> c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
> a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)
> 
> As this patch is RFC there are a number of TO-DOs including adding a
> memory calculation section to the documentation for both models. This is
> expected to be completed in the v1 after RFC.
> 
> Signed-off-by: Ian Stokes <ian.stokes at intel.com>
> 
> ---
> v1 -> v2
> * Rebase to head of master.
> * Change global config option 'per-port-mp-enabled' to 'per-port-memory'.
>   in commit message & documentation and code.
> * Specify in documentation that restart of daemon is only required if per
>   port-port-memory is set after dpdk-init=true.
> * Return comment 'Contains all 'struct dpdk_mp's.' which was lost in
>   previous refactor.
> * Improve comments regarding unique mempool names in the shared mempool
>   usecase.
> * Check per_port_mp condition first when verifying if new mempool is
>   required in netdev_dpdk_mempool_configure() for the shared mempool
>   usecase.
> * Correctly return ret variable instead of 0 for function
>   netdev_dpdk_mempool_configure().
> * Move VLOG_ERR regarding failure to create mempool for a device prior to
>   dpdk_mp_create() returns.
> * Add VLOG_DBG message flagging that the number of mbufs could not be
>   allocated and that it will be retried with half that amount.
> * Fix indentation of VLOG_ERR in netdev_dpdk_mempool_configure().
> * Handle EEXIST case for per port memory in function dpdk_mp_get() to
>   avoid duplication of dpdk_mps, correctly set refcount and return
>   correct dpdk_mp for the device to use.
> ---
>  Documentation/automake.mk            |   1 +
>  Documentation/topics/dpdk/index.rst  |   1 +
>  Documentation/topics/dpdk/memory.rst |  67 ++++++++
>  NEWS                                 |   1 +
>  lib/dpdk-stub.c                      |   6 +
>  lib/dpdk.c                           |  12 ++
>  lib/dpdk.h                           |   1 +
>  lib/netdev-dpdk.c                    | 298 +++++++++++++++++++++++------------
>  vswitchd/vswitch.xml                 |  17 ++
>  9 files changed, 304 insertions(+), 100 deletions(-)
>  create mode 100644 Documentation/topics/dpdk/memory.rst
> 
> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
> index 2202df4..141b33d 100644
> --- a/Documentation/automake.mk
> +++ b/Documentation/automake.mk
> @@ -36,6 +36,7 @@ DOC_SOURCE = \
>  	Documentation/topics/dpdk/index.rst \
>  	Documentation/topics/dpdk/bridge.rst \
>  	Documentation/topics/dpdk/jumbo-frames.rst \
> +	Documentation/topics/dpdk/memory.rst \
>  	Documentation/topics/dpdk/pdump.rst \
>  	Documentation/topics/dpdk/phy.rst \
>  	Documentation/topics/dpdk/pmd.rst \
> diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
> index 181f61a..cf24a7b 100644
> --- a/Documentation/topics/dpdk/index.rst
> +++ b/Documentation/topics/dpdk/index.rst
> @@ -40,3 +40,4 @@ The DPDK Datapath
>     /topics/dpdk/qos
>     /topics/dpdk/pdump
>     /topics/dpdk/jumbo-frames
> +   /topics/dpdk/memory
> diff --git a/Documentation/topics/dpdk/memory.rst b/Documentation/topics/dpdk/memory.rst
> new file mode 100644
> index 0000000..7c00f0f
> --- /dev/null
> +++ b/Documentation/topics/dpdk/memory.rst
> @@ -0,0 +1,67 @@
> +..
> +        Copyright 2018, Intel, Inc.
> +
> +      Licensed under the Apache License, Version 2.0 (the "License"); you may
> +      not use this file except in compliance with the License. You may obtain
> +      a copy of the License at
> +
> +          http://www.apache.org/licenses/LICENSE-2.0
> +
> +      Unless required by applicable law or agreed to in writing, software
> +      distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
> +      WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
> +      License for the specific language governing permissions and limitations
> +      under the License.
> +
> +      Convention for heading levels in Open vSwitch documentation:
> +
> +      =======  Heading 0 (reserved for the title in a document)
> +      -------  Heading 1
> +      ~~~~~~~  Heading 2
> +      +++++++  Heading 3
> +      '''''''  Heading 4
> +
> +      Avoid deeper levels because they do not render well.
> +
> +=========================
> +DPDK Device Memory Models
> +=========================
> +
> +DPDK device memory can be allocated in one of two ways in OVS DPDK,
> +shared mempools or per port mempools. The specifics of both are detailed
> +below.
> +
> +Shared Mempools
> +---------------
> +
> +By Default OVS DPDK uses a shared mempool model. This means that multiple
> +ports can share the same mempool. For example when a port is added it will
> +have a given MTU and socket ID associated with it. If a mempool has been
> +created previously for an existing port that has the same MTU and socket ID,
> +that mempool is used for both ports. If there is no existing mempool
> +supporting these parameters then a new mempool is created.
> +
> +Per Port Mempools
> +-----------------
> +
> +In the per port mempool model, mempools are created per device and are not
> +shared. The benefit of this is a more transparent memory model where mempools
> +will not be exhausted by other DPDK devices. However this comes at a potential
> +increase in cost for memory dimensioning for a given deployment. Users should
> +be aware of the memory requirements for their deployment before using this
> +model and allocate the required hugepage memory.
> +
> +Per port mempool support may be enabled via a global config value,
> +```per-port-mp-enabled```. Setting this to true enables the per port mempool
> +model for all DPDK devices in OVS::
> +
> +   $ ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
> +
> +.. important::
> +
> +    This value should be set before setting dpdk-init=true. If set after
> +    dpdk-init=true then the daemon must be restarted to use per-port-memory.
> +
> +.. todo::
> +
> +   Add memory calculation section for both mempool models.
> diff --git a/NEWS b/NEWS
> index 484c6dc..eff6cf1 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -33,6 +33,7 @@ Post-v2.9.0
>         See Testing topic for the details.
>       * Add LSC interrupt support for DPDK physical devices.
>       * Allow init to fail and record DPDK status/version in OVS database.
> +     * Support both shared and per port mempools for DPDK devices.
>     - Userspace datapath:
>       * Commands ovs-appctl dpif-netdev/pmd-*-show can now work on a single PMD
>       * Detailed PMD performance metrics available with new command
> diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c
> index 1df1c58..fd3fe21 100644
> --- a/lib/dpdk-stub.c
> +++ b/lib/dpdk-stub.c
> @@ -56,6 +56,12 @@ dpdk_vhost_iommu_enabled(void)
>      return false;
>  }
>  
> +bool
> +dpdk_per_port_mempool_enabled(void)
> +{
> +    return false;
> +}
> +
>  void
>  print_dpdk_version(void)
>  {
> diff --git a/lib/dpdk.c b/lib/dpdk.c
> index 09afd8c..6a02a17 100644
> --- a/lib/dpdk.c
> +++ b/lib/dpdk.c
> @@ -47,6 +47,7 @@ static char *vhost_sock_dir = NULL;   /* Location of vhost-user sockets */
>  static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU support */
>  static bool dpdk_initialized = false; /* Indicates successful initialization
>                                         * of DPDK. */
> +static bool per_port_memory = false; /* Status of per port memory support */
>  
>  static int
>  process_vhost_flags(char *flag, const char *default_val, int size,
> @@ -358,6 +359,11 @@ dpdk_init__(const struct smap *ovs_other_config)
>      VLOG_INFO("IOMMU support for vhost-user-client %s.",
>                 vhost_iommu_enabled ? "enabled" : "disabled");
>  
> +    per_port_memory = smap_get_bool(ovs_other_config,
> +                                    "per-port-memory", false);
> +    VLOG_INFO("Per port mempool for DPDK devices %s.",
> +              per_port_memory ? "enabled" : "disabled");
> +
>      argv = grow_argv(&argv, 0, 1);
>      argc = 1;
>      argv[0] = xstrdup(ovs_get_program_name());
> @@ -515,6 +521,12 @@ dpdk_vhost_iommu_enabled(void)
>      return vhost_iommu_enabled;
>  }
>  
> +bool
> +dpdk_per_port_memory(void)
> +{
> +    return per_port_memory;
> +}
> +
>  void
>  dpdk_set_lcore_id(unsigned cpu)
>  {
> diff --git a/lib/dpdk.h b/lib/dpdk.h
> index efdaa63..bbb89d4 100644
> --- a/lib/dpdk.h
> +++ b/lib/dpdk.h
> @@ -39,6 +39,7 @@ void dpdk_init(const struct smap *ovs_other_config);
>  void dpdk_set_lcore_id(unsigned cpu);
>  const char *dpdk_get_vhost_sock_dir(void);
>  bool dpdk_vhost_iommu_enabled(void);
> +bool dpdk_per_port_memory(void);
>  void print_dpdk_version(void);
>  void dpdk_status(const struct ovsrec_open_vswitch *);
>  #endif /* dpdk.h */
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index 2e2f568..2c7ff81 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -91,13 +91,24 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>  #define NETDEV_DPDK_MBUF_ALIGN      1024
>  #define NETDEV_DPDK_MAX_PKT_LEN     9728
>  
> -/* Min number of packets in the mempool.  OVS tries to allocate a mempool with
> - * roughly estimated number of mbufs: if this fails (because the system doesn't
> - * have enough hugepages) we keep halving the number until the allocation
> - * succeeds or we reach MIN_NB_MBUF */
> +/* Max and min number of packets in the mempool.  OVS tries to allocate a
> + * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
> + * enough hugepages) we keep halving the number until the allocation succeeds
> + * or we reach MIN_NB_MBUF */
> +
> +#define MAX_NB_MBUF          (4096 * 64)
>  #define MIN_NB_MBUF          (4096 * 4)
>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>  
> +/* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
> +BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF)
> +                  == 0);
> +
> +/* The smallest possible NB_MBUF that we're going to try should be a multiple
> + * of MP_CACHE_SZ. This is advised by DPDK documentation. */
> +BUILD_ASSERT_DECL((MAX_NB_MBUF / ROUND_DOWN_POW2(MAX_NB_MBUF / MIN_NB_MBUF))
> +                  % MP_CACHE_SZ == 0);
> +
>  /*
>   * DPDK XSTATS Counter names definition
>   */
> @@ -297,12 +308,14 @@ static struct ovs_mutex dpdk_mp_mutex OVS_ACQ_AFTER(dpdk_mutex)
>      = OVS_MUTEX_INITIALIZER;
>  
>  /* Contains all 'struct dpdk_mp's. */
> -static struct ovs_list dpdk_mp_free_list OVS_GUARDED_BY(dpdk_mp_mutex)
> -    = OVS_LIST_INITIALIZER(&dpdk_mp_free_list);
> +static struct ovs_list dpdk_mp_list OVS_GUARDED_BY(dpdk_mp_mutex)
> +    = OVS_LIST_INITIALIZER(&dpdk_mp_list);
>  
> -/* Wrapper for a mempool released but not yet freed. */
>  struct dpdk_mp {
>       struct rte_mempool *mp;
> +     int mtu;
> +     int socket_id;
> +     int refcount;
>       struct ovs_list list_node OVS_GUARDED_BY(dpdk_mp_mutex);
>   };
>  
> @@ -383,7 +396,7 @@ struct netdev_dpdk {
>  
>      PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE, cacheline1,
>          struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex);
> -        struct rte_mempool *mp;
> +        struct dpdk_mp *dpdk_mp;
>  
>          /* virtio identifier for vhost devices */
>          ovsrcu_index vid;
> @@ -544,73 +557,95 @@ dpdk_mp_full(const struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
>       * likely be ok, as there are additional checks during mempool
>       * freeing but it would make things racey.
>       */
> -    return rte_mempool_full(mp);
> +    rte_mempool_full(mp);
> +    return 0;

I don't think you intended this? unless you are in the business of
selling memory ;-)

>  }
>  
>  /* Free unused mempools. */
>  static void
> -dpdk_mp_sweep(void)
> +dpdk_mp_sweep(void) OVS_REQUIRES(dpdk_mp_mutex)
>  {
>      struct dpdk_mp *dmp, *next;
>  
> -    ovs_mutex_lock(&dpdk_mp_mutex);
> -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
> -        if (dpdk_mp_full(dmp->mp)) {
> +    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_list) {
> +        if (!dmp->refcount && dpdk_mp_full(dmp->mp)) {
>              VLOG_DBG("Freeing mempool \"%s\"", dmp->mp->name);
>              ovs_list_remove(&dmp->list_node);
>              rte_mempool_free(dmp->mp);
>              rte_free(dmp);
>          }
>      }
> -    ovs_mutex_unlock(&dpdk_mp_mutex);
>  }
>  
> -/* Ensure a mempool will not be freed. */
> -static void
> -dpdk_mp_do_not_free(struct rte_mempool *mp) OVS_REQUIRES(dpdk_mp_mutex)
> +/* Calculating the required number of mbufs differs depending on the
> + * mempool model being used. Check if per port mempools are in use before
> + * calculating.
> + */
> +static uint32_t
> +dpdk_calculate_mbufs(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>  {
> -    struct dpdk_mp *dmp, *next;
> +    uint32_t n_mbufs;
>  
> -    LIST_FOR_EACH_SAFE (dmp, next, list_node, &dpdk_mp_free_list) {
> -        if (dmp->mp == mp) {
> -            VLOG_DBG("Removing mempool \"%s\" from free list", dmp->mp->name);
> -            ovs_list_remove(&dmp->list_node);
> -            rte_free(dmp);
> -            break;
> +    if (!per_port_mp) {
> +        /* Shared mempools are being used.
> +         * XXX: this is a really rough method of provisioning memory.
> +         * It's impossible to determine what the exact memory requirements are
> +         * when the number of ports and rxqs that utilize a particular mempool
> +         * can change dynamically at runtime. For now, use this rough
> +         * heurisitic.
> +         */
> +        if (mtu >= ETHER_MTU) {
> +            n_mbufs = MAX_NB_MBUF;
> +        } else {
> +            n_mbufs = MIN_NB_MBUF;
>          }
> +    } else {
> +        /* Per port mempools are being used.
> +         * XXX: rough estimation of number of mbufs required for this port:
> +         * <packets required to fill the device rxqs>
> +         * + <packets that could be stuck on other ports txqs>
> +         * + <packets in the pmd threads>
> +         * + <additional memory for corner cases>
> +         */
> +        n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
> +                  + dev->requested_n_txq * dev->requested_txq_size
> +                  + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
> +                  + MIN_NB_MBUF;
>      }
> +
> +    return n_mbufs;
>  }
>  
> -/* Returns a valid pointer when either of the following is true:
> - *  - a new mempool was just created;
> - *  - a matching mempool already exists. */
> -static struct rte_mempool *
> -dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
> +static struct dpdk_mp *
> +dpdk_mp_create(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>  {
>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>      const char *netdev_name = netdev_get_name(&dev->up);
>      int socket_id = dev->requested_socket_id;
>      uint32_t n_mbufs;
>      uint32_t hash = hash_string(netdev_name, 0);
> -    struct rte_mempool *mp = NULL;
> -
> -    /*
> -     * XXX: rough estimation of number of mbufs required for this port:
> -     * <packets required to fill the device rxqs>
> -     * + <packets that could be stuck on other ports txqs>
> -     * + <packets in the pmd threads>
> -     * + <additional memory for corner cases>
> -     */
> -    n_mbufs = dev->requested_n_rxq * dev->requested_rxq_size
> -              + dev->requested_n_txq * dev->requested_txq_size
> -              + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) * NETDEV_MAX_BURST
> -              + MIN_NB_MBUF;
> +    struct dpdk_mp *dmp = NULL;
> +    int ret;
> +
> +    dmp = dpdk_rte_mzalloc(sizeof *dmp);
> +    if (!dmp) {
> +        return NULL;
> +    }
> +    dmp->socket_id = socket_id;
> +    dmp->mtu = mtu;
> +    dmp->refcount = 1;
> +
> +    n_mbufs = dpdk_calculate_mbufs(dev, mtu, per_port_mp);
>  
> -    ovs_mutex_lock(&dpdk_mp_mutex);
>      do {
>          /* Full DPDK memory pool name must be unique and cannot be
> -         * longer than RTE_MEMPOOL_NAMESIZE. */
> -        int ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
> +         * longer than RTE_MEMPOOL_NAMESIZE. Note that for the shared
> +         * mempool case this can result in one device using a mempool
> +         * which references a different device in it's name. However as
> +         * mempool names are hashed, the device name will not be readable
> +         * so this is not an issue for tasks such as debugging.
> +         */
> +        ret = snprintf(mp_name, RTE_MEMPOOL_NAMESIZE,
>                             "ovs%08x%02d%05d%07u",
>                             hash, socket_id, mtu, n_mbufs);
>          if (ret < 0 || ret >= RTE_MEMPOOL_NAMESIZE) {
> @@ -626,96 +661,158 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
>                    netdev_name, n_mbufs, socket_id,
>                    dev->requested_n_rxq, dev->requested_n_txq);
>  
> -        mp = rte_pktmbuf_pool_create(mp_name, n_mbufs, MP_CACHE_SZ,
> -                 sizeof (struct dp_packet) - sizeof (struct rte_mbuf),
> -                 MBUF_SIZE(mtu) - sizeof(struct dp_packet), socket_id);
> +        dmp->mp = rte_pktmbuf_pool_create(mp_name, n_mbufs,
> +                                          MP_CACHE_SZ,
> +                                          sizeof (struct dp_packet)
> +                                          - sizeof (struct rte_mbuf),
> +                                          MBUF_SIZE(mtu)
> +                                          - sizeof(struct dp_packet),
> +                                          socket_id);
>  
> -        if (mp) {
> +        if (dmp->mp) {
>              VLOG_DBG("Allocated \"%s\" mempool with %u mbufs",
>                       mp_name, n_mbufs);
>              /* rte_pktmbuf_pool_create has done some initialization of the
> -             * rte_mbuf part of each dp_packet. Some OvS specific fields
> -             * of the packet still need to be initialized by
> -             * ovs_rte_pktmbuf_init. */
> -            rte_mempool_obj_iter(mp, ovs_rte_pktmbuf_init, NULL);
> +             * rte_mbuf part of each dp_packet, while ovs_rte_pktmbuf_init
> +             * initializes some OVS specific fields of dp_packet.
> +             */
> +            rte_mempool_obj_iter(dmp->mp, ovs_rte_pktmbuf_init, NULL);
> +            return dmp;
>          } else if (rte_errno == EEXIST) {
>              /* A mempool with the same name already exists.  We just
>               * retrieve its pointer to be returned to the caller. */
> -            mp = rte_mempool_lookup(mp_name);
> +            dmp->mp = rte_mempool_lookup(mp_name);
>              /* As the mempool create returned EEXIST we can expect the
>               * lookup has returned a valid pointer.  If for some reason
>               * that's not the case we keep track of it. */
>              VLOG_DBG("A mempool with name \"%s\" already exists at %p.",
> -                     mp_name, mp);
> -            /* Ensure this reused mempool will not be freed. */
> -            dpdk_mp_do_not_free(mp);
> +                     mp_name, dmp->mp);
> +            return dmp;
>          } else {
> -            VLOG_ERR("Failed mempool \"%s\" create request of %u mbufs",
> -                     mp_name, n_mbufs);
> +            VLOG_DBG("Failed to create mempool \"%s\" with a request of "
> +                     "%u mbufs, retrying with %u mbufs",
> +                     mp_name, n_mbufs, n_mbufs/2);
>          }
> -    } while (!mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
> +    } while (!dmp->mp && rte_errno == ENOMEM && (n_mbufs /= 2) >= MIN_NB_MBUF);
>  
> -    ovs_mutex_unlock(&dpdk_mp_mutex);
> -    return mp;
> +    VLOG_ERR("Failed to create mempool \"%s\" with a request of %u mbufs",
> +             mp_name, n_mbufs);
> +
> +    rte_free(dmp);
> +    return NULL;
>  }
>  
> -/* Release an existing mempool. */
> -static void
> -dpdk_mp_release(struct rte_mempool *mp)
> +static struct dpdk_mp *
> +dpdk_mp_get(struct netdev_dpdk *dev, int mtu, bool per_port_mp)
>  {
> -    if (!mp) {
> -        return;
> -    }
> +    struct dpdk_mp *dmp, *next;
> +    bool reuse = false;
>  
>      ovs_mutex_lock(&dpdk_mp_mutex);
> -    if (dpdk_mp_full(mp)) {
> -        VLOG_DBG("Freeing mempool \"%s\"", mp->name);
> -        rte_mempool_free(mp);
> -    } else {
> -        struct dpdk_mp *dmp;p
> +    /* Check if shared mempools are being used, if so check existing mempools
> +     * to see if reuse is possible. */
> +    if (!per_port_mp) {
> +        LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
> +            if (dmp->socket_id == dev->requested_socket_id
> +                && dmp->mtu == mtu) {
> +                VLOG_DBG("Reusing mempool \"%s\"", dmp->mp->name);
> +                dmp->refcount++;
> +                reuse = true;
> +                break;
> +            }
> +        }
> +    }
> +    /* Sweep mempools after reuse or before create. */
> +    dpdk_mp_sweep();
>  
> -        dmp = dpdk_rte_mzalloc(sizeof *dmp);
> +    if (!reuse) {
> +        dmp = dpdk_mp_create(dev, mtu, per_port_mp);
>          if (dmp) {
> -            dmp->mp = mp;
> -            ovs_list_push_back(&dpdk_mp_free_list, &dmp->list_node);
> +            /* We need to check for the EEXIST case for per port memory.

I'm pretty sure this also applies to the shared mempool case, because
you have rounded the mtu with dpdk_buf_size() so the requested_mtu might
be different but the rounded value that is used for the mempool name
might be the same, meaning that you can get an EEXISTS.

> +             * Compare the mempool returned by dmp to each entry in
> +             * dpdk_mp_list. If a match is found, free dmp as a new entry
> +             * is not required, set dmp to point to the existing
> +             * entry and set refcount to 1 to avoid being freed at a later
> +             * stage.
> +             */
> +            if (per_port_mp && rte_errno == EEXIST) {
> +                LIST_FOR_EACH (next, list_node, &dpdk_mp_list) {
> +                    if (dmp->mp == next->mp) {
> +                        rte_free(dmp);
> +                        dmp = next;
> +                        dmp->refcount = 1;
> +                    }
> +                }
> +            }
> +            else {
> +                ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
> +            }

I think you need to increment refcount and use the safe list option. How
about

if (rte_errno == EEXIST) {
    LIST_FOR_EACH_SAFE (next, list_node, &dpdk_mp_list) {
        if (dmp->mp == next->mp) {
            next->refcount++;
            rte_free(dmp);
            break;
        }
    }
} else {
    dmp->refcount++;
    ovs_list_push_back(&dpdk_mp_list, &dmp->list_node);
}

>          }
>      }
> +
>      ovs_mutex_unlock(&dpdk_mp_mutex);
> +
> +    return dmp;
>  }
>  
> -/* Tries to allocate a new mempool - or re-use an existing one where
> - * appropriate - on requested_socket_id with a size determined by
> - * requested_mtu and requested Rx/Tx queues.
> - * On success - or when re-using an existing mempool - the new configuration
> - * will be applied.
> +/* Decrement reference to a mempool. */
> +static void
> +dpdk_mp_put(struct dpdk_mp *dmp)
> +{
> +    if (!dmp) {
> +        return;
> +    }
> +
> +    ovs_mutex_lock(&dpdk_mp_mutex);
> +    ovs_assert(dmp->refcount);
> +    dmp->refcount--;
> +    ovs_mutex_unlock(&dpdk_mp_mutex);
> +}
> +
> +/* Depending on the memory model being used this function tries
> + * identify and reuse an existing mempool or tries to allocate new
> + * mempool on requested_socket_id with mbuf size corresponding to
> + * requested_mtu. On success new configuration will be applied.
>   * On error, device will be left unchanged. */
>  static int
>  netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>      OVS_REQUIRES(dev->mutex)
>  {
>      uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
> -    struct rte_mempool *mp;
> +    struct dpdk_mp *mp;

I think it should have a different name now, as it leads to mp->mp
later. It would be good if you could make is consistent with the other
functions (dmp?)

>      int ret = 0;
> +    bool per_port_mp = dpdk_per_port_memory();
>  
> -    dpdk_mp_sweep();
> +    /* With shared mempools we do not need to configure a mempool if the MTU
> +     * and socket ID have not changed, the previous configuration is still
> +     * valid so return 0 */
> +    if (!per_port_mp && dev->mtu == dev->requested_mtu
> +        && dev->socket_id == dev->requested_socket_id) {
> +        return ret;
> +    }
>  
> -    mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
> +    mp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
>      if (!mp) {
>          VLOG_ERR("Failed to create memory pool for netdev "
>                   "%s, with MTU %d on socket %d: %s\n",
>                   dev->up.name, dev->requested_mtu, dev->requested_socket_id,
>                   rte_strerror(rte_errno));
> -        ret = rte_errno;
> +        return rte_errno;
>      } else {
> -        /* If a new MTU was requested and its rounded value equals the one
> -         * that is currently used, then the existing mempool is returned. */
> -        if (dev->mp != mp) {
> -            /* A new mempool was created, release the previous one. */
> -            dpdk_mp_release(dev->mp);
> -        } else {
> -            ret = EEXIST;
> +        /* Check for any pre-existing dpdk_mp for the device before accessing
> +         * the associated mempool.
> +         */
> +        if (dev->dpdk_mp != NULL) {
> +            /* If a new MTU was requested and its rounded value does not
> +             * equal the rounded value of the current mempool, decrement
> +             * the reference count to devices current dpdk_mp as its
> +             * associated mempool will not be used for this device.
> +             */
> +            if (dev->dpdk_mp->mp != mp->mp) {
	
Not sure if check is needed. If it were to be false, wouldn't we have
already reused or got an EEXISTS and handled that case?

> +                dpdk_mp_put(dev->dpdk_mp);
> +            }
>          }
> -        dev->mp = mp;
> +        dev->dpdk_mp = mp;
>          dev->mtu = dev->requested_mtu;
>          dev->socket_id = dev->requested_socket_id;
>          dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> @@ -838,7 +935,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>  
>          for (i = 0; i < n_rxq; i++) {
>              diag = rte_eth_rx_queue_setup(dev->port_id, i, dev->rxq_size,
> -                                          dev->socket_id, NULL, dev->mp);
> +                                          dev->socket_id, NULL,
> +                                          dev->dpdk_mp->mp);
>              if (diag) {
>                  VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
>                            dev->up.name, i, rte_strerror(-diag));
> @@ -926,7 +1024,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>      memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
>      rte_eth_link_get_nowait(dev->port_id, &dev->link);
>  
> -    mbp_priv = rte_mempool_get_priv(dev->mp);
> +    mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
>      dev->buf_size = mbp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM;
>  
>      /* Get the Flow control configuration for DPDK-ETH */
> @@ -1180,7 +1278,7 @@ common_destruct(struct netdev_dpdk *dev)
>      OVS_EXCLUDED(dev->mutex)
>  {
>      rte_free(dev->tx_q);
> -    dpdk_mp_release(dev->mp);
> +    dpdk_mp_put(dev->dpdk_mp);
>  
>      ovs_list_remove(&dev->list_node);
>      free(ovsrcu_get_protected(struct ingress_policer *,
> @@ -1933,7 +2031,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
>          return EAGAIN;
>      }
>  
> -    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
> +    nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
>                                      (struct rte_mbuf **) batch->packets,
>                                      NETDEV_MAX_BURST);
>      if (!nb_rx) {
> @@ -2172,7 +2270,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
>              continue;
>          }
>  
> -        pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
> +        pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
>          if (OVS_UNLIKELY(!pkts[txcnt])) {
>              dropped += cnt - i;
>              break;
> @@ -3052,7 +3150,7 @@ netdev_dpdk_get_mempool_info(struct unixctl_conn *conn,
>          ovs_mutex_lock(&dev->mutex);
>          ovs_mutex_lock(&dpdk_mp_mutex);
>  
> -        rte_mempool_dump(stream, dev->mp);
> +        rte_mempool_dump(stream, dev->dpdk_mp->mp);
>  
>          ovs_mutex_unlock(&dpdk_mp_mutex);
>          ovs_mutex_unlock(&dev->mutex);
> @@ -3749,7 +3847,7 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
>  
>      err = netdev_dpdk_mempool_configure(dev);
>      if (!err) {
> -        /* A new mempool was created. */
> +        /* A new mempool was created or re-used. */
>          netdev_change_seq_changed(&dev->up);
>      } else if (err != EEXIST){
>          return err;
> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
> index 1e27a02..dbf70ac 100644
> --- a/vswitchd/vswitch.xml
> +++ b/vswitchd/vswitch.xml
> @@ -359,6 +359,23 @@
>          </p>
>        </column>
>  
> +      <column name="other_config" key="per-port-memory"
> +              type='{"type": "boolean"}'>
> +        <p>
> +          By default OVS DPDK uses a shared mempool memory model wherein
> +          devices that have the same MTU and socket values can share the same
> +          mempool. Setting this value to <code>true</code> changes this
> +          behaviour. Per port memory allow DPDK devices to use private
> +          memory per device. This can provide greater transapraency as
> +          regards memory usage but potentially at the cost of greater memory
> +          requirements.
> +        </p>
> +        <p>
> +          Changing this value requires restarting the daemon if dpdk-init has
> +          already been set to true.
> +        </p>
> +      </column>
> +
>        <column name="other_config" key="tx-flush-interval"
>                type='{"type": "integer",
>                       "minInteger": 0, "maxInteger": 1000000}'>
> 



More information about the dev mailing list