[ovs-dev] [PATCH v1] dpdk: Support both shared and per port mempools.
Ian Stokes
ian.stokes at intel.com
Tue Jun 26 18:59:28 UTC 2018
On 6/26/2018 3:33 PM, Kevin Traynor wrote:
> On 06/25/2018 12:56 PM, Ian Stokes wrote:
>> This commit re-introduces the concept of shared mempools as the default
>> memory model for DPDK devices. Per port mempools are still available but
>> must be enabled explicitly by a user.
>>
>> OVS previously used a shared mempool model for ports with the same MTU
>> and socket configuration. This was replaced by a per port mempool model
>> to address issues flagged by users such as:
>>
>> https://mail.openvswitch.org/pipermail/ovs-discuss/2016-September/042560.html
>>
>> However the per port model potentially requires an increase in memory
>> resource requirements to support the same number of ports and configuration
>> as the shared port model.
>>
>> This is considered a blocking factor for current deployments of OVS
>> when upgrading to future OVS releases as a user may have to redimension
>> memory for the same deployment configuration. This may not be possible for
>> users.
>>
>> This commit resolves the issue by re-introducing shared mempools as
>> the default memory behaviour in OVS DPDK but also refactors the memory
>> configuration code to allow for per port mempools.
>>
>> This patch adds a new global config option, per-port-memory, that
>> controls the enablement of per port mempools for DPDK devices.
>>
>> ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
>>
>> This value defaults to false; to enable per port memory support,
>> this field should be set to true when setting other global parameters
>> on init (such as "dpdk-socket-mem", for example). Changing the value at
>> runtime is not supported, and requires restarting the vswitch
>> daemon.
>>
>> The mempool sweep functionality is also replaced with the
>> sweep functionality from OVS 2.9 found in commits
>>
>> c77f692 (netdev-dpdk: Free mempool only when no in-use mbufs.)
>> a7fb0a4 (netdev-dpdk: Add mempool reuse/free debug.)
>>
>> A new document to discuss the specifics of the memory models and example
>> memory requirement calculations is also added.
>>
>> Signed-off-by: Ian Stokes <ian.stokes at intel.com>
>> ---
>> Documentation/automake.mk | 1 +
>> Documentation/intro/install/dpdk.rst | 6 +
>> Documentation/topics/dpdk/index.rst | 1 +
>> Documentation/topics/dpdk/memory.rst | 216 +++++++++++++++++++++++++
>> NEWS | 1 +
>> lib/dpdk-stub.c | 6 +
>> lib/dpdk.c | 12 ++
>> lib/dpdk.h | 1 +
>> lib/netdev-dpdk.c | 296 +++++++++++++++++++++++------------
>> vswitchd/vswitch.xml | 17 ++
>> 10 files changed, 457 insertions(+), 100 deletions(-)
>> create mode 100644 Documentation/topics/dpdk/memory.rst
>>
>> diff --git a/Documentation/automake.mk b/Documentation/automake.mk
>> index bc728df..2444794 100644
>> --- a/Documentation/automake.mk
>> +++ b/Documentation/automake.mk
>> @@ -36,6 +36,7 @@ DOC_SOURCE = \
>> Documentation/topics/dpdk/index.rst \
>> Documentation/topics/dpdk/bridge.rst \
>> Documentation/topics/dpdk/jumbo-frames.rst \
>> + Documentation/topics/dpdk/memory.rst \
>> Documentation/topics/dpdk/pdump.rst \
>> Documentation/topics/dpdk/phy.rst \
>> Documentation/topics/dpdk/pmd.rst \
>> diff --git a/Documentation/intro/install/dpdk.rst b/Documentation/intro/install/dpdk.rst
>> index 085e479..7dc75ef 100644
>> --- a/Documentation/intro/install/dpdk.rst
>> +++ b/Documentation/intro/install/dpdk.rst
>> @@ -170,6 +170,12 @@ Mount the hugepages, if not already mounted by default::
>>
>> $ mount -t hugetlbfs none /dev/hugepages``
>>
>> +.. note::
>> +
>> + The amount of hugepage memory required can be affected by various
>> + aspects of the datapath and device configuration. Refer to
>> + :doc:`/topics/dpdk/memory` for more details.
>> +
>> .. _dpdk-vfio:
>>
>> Setup DPDK devices using VFIO
>> diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst
>> index 181f61a..cf24a7b 100644
>> --- a/Documentation/topics/dpdk/index.rst
>> +++ b/Documentation/topics/dpdk/index.rst
>> @@ -40,3 +40,4 @@ The DPDK Datapath
>> /topics/dpdk/qos
>> /topics/dpdk/pdump
>> /topics/dpdk/jumbo-frames
>> + /topics/dpdk/memory
>> diff --git a/Documentation/topics/dpdk/memory.rst b/Documentation/topics/dpdk/memory.rst
>> new file mode 100644
>> index 0000000..2e0b3f5
>> --- /dev/null
>> +++ b/Documentation/topics/dpdk/memory.rst
>> @@ -0,0 +1,216 @@
>> +..
>> + Copyright (c) 2018 Intel Corporation
>> +
>> + Licensed under the Apache License, Version 2.0 (the "License"); you may
>> + not use this file except in compliance with the License. You may obtain
>> + a copy of the License at
>> +
>> + http://www.apache.org/licenses/LICENSE-2.0
>> +
>> + Unless required by applicable law or agreed to in writing, software
>> + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
>> + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
>> + License for the specific language governing permissions and limitations
>> + under the License.
>> +
>> + Convention for heading levels in Open vSwitch documentation:
>> +
>> + ======= Heading 0 (reserved for the title in a document)
>> + ------- Heading 1
>> + ~~~~~~~ Heading 2
>> + +++++++ Heading 3
>> + ''''''' Heading 4
>> +
>> + Avoid deeper levels because they do not render well.
>> +
>> +=========================
>> +DPDK Device Memory Models
>> +=========================
>> +
>> +DPDK device memory can be allocated in one of two ways in OVS DPDK,
>> +**shared memory** or **per port memory**. The specifics of both are
>> +detailed below.
>> +
>> +Shared Memory
>> +-------------
>> +
>> +By default OVS DPDK uses a shared memory model. This means that multiple
>> +ports can share the same mempool. For example when a port is added it will
>> +have a given MTU and socket ID associated with it. If a mempool has been
>> +created previously for an existing port that has the same MTU and socket ID,
>> +that mempool is used for both ports. If there is no existing mempool
>> +supporting these parameters then a new mempool is created.
>> +
>> +Per Port Memory
>> +---------------
>> +
>> +In the per port memory model, mempools are created per device and are not
>> +shared. The benefit of this is a more transparent memory model where mempools
>> +will not be exhausted by other DPDK devices. However this comes at a potential
>> +increase in cost for memory dimensioning for a given deployment. Users should
>> +be aware of the memory requirements for their deployment before using this
>> +model and allocate the required hugepage memory.
>> +
>> +Per port mempool support may be enabled via a global config value,
>> +```per-port-memory```. Setting this to true enables the per port memory
>> +model for all DPDK devices in OVS::
>> +
>> + $ ovs-vsctl set Open_vSwitch . other_config:per-port-memory=true
>> +
>> +.. important::
>> +
>> + This value should be set before setting dpdk-init=true. If set after
>> + dpdk-init=true then the daemon must be restarted to use per-port-memory.
>> +
>> +Calculating Memory Requirements
>> +-------------------------------
>> +
>> +The amount of memory required for a given mempool can be calculated by the
>> +**number mbufs in the mempool \* mbuf size**.
>> +
>> +Users should be aware of the following:
>> +
>> +* The **number of mbufs** per mempool will differ between memory models.
>> +
>> +* The **size of each mbuf** will be affected by the requested **MTU** size.
>> +
>> +.. important::
>> +
>> + An mbuf size in bytes can be larger than the requested MTU size due to
>
> s/can be/is always/
>
>> + allignment and rounding needed in OVS DPDK.
>
> alignment
>
>> +
>> +Below are a number of examples of memory requirement calculations for both
>> +shared and per port memory models.
>> +
>
> -----8<-----
>
>>
>> -/* Tries to allocate a new mempool - or re-use an existing one where
>> - * appropriate - on requested_socket_id with a size determined by
>> - * requested_mtu and requested Rx/Tx queues.
>> - * On success - or when re-using an existing mempool - the new configuration
>> - * will be applied.
>> +/* Depending on the memory model being used this function tries
>
> tries to
>
>> + * identify and reuse an existing mempool or tries to allocate new
>> + * mempool on requested_socket_id with mbuf size corresponding to
>> + * requested_mtu. On success new configuration will be applied.
>> * On error, device will be left unchanged. */
>> static int
>> netdev_dpdk_mempool_configure(struct netdev_dpdk *dev)
>> OVS_REQUIRES(dev->mutex)
>> {
>> uint32_t buf_size = dpdk_buf_size(dev->requested_mtu);
>> - struct rte_mempool *mp;
>> + struct dpdk_mp *dmp;
>> int ret = 0;
>> + bool per_port_mp = dpdk_per_port_memory();
>>
>> - dpdk_mp_sweep();
>> + /* With shared mempools we do not need to configure a mempool if the MTU
>> + * and socket ID have not changed, the previous configuration is still
>> + * valid so return 0 */
>> + if (!per_port_mp && dev->mtu == dev->requested_mtu
>> + && dev->socket_id == dev->requested_socket_id) {
>> + return ret;
>> + }
>>
>> - mp = dpdk_mp_create(dev, FRAME_LEN_TO_MTU(buf_size));
>> - if (!mp) {
>> + dmp = dpdk_mp_get(dev, FRAME_LEN_TO_MTU(buf_size), per_port_mp);
>> + if (!dmp) {
>> VLOG_ERR("Failed to create memory pool for netdev "
>> "%s, with MTU %d on socket %d: %s\n",
>> dev->up.name, dev->requested_mtu, dev->requested_socket_id,
>> rte_strerror(rte_errno));
>> - ret = rte_errno;
>> + return rte_errno;
>
> No need to return here and have the else below. You could set ret or
> remove the else.
Sure I'll just set ret = rte_errno and return below.
>
>> } else {
>> - /* If a new MTU was requested and its rounded value equals the one
>> - * that is currently used, then the existing mempool is returned. */
>> - if (dev->mp != mp) {
>> - /* A new mempool was created, release the previous one. */
>> - dpdk_mp_release(dev->mp);
>> - } else {
>> - ret = EEXIST;
>> + /* Check for any pre-existing dpdk_mp for the device before accessing
>> + * the associated mempool.
>> + */
>> + if (dev->dpdk_mp != NULL) {
>> + /* A new MTU was requested and its rounded value does not
>> + * equal the rounded value of the current mempool, decrement
>> + * the reference count to the devices current dpdk_mp as its
>> + * associated mempool will not be used for this device.
>> + */
>
> The comment needs updating. You could be reusing the existing mempool,
> in which case it's still ok to mp_put because you have just incremented
> it...
Good catch, will update for the v2.
>
> About that, I know it was my comment to remove an unnecessary check here
> :-/ but i'm wondering if the refcount inc and then dec for the EEXISTS
> case is unintuitive and it would better the way you had it with set
> refcount =1 in mp_get, and then have the check for "if (dev->dpdk_mp !=
> dmp)" before you mp_put. What do you think?
>
I think the approach we have at the moment is better because it handles
the per port model of EEXIST and handles the reuse case for a single
port for the shared_memory model described below.
Add a port 'dpdk0' with default MTU 1500. The associated dpdk mempool
will have a refcount of 1.
Now set the mtu of 'dpdk0' to 1800. The existing dpdk mempool will be
reused and incremented in mp_get. refcount will now be 2 but this is
incorrect as there is only 1 port and it's reusing the same mempool, we
have to decrement the refcount by 1 for it to be correct.
The code as is handles that situation. If you checked for "if
(dev->dpdk_mp != dmp)" you wouldn't decrement in this case.
Maybe it needs a comment to this effect?
What are your thoughts?
Ian
>> + dpdk_mp_put(dev>dpdk_mp);
>
>
>
>> }
>> - dev->mp = mp;
>> + dev->dpdk_mp = dmp;
>> dev->mtu = dev->requested_mtu;
>> dev->socket_id = dev->requested_socket_id;
>> dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>> @@ -855,7 +950,8 @@ dpdk_eth_dev_port_config(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>>
>> for (i = 0; i < n_rxq; i++) {
>> diag = rte_eth_rx_queue_setup(dev->port_id, i, dev->rxq_size,
>> - dev->socket_id, NULL, dev->mp);
>> + dev->socket_id, NULL,
>> + dev->dpdk_mp->mp);
>> if (diag) {
>> VLOG_INFO("Interface %s unable to setup rxq(%d): %s",
>> dev->up.name, i, rte_strerror(-diag));
>> @@ -950,7 +1046,7 @@ dpdk_eth_dev_init(struct netdev_dpdk *dev)
>> memcpy(dev->hwaddr.ea, eth_addr.addr_bytes, ETH_ADDR_LEN);
>> rte_eth_link_get_nowait(dev->port_id, &dev->link);
>>
>> - mbp_priv = rte_mempool_get_priv(dev->mp);
>> + mbp_priv = rte_mempool_get_priv(dev->dpdk_mp->mp);
>> dev->buf_size = mbp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM;
>>
>> /* Get the Flow control configuration for DPDK-ETH */
>> @@ -1204,7 +1300,7 @@ common_destruct(struct netdev_dpdk *dev)
>> OVS_EXCLUDED(dev->mutex)
>> {
>> rte_free(dev->tx_q);
>> - dpdk_mp_release(dev->mp);
>> + dpdk_mp_put(dev->dpdk_mp);
>>
>> ovs_list_remove(&dev->list_node);
>> free(ovsrcu_get_protected(struct ingress_policer *,
>> @@ -1957,7 +2053,7 @@ netdev_dpdk_vhost_rxq_recv(struct netdev_rxq *rxq,
>> return EAGAIN;
>> }
>>
>> - nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->mp,
>> + nb_rx = rte_vhost_dequeue_burst(vid, qid, dev->dpdk_mp->mp,
>> (struct rte_mbuf **) batch->packets,
>> NETDEV_MAX_BURST);
>> if (!nb_rx) {
>> @@ -2196,7 +2292,7 @@ dpdk_do_tx_copy(struct netdev *netdev, int qid, struct dp_packet_batch *batch)
>> continue;
>> }
>>
>> - pkts[txcnt] = rte_pktmbuf_alloc(dev->mp);
>> + pkts[txcnt] = rte_pktmbuf_alloc(dev->dpdk_mp->mp);
>> if (OVS_UNLIKELY(!pkts[txcnt])) {
>> dropped += cnt - i;
>> break;
>> @@ -3075,7 +3171,7 @@ netdev_dpdk_get_mempool_info(struct unixctl_conn *conn,
>> ovs_mutex_lock(&dev->mutex);
>> ovs_mutex_lock(&dpdk_mp_mutex);
>>
>> - rte_mempool_dump(stream, dev->mp);
>> + rte_mempool_dump(stream, dev->dpdk_mp->mp);
>>
>> ovs_mutex_unlock(&dpdk_mp_mutex);
>> ovs_mutex_unlock(&dev->mutex);
>> @@ -3772,7 +3868,7 @@ dpdk_vhost_reconfigure_helper(struct netdev_dpdk *dev)
>>
>> err = netdev_dpdk_mempool_configure(dev);
>> if (!err) {
>> - /* A new mempool was created. */
>> + /* A new mempool was created or re-used. */
>> netdev_change_seq_changed(&dev->up);
>> } else if (err != EEXIST){
>> return err;
>> diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
>> index b93da69..b9aba8c 100644
>> --- a/vswitchd/vswitch.xml
>> +++ b/vswitchd/vswitch.xml
>> @@ -360,6 +360,23 @@
>> </p>
>> </column>
>>
>> + <column name="other_config" key="per-port-memory"
>> + type='{"type": "boolean"}'>
>> + <p>
>> + By default OVS DPDK uses a shared memory model wherein devices
>> + that have the same MTU and socket values can share the same
>> + mempool. Setting this value to <code>true</code> changes this
>> + behaviour. Per port memory allow DPDK devices to use private
>> + memory per device. This can provide greater transapraency as
>> + regards memory usage but potentially at the cost of greater memory
>> + requirements.
>> + </p>
>> + <p>
>> + Changing this value requires restarting the daemon if dpdk-init has
>> + already been set to true.
>> + </p>
>> + </column>
>> +
>> <column name="other_config" key="tx-flush-interval"
>> type='{"type": "integer",
>> "minInteger": 0, "maxInteger": 1000000}'>
>>
>
More information about the dev
mailing list