[ovs-dev] Mempool redesign for OVS 2.10
jan.scheurich at ericsson.com
Thu Apr 26 14:25:46 UTC 2018
Thanks, everyone, for re-opening the discussion around the new packet mempool handling for 2.10.
Before we agree on what to actually implement I’d like to summarize my understanding of the requirements that have been discussed so far. Based on those I want to share some thoughts about how e can best address these requirements.
R1 (Backward compatibility):
The new mempool handling shall be able to function equally well as the OVS 2.9 design base given any specific configuration of OVS-DPDK: hugepage memory, PMDs, ports, queues, MTU sizes, traffic flows. This is to ensure that we can upgrade OVS in existing deployments without risk of breaking anything.
R2 (Dimensioning for static deployments):
It shall be possible for an operator to calculate the amount of memory needed for packet mempools in a given static (maximum) configuration (PMDs, ethernet ports and queues, maximum number of vhost ports, MTU sizes) to reserve sufficient hugepages for OVS.
R3 (Safe operation):
If the mempools are dimensioned correctly, it shall not be possible that OVS runs out of mbufs for packet processing.
R4 (Minimal footprint):
The packet mempool size needed for safe operation of OVS should be as small as possible.
R5 (Dynamic mempool allocation):
It should be possible to automatically adjust the size of packet mempools at run-time when changing the OVS configuration e.g. adding PMDs, adding ports, adding rx/tx queues, changing the port MTU size. (Note: Shrinking the mempools with reducing OVS configuration is less important.)
Actual maximum mbuf consumption in OVS DPDK:
1. Phy rx queues: Sum over dpdk dev: (dev->requested_n_rxq * dev->requested_rxq_size)
Note: Normally the number of rx queues should not exceed the number of PMDs.
2. Phy tx queues: Sum over dpdk dev: (#active tx queues (=#PMDs) * dev->requested_txq_size)
Note 1: These are hogged because of DPDK PMD’s lazy release of transmitted mbufs.
Note 2: Stored mbufs in a tx queue are coming from all ports.
1. One rx batch per PMD during processing: #PMDs * 32
2. One batch per active tx queue for time-based batching: 32 * #devs * #PMDs
Assuming rx/tx queue size of 2K for physical ports and #rx queues = #PMDs (RSS), the upper limit for the used mbufs would be
(*1*) #dpdk devs * #PMDs * 4K + (#dpdk devs + #vhost devs) * #PMDs * 32 + #PMDs * 32
* With a typical NFVI deployment (2 DPDK devs, 4 PMDs, 128 vhost devs ) this yields 32K + 17K = 49K mbufs
* For a large NFVI deployment (4 DPDK devs, 8 PMDs, 256 vhost devs ) this would yield 128K + 66K = 194K mbufs
Roughly 1/3rd of the total mbufs are hogged in dpdk dev rx queues. The remaining 2/3rds are populated with an arbitrary mix of mbufs from all sources.
Legacy shared mempool handling up to OVS 2.9:
* One mempool per NUMA node and used MTU size range.
* Each mempool has the maximum of (256K, 128K, 64K, 32K or 16K) mbufs available in DPDK at mempool creation.
* Each mempool is shared among all ports on its NUMA node with an MTU in its range.
* All rx queues of a port share the same mempool
The legacy code trivially satisfies R1. Its good feature is that the mempools are shared so that it avoids the bloating of dedicated mempools per port implied by the handling on master (see below).
Apart from that it does not fulfill any of the requirements.
* It swallows all available hugepage memory to allocate up to 256K mbufs per NUMA node, even though that is far more than typically needed (violating R4).
* The actual size of created mempools depends on the order of creation and the hugepage memory available. Early mempools are over-dimensioned, later mempools might be under-dimensioned. Operation is not at all safe (violating R3)
* It doesn’t provide any help for the operator to dimension and reserve hugepages for OVS (violating R2)
* The only dynamicity is that it creates additional mempools for new MTU size ranges only when they are needed. Due to greedy initial allocation these are likely to fail (violating R5).
My take is that even though the shared mempool is concept is good, the legacy mempool handling should not be kept as is.
Mempool per port scheme (currently implemented on master):
From the above mbuf utilization calculation it is clear that only the dpdk rx queues are populated exclusively with mbufs from the port’s mempool. All other places are populated with mbufs from all ports, in the case of tx queues typically not even their own. As it is not possible to predict the assignment of rx queues to PMDs and the flow of packets between ports, safety requirement R3 implies that each port mempool must be dimensioned for the worst case, i.e.
[#PMDs * 2K ] + #dpdk devs * #PMDs * 2K + (#dpdk devs + #vhost devs) * #PMDs * 32 + #PMDs * 32
Even though the first term [#PMDs * 2K] is only needed for physical ports this almost multiplies the total number of mbufs needed (*1*) by the number of ports (dpdk and vhost) in the system.
* With a typical NFVI deployment (2 DPDK devs, 4 PMDs, 128 vhost devs ) this yields
2 * (24K + 17K) + 128 * (16K + 17K) = 4306K mbufs
* For a large NFVI deployment (4 DPDK devs, 8 PMDs, 256 vhost devs ) this would yield
4 * (80K + 66K) + 256 * (64K + 17K) = 21320K mbufs
The required total mempool sizes needed for safe operation is ridiculously high. Any attempt to bring the per-port mempool model on par with the memory consumption of a properly dimensioned shared mempool scheme will be inherently unsafe. This clearly indicates that a per-port mempool model is not adequate. The current per-port mempool scheme on master should be removed.
One mempool per MTU range:
A similar argument as for the per-port mempool above also holds for the per-MTU range mempools used in the 2.9 design base. As the mbufs received on a port with a given MTU can be sent to any port in the system, each MTU range mempool must be dimensioned to a large fraction of the maximum total number of mbufs in use (*1*): 2/3rds + the number of rx queue descriptors for that MTU range.
Already with 2 different mbuf sizes (e.g. for MTU 9000 on phy ports and MTU 1500 on vhu ports), dimensioning each MTU-mempool safely can require more memory in total than using a single mempool of the maximum needed mbuf size for all ports.
To address R4 (minimal footprint) we could simplify the solution and give up the concept of one mempool per MTU range. There are three options:
* Configure an mbuf size for the single mempool, which then implies an upper limit on the configurable MTU per port.
* Replace the mempool with another mempool of larger mbufs when a port is configured with MTU that would not fit.
* Use the multi-segment mbuf approach (Intel WiP patch) to satisfy MTU sizes that do not fit the fixed mbuf-size.
Per PMD mempools:
The following arguments suggest that a mempool pool per PMD allocated on the PMD’s NUMA node might make good sense:
* The total mbufs in use by OVS cleanly partitions into subsets per PMD:
* Packets hogged in dpdk rx queues are naturally owned by the PMD polling the rx queues
* Each PMD typically has its dedicated dpdk tx queue, so that all mbufs hogged in that tx queue are owned by the PMD.
(In the unusual case of shared tx queues we still need to assume the worst case that all mbufs belong to a single PMD.)
* Also the mbufs in flight and in tx batching buffers are owned by the PMD.
With the same assumptions as above, the amount of mbufs in use by a single PMD is bounded by
(*2*) #dpdk devs * 4K + (#dpdk devs + #vhost devs) * 32 + 32
* For best performance mbufs being processed by a PMD thread should be local to the PMD’s NUMA socket. This is especially important for tx to vhostuser due to copying of entire packet content.
Today this is not the case for dpdk rx queues polled by remote PMDs (through rx queue pinning). All rx queues of a dpdk port are tied to a mempool on the NIC’s NUMA. Node. The “Fujitsu patch” presented on the OVS Conf 2016 showed that the performance of a remote PMD can be significantly improved by assigning a mempool local to the PMD for the pinned dpdk rx queue. In this case the DMA engine of the NIC takes care of the QPI bus transfer and the PMD is not burdened. DPDK supports this model as the mempool for eth devices is configurable per rx queue, not per port.
* Using the above dimensioning formula, requirements R1 to R4 could be fulfilled by a mempool per PMD in the same way as per NUMA mempools globally shared by all PMDs on that NUMA node. Requirement R5 (Dynamic allocation) would some extent be fulfilled also, as mempools could be added/deleted dynamically when PMDs are added/deleted to the OVS.
I would suggest to aim for a new mempool handling along the following lines:
* Create mempools per PMD based on the above formula (*2*) using reasonable hard-coded default bounds for #dpdk devs (e.g. 8) and #vhost devs (256) such that the total memory remains below the 2.9 legacy.
* Improvement: make the these bounds configurable.
* Use the “Fujitsu patch approach” and assign the dpdk rx queue to the mempool of the polling PMD.
* Avoid the complexity and memory waste with multiple mempools per PMD for different MTU sizes.
Use one configurable common mbuf size (default e.g. 3x1024 (3KB) bytes covering most common MTU sizes) and multi-segment mbufs to handle larger port MTUs. For optimal jumbo frame performance, users would configure 10KB mbufs for the price of more memory needed.
Assuming 8 PMDs, 8 dpdk devs, 256 vhost devs, 2K descriptors per dpdk rx/tx queue and 3KB mbuf size, the resulting overall hugepage memory requirements for packet mempools would be:
8 * 4K + 264 * 32 + 32 mbufs = 41K mbufs per PMD
4 PMD * 41K mbufs/PMD * 3KB ~= 512 MB per NUMA node with equal 4:4 PMD distribution
So a typical OVS deployment with 1GB hugepage memory per NUMA socket should be more than sufficient to cover the memory requirement for the proposed default mempool scheme for large NFVI deployments. Assigning 2 GB per NUMA should already cover the memory need for unsegmented 9KB jumbo frames.
For better compatibility with OVS 2.9 in small test setups we could consider maintaining a scheme to reduce the above default #mbufs per PMD mempool successively until they fit the available hugepage memory. In that case it would be good to have a WARN message in the log indicating if the created mempools are not sufficient to handle the actual DPDK datapath configuration safely.
Comments are welcome!
> >> Hi all,
> >> Now seems a good time to kick start this conversation again as there's a few patches floating around for mempools on master and
> >> I'm happy to work on a solution for this but before starting I'd like to agree on the requirements so we're all comfortable with the
> > Thanks for kicking it off Ian. FWIW, the freeing fix code can work with
> > both schemes below. I already have that between the patches for
> > different branches. It should be straightforward to change to cover both
> > in same code. I can help with that if needed.
> Agree, there is no much difference between mempool models for freeing fix.
> >> I see two use cases above, static and dynamic. Each have their own requirements (I'm keeping OVS 2.10 in mind here as it's an
> issue we need to resolve).
> >> Static environment
> >> 1. For a given deployment, the 2.10 the mempool design should use the same or less memory as the shared mempool design of
> >> 2. Memory pool size can depend on static datapath configurations, but the previous provisioning used in OVS 2.9 is acceptable also.
> >> I think the shared mempool model suits the static environment, it's a rough way of provisioning memory but it works for the
> majority involved in the discussion to date.
> >> Dynamic environment
> >> 1. Mempool size should not depend on dynamic characteristics (number of PMDs, number of ports etc.), this leads to frequent
> traffic interrupts.
> > If that is wanted I think you need to distinguish between port related
> > dynamic characteristics and non-port related. At present the per port
> > scheme depends on number of rx/tx queues and the size of rx/tx queues.
> > Also, txq's depends on number of PMDs. All of which can be changed
> > dynamically.
> Changing of the mempool size is too heavy operation. We should
> avoid it somehow as long as possible.
> It'll be cool to have some kind of dynamic mempool resize API from the
> DPDK, but there is no such concepts right now. Maybe it'll be good if
> DPDK API will allow to add more than one mempool for a device. Such API
> could allow us to dynamically increase/decrease the total amount of
> memory available for a single port. We should definitely think about
> something like this in the future.
> >> 2. Due to the dynamic environment, it's preferable for clear visibility of memory usage for ports (Sharing mempools violates this).
> >> The current per port model suits the dynamic environment.
> >> I'd like to propose for 2.10 that we implement a model to allow both:
> >> * When adding a port the shared mempool model would be the default behavior. This would satisfy users moving from previous
> OVS releases to 2.10 as memory requirements would be in line with what was previously expected and no new options/arguments
> are needed.
> > +1
> It's OK for me too.
> >> * Per port mempool is available but must be requested by a user, it would require a new option argument when adding a port.
> > I'm not sure there needs to be an option *per port*. The implication is
> > that some mempools would be created exclusively for a single port, while
> > others would be available to share and this would operate at the same time.
> > I think a user would either have an unknown or high number of ports and
> > are ok with provisioning the amount of memory for shared mempools, or
> > they know they will have only a few ports and can benefit from using
> > less memory.
> Unknown/big but limited number of ports could also be a scenario for
> separate mempool model, especially for dynamic case.
> > Although, while it is desirable to reduce memory usage, I've never
> > actually heard anyone complaining about the amount of memory needed for
> > shared mempools and requesting it to be reduced.
> I agree that per-port option looks like more than users could need.
> Maybe global config will be better.
> There is one more thing: Users like OpenStack are definitely "dynamic".
> Addition of the new special parameter will require them to modify their
> code to have more or less manageable memory consumption.
> P.S. Meanwhile, I will be out of office until May 3 and will not be able
> to respond to emails.
> > I don't think it would be particularly difficult to have both schemes
> > operating at the same time because you could use mempool names to
> > differentiate (some with unique port related name, some with a general
> > name) and mostly treat them the same, but just not sure that it's really
> > needed.
> >> This would be an advanced feature as its mempool size can depend on port configuration, users need to understand this &
> mempool concepts in general before using this. A bit of work to be done here in the docs to make this clear how memory
> requirements are calculated etc.
> >> Before going into solution details I'd like to get people's opinions. There's a few different ways to implement this, but in general
> would the above be acceptable? I think with some smart design we could minimize the code impact so that both approaches share as
> much as possible.
> >> Ian
More information about the dev