[ovs-discuss] Mempool issue for OVS 2.9

Stokes, Ian ian.stokes at intel.com
Mon Jan 29 16:59:11 UTC 2018


> -----Original Message-----
> From: Kevin Traynor [mailto:ktraynor at redhat.com]
> Sent: Monday, January 29, 2018 8:03 AM
> To: Stokes, Ian <ian.stokes at intel.com>; Ilya Maximets
> <i.maximets at samsung.com>; ovs-discuss at openvswitch.org
> Cc: Flavio Leitner <fbl at redhat.com>; Loftus, Ciara
> <ciara.loftus at intel.com>; Kavanagh, Mark B <mark.b.kavanagh at intel.com>;
> Jan Scheurich (jan.scheurich at ericsson.com) <jan.scheurich at ericsson.com>;
> Ben Pfaff (blp at ovn.org) <blp at ovn.org>; aconole at redhat.com; Venkatesan
> Pradeep <venkatesan.pradeep at ericsson.com>
> Subject: Re: Mempool issue for OVS 2.9
> 
> On 01/26/2018 05:27 PM, Stokes, Ian wrote:
> >> -----Original Message-----
> >> From: Kevin Traynor [mailto:ktraynor at redhat.com]
> >> Sent: Friday, January 26, 2018 3:48 PM
> >> To: Ilya Maximets <i.maximets at samsung.com>; Stokes, Ian
> >> <ian.stokes at intel.com>; ovs-discuss at openvswitch.org
> >> Cc: Flavio Leitner <fbl at redhat.com>; Loftus, Ciara
> >> <ciara.loftus at intel.com>; Kavanagh, Mark B
> >> <mark.b.kavanagh at intel.com>; Jan Scheurich
> >> (jan.scheurich at ericsson.com) <jan.scheurich at ericsson.com>; Ben Pfaff
> >> (blp at ovn.org) <blp at ovn.org>; aconole at redhat.com; Venkatesan Pradeep
> >> <venkatesan.pradeep at ericsson.com>
> >> Subject: Re: Mempool issue for OVS 2.9
> >>
> >> hOn 01/26/2018 03:16 PM, Ilya Maximets wrote:
> >>> On 26.01.2018 15:00, Stokes, Ian wrote:
> >>>> Hi All,
> >>>>
> >>>> Recently an issue was raised regarding the move from a single
> >>>> shared
> >> mempool model that was in place up to OVS 2.8, to a mempool per port
> >> model introduced in 2.9.
> >>>>
> >>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2018-January/046
> >>>> 02
> >>>> 1.html
> >>>>
> >>>> The per port mempool model was introduced in September 5th with
> >>>> commit
> >> d555d9bded to allow fine grain control on a per port case of memory
> usage.
> >>>>
> >>>> In the 'common/shared mempool' model, ports sharing a similar MTU
> >>>> would
> >> all share the same buffer mempool (e.g. the most common example of
> >> this being that all ports are by default created with a 1500B MTU,
> >> and as such share the same mbuf mempool).
> >>>>
> >>>> This approach had some drawbacks however. For example, with the
> >>>> shared
> >> memory pool model a user could exhaust the shared memory pool (for
> >> instance by requesting a large number of RXQs for a port), this would
> >> cause the vSwitch to crash as any remaining ports would not have the
> >> required memory to function. This bug was discovered and reported to
> >> the community in late 2016
> >> https://mail.openvswitch.org/pipermail/ovs-
> >> discuss/2016-September/042560.html.
> >>>>
> >>>> The per port mempool patch aimed to avoid such issues by allocating
> >>>> a
> >> separate buffer mempool to each port.
> >>>>
> >>>> An issue has been flagged on ovs-discuss, whereby memory dimensions
> >> provided for a given number of ports on OvS 2.6-2.8 may be
> >> insufficient to support the same number of ports in OvS 2.9, on
> >> account of the per-port mempool model without re-dimensioning extra
> >> memory. The effect of this is use case dependent (number of ports,
> >> RXQs, MTU settings, number of PMDs
> >> etc.) The previous 'common-pool' model was rudimentary in estimating
> >> the mempool size and was marked as something that was to be improved
> >> upon. The memory allocation calculation for per port model was
> >> modified to take the possible configuration factors mentioned into
> account.
> >>>>
> >>>> It's unfortunate that this came to light so close to the release
> >>>> code
> >> freeze - but better late than never as it is a valid problem to be
> >> resolved.
> >>>>
> >>>> I wanted to highlight some options to the community as I don't
> >>>> think
> >> the next steps should be taken in isolation due to the impact this
> >> feature has.
> >>>>
> >>>> There are a number of possibilities for the 2.9 release.
> >>>>
> >>>> (i) Revert the mempool per port patches and return to the shared
> >> mempool model. There are a number of features and refactoring in
> >> place on top of the change so this will not be a simple revert. I'm
> >> investigating what exactly is involved with this currently.
> >>>
> >>> This looks like a bad thing to do. Shared memory pools has their own
> >>> issues and hides the real memory usage by each port. Also, reverting
> >>> seems almost impossible, we'll have to re-implement it from scratch.
> >
> > I would agree, reverting isn’t as straight forward an option due to the
> amount of commits that were introduced in relation to the per port mempool
> feature over time(listed below for completeness).
> >
> > netdev-dpdk: Create separate memory pool for each port: d555d9b
> > netdev-dpdk: fix management of pre-existing mempools:b6b26021d
> > netdev-dpdk: Fix mempool names to reflect socket id: f06546a
> > netdev-dpdk: skip init for existing mempools: 837c176
> > netdev-dpdk: manage failure in mempool name creation: 65056fd
> > netdev-dpdk: Reword mp_size as n_mbufs: ad9b5b9
> > netdev-dpdk: Rename dpdk_mp_put as dpdk_mp_free: a08a115
> > netdev-dpdk: Fix mp_name leak on snprintf failure: ec6edc8
> > netdev-dpdk: Fix dpdk_mp leak in case of EEXIST: 173ef76
> > netdev-dpdk: Factor out struct dpdk_mp: 24e78f9
> > netdev-dpdk: Remove unused MAX_NB_MBUF: bc57ed9
> > netdev-dpdk: Fix mempool creation with large MTU: af5b0da
> > netdev-dpdk: Add debug appctl to get mempool information: be48173
> >
> > Although a lot of these are fixes/formatting, we would have to introduce
> a new series and re-introduce shared model by removing the per port
> specifics. This will require extensive testing also.
> >
> > Functionality such as the mempool debug tool would also be impacted as
> well (I'm not sure how valuable it would be for 1 shared mempool).
> >
> > I can look at putting an RFC for this together so as to keep the option
> on the table if it's preferred within the 2.9 time window.

I've reverted the changes above to use re-introduce shared-mempool model as it was previously:

https://github.com/istokes/ovs/tree/mempool_revert 

I wanted to share it as I intend to use above to make testing bugs/corner-cases with the both approaches easier. It's not intended for merging to master at this point and I haven't validated all features yet.

> >>>>> (ii) Leave the per port mempool implementation as is, flag to
> >>>>> users
> >> that memory requirements have increased. Extra memory may have to be
> >> provided on a per use case basis.
> >>>
> >>> That's a good point. I prefer this option if possible.
> >>
> >> With 4 pmds and 4 rxqs per port a user would only be able to add 7
> >> ports per socket now. 9 ports if 1 Rxq. Of course a user may have
> >> some headroom in the memory they have allocated but that can't be
> >> assumed. It will surely mean some users setup will not work after
> >> upgrade - then they will have to debug why.
> >>
> >> If we just reduce the MIN_NB_MBUF to 8192, it would make it 10
> >> ports/4 rxqs, or 13 ports/1 rxqs without additional memory. It still
> >> gives 24K/18K buffers per port.
> >>
> >> What do you think?
> >
> > This is similar to the RFC you submitted I think.
> 
> The RFC tried to make the number of buffers required calculation more
> accurate, reduce the number of additional buffers by 75% and allow it go
> to zero. The suggestion above just reduces the additional buffers by 50%.
> Ilya has shown a common case where it would not fix the issue or be a
> feasible workaround.
> > I take it we still hit the issues flagged by Pradeep but it increases
> the number of ports at least.

So for next steps with regards to OVS 2.9 release, there is a trend to leave the per mempool implementation in place for 2.9 but with the addition of documentation flagging the known memory requirements increase, an example with regards how to calculate memory and also any other known corner cases with the current approach.

This would just be for the 2.9 release.

There's agreement that rework is required for the feature, it's not clear what that will look like but Intel will be happy to contribute as the current approach was submitted by Intel developers.

I would suggest whatever form this takes it should accommodate the previous lower memory requirements and that it should be considered for backport to OVS 2.9 when it's merged to master to enable upgrade to 2.9 without the use of workarounds and where the high memory requirements are a blocker.

If above is not acceptable please let me know.

Thanks
Ian


> >
> > The 24K/18K is in and around the region Ilya had flagged below so it
> might be ok. It would be great if we had common config/deployment info to
> help confirm this.
> >
> >>
> >>>
> >>>> (iii) Reduce the amount of memory allocated per mempool per port.
> >>>> An
> >> RFC to this effect was submitted by Kevin but on follow up the
> >> feeling is that it does not resolve the issue adequately.
> >>>
> >>
> >> Right, we discussed it in the other thread. It fixes some of the
> >> missed factors for worst case, but Pradeep pointed out there are
> >> more. To account for them all would be difficult and mean a lot of
> extra mbufs per port.
> >>
> >>> IMHO, we should not reduce the mempool size by hardcoding small value.
> >>> Different installations has their own requirements because of
> >>> different numbers of HW NICs and CPU cores. I beleive that 20-30K is
> >>> a
> >> normal size per-port for now.
> >>>
> >>
> >> Would be not too far off that range with suggested change above.
> >>
> >>> How about to add an upper limit for mempool size configurable in
> >>> boot
> >> time?
> >>
> >> I don't think an OVS user would know how to calculate the number of
> mbufs.
> >> Even if they did, setting it would constitute them okaying that they
> >> may run out of mbufs on a port.
> >
> > I think with the shared memory model as it was before risks running out
> of mbufs also as per the original bug where it was killing the switch.
> >
> > Below helps with the debug side for users. Ignoring whether they have
> the ability to calculate the required memory for the moment (I agree BTW,
> I don’t think some users would know how to accurately do this), would the
> new parameter still be an issue for backwards compatibility?
> 
> Yes, it would.
> 
> >
> >>
> >>> See below code for example (not tested):
> >>> --------------------------------------------------------------------
> >>> --
> >>> --------- diff --git a/lib/dpdk-stub.c b/lib/dpdk-stub.c index
> >>> 3602180..ed54e06 100644
> >>> --- a/lib/dpdk-stub.c
> >>> +++ b/lib/dpdk-stub.c
> >>> @@ -54,3 +54,9 @@ dpdk_vhost_iommu_enabled(void)  {
> >>>      return false;
> >>>  }
> >>> +
> >>> +uint32_t
> >>> +dpdk_mempool_size_limit(void)
> >>> +{
> >>> +    return UINT32_MAX;
> >>> +}
> >>> diff --git a/lib/dpdk.c b/lib/dpdk.c index 6710d10..219e412 100644
> >>> --- a/lib/dpdk.c
> >>> +++ b/lib/dpdk.c
> >>> @@ -43,6 +43,9 @@ static FILE *log_stream = NULL;       /* Stream for
> >> DPDK log redirection */
> >>>  static char *vhost_sock_dir = NULL;   /* Location of vhost-user
> sockets
> >> */
> >>>  static bool vhost_iommu_enabled = false; /* Status of vHost IOMMU
> >>> support */
> >>>
> >>> +/* Maximum number of mbufs in a single memory pool. */ static
> >>> +uint32_t per_port_mempool_size_limit = UINT32_MAX;
> >>> +
> >>>  static int
> >>>  process_vhost_flags(char *flag, const char *default_val, int size,
> >>>                      const struct smap *ovs_other_config, @@ -313,6
> >>> +316,7 @@ dpdk_init__(const struct smap *ovs_other_config)
> >>>      int err = 0;
> >>>      cpu_set_t cpuset;
> >>>      char *sock_dir_subcomponent;
> >>> +    uint64_t size_u64;
> >>>
> >>>      log_stream = fopencookie(NULL, "w+", dpdk_log_func);
> >>>      if (log_stream == NULL) {
> >>> @@ -351,6 +355,14 @@ dpdk_init__(const struct smap *ovs_other_config)
> >>>      VLOG_INFO("IOMMU support for vhost-user-client %s.",
> >>>                 vhost_iommu_enabled ? "enabled" : "disabled");
> >>>
> >>> +    size_u64 = smap_get_ullong(ovs_other_config, "per-port-mempool-
> >> size-limit",
> >>> +                               UINT32_MAX);
> >>> +    if (size_u64 < UINT32_MAX) {
> >>> +        per_port_mempool_size_limit = size_u64;
> >>> +        VLOG_INFO("Mempool size for a single port limited to
> %"PRIu32"
> >> mbufs.",
> >>> +                  per_port_mempool_size_limit);
> >>> +    }
> >>> +
> >>>      argv = grow_argv(&argv, 0, 1);
> >>>      argc = 1;
> >>>      argv[0] = xstrdup(ovs_get_program_name()); @@ -494,6 +506,12 @@
> >>> dpdk_vhost_iommu_enabled(void)
> >>>      return vhost_iommu_enabled;
> >>>  }
> >>>
> >>> +uint32_t
> >>> +dpdk_mempool_size_limit(void)
> >>> +{
> >>> +    return per_port_mempool_size_limit; }
> >>> +
> >>>  void
> >>>  dpdk_set_lcore_id(unsigned cpu)
> >>>  {
> >>> diff --git a/lib/dpdk.h b/lib/dpdk.h index dc58d96..4d78895 100644
> >>> --- a/lib/dpdk.h
> >>> +++ b/lib/dpdk.h
> >>> @@ -38,5 +38,6 @@ void dpdk_init(const struct smap
> >>> *ovs_other_config); void dpdk_set_lcore_id(unsigned cpu);  const
> >>> char *dpdk_get_vhost_sock_dir(void);  bool
> >>> dpdk_vhost_iommu_enabled(void);
> >>> +uint32_t dpdk_mempool_size_limit(void);
> >>>
> >>>  #endif /* dpdk.h */
> >>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c index
> >>> e834c7e..b2f1072 100644
> >>> --- a/lib/netdev-dpdk.c
> >>> +++ b/lib/netdev-dpdk.c
> >>> @@ -518,7 +518,7 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
> >>>      char mp_name[RTE_MEMPOOL_NAMESIZE];
> >>>      const char *netdev_name = netdev_get_name(&dev->up);
> >>>      int socket_id = dev->requested_socket_id;
> >>> -    uint32_t n_mbufs;
> >>> +    uint32_t n_mbufs, max_mbufs;
> >>>      uint32_t hash = hash_string(netdev_name, 0);
> >>>      struct rte_mempool *mp = NULL;
> >>>
> >>> @@ -534,6 +534,13 @@ dpdk_mp_create(struct netdev_dpdk *dev, int mtu)
> >>>                + MIN(RTE_MAX_LCORE, dev->requested_n_rxq) *
> >> NETDEV_MAX_BURST
> >>>                + MIN_NB_MBUF;
> >>>
> >>> +    max_mbufs = dpdk_mempool_size_limit();
> >>> +    if (max_mbufs < n_mbufs) {
> >>> +        VLOG_WARN("Mempool size for port %s limited to %"PRIu32
> >>> +                  " mbufs. Original estimation was %"PRIu32" mbufs.",
> >>> +                  netdev_name, max_mbufs, n_mbufs);
> >>> +    }
> >>> +
> >>>      ovs_mutex_lock(&dpdk_mp_mutex);
> >>>      do {
> >>>          /* Full DPDK memory pool name must be unique and cannot be
> >>>
> >>> --------------------------------------------------------------------
> >>> --
> >>> ---------
> >>>
> >>> I think that above approach is a good enough workaround because only
> >>> user knows how much memory he/she able to allocate for OVS and how
> >>> many ports expected.
> >>> I'd like the limit to be in megabytes, but I didn't find a good way
> >>> to estimate total size of a mempool.
> >>>
> >>>> (iv) Introduce a feature to allow users to configure mempool as
> >>>> shared
> >> or on a per port basis: This would be the best of both worlds but
> >> given the proximity to the 2.9 freeze I don't think it's feasible by
> >> the end of January.
> >>>
> >>> This, probably, may solve some issues, but yes, it's definitely not
> >>> for
> >> 2.9.
> >>> Good implementation and extensive testing will be required.
> >>
> >> +1. There may be other solutions too, Pradeep had a different
> >> suggestion, but it's not for now.
> >
> > OK we can agree to remove this option for the moment, at least within
> the period for the OVS code freeze. It will definitely be looked at for
> 2.10 however.
> >
> > Thanks
> > Ian
> >>
> >> thanks,
> >> Kevin.
> >>
> >>>
> >>>>
> >>>> I'd appreciate peoples input on this ASAP so we can reach a
> >>>> consensus
> >> on the next steps.
> >>>>
> >>>> Thanks
> >>>> Ian
> >>
> >



More information about the discuss mailing list