[ovs-dev] [PATCH] ovs: do not allocate memory from offline numa node
jrajahalme at nicira.com
Tue Oct 20 17:58:33 UTC 2015
> On Oct 9, 2015, at 5:02 PM, Jarno Rajahalme <jrajahalme at nicira.com> wrote:
>> On Oct 9, 2015, at 3:11 PM, Jesse Gross <jesse at nicira.com <mailto:jesse at nicira.com>> wrote:
>> On Fri, Oct 9, 2015 at 8:54 AM, Jarno Rajahalme <jrajahalme at nicira.com <mailto:jrajahalme at nicira.com>> wrote:
>>> On Oct 8, 2015, at 4:03 PM, Jesse Gross <jesse at nicira.com <mailto:jesse at nicira.com>> wrote:
>>> On Wed, Oct 7, 2015 at 10:47 AM, Jarno Rajahalme <jrajahalme at nicira.com <mailto:jrajahalme at nicira.com>>
>>> On Oct 6, 2015, at 6:01 PM, Jesse Gross <jesse at nicira.com <mailto:jesse at nicira.com>> wrote:
>>> On Mon, Oct 5, 2015 at 1:25 PM, Alexander Duyck
>>> <alexander.duyck at gmail.com <mailto:alexander.duyck at gmail.com>> wrote:
>>> On 10/05/2015 06:59 AM, Vlastimil Babka wrote:
>>> On 10/02/2015 12:18 PM, Konstantin Khlebnikov wrote:
>>> When openvswitch tries allocate memory from offline numa node 0:
>>> stats = kmem_cache_alloc_node(flow_stats_cache, GFP_KERNEL | __GFP_ZERO,
>>> It catches VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid))
>>> [ replaced with VM_WARN_ON(!node_online(nid)) recently ] in linux/gfp.h
>>> This patch disables numa affinity in this case.
>>> Signed-off-by: Konstantin Khlebnikov <khlebnikov at yandex-team.ru <mailto:khlebnikov at yandex-team.ru>>
>>> diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
>>> index f2ea83ba4763..c7f74aab34b9 100644
>>> --- a/net/openvswitch/flow_table.c
>>> +++ b/net/openvswitch/flow_table.c
>>> @@ -93,7 +93,8 @@ struct sw_flow *ovs_flow_alloc(void)
>>> /* Initialize the default stat node. */
>>> stats = kmem_cache_alloc_node(flow_stats_cache,
>>> - GFP_KERNEL | __GFP_ZERO, 0);
>>> + GFP_KERNEL | __GFP_ZERO,
>>> + node_online(0) ? 0 : NUMA_NO_NODE);
>>> Stupid question: can node 0 become offline between this check, and the
>>> VM_WARN_ON? :) BTW what kind of system has node 0 offline?
>>> Another question to ask would be is it possible for node 0 to be online, but
>>> be a memoryless node?
>>> I would say you are better off just making this call kmem_cache_alloc. I
>>> don't see anything that indicates the memory has to come from node 0, so
>>> adding the extra overhead doesn't provide any value.
>>> I agree that this at least makes me wonder, though I actually have
>>> concerns in the opposite direction - I see assumptions about this
>>> being on node 0 in net/openvswitch/flow.c.
>>> Jarno, since you original wrote this code, can you take a look to see
>>> if everything still makes sense?
>>> We keep the pre-allocated stats node at array index 0, which is initially
>>> used by all CPUs, but if CPUs from multiple numa nodes start updating the
>>> stats, we allocate additional stats nodes (up to one per numa node), and the
>>> CPUs on node 0 keep using the preallocated entry. If stats cannot be
>>> allocated from CPUs local node, then those CPUs keep using the entry at
>>> index 0. Currently the code in net/openvswitch/flow.c will try to allocate
>>> the local memory repeatedly, which may not be optimal when there is no
>>> memory at the local node.
>>> Allocating the memory for the index 0 from other than node 0, as discussed
>>> here, just means that the CPUs on node 0 will keep on using non-local memory
>>> for stats. In a scenario where there are CPUs on two nodes (0, 1), but only
>>> the node 1 has memory, a shared flow entry will still end up having separate
>>> memory allocated for both nodes, but both of the nodes would be at node 1.
>>> However, there is still a high likelihood that the memory allocations would
>>> not share a cache line, which should prevent the nodes from invalidating
>>> each other’s caches. Based on this I do not see a problem relaxing the
>>> memory allocation for the default stats node. If node 0 has memory, however,
>>> it would be better to allocate the memory from node 0.
>>> Thanks for going through all of that.
>>> It seems like the question that is being raised is whether it actually
>>> makes sense to try to get the initial memory on node 0, especially
>>> since it seems to introduce some corner cases? Is there any reason why
>>> the flow is more likely to hit node 0 than a randomly chosen one?
>>> (Assuming that this is a multinode system, otherwise it's kind of a
>>> moot point.) We could have a separate pointer to the default allocated
>>> memory, so it wouldn't conflict with memory that was intentionally
>>> allocated for node 0.
>>> It would still be preferable to know from which node the default stats node
>>> was allocated, and store it in the appropriate pointer in the array. We
>>> could then add a new “default stats node index” that would be used to locate
>>> the node in the array of pointers we already have. That way we would avoid
>>> extra allocation and processing of the default stats node.
>> I agree, that sounds reasonable to me. Will you make that change?
>> Besides eliminating corner cases, it might help performance in some
>> cases too by avoiding stressing memory bandwidth on node 0.
According to the comment above kmem_cache_alloc_node(), kmem_cache_alloc_node() should not BUG_ON/WARN_ON in this case:
> * kmem_cache_alloc_node - Allocate an object on the specified node
> * @cachep: The cache to allocate from.
> * @flags: See kmalloc().
> * @nodeid: node number of the target node.
> * Identical to kmem_cache_alloc but it will allocate memory on the given
> * node, which can improve the performance for cpu bound structures.
> * Fallback to other node is possible if __GFP_THISNODE is not set.
See also this from cpuset.c:
> * cpuset_mem_spread_node() - On which node to begin search for a file page
> * cpuset_slab_spread_node() - On which node to begin search for a slab page
> * If a task is marked PF_SPREAD_PAGE or PF_SPREAD_SLAB (as for
> * tasks in a cpuset with is_spread_page or is_spread_slab set),
> * and if the memory allocation used cpuset_mem_spread_node()
> * to determine on which node to start looking, as it will for
> * certain page cache or slab cache pages such as used for file
> * system buffers and inode caches, then instead of starting on the
> * local node to look for a free page, rather spread the starting
> * node around the tasks mems_allowed nodes.
> * We don't have to worry about the returned node being offline
> * because "it can't happen", and even if it did, it would be ok.
> * The routines calling guarantee_online_mems() are careful to
> * only set nodes in task->mems_allowed that are online. So it
> * should not be possible for the following code to return an
> * offline node. But if it did, that would be ok, as this routine
> * is not returning the node where the allocation must be, only
> * the node where the search should start. The zonelist passed to
> * __alloc_pages() will include all nodes. If the slab allocator
> * is passed an offline node, it will fall back to the local node.
> * See kmem_cache_alloc_node().
Based on this it seems this is a bug in the memory allocator, it probably should not be calling alloc_pages_exact_node() when __GFP_THISNODE is not set?
More information about the dev