[ovs-discuss] ovs-vswitchd mlockall and stack size

Mon Jul 14 17:06:09 UTC 2014

Hi Ben,

Thanks for the suggestions. I did some quick test and analysis of the 
stack usage on ovs-2.1.2 (planning to do it on master later) and here 
are some of the findings.

The list shows the stack usage (in bytes) by each of the functions. I 
have collected only those functions which uses more than 1KB of stack.
dpif_linux_operate             114536
updif_upcall_handler            70856
nl_sock_recv__                  65656
json_from_stream                8216
udpif_revalidator               7672
revalidator_sweep__             6008
system_stats_thread_func        4872
netdev_linux_run                4408
dpif_linux_port_poll            4328
nln_run                         4208
nl_sock_transact_multiple       3368
xlate_actions                   3112
ofproto_trace                   2760
handle_openflow__               2072
do_xlate_actions                1928
handle_flow_stats_request       1784
ofp_print_nxst_flow_monitor_reply 1712
handle_aggregate_stats_request  1368
netdev_linux_sys_get_stats      1352
append_group_stats &
ofproto_dpif_execute_actions    1320
parse_odp_key_mask_attr         1304
xlate_group_bucket              1272
dpif_ipfix_cache_expire
& describe_fd                   1256
dpif_linux_execute              1160
handle_flow_monitor_request     1112
sfl_agent_sysError              1080
handle_meter_mod                1048
sfl_agent_error                 1032

As you can see there are few functions that are in packet processing path.

Assuming that we ran a "AT_SETUP([ofproto-dpif - infinite resubmit])" 
test, which cause do_xlate_actions (and friends) to re-curse (64 levels 
deep), roughly around 400KB of stack would be used by 
"updif_upcall_handler". I think the stack usage of udpif_revalidator 
should be same as that of updif_upcall_handler (if not less). I limited 
the stack size of all the pthreads to 512KB and was able to run both the 
tests you mentioned.

I tried valgrind (tool=massif) against ovs-vswitchd and ran 
"AT_SETUP([ofproto-dpif - infinite resubmit])" test and valgrind 
reported the max stack usage was around 400MB and
"AT_SETUP([ofproto-dpif - exponential resubmit chain])" uses around 
700MB. This was with 4vCPUs (6 pthreads). valgrind though reports the 
total stack usage by all threads.

This makes me believe that 1MB of stack size should be enough for each 
of pthreads and 512KB would be tight. Let me know your thoughts. I will 
send out a patch which would limit pthread stack size to 1024KB and 
would it make it "other-config" configurable.

Thanks,
Anoob.

On 08/07/14 17:47, Ben Pfaff wrote:
> I guess that the biggest effect on stack size would be the flow table
> and in particular how much recursion flow processing causes.  There are
> a few tests that force as-deep-as-possible recursion:
>
>      AT_SETUP([ofproto-dpif - infinite resubmit])
>      
>
> I don't think that forcing all packets to userspace would have much of
> an effect.  (The closest equivalent would be to disable megaflows,
> there's an "ovs-appctl" command for that, look in "ovs-appctl help".)
>
> Another hint toward maximum stack requirement is to look through the
> generated asm for stack usage, e.g.:
>
>          objdump -dr vswitchd/ovs-vswitchd|sed -n 's/^.*sub.*$0x\([0-9a-f]\{1,\}\),%esp/\1/p'|sort|uniq|less
>
> which shows that we have at least one place where we allocate 327,788
> bytes on the stack (!).  I hope that is not in the flow processing path!
>
> On Tue, Jul 08, 2014 at 05:36:07PM +0100, Anoob Soman wrote:
>> I have been running tests with 1MB stack size and ovs-vswitchd seem
>> to hold pretty well. I will try to do some more experiments to find
>> out the max depth of the stack, but I am afraid this will totally
>> depend on the test I am running. Any suggestion on what sort of test
>> I should be running ? More over "force-miss-model" other-config is
>> missing from 2.1.x as there is no concept of facets. Is there way
>> that I can force all packets to be processed in userspace, other
>> than me doing "ovs-dpctl del-flows" periodically.
>>
>> Thanks,
>> Anoob.
>> On 08/07/14 17:15, Ben Pfaff wrote:
>>> On Tue, Jul 08, 2014 at 05:08:43PM +0100, Anoob Soman wrote:
>>>> Since openvswitch has moved to multi-threaded model, RSS usage of
>>>> ovs-vswitchd has increased quite significantly compared to the last
>>>> release we used (ovs-1.4.x). Part of the problem is using mlockall
>>>> (with MCL_CURRENT|MCL_FUTURE) on ovs-vswitchd, which causes every
>>>> pthreads stack's and heap's virtual address to locked to RAM.
>>>> ovs-vswitch (2.1.x) running on a 8 vCPU dom0 (10 pthreads) uses
>>>> around 89M of RSS (80MB just for stack), without any VMs running on
>>>> the host. One way to reduce RSS would be to reduce the number of
>>>> "n-handler-threads" and "n-revalidator-threads", but I am not sure
>>>> about the performance impact of having these thread numbers reduced.
>>>> I am wondering if the stack size of the pthreads can be reduce
>>>> (using pthread_attr_setstack). By default pthreads max stack size is
>>>> 8MB and mlockall locks all of this 8MB into RAM. What could be
>>>> optimal stack size that can be used.
>>> I think it would be very reasonable to reduce the stack sizes, but I
>>> don't know the "correct" size off-hand.  Since you're looking at the
>>> problem already, perhaps you should consider some experiments.