[ovs-discuss] [OVN] ovn-controller Incremental Processing scale testing

Daniel Alvarez Sanchez dalvarez at redhat.com
Tue Jun 11 08:38:59 UTC 2019


Hi Han, all,

Lucas, Numan and I have been doing some 'scale' testing of OpenStack
using OVN and wanted to present some results and issues that we've
found with the Incremental Processing feature in ovn-controller. Below
is the scenario that we executed:

* 7 baremetal nodes setup: 3 controllers (running
ovn-northd/ovsdb-servers in A/P with pacemaker) + 4 compute nodes. OVS
2.10.
* The test consists on:
  - Create openstack network (OVN LS), subnet and router
  - Attach subnet to the router and set gw to the external network
  - Create an OpenStack port and apply a Security Group (ACLs to allow
UDP, SSH and ICMP).
  - Bind the port to one of the 4 compute nodes (randomly) by
attaching it to a network namespace.
  - Wait for the port to be ACTIVE in Neutron ('up == True' in NB)
  - Wait until the test can ping the port
* Running browbeat/rally with 16 simultaneous process to execute the
test above 150 times.
* When all the 150 'fake VMs' are created, browbeat will delete all
the OpenStack/OVN resources.

We first tried with OVS/OVN 2.10 and pulled some results which showed
100% success but ovn-controller is quite loaded (as expected) in all
the nodes especially during the deletion phase:

- Compute node: https://imgur.com/a/tzxfrIR
- Controller node (ovn-northd and ovsdb-servers): https://imgur.com/a/8ffKKYF

After conducting the tests above, we replaced ovn-controller in all 7
nodes by the one with the current master branch (actually from last
week). We also replaced ovn-northd and ovsdb-servers but the
ovs-vswitchd has been left untouched (still on 2.10). The expected
results were to get less ovn-controller CPU usage and also better
times due to the Incremental Processing feature introduced recently.
However, the results don't look very good:

- Compute node: https://imgur.com/a/wuq87F1
- Controller node (ovn-northd and ovsdb-servers): https://imgur.com/a/99kiyDp

One thing that we can tell from the ovs-vswitchd CPU consumption is
that it's much less in the Incremental Processing (IP) case which
apparently doesn't make much sense. This led us to think that perhaps
ovn-controller was not installing the necessary flows in the switch
and we confirmed this hypothesis by looking into the dataplane
results. Out of the 150 VMs, 10% of them were unreachable via ping
when using ovn-controller from master.

@Han, others, do you have any ideas as of what could be happening
here? We'll be able to use this setup for a few more days so let me
know if you want us to pull some other data/traces, ...

Some other interesting things:
On each of the compute nodes, (with an almost evenly distributed
number of logical ports bound to them), the max amount of logical
flows in br-int is ~90K (by the end of the test, right before deleting
the resources).

It looks like with the IP version, ovn-controller leaks some memory:
https://imgur.com/a/trQrhWd
While with OVS 2.10, it remains pretty flat during the test:
https://imgur.com/a/KCkIT4O

Looking forward to hearing back :)
Daniel

PS. Sorry for my previous email, I sent it by mistake without the subject


More information about the discuss mailing list