[ovs-discuss] Incremental perf results

Fri May 11 14:58:55 UTC 2018

On 05/10/2018 08:33 PM, Han Zhou wrote:
> 
> 
> On Wed, May 9, 2018 at 12:21 PM, Mark Michelson <mmichels at redhat.com 
> <mailto:mmichels at redhat.com>> wrote:
> 
>     On 05/07/2018 08:29 AM, Mark Michelson wrote:
> 
>         On 05/04/2018 04:15 PM, Han Zhou wrote:
> 
> 
> 
>             On Thu, May 3, 2018 at 10:49 AM, Mark Michelson
>             <mmichels at redhat.com <mailto:mmichels at redhat.com>
>             <mailto:mmichels at redhat.com <mailto:mmichels at redhat.com>>>
>             wrote:
> 
>                  On 05/03/2018 12:58 PM, Mark Michelson wrote:
> 
>                      Hi Han,
> 
>                      (cc'ed ovs-discuss)
> 
>                      I have some test results here for your incremental
>             branch[0].
> 
>                      Browbeat was the test orchestrator for the test,
>             and it uses
>                      ovn-scale-test[1] to configure the test parameters
>             and run the test.
> 
>                      The test use one central node on which ovn-northd
>             runs. There
>                      are three farm nodes on which sandboxes are run for
>                      ovn-controller. Each farm node runs 24 sandboxes,
>             for a total of
>                      72 sandboxes (and thus 72 ovn-controller processes).
> 
>                      The test uses the datapath context[2] to set up 72
>             logical
>                      switches and one logical router in advance. Then
>             during each
>                      test iteration, a logical switch port is added to
>             one of the
>                      logical switches and bound on one of the sandboxes.
>             The next
>                      iteration does not start until the previous
>             iteration is 100%
>                      complete (i.e. we see that the logical switch port
>             is "up" in
>                      the northbound db). The total number of logical
>             switch ports
>                      added during the tests is 3312.
> 
>                      During the test, I ran `perf record` on one of the
>                      ovn-controller processes and then created a flame
>             graph[3] from
>                      the results. I have attached the flame graph to
>             this e-mail. I
>                      think this can give us a good jumping off point for
>             determining
>                      more optimizations to make to ovn-controller.
> 
>                      [0] https://github.com/hzhou8/ovs/tree/ip7
>             <https://github.com/hzhou8/ovs/tree/ip7>
>                      <https://github.com/hzhou8/ovs/tree/ip7
>             <https://github.com/hzhou8/ovs/tree/ip7>>
>                      [1] https://github.com/openvswitch/ovn-scale-test
>             <https://github.com/openvswitch/ovn-scale-test>
>                      <https://github.com/openvswitch/ovn-scale-test
>             <https://github.com/openvswitch/ovn-scale-test>>
>                      [2]
>             https://github.com/openvswitch/ovn-scale-test/pull/165
>             <https://github.com/openvswitch/ovn-scale-test/pull/165>
>                     
>             <https://github.com/openvswitch/ovn-scale-test/pull/165
>             <https://github.com/openvswitch/ovn-scale-test/pull/165>>
>                      [3]
>             http://www.brendangregg.com/FlameGraphs/cpuflamegraphs
>             <http://www.brendangregg.com/FlameGraphs/cpuflamegraphs>
>                     
>             <http://www.brendangregg.com/FlameGraphs/cpuflamegraphs
>             <http://www.brendangregg.com/FlameGraphs/cpuflamegraphs>>
> 
> 
> 
>                   From the IRC meeting, it was requested to see a flame
>             graph of
>                  performance on OVS master. I am attaching that on this
>             e-mail.
> 
>                  One difference in this test run is that the number of
>             switch ports
>                  was fewer (I'm not 100% sure of the exact number), so
>             the number of
>                  samples in perf record is less than in the flame graph
>             I previously
>                  sent.
> 
>                  The vast majority of the time is spent in lflow_run().
>             Based on this
>                  flame graph, our initial take on the matter was that we
>             could get
>                  improved performance by reducing the number of logical
>             flows to
>                  process. The incremental branch seemed like a good
>             testing target to
>                  that end.
> 
> 
>             Thanks Mark for sharing the results!
>             It seems you have sent the wrong attachment perf-master.svg
>             in your second email, which is still the same one as in the
>             first email. Would you mind sending the right one? Also,
>             please if you could share total CPU cost for incremental
>             v.s. master, when you have the data.
> 
>               From your text description, it is improved as expected,
>             since the bottleneck moved from lflow_run() to ofctrl_put().
>             For the new bottleneck ofctrl_put(), it's a good finding,
>             and I think I have some ideas to improve that, too.
>             Basically, when we are able to do incremental computing, we
>             don't need to go through and compare the installed v.s.
>             desired flows every time, but instead we can maintain a list
>             of changes made in the iteration and then send them to OVS.
>             This would reduce a lot of ovn_flow_lookup() calls which is
>             currently shown as the hotspot in ofctrl_put(). I will try
>             this as soon as I fix some corner cases and get confident
>             about the correctness.
> 
>             Thanks,
>             Han
> 
> 
>         Apologies, Han. Here is the proper file.
> 
>         Your conclusion is similar to what I thought about when I saw
>         where the new bottleneck was.
> 
>         I don't have a comparison of total CPU cost yet. But I can tell
>         you that from observing CPU usage in real time during the test
>         with the master branch, the graphs appeared to be at 100% across
>         all CPU cores for the duration of the test. I have not observed
>         CPU usage during a test with the incremental branch to see if it
>         improved.
> 
>         Mark!
> 
> 
>     Hello Han,
> 
>     Just following up on the previous e-mail. I plan to run a test soon
>     with the incremental branch and observe CPU usage. I won't be able
>     to share a graph, but I can at least tell you if I observe an
>     overall drop in CPU usage compared to using the master branch.
> 
>     I took a look through ofctrl_put() and I think that the incremental
>     processing branch puts it in a good position to be improved
>     dramatically.
> 
>     In master, we currently maintain a hmap of installed flows. When a
>     loop of ovn-controller runs, we use the contents of the southbound
>     database to build up a hmap of our desired flows. We then perform an
>     (expensive) operation where we reconcile the installed flows with
>     the desired flows. The output of all of this is a list of
>     ofputil_flow_mods that we then pass down to vswitchd.
> 
>     In the incremental processing branch [1], this is helped a bit. When
>     we can perform incremental processing, the desired flows are no
>     longer built from scratch on each loop. Instead, the desired flows
>     persist and are modified based on changes observed in the database.
>     ofctrl_put() is mostly unchanged, although it can exit early if it
>     knows that there were no changes observed.
> 
> In fact the ofctrl_put() (in ip7 branch) already takes care of the no 
> change scenario. It returns immediately if decided no change needed 
> after some simple evaluation ("flow_changed" is one of the condition it 
> considers). But yes, when change is needed, it can be further optimized 
> as I mentioned also in last email.
> 
> 
>     I've thought about this and considered two stages to potentially
>     improving this.
> 
>     Stage 1: When performing incremental processing, don't bother
>     dealing with desired flows at all. Instead, when changes are
>     observed in the southbound database, modify the installed flows
>     directly and create ofputil_flow_mods based on the changes. When it
>     is time to run ofctrl_put(), we can pass the generated
>     ofputil_flow_mods directly to vswitchd without having to perform any
>     sort of reconciliation between hmaps. We still have to maintain the
>     installed flows because full runs through the southbound database
>     will need to perform the "classic" version of ofctrl_put() and will
>     need to have an up-to-date table of installed flows.
> 
> I was thinking about the same, but then figured out we still need the 
> desired flows because of a corner case: different logical flows can 
> generate same OVS flows (e.g. when 2 identical ACLs are created for the 
> same ovn port, due to client behavior, which is real at least in 
> Neutron), but duplicated flows can't exist in installed flows. If we 
> don't maintain desired flows that stores multiple entries of duplicated 
> flows, when we handle a logical flow deletion we don't know if the 
> corresponding installed flow(s) is/are still being used by any other 
> logical flow.

I'm glad I asked because I would not have realized that distinct logical 
flows could result in the same OVS flow.

One idea about how to address this issue is that the installed_flows 
hmap could consist of both the OVS flow and a reference count. The 
reference count is equal to the number of logical flows that resulted in 
that OVS flow. When doing incremental processing, it's easy to add and 
remove from the count based on changes in logical flows. If the count on 
an installed OVS flow reaches 0, then the flow can be removed from the hmap.

> 
> This can be addressed by changing the installed flow table structure to 
> maintain the M:N mapping between logical flow and OVS flows, and then 
> the desired flow table may be removed. However, keeping the desired flow 
> table doesn't harm either (despite the memory consumption), and I am not 
> sure if there are other reasons we will need it (maybe it helps when 
> full computation is needed). When incremental processing taking effect, 
> we can still avoid the costly full table comparing by maintaining a list 
> of the changes while keep updating desired flow table.

My thinking behind removing the persistent desired flows table was 
mainly just removing something that didn't need to be there. For full 
computation, we can do what OVS master currently does: build the desired 
flows from scratch and then delete them at the end of the iteration. 
Thinking about it more, I think you're right that the main savings we 
would get would be memory consumption.

> 
>     Stage 2: If the code ever reaches a point where incremental
>     processing is guaranteed to be performed on every loop iteration of
>     ovn-controller, then ovn-controller could become stateless with
>     regards to installed flows. There would be no need to maintain the
>     installed flows since we could just generate ofputil_flow_mods based
>     on how the southbound database has changed. We could then pass those
>     modifications down to vswitchd.
> 
> This would be ideal situation, but it seems to me not very realistic for 
> ovn-controller, since there are just too many dependencies. Probably I 
> am just not convinced of the value v.s. effort to achieve that. Maybe it 
> is ok to recompute for changes that not happen very often, isn't it?

I agree that it is not very realistic, but it would be nice if it were 
possible to do. I agree that for infrequent changes, performing a 
recompute is probably fine.

> 
>     I think similar changes could be implemented for the group and meter
>     extend tables, but they are not nearly as important as the flow table.
> 
>     Do my suggestions make sense to you? Are they in line with what you
>     had been thinking about?
> 
> Of course! Thanks for sharing your good thoughts. I am glad this work 
> helps and let's improve it further :)
> 
>     Thanks,
>     Mark
> 
>     [1] Currently I'm looking at https://github.com/hzhou8/tree/ip7
>     <https://github.com/hzhou8/tree/ip7> . I noticed that there is an
>     ip8 branch as well, but since I have been using ip7 in tests, I
>     decided look at it as a reference.
> 
> 
> I will keep the branch ip7 untouched and let you know when I get ip8 ready.

Thanks. Can you give a brief summary of what is different in ip8? Just a 
sentence or two is fine.

> 
> Thanks,
> Han