[ovs-dev] ovn-controller refactor doc (LONG)

Thu Feb 11 12:47:09 UTC 2021

On Thu, Feb 11, 2021 at 2:49 PM Han Zhou <hzhou at ovn.org> wrote:
>
> On Tue, Jan 26, 2021 at 1:04 PM Mark Michelson <mmichels at redhat.com> wrote:
> >
> > Warning: Very long email ahead
> >
>
> Hi Mark,
>
> Thanks for writing this up. Please see my comments (also very long :))
> below.
>
> > ovn-controller has seen numerous drastic improvements over the past
> > several years. The focus has been mostly on performance and adding new
> > features. This, however, has resulted in a code base that has become
> > difficult to work with. The problems it has can be summed up as follows:
> > * It is not always obvious how certain pieces of code interact with each
> > other.
> > * The architecture comes across as monolithic. That is, the
> > ovn-controller application does not have individual components or layers
> > that have isolated duties.
> > * Southbound database IDL structures are used directly throughout the
> > code, rather than structures that may better represent the network
> > makeup or that take advantage of types and other benefits of C.
> > * There are a lot of special-cases. In other words, circumstances can
> > result in different code paths running.
> >
> > These have led to problems, such as:
> > * Unit testing is difficult.
> > * System tests can be flaky. We might see some developers run tests
> > successfully 90% of the time, while other devs run the same tests with
> > 90% failure.
> > * Attempting to replace a section of code with something new is difficult.
> > * Modifying the code can result in unexpected errors in what seems at
> > first glance to be an unrelated section.
> > * The incremental engine sometimes encounters situations that require
> > full recomputes of all southbound data. Many of these situations could
> > be avoided if the focus of the incremental engine were narrowed.
> >
> > To fix these issues, we suggest the following ideas:
> > * Isolate database IDL usage to a data translation layer. That is,
> > database-specific structures from the IDL will be translated into local
> > structures. Any lower layers would act upon the local structures only,
> > never seeing the database data directly.
> >    * This allows lower layers to be more easily unit tested.
> >    * This allows changing database technologies more easily.
> >    * This potentially allows for more natural representations of data.
> > In other words, we are not restricted to the relations and limited types
> > that the databases use.
>
> This sounds reasonable, but we should also keep in mind that an abstraction
> layer would introduce some cost. I think it would be worth the effort &
> cost if we really want to change the database technologies.
>
> > * Isolate the incremental engine to determine only on which data is
> > relevant to be processed by lower layers. The lower layers should then
> > take whatever data is relevant and perform any actions that should be
> > performed (e.g. claiming/releasing port bindings, writing openflow
> > flows, etc.)
> >    * This reduces most special-casing of code in lower layers.
> >    * This makes it easier to reason about what effect the incremental
> > engine will have on what will be processed.
> >    * This reduces the overall stretch of the incremental engine on the
> > rest of the code.
>
> I think this was the initial design principle of the I-P engine. It should
> only represent data dependency and invokes the change handlers (if this is
> what the *lower layer* means here) to react to what's changed accordingly.
> The difficult part is the change handler implementation: it needs to
> understand the relationship between all the inputs and take correct actions
> when any of the input changes, examine the change and reconcile by
> processing the other related data.
> Due to this complexity, there were only simple scenarios supported with I-P
> initially. While adding more change handlers for more scenarios to utilize
> I-P, the complexity increased. Specifically, the model was somehow not
> strictly followed when adding those scenarios in I-P, which lead to a lot
> of *special* cases (and I am definitely responsible for it as the main
> reviewer). I'd also admit that there were *special* cases even in the
> initial commits of I-P, but since it only takes care of the very limited
> most frequently used scenarios, it was much easier to reason about the
> correctness.
> Now there is an effort to complete one of the big TODOs to make the I-P
> dependency graph more straightforward and easier to follow and maintain (
> https://patchwork.ozlabs.org/project/ovn/patch/20210205094254.425942-1-numans@ovn.org/).
> Thanks Numan for working on this and I am still reviewing it.
>
> Moreover, I think if we continue with the I-P engine approach, we should
> have the important assumption that we are not going to implement all the
> change handlers: we only implement a very limited number of them for the
> most frequent operations that matter for scalability. For other less
> frequent changes we just fall back to full recompute.
> If we are unhappy with this, or if there are too many scenarios (input
> changes) that are considered important for scale, I think we shouldn't
> continue with this approach because it will eventually lead to
> unmaintainable change handler implementations. Instead, we should
> reimplement using DDlog, which is approaching a success for replacing
> northd. (I will address more of this later at the end of the email)
>
> > * Separate unrelated units from each other. For instance, OpenFlow
> > generation should be separate from packet-in processing, and those
> > should be separate from the code that claims port bindings. [1]
> >    * This allows for easier unit testing of these specific components.
> >    * This allows for potential parallelization (or pipelining) of tasks.
> > * Isolate the writing of OpenFlow to its own unit.
> >
> > Together, these changes have some net positive effects on the code:
> > * There will be fewer conditions that require the incremental engine to
> > perform full recomputes. This should lead to better performance.
> > * Adding additional incremental processing should be easier and less
> > error prone than it currently is.
> > * Changing technologies that ovn-controller interacts with (e.g. ovsdb,
> > OpenFlow) is now at least within the realm of possibility.
> > * Adding new features should be less error-prone. Adding new features
> > should also be easier, since it should hopefully be more obvious how a
> > new feature should slot into ovn-controller.
> > * As has been noted above, several of the changes allow for more unit
> > tests to be written, allowing for better code coverage, and better
> > confidence in the code.
> > * With the isolation of code sections, it should not be as easy to
> > accidentally break an unrelated code section with a bug fix or new
> feature.
> > * With the layering of code sections, it should be easier to profile the
> > code and find where the problem areas are. Adjusting their performance
> > should be safe as long as the inputs and outputs from those layers
> > remains the same.
> >
> > Hypothetical layering system
> >
> > Below is a possible layering of parts that ovn-controller could use.
> > Note that this chart is not meant to be prescriptive: this is only a
> > suggestion, and it likely is missing important details.
> >
> > +-------------------------------------------------------------+
> > |                                                             |
> > |                  OVSDB (Southbound and OVS)                 |
> > |                                                             |
> > +-------------------------------------------------------------+
> > |                                                             |
> > |               Data Translation/Validation (read)            |
> > |                                                             |
> > +-------------------------------------------------------------+
> > |                                                             |
> > |                         Data Filter                         |
> > |                                                             |
> > +-------------------------------------------------------------+
> > |                                                             |
> > |                     Incremental Engine                      |
> > |                                                             |
> > +---------------+----------------+----------------------------+
> > |               |                |                            |
> > | Logical Flow  | Physical Flow  |      Other Processing      |
> > |               |                |                            |
> > +---------------+---------------------------------------------+
> > |                                |                            |
> > |          Flow Writing          |  Data Translation (write)  |
> > |                                |                            |
> > +-------------------------------------------------------------+
> > |                                |                            |
> > |           vswitchd             | OVSDB (Southbound and OVS) |
> > |                                |                            |
> > +--------------------------------+----------------------------+
> >
> >
> > The topmost and bottommost rows represent entities outside of OVN. The
> > top row represents the source of all data that OVN processes. The bottom
> > row represents the final destinations for data that OVN has processed.
> > The inner blocks are layers within ovn-controller. Each layer reads data
> > from the layer(s) directly above it and writes data to the layer(s)
> > directly below it.
> >
> > Data Translation/Validation (read)
> >
> > This layer is responsible for all reads from the OVSDBs that OVN
> > interacts with. It translates the structures from the OVSDB IDL into
> > local ovn-controller formats, possibly performing validation if
> > necessary. The output of this layer to lower layers is a snapshot of the
> > database. This layer and the Data Translation (write) layer are the only
> > layers that use OVSDB IDL types and transactions.
> >
>
> For incremental processing to work, the output of this layer must include
> the changed data tracking information in addition to a snapshot of the
> whole database.
>
> > Data Filter
> >
> > This layer takes the data from the Data Translation layer and determines
> > what, if any, is relevant to the local chassis. This layer is
> > responsible for filtering out non-local data before it reaches any lower
> > layers.
> >
>
> I assume the conditional monitoring provided by the DB would still be used,
> which doesn't conflict with the filtering layer here, right?
>
> > Incremental Engine
> >
> > This portion of the code is responsible for keeping caches of data, and
> > taking the snapshots of data from the layers above to determine what has
> > changed. The output of this layer is the set of changes: data that has
> > been added, updated, or deleted.
> >
> As mentioned above, I hope in most cases determining what has changed is
> deduced directly from the Data Translation layer, otherwise I wonder if
> this would become the new bottleneck in a large scale environment.
>
> > Logical Flow
> >
> > This layer is responsible for generating the OpenFlow flows based on
> > changed logical flow data. This layer is where all logical flow
> > expression parsing is performed. It takes the changes from the
> > incremental engine and outputs a set of OpenFlow changes.
> >
> > Physical Flow
> >
> > This layer is responsible for generating the OpenFlow flows based on
> > physical data, such as input/output to VIF ports, logical patch ports,
> > etc. Unlike the Logical Flow layer, this creates OpenFlow flows directly
> > without the need for any expression parsing, since the flows originate
> > from the network structure rather than logical rules. Its output is the
> > same as the Logical Flow layer: the set of OpenFlow changes that should
> > be written.
> >
> > Flow Writing
> >
> > This layer writes the flows generated from the Logical Flow and Physical
> > Flow layers. This layer is responsible for writing flow_mod, group_mod,
> > meter_mod, etc. messages from OVN to OVS. it also maintains the state
> > machine between OVN and OVS, leaving the upper layers from having to
> > bother with it.
> >
> > Other processing
> >
> > What a horrible name! This essentially describes any entity in OVN that
> > needs to consume the changes from the incremental engine and write
> > changes to a database as a result.
> >
> > General notes:
> > * Though the layers describe the overall flow of data and subdivide the
> > duties, it does not necessarily represent how code might end up being
> > grouped. For instance, we might write a code unit that takes on the
> > duties of multiple layers by specializing in a particular data type. As
> > an example, consider a port binding unit that does the following:
> >    * Translates southbound port binding related data into local data
> > structures (Data Translation/Validation (read) layer)
> >    * Determines which port bindings are considered local to the
> > hypervisor (data filter layer)
> >    * Maintains a cache of current port binding data to determine what
> > has changed (incremental engine layer)
> >    * Acts upon the changes to port binding data (e.g. "claims" local
> > port bindings) (other processing layer)
> >    * Writes updates to the southbound database (e.g. updates the chassis
> > field on a port binding that it has claimed) (data translation (write)
> > layer)
> > As long as the layering is adhered to, it should not necessarily matter
> > how the code is arranged.
> > * If the layering model is adhered to, it should be possible to replace
> > a layer transparently, so long as the new code provides the same outputs
> > based on the inputs it receives.
> > * If we decided to change database technologies, the two data
> > translation layers (read and write) would need to be replaced, but
> > nothing else would need to be touched.
> > * What previously was called a recompute in the incremental engine is
> > now essentially replaced with a cache clear. By clearing the cache, all
> > data from the upper layers will be seen as new. Based on what was
> > discussed in a higher bullet point, if the data caches are farmed out to
> > separate code units, you can have localized cache clears that do not
> > affect other units. Thus one unit may need a "recompute" while others
> > may be able to continue without a need to clear their caches.
>
> Recomputing in most cases is in fact caused by missing a change handler
> instead of missing the change tracking information. The difficulty of
> implementing change handlers for all inputs on every node is because of the
> complex dependency graph. For this reason, clearing a cache can still
> easily trigger recomputing of many components if the cache (e.g.
> port-binding) is shared by them.
>
> > * pinctrl is not represented in this diagram. pinctrl sources its data
> > differently from the rest of ovn-controller. pinctrl does not use the
> > incremental engine. The pinctrl thread does not use OVSDB structures
> > directly. In other words, pinctrl is not in need of the same type of
> > refactoring compared with the rest of ovn-controller [2]
>
> Sorry, maybe I misunderstood here, but pinctrl does rely on several SB
> OVSDB tables as inputs/outputs. You are absolutely right that it doesn't
> use incremental processing engine. But if the DB layer is refactored then I
> think this module needs to be heavily involved in the consideration as
> well. (There is still an option open for discussing of using independent SB
> DB connection for pinctrl with the side-effect of doubling the number of SB
> DB connections.)
>
> > * The top three layers (Data translation (read), data locality, and
> > incremental engine) can be grouped together as a larger data acquisition
> > layer. The layers below that can be grouped together as a data
> > transformation layer. This was difficult to represent in an ASCII
> > diagram. The reason for making this distinction is that it provides a
> > clean separation point that could be considered when rethinking
> > threading/program control of ovn-controller.
> >
> > Possible Transition Plan
> > If all of the above were submitted as a single patch series it would be
> > nearly impossible to review, and it likely would introduce unforeseen
> > bugs. So let's think about how best to go from current ovn-controller to
> > a more structured ovn-controller.
> >
> > Phase 1:  Add scaffolding for the layers alongside the existing
> > ovn-controller code.
> > 1. Add stub function calls to ovn-controller's main loop that each
> > correspond to one of the data processing layers.
> > 2. Move the current ovn-controller loop logic into one of these stubs.
> > If using the layering system suggested earlier in this document, the
> > "other processing" layer would be a good place to put it.
> >
> > This reformulates ovn-controller to call into the individual layers.
> >  From here, we can fill out the stubs with the appropriate logic, and we
> > can remove "old" code as necessary.
> >
> > Phase 2: Isolate IDL data type usage
> > 1. Determine internal data types to be used.
> > 2. Write data translation layers so that IDL types can be translated
> > into internal types and vice-versa. Call into these functions from the
> > appropriate stub functions created in phase 1.
> > 3. Add unit tests that ensure data translation works as expected between
> > the types.
> > 4. Replace the use of IDL data types in the code with the internal types.
> >
> > At this point, you will have the data translation layers written. Using
> > internal data types will make it easier to create unit tests during
> > later phases.
> >
> > Phase 3: Incremental engine redefinition
> > 1. Determine the data types for which incremental processing is desired.
> > 2. Pick one of those data types and create the following:
> >     a. A data cache for the type.
> >     b. A method that takes the current data as input and compares it to
> > the cache, outputting the new, deleted, and updated items.
> > 3. Write unit tests
> > 4. Write the layers below the incremental engine to use this "diff" data.
> > 5. Ensure the stub from phase 1 that calls out to the incremental engine
> > is filled in for this data type.
> > 6. Delete any old incremental engine code that might conflict with this
> > new data handling.
> > 7. Repeat steps 2 - 6 for each data type that the incremental engine
> > will handle.
> >
> > Phase 3 is by far the hardest phase, since it likely will not be a very
> > clean process. By dividing the work by data type, it at least makes it
> > slightly easier to try to break the work up into more manageable chunks.
> >
> > Phase 4: Add processing between data translation and the incremental
> > engine. If using the proposed layering scheme above, this would be the
> > data filter layer.
> > 1. Determine the data types that need to be filtered for the local
> > environment.
> > 2. Pick one of those data types and create the following:
> >     a. A function that will take in the current data set and will output
> > a subset that includes only data that is pertinent to the local
> hypervisor.
> > 3. Call the functions from step 2 in the appropriate stub from
> > ovn-controller.
> > 4. Write unit tests.
> > 5. Repeat steps 2 - 4 for each data type from step 1.
> >
> > Phase 4 is a bit simpler, assuming that data filtering is the only
> > processing that occurs between the data translation layer and
> > incremental engine.
> >
> > At this point we essentially have the new code written. There may be
> > some additional work that is needed below the incremental engine layer,
> > but a lot of the incremental engine rework in phase 3 is going to touch
> > those lower layers as well. Whatever is left of the original
> > ovn-controller logic should be examined and determined if it should be
> > moved into a different layer, removed altogether, or left where it is.
> > For the code that gets left where it is, we can examine what is left and
> > determine if it can or should be cleaned up/refactored.
> >
>
> Thanks for thinking and writing up so much details for the transition plan.
> However, I have to admit that it is still quite vague to me probably
> because I didn't think through all the details just by reading this
> document. Some POC is definitely needed, but even the POC would sound very
> big effort for the proposed changes.
>
> Before moving forward, I'd like to share some thoughts which are more of
> re-architecturing than refactoring the ovn-controller, but for the same
> goal of simplicity and efficiency.
> In the past we have learnt that the logical flow translation to open flow
> rules is the performance bottleneck, and most of our effort of scalability
> improvement, including the I-P engine, was focusing on the logical flow
> processing. There are two major challenges for logical flow translation.
> 1. Logical flow translation is mostly string parsing, which is CPU
> sensitive.
> 2. Since it is a string (specifically, the "match" field), it lacks
> metadata for preprocessing such as filtering out the flows that are not
> needed on the current chassis. Those information is available only after
> parsing the strings.
>
> The initial design of the centralized logical flow generation (by northd)
> is to do the common task in one place and each chassis only takes care of
> the chassis-specific part of computation. This is a reasonable decision
> from the architecture point of view. However, now that we have the
> experience on the performance bottle at scale, maybe it is time to rethink
> this design. Since the logical flow translation is so costly, the cost
> saved from each ovn-controller doing the same compute may be not worth the
> cost and complexity it introduced. It may be better to combine the two
> layer of processing, i.e. move the northd logic to ovn-controller, and
> ovn-controller directly talks to NB DB consuming the NB model together with
> the current SB DB data (except that the logical-flow table is not needed at
> all) to compute the OVS open flow rules directly, completely skipping the
> logical flow to open-flow translation. The benefits of this approach are:
> 1. Obviously, the most costly logical-flow to open-flow translation is
> completely avoided.
> 2. By directly consuming the NB objects (instead of strings of match
> conditions), ovn-controller has first-hand insight of the data and it is
> much easier for optimizations such as filtering out data that is irrelevant
> to the current chassis.
> 3. Since we are moving most part of the northd logical to ovn-controller,
> it is likely that we can reuse most of the DDlog effort taken for northd,
> and extend it for ovn-controller. This would be more maintainable than the
> C I-P engine in the long run, and also avoids the I-P problems discussed
> above.
> 4. If DDlog is utilized, the problem of repeating computation for the
> common part that the initial design of the centralized logical flow
> generation was trying to solve doesn't even exist - the cost of computing
> the common part is so small and there is no point to save it by
> implementing the two-layer architecture.
>
> Although this may sound like a bigger change, but it seems to me this
> approach doesn't necessarily take more effort than the above refactor
> efforts for ovn-controller, and I think it would benefit OVN in the long
> run.
>
> Thoughts?

I definitely think it's worth pursuing in this direction too.  We may
not be able to
completely remove the SB dependency in my opinion. Or at least to start
with
  - ovn-northd can replicate the information required for flow
generating by each
    ovn-controller into SB DB
 - and it can stop generating logical flows.

A while back, I started exploring similar thoughts, that ovn-controllers will
generate logical flows internally. As a first step I removed the
logical flow generation
in ovn-northd which doesn't involve logical ports (i.e match with
inport or outport fields).
I was able to get it to work [1]. Meanwhile Ilya added the logical
datapath groups and that
helped in reducing lots of logical flows and I stopped my POC effort after that.

My main motivation was to reduce the number of logical flows in the SB
DB as that
was causing a lot of scale issues in our scale testing efforts.

Removing the logical flow generation in the northd, will help the high CPU and
high memory consumption seen (in a scaled deployment) in ovn-northd
and SB ovsdb-servers.

I still think we can make use of the logical flow expr parsing in
ovn-controller, as it will
be easier to generate OF flows.

[1] - https://github.com/numansiddique/ovn/commits/lflows_in_controller_rebased
       https://github.com/numansiddique/ovn/tree/lflows_in_controller

Whether we take this approach or not, I think it's still worth it to
put efforts in refactoring
the existing ovn-controller code base as per the suggestions and
design put forward by
Mark here (wherever it is possible to follow the design).

Thanks
Numan

>
> Thanks,
> Han
>
> >
> > [1] To some extent this is already done, since code is separated into
> > lflow.c, ofctrl.c, binding.c, physical.c, etc. However, those all have
> > cross-dependencies taht can result in unexpected interactions.
> > [2] This is not to imply that pinctrl is perfect. It is potentially in
> > need of its own refactoring. It also doesn't mean pinctrl would not be
> > touched by the ovn-controller refactor, since it makes use of structures
> > like local datapaths, which may be altered by the refactor.
> >
> > _______________________________________________
> > dev mailing list
> > dev at openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>