[ovs-dev] ovn-controller refactor doc (LONG)

Dumitru Ceara dceara at redhat.com
Wed Feb 24 11:31:35 UTC 2021


On 2/11/21 3:22 PM, Mark Michelson wrote:
> On 2/11/21 7:47 AM, Numan Siddique wrote:
>> On Thu, Feb 11, 2021 at 2:49 PM Han Zhou <hzhou at ovn.org> wrote:
>>>
>>> On Tue, Jan 26, 2021 at 1:04 PM Mark Michelson <mmichels at redhat.com> 
>>> wrote:
>>>>
>>>> Warning: Very long email ahead
>>>>
>>>
>>> Hi Mark,
>>>
>>> Thanks for writing this up. Please see my comments (also very long :))
>>> below.
> 
> And thanks for taking the time to read and respond.

Hi Mark, Numan, Han,

Thanks for the discussion until now!

I'll try to share some thoughts too.

> 
>>>
>>>> ovn-controller has seen numerous drastic improvements over the past
>>>> several years. The focus has been mostly on performance and adding new
>>>> features. This, however, has resulted in a code base that has become
>>>> difficult to work with. The problems it has can be summed up as 
>>>> follows:
>>>> * It is not always obvious how certain pieces of code interact with 
>>>> each
>>>> other.
>>>> * The architecture comes across as monolithic. That is, the
>>>> ovn-controller application does not have individual components or 
>>>> layers
>>>> that have isolated duties.
>>>> * Southbound database IDL structures are used directly throughout the
>>>> code, rather than structures that may better represent the network
>>>> makeup or that take advantage of types and other benefits of C.
>>>> * There are a lot of special-cases. In other words, circumstances can
>>>> result in different code paths running.
>>>>
>>>> These have led to problems, such as:
>>>> * Unit testing is difficult.
>>>> * System tests can be flaky. We might see some developers run tests
>>>> successfully 90% of the time, while other devs run the same tests with
>>>> 90% failure.
>>>> * Attempting to replace a section of code with something new is 
>>>> difficult.
>>>> * Modifying the code can result in unexpected errors in what seems at
>>>> first glance to be an unrelated section.
>>>> * The incremental engine sometimes encounters situations that require
>>>> full recomputes of all southbound data. Many of these situations could
>>>> be avoided if the focus of the incremental engine were narrowed.
>>>>
>>>> To fix these issues, we suggest the following ideas:
>>>> * Isolate database IDL usage to a data translation layer. That is,
>>>> database-specific structures from the IDL will be translated into local
>>>> structures. Any lower layers would act upon the local structures only,
>>>> never seeing the database data directly.
>>>>     * This allows lower layers to be more easily unit tested.
>>>>     * This allows changing database technologies more easily.
>>>>     * This potentially allows for more natural representations of data.

I think another implication is that such a layer would enable further 
optimization in ovn-controller.

For example, the current IDL change tracking doesn't provide information 
about old vs new values of specific columns in a table.  One consequence 
of that is that with logical datapath groups enabled, when a datapath 
group is updated (e.g., a datapath is added), we are forced to remove 
the resulting OF flows for all datapaths and readd them for the new 
contents of the 'dp_group->datapaths' column.

If the data translation layer would provide more fine grain information 
about changes that happened to a record this could be easily optimized 
and only OF flows for deleted/updated/new datapaths in a dp_group would 
have to be recomputed.

>>>> In other words, we are not restricted to the relations and limited 
>>>> types
>>>> that the databases use.
>>>
>>> This sounds reasonable, but we should also keep in mind that an 
>>> abstraction
>>> layer would introduce some cost. I think it would be worth the effort &
>>> cost if we really want to change the database technologies.
> 
> Yes, I agree. If I recall how DDLog works, then a similar translation 
> layer would have to be written in order to use DDLog, since we would 
> need to translate between OVSDB data and DDLog relations. It means that 
> switching to DDlog would not require locking into a specific DB.
> 
>>>
>>>> * Isolate the incremental engine to determine only on which data is
>>>> relevant to be processed by lower layers. The lower layers should then
>>>> take whatever data is relevant and perform any actions that should be
>>>> performed (e.g. claiming/releasing port bindings, writing openflow
>>>> flows, etc.)
>>>>     * This reduces most special-casing of code in lower layers.
>>>>     * This makes it easier to reason about what effect the incremental
>>>> engine will have on what will be processed.
>>>>     * This reduces the overall stretch of the incremental engine on the
>>>> rest of the code.
>>>
>>> I think this was the initial design principle of the I-P engine. It 
>>> should
>>> only represent data dependency and invokes the change handlers (if 
>>> this is
>>> what the *lower layer* means here) to react to what's changed 
>>> accordingly.
>>> The difficult part is the change handler implementation: it needs to
>>> understand the relationship between all the inputs and take correct 
>>> actions
>>> when any of the input changes, examine the change and reconcile by
>>> processing the other related data.
>>> Due to this complexity, there were only simple scenarios supported 
>>> with I-P
>>> initially. While adding more change handlers for more scenarios to 
>>> utilize
>>> I-P, the complexity increased. Specifically, the model was somehow not
>>> strictly followed when adding those scenarios in I-P, which lead to a 
>>> lot
>>> of *special* cases (and I am definitely responsible for it as the main
>>> reviewer). I'd also admit that there were *special* cases even in the
>>> initial commits of I-P, but since it only takes care of the very limited
>>> most frequently used scenarios, it was much easier to reason about the
>>> correctness.
>>> Now there is an effort to complete one of the big TODOs to make the I-P
>>> dependency graph more straightforward and easier to follow and 
>>> maintain (
>>> https://patchwork.ozlabs.org/project/ovn/patch/20210205094254.425942-1-numans@ovn.org/). 
>>>
>>> Thanks Numan for working on this and I am still reviewing it.
> 
> This feedback is very valuable. I'll admit that my proposal is 
> deliberately vague in certain cases, such as defining specific data 
> types to be used internally by ovn-controller. One of the implied ideas 
> I had was that when translating from database formats into internal 
> structs, we might attempt to unify some of the dependent database types 
> into more streamlined C structures. This way, the I-P engine at the 
> lower level doesn't have such a complex dependency graph. Instead, it 
> can (hopefully) operate on independent types as much as possible.
> 
> As an example, think about struct ovn_datapath in ovn-northd. It is a C 
> structure that does not correlate directly with any one particular 
> database table, but rather contains a selection of data from multiple 
> tables. If you think about having ovn_datapath as the incrementally 
> processed datatype, rather than Logical_Switch, Logical_Switch_Port, 
> etc., then you might find that the dependency graph becomes much simpler.
> 
> It is possible that what I'm imagining is not practical, though.
> 
>>>
>>> Moreover, I think if we continue with the I-P engine approach, we should
>>> have the important assumption that we are not going to implement all the
>>> change handlers: we only implement a very limited number of them for the
>>> most frequent operations that matter for scalability. For other less
>>> frequent changes we just fall back to full recompute.
> 
> Yes, this is a very good point. There was one point during an internal 
> RH meeting a few weeks ago where an OVN developer brought up the idea 
> that we should probably re-profile OVN and try to determine if we can 
> actually get rid of some of the I-P and have similar performance (for 
> common scenarios, anyway). The idea is that the tradeoff between 
> performance and maintainability is worth exploring more.
> 
>>> If we are unhappy with this, or if there are too many scenarios (input
>>> changes) that are considered important for scale, I think we shouldn't
>>> continue with this approach because it will eventually lead to
>>> unmaintainable change handler implementations. Instead, we should
>>> reimplement using DDlog, which is approaching a success for replacing
>>> northd. (I will address more of this later at the end of the email) >>
>>>> * Separate unrelated units from each other. For instance, OpenFlow
>>>> generation should be separate from packet-in processing, and those
>>>> should be separate from the code that claims port bindings. [1]
>>>>     * This allows for easier unit testing of these specific components.
>>>>     * This allows for potential parallelization (or pipelining) of 
>>>> tasks.
>>>> * Isolate the writing of OpenFlow to its own unit.
>>>>
>>>> Together, these changes have some net positive effects on the code:
>>>> * There will be fewer conditions that require the incremental engine to
>>>> perform full recomputes. This should lead to better performance.
>>>> * Adding additional incremental processing should be easier and less
>>>> error prone than it currently is.
>>>> * Changing technologies that ovn-controller interacts with (e.g. ovsdb,
>>>> OpenFlow) is now at least within the realm of possibility.
>>>> * Adding new features should be less error-prone. Adding new features
>>>> should also be easier, since it should hopefully be more obvious how a
>>>> new feature should slot into ovn-controller.
>>>> * As has been noted above, several of the changes allow for more unit
>>>> tests to be written, allowing for better code coverage, and better
>>>> confidence in the code.
>>>> * With the isolation of code sections, it should not be as easy to
>>>> accidentally break an unrelated code section with a bug fix or new
>>> feature.
>>>> * With the layering of code sections, it should be easier to profile 
>>>> the
>>>> code and find where the problem areas are. Adjusting their performance
>>>> should be safe as long as the inputs and outputs from those layers
>>>> remains the same.
>>>>
>>>> Hypothetical layering system
>>>>
>>>> Below is a possible layering of parts that ovn-controller could use.
>>>> Note that this chart is not meant to be prescriptive: this is only a
>>>> suggestion, and it likely is missing important details.
>>>>
>>>> +-------------------------------------------------------------+
>>>> |                                                             |
>>>> |                  OVSDB (Southbound and OVS)                 |
>>>> |                                                             |
>>>> +-------------------------------------------------------------+
>>>> |                                                             |
>>>> |               Data Translation/Validation (read)            |
>>>> |                                                             |
>>>> +-------------------------------------------------------------+
>>>> |                                                             |
>>>> |                         Data Filter                         |
>>>> |                                                             |
>>>> +-------------------------------------------------------------+
>>>> |                                                             |
>>>> |                     Incremental Engine                      |
>>>> |                                                             |
>>>> +---------------+----------------+----------------------------+
>>>> |               |                |                            |
>>>> | Logical Flow  | Physical Flow  |      Other Processing      |
>>>> |               |                |                            |
>>>> +---------------+---------------------------------------------+
>>>> |                                |                            |
>>>> |          Flow Writing          |  Data Translation (write)  |
>>>> |                                |                            |
>>>> +-------------------------------------------------------------+
>>>> |                                |                            |
>>>> |           vswitchd             | OVSDB (Southbound and OVS) |
>>>> |                                |                            |
>>>> +--------------------------------+----------------------------+
>>>>
>>>>
>>>> The topmost and bottommost rows represent entities outside of OVN. The
>>>> top row represents the source of all data that OVN processes. The 
>>>> bottom
>>>> row represents the final destinations for data that OVN has processed.
>>>> The inner blocks are layers within ovn-controller. Each layer reads 
>>>> data
>>>> from the layer(s) directly above it and writes data to the layer(s)
>>>> directly below it.
>>>>
>>>> Data Translation/Validation (read)
>>>>
>>>> This layer is responsible for all reads from the OVSDBs that OVN
>>>> interacts with. It translates the structures from the OVSDB IDL into
>>>> local ovn-controller formats, possibly performing validation if
>>>> necessary. The output of this layer to lower layers is a snapshot of 
>>>> the
>>>> database. This layer and the Data Translation (write) layer are the 
>>>> only
>>>> layers that use OVSDB IDL types and transactions.
>>>>
>>>
>>> For incremental processing to work, the output of this layer must 
>>> include
>>> the changed data tracking information in addition to a snapshot of the
>>> whole database.
> 
> I'm curious about this statement. Why is a snapshot of the whole 
> database required?

I think I agree with Han here, along with the local ovn-controller 
records (which I assume are the "snapshot" of the DB) there should also 
be methods/APIs in the translation layer to determine exactly what 
changed in a record, ideally at column level granularity (see the 
example about dp_groups above).

> 
>>>
>>>> Data Filter
>>>>
>>>> This layer takes the data from the Data Translation layer and 
>>>> determines
>>>> what, if any, is relevant to the local chassis. This layer is
>>>> responsible for filtering out non-local data before it reaches any 
>>>> lower
>>>> layers.
>>>>
>>>
>>> I assume the conditional monitoring provided by the DB would still be 
>>> used,
>>> which doesn't conflict with the filtering layer here, right?
> 
> No, actually. I was thinking that until the incremental engine layer was 
> hit, we were operating on whole snapshots of the database. Essentially, 
> we were moving the incremental processing from the database layer down 
> to a lower layer and operating on custom structures instead.
> 
> Now that I think about it some more, having some database tracking might 
> be good just so we can account for simple cases, like when nothing in a 
> database table has changed.

I guess I'm a bit confused here, I initially thought that a first 
implementation of the data translation layer would be a wrapper on top 
of the ovsdb IDL with enhanced change tracking capabilities.

Conditional monitoring is something we definitely need on the long run 
in my opinion but that would be an "implementation detail" of the data 
translation layer.

With that in mind, I think the only use case for the Data Filter layer 
would be to perform client side filtering in case conditional monitoring 
is disabled (i.e., ovn-monitor-all=true).  I might be completely wrong 
though :)

> 
>>>
>>>> Incremental Engine
>>>>
>>>> This portion of the code is responsible for keeping caches of data, and
>>>> taking the snapshots of data from the layers above to determine what 
>>>> has
>>>> changed. The output of this layer is the set of changes: data that has
>>>> been added, updated, or deleted.
>>>>
>>> As mentioned above, I hope in most cases determining what has changed is
>>> deduced directly from the Data Translation layer, otherwise I wonder if
>>> this would become the new bottleneck in a large scale environment.
> 
> Nope, but you could be right that if the data translation layer is 
> having to translate the entire database on every loop, it could become a 
> bottleneck.

I agree with Han, I think it's more useful if the Data Translation layer 
provides information about what changed in the records.

> 
> I'm curious how DDLog works in this case. Is it using change tracking in 
> the database, or does it just rely on very efficient 1-to-1 translations 
> of database data to DDLog relations?

IIUC, DDlog processes directly the jsonrpc updates from ovsdb (which are 
"incremental" by nature) without using the IDL (and its change 
tracking), implementing its own incremental handling of updates, 
ddlog_apply_ovsdb_updates().

> 
>>>
>>>> Logical Flow
>>>>
>>>> This layer is responsible for generating the OpenFlow flows based on
>>>> changed logical flow data. This layer is where all logical flow
>>>> expression parsing is performed. It takes the changes from the
>>>> incremental engine and outputs a set of OpenFlow changes.
>>>>
>>>> Physical Flow
>>>>
>>>> This layer is responsible for generating the OpenFlow flows based on
>>>> physical data, such as input/output to VIF ports, logical patch ports,
>>>> etc. Unlike the Logical Flow layer, this creates OpenFlow flows 
>>>> directly
>>>> without the need for any expression parsing, since the flows originate
>>>> from the network structure rather than logical rules. Its output is the
>>>> same as the Logical Flow layer: the set of OpenFlow changes that should
>>>> be written.
>>>>
>>>> Flow Writing
>>>>
>>>> This layer writes the flows generated from the Logical Flow and 
>>>> Physical
>>>> Flow layers. This layer is responsible for writing flow_mod, group_mod,
>>>> meter_mod, etc. messages from OVN to OVS. it also maintains the state
>>>> machine between OVN and OVS, leaving the upper layers from having to
>>>> bother with it.
>>>>
>>>> Other processing
>>>>
>>>> What a horrible name! This essentially describes any entity in OVN that
>>>> needs to consume the changes from the incremental engine and write
>>>> changes to a database as a result.
>>>>
>>>> General notes:
>>>> * Though the layers describe the overall flow of data and subdivide the
>>>> duties, it does not necessarily represent how code might end up being
>>>> grouped. For instance, we might write a code unit that takes on the
>>>> duties of multiple layers by specializing in a particular data type. As
>>>> an example, consider a port binding unit that does the following:
>>>>     * Translates southbound port binding related data into local data
>>>> structures (Data Translation/Validation (read) layer)
>>>>     * Determines which port bindings are considered local to the
>>>> hypervisor (data filter layer)
>>>>     * Maintains a cache of current port binding data to determine what
>>>> has changed (incremental engine layer)
>>>>     * Acts upon the changes to port binding data (e.g. "claims" local
>>>> port bindings) (other processing layer)
>>>>     * Writes updates to the southbound database (e.g. updates the 
>>>> chassis
>>>> field on a port binding that it has claimed) (data translation (write)
>>>> layer)
>>>> As long as the layering is adhered to, it should not necessarily matter
>>>> how the code is arranged.
>>>> * If the layering model is adhered to, it should be possible to replace
>>>> a layer transparently, so long as the new code provides the same 
>>>> outputs
>>>> based on the inputs it receives.
>>>> * If we decided to change database technologies, the two data
>>>> translation layers (read and write) would need to be replaced, but
>>>> nothing else would need to be touched.
>>>> * What previously was called a recompute in the incremental engine is
>>>> now essentially replaced with a cache clear. By clearing the cache, all
>>>> data from the upper layers will be seen as new. Based on what was
>>>> discussed in a higher bullet point, if the data caches are farmed 
>>>> out to
>>>> separate code units, you can have localized cache clears that do not
>>>> affect other units. Thus one unit may need a "recompute" while others
>>>> may be able to continue without a need to clear their caches.
>>>
>>> Recomputing in most cases is in fact caused by missing a change handler
>>> instead of missing the change tracking information. The difficulty of
>>> implementing change handlers for all inputs on every node is because 
>>> of the
>>> complex dependency graph. For this reason, clearing a cache can still
>>> easily trigger recomputing of many components if the cache (e.g.
>>> port-binding) is shared by them.
> 
> See my comment above about hopefully being able to not have as 
> interdependent types.
> 
>>>
>>>> * pinctrl is not represented in this diagram. pinctrl sources its data
>>>> differently from the rest of ovn-controller. pinctrl does not use the
>>>> incremental engine. The pinctrl thread does not use OVSDB structures
>>>> directly. In other words, pinctrl is not in need of the same type of
>>>> refactoring compared with the rest of ovn-controller [2]
>>>
>>> Sorry, maybe I misunderstood here, but pinctrl does rely on several SB
>>> OVSDB tables as inputs/outputs. You are absolutely right that it doesn't
>>> use incremental processing engine. But if the DB layer is refactored 
>>> then I
>>> think this module needs to be heavily involved in the consideration as
>>> well. (There is still an option open for discussing of using 
>>> independent SB
>>> DB connection for pinctrl with the side-effect of doubling the number 
>>> of SB
>>> DB connections.)
> 
> I'll clarify: the pinctrl *thread* currently does not use OVSDB 
> structures directly. Instead, it makes use of internal types. The main 
> ovn-controller thread in pinctrl.c then translates this internal data 
> into SB data and syncs with the database (sometimes the operation is 
> reversed: we translate SB data into an internal type and sync with our 
> internal cache). So in a way, pinctrl is following the model proposed in 
> this document: use internal data structures for processing, and save 
> translation/synchronization for a different layer. If it were strictly 
> following the layers I outlined, then translation and synchronization 
> would be defined in more discrete layers. But for the purposes of code 
> cleanup, I'm of the opinion that pinctrl is not in as immediate need of 
> help [1].
> 
>>>
>>>> * The top three layers (Data translation (read), data locality, and
>>>> incremental engine) can be grouped together as a larger data 
>>>> acquisition
>>>> layer. The layers below that can be grouped together as a data
>>>> transformation layer. This was difficult to represent in an ASCII
>>>> diagram. The reason for making this distinction is that it provides a
>>>> clean separation point that could be considered when rethinking
>>>> threading/program control of ovn-controller.
>>>>
>>>> Possible Transition Plan
>>>> If all of the above were submitted as a single patch series it would be
>>>> nearly impossible to review, and it likely would introduce unforeseen
>>>> bugs. So let's think about how best to go from current 
>>>> ovn-controller to
>>>> a more structured ovn-controller.
>>>>
>>>> Phase 1:  Add scaffolding for the layers alongside the existing
>>>> ovn-controller code.
>>>> 1. Add stub function calls to ovn-controller's main loop that each
>>>> correspond to one of the data processing layers.
>>>> 2. Move the current ovn-controller loop logic into one of these stubs.
>>>> If using the layering system suggested earlier in this document, the
>>>> "other processing" layer would be a good place to put it.
>>>>
>>>> This reformulates ovn-controller to call into the individual layers.
>>>>   From here, we can fill out the stubs with the appropriate logic, 
>>>> and we
>>>> can remove "old" code as necessary.
>>>>
>>>> Phase 2: Isolate IDL data type usage
>>>> 1. Determine internal data types to be used.
>>>> 2. Write data translation layers so that IDL types can be translated
>>>> into internal types and vice-versa. Call into these functions from the
>>>> appropriate stub functions created in phase 1.
>>>> 3. Add unit tests that ensure data translation works as expected 
>>>> between
>>>> the types.
>>>> 4. Replace the use of IDL data types in the code with the internal 
>>>> types.
>>>>
>>>> At this point, you will have the data translation layers written. Using
>>>> internal data types will make it easier to create unit tests during
>>>> later phases.
>>>>
>>>> Phase 3: Incremental engine redefinition
>>>> 1. Determine the data types for which incremental processing is 
>>>> desired.
>>>> 2. Pick one of those data types and create the following:
>>>>      a. A data cache for the type.
>>>>      b. A method that takes the current data as input and compares 
>>>> it to
>>>> the cache, outputting the new, deleted, and updated items.
>>>> 3. Write unit tests
>>>> 4. Write the layers below the incremental engine to use this "diff" 
>>>> data.
>>>> 5. Ensure the stub from phase 1 that calls out to the incremental 
>>>> engine
>>>> is filled in for this data type.
>>>> 6. Delete any old incremental engine code that might conflict with this
>>>> new data handling.
>>>> 7. Repeat steps 2 - 6 for each data type that the incremental engine
>>>> will handle.
>>>>
>>>> Phase 3 is by far the hardest phase, since it likely will not be a very
>>>> clean process. By dividing the work by data type, it at least makes it
>>>> slightly easier to try to break the work up into more manageable 
>>>> chunks.
>>>>
>>>> Phase 4: Add processing between data translation and the incremental
>>>> engine. If using the proposed layering scheme above, this would be the
>>>> data filter layer.
>>>> 1. Determine the data types that need to be filtered for the local
>>>> environment.
>>>> 2. Pick one of those data types and create the following:
>>>>      a. A function that will take in the current data set and will 
>>>> output
>>>> a subset that includes only data that is pertinent to the local
>>> hypervisor.
>>>> 3. Call the functions from step 2 in the appropriate stub from
>>>> ovn-controller.
>>>> 4. Write unit tests.
>>>> 5. Repeat steps 2 - 4 for each data type from step 1.
>>>>
>>>> Phase 4 is a bit simpler, assuming that data filtering is the only
>>>> processing that occurs between the data translation layer and
>>>> incremental engine.
>>>>
>>>> At this point we essentially have the new code written. There may be
>>>> some additional work that is needed below the incremental engine layer,
>>>> but a lot of the incremental engine rework in phase 3 is going to touch
>>>> those lower layers as well. Whatever is left of the original
>>>> ovn-controller logic should be examined and determined if it should be
>>>> moved into a different layer, removed altogether, or left where it is.
>>>> For the code that gets left where it is, we can examine what is left 
>>>> and
>>>> determine if it can or should be cleaned up/refactored.
>>>>
>>>
>>> Thanks for thinking and writing up so much details for the transition 
>>> plan.
>>> However, I have to admit that it is still quite vague to me probably
>>> because I didn't think through all the details just by reading this
>>> document. Some POC is definitely needed, but even the POC would sound 
>>> very
>>> big effort for the proposed changes.

I agree, a PoC that touches all the layers described above is probably a 
big effort.  I wonder if changing a bit the order of the phases 
described above and focusing only on some of them for the moment being 
would make a PoC more feasible while also providing enough information 
about the implications of the changes proposed here.

It seems to me that the most concerns are related to the Data 
Translation layer adding overhead.

What if we move "phase 2" above and turn it into "phase 0" with the 
following changes:

Phase 0: "Isolate IDL data type usage (initial Data Translation layer)":
1. For the time being create (generate code?) 1:1 mappings between IDL 
tables (both Southbound and OVS) to internal types.
2. Add APIs for internal types (generate code?) to provide information 
about what changed in a specific record.  Internally this would be using 
the IDL change tracking mechanism where possible, enhanced with custom 
code, e.g., for column-level change tracking.
3. Replace the use of IDL data types in the code with the internal types.

At this point we should be able to evaluate:
a. the impact on performance due to the Data Translation layer.
b. the overhead due to the Data Translation layer (e.g., additional 
memory usage).
c. usefulness of the layer wrt. enabling other optimizations (the 
dp_groups example above).

This phase 0 could be further limited in scope by focusing only on some 
of the IDL tables; maybe Logical_DP_Group, Datapath_Binding, Port_Group 
and Address_Set are a good enough starting point.

>>>
>>> Before moving forward, I'd like to share some thoughts which are more of
>>> re-architecturing than refactoring the ovn-controller, but for the same
>>> goal of simplicity and efficiency.
>>> In the past we have learnt that the logical flow translation to open 
>>> flow
>>> rules is the performance bottleneck, and most of our effort of 
>>> scalability
>>> improvement, including the I-P engine, was focusing on the logical flow
>>> processing. There are two major challenges for logical flow translation.
>>> 1. Logical flow translation is mostly string parsing, which is CPU
>>> sensitive.
>>> 2. Since it is a string (specifically, the "match" field), it lacks
>>> metadata for preprocessing such as filtering out the flows that are not
>>> needed on the current chassis. Those information is available only after
>>> parsing the strings.
>>>
>>> The initial design of the centralized logical flow generation (by 
>>> northd)
>>> is to do the common task in one place and each chassis only takes 
>>> care of
>>> the chassis-specific part of computation. This is a reasonable decision
>>> from the architecture point of view. However, now that we have the
>>> experience on the performance bottle at scale, maybe it is time to 
>>> rethink
>>> this design. Since the logical flow translation is so costly, the cost
>>> saved from each ovn-controller doing the same compute may be not 
>>> worth the
>>> cost and complexity it introduced. It may be better to combine the two
>>> layer of processing, i.e. move the northd logic to ovn-controller, and
>>> ovn-controller directly talks to NB DB consuming the NB model 
>>> together with
>>> the current SB DB data (except that the logical-flow table is not 
>>> needed at
>>> all) to compute the OVS open flow rules directly, completely skipping 
>>> the
>>> logical flow to open-flow translation. The benefits of this approach 
>>> are:
>>> 1. Obviously, the most costly logical-flow to open-flow translation is
>>> completely avoided.
>>> 2. By directly consuming the NB objects (instead of strings of match
>>> conditions), ovn-controller has first-hand insight of the data and it is
>>> much easier for optimizations such as filtering out data that is 
>>> irrelevant
>>> to the current chassis.
>>> 3. Since we are moving most part of the northd logical to 
>>> ovn-controller,
>>> it is likely that we can reuse most of the DDlog effort taken for 
>>> northd,
>>> and extend it for ovn-controller. This would be more maintainable 
>>> than the
>>> C I-P engine in the long run, and also avoids the I-P problems discussed
>>> above.
>>> 4. If DDlog is utilized, the problem of repeating computation for the
>>> common part that the initial design of the centralized logical flow
>>> generation was trying to solve doesn't even exist - the cost of 
>>> computing
>>> the common part is so small and there is no point to save it by
>>> implementing the two-layer architecture.
>>>
>>> Although this may sound like a bigger change, but it seems to me this
>>> approach doesn't necessarily take more effort than the above refactor
>>> efforts for ovn-controller, and I think it would benefit OVN in the long
>>> run.
>>>
>>> Thoughts?
>>
>> I definitely think it's worth pursuing in this direction too.  We may
>> not be able to
>> completely remove the SB dependency in my opinion. Or at least to start
>> with
>>    - ovn-northd can replicate the information required for flow
>> generating by each
>>      ovn-controller into SB DB
>>   - and it can stop generating logical flows.
>>
>> A while back, I started exploring similar thoughts, that 
>> ovn-controllers will
>> generate logical flows internally. As a first step I removed the
>> logical flow generation
>> in ovn-northd which doesn't involve logical ports (i.e match with
>> inport or outport fields).
>> I was able to get it to work [1]. Meanwhile Ilya added the logical
>> datapath groups and that
>> helped in reducing lots of logical flows and I stopped my POC effort 
>> after that.
>>
>> My main motivation was to reduce the number of logical flows in the SB
>> DB as that
>> was causing a lot of scale issues in our scale testing efforts.
>>
>> Removing the logical flow generation in the northd, will help the high 
>> CPU and
>> high memory consumption seen (in a scaled deployment) in ovn-northd
>> and SB ovsdb-servers.
>>
>> I still think we can make use of the logical flow expr parsing in
>> ovn-controller, as it will
>> be easier to generate OF flows.
>>
>> [1] - 
>> https://github.com/numansiddique/ovn/commits/lflows_in_controller_rebased
>>         https://github.com/numansiddique/ovn/tree/lflows_in_controller
>>
>>
>> Whether we take this approach or not, I think it's still worth it to
>> put efforts in refactoring
>> the existing ovn-controller code base as per the suggestions and
>> design put forward by
>> Mark here (wherever it is possible to follow the design).
>>
>> Thanks
>> Numan
>>
> 
> I think what the two of you have discussed is the eventual future of 
> OVN. In my proposal, I was hoping to refactor ovn-controller, rather 
> than rewrite or rearchitect OVN. Like Numan said, I think some of what I 
> suggested is good no matter how we go:
> 
> * Not having deep reliance on database-specific types allows for more 
> easily changing database technologies, and it makes for easier unit 
> testing.
> * Having a layered approach can help to better understand the code, more 
> safely make changes, and again makes for easier unit testing.
> * While not mentioned directly in my doc, I also think the code's 
> architecture should be better documented. By that I mean adding 
> architecture documentation aimed specifically at people wishing to 
> contribute to the OVN codebase. As Gongming Chen recently brought up on 
> the ovs-discuss list, it is very difficult to figure out how all the 
> parts work together just by reading the code.

I agree on this, a rewrite of OVN is quite a long term effort and, while 
that can happen in parallel, some (all?) of the points covered by Mark's 
proposal might make OVN more manageable and less bug prone until we get 
a new complete implementation.

> 
> If we don't use the specific layering approach described in my document, 
> that is perfectly fine. The proposed model was hypothetical. So if it 
> makes sense to describe a different model, and to incorporate some of 
> these ideas into it, I'd be fine with it.
> 
> [1] It arguably could be separated into separate code units or otherwise 
> be better organized.
> 
>>>
>>> Thanks,
>>> Han
>>>
>>>>
>>>> [1] To some extent this is already done, since code is separated into
>>>> lflow.c, ofctrl.c, binding.c, physical.c, etc. However, those all have
>>>> cross-dependencies taht can result in unexpected interactions.
>>>> [2] This is not to imply that pinctrl is perfect. It is potentially in
>>>> need of its own refactoring. It also doesn't mean pinctrl would not be
>>>> touched by the ovn-controller refactor, since it makes use of 
>>>> structures
>>>> like local datapaths, which may be altered by the refactor.
>>>>

Thanks again for starting this effort and for the discussion!

Regards,
Dumitru




More information about the dev mailing list