[ovs-dev] DDLog after one week

Leonid Ryzhyk lryzhyk at vmware.com
Mon Sep 23 16:54:05 UTC 2019


> Unfortunately, this wasn't possible without exporting some new C  functions.

The approach we've taken so far was to export whatever C functions are needed from either OVS or OVN. Numan has isolated the relevant patches here: https://github.com/numansiddique/ovs/commits/ovs_ddlog_patches

> I'm guessing that the reason for not  using the OVSDB IDL is that it doesn't have the hooks necessary for DDLog.

I think the initial implementation actually used the IDL, but Justin felt it was an overkill.  I will leave it to him to comment in more detail.

> Sure: https://github.com/putnopvut/ovn/commit/0ea3b23d1b6a11f88a870acc4cf15b324c3ffb35
> I imagine you'll take one look and immediately realize what's wrong :)

This was caused by a bounds violation in `ovn_scan_static_dynamic_ip6`.  Here is how I found this (I will add a description of this process to the DDlog debuggin tips doc in the repo):

- Looked at the northd log to find a bunch of errors like this:
  ```
  2019-09-23T16:23:24.549Z|00011|ovn_northd|ERR|ddlog_transaction_commit(): error: failed to receive flush ack message from timely dataflow thread
  ```

I admit this is a poor diagnostics message and we have an issue to fix it: https://github.com/vmware/differential-datalog/issues/342, but what it really means is that a DDLog thread crashed.

- Replayed the recorded scenario using the CLI generated by DDlog:
  ```
  ./northd/ovn_northd_ddlog/target/release/ovn_northd_cli < tests/testsuite.dir/117/northd/replay.dat
 ```
  This generates a bunch of output ending with:

 ```
  thread 'worker thread 2' panicked at 'index out of bounds: the len is 1 but the index is 1', /rustc/eae3437dfe991621e8afdc82734f4a172d7ddf9b/src/libcore/slice/mod.rs:2681:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
  ```

- Ran the CLI again with backtrace enabled (as suggested by the error message):

  ```
  RUST_BACKTRACE=1 ./northd/ovn_northd_ddlog/target/release/ovn_northd_cli < tests/testsuite.dir/117/northd/replay.dat
  ```

This finally yields the following stack trace, which suggests array bound violation in `ovn_scan_static_dynamic_ip6`:

```
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.29  10: core::panicking::panic_bounds_check
             at src/libcore/panicking.rs:61
[SKIPPED]
  11: ovn_northd_ddlog::__ovn::ovn_scan_static_dynamic_ip6
  12: ovn_northd_ddlog::prog::__f
[SKIPPED]
```

- Finally, looking at ovn_scan_static_dynamic_ip6, there are a couple of array indexing operation there, the first one likely being the culprit: `(if (f[0] != "dynamic" || f.len() == 1) )`

> From my POV, what would have made this easiest to figure out was the
> ability to open a debugger and insert break points into the DDlog to
> determine the current values at various steps.

This is very useful feedback.  We can conceivably build such a debugger, but we need to figure out the exact debugging workflow, which is not 100% clear, given the (largely) declarative nature of the language.  There are some things you can already approximate witH printf-style debugging.  For example, as you've already figured out, you can print a message at any point inside a rule. You can even print it conditionally when some predicate over variables in the rule holds.  One thing you cannot easily do at the moment is answer provenance queries: Which rule generated a particular output record and what inputs triggered this rule?

Let's continue this discussion and see if we can come up with a meaningful spec for a DDlog debugger. I don't want to plunge into the implementation straight away, as it is likely to be a lot of work and I want to be sure we end up building somethings that's actually useful for you and others.

> I have not enabled the fast build option, nor have I enabled
> ddlog-northd-cli. I'll give the fast build option a try. Is there a
> reason it is not enabled by default?

Fast build disables some of Rust and LLVM optimization passes, causing the compiled code to run twice slower.

Leonid

________________________________
From: Mark Michelson <mmichels at redhat.com>
Sent: Monday, September 23, 2019 6:04 AM
To: Leonid Ryzhyk <lryzhyk at vmware.com>
Cc: Dumitru Ceara <dceara at redhat.com>; Numan Siddique <nusiddiq at redhat.com>; Lorenzo Bianconi <lbiancon at redhat.com>; Mihai Budiu <mbudiu at vmware.com>; Justin Pettit <jpettit at vmware.com>; Ben Pfaff <bpfaff at vmware.com>; ovs dev <dev at openvswitch.org>
Subject: Re: DDLog after one week

I'm adding the ovs-dev list since this discussion may be useful to the
general development community.

See my responses inline.

On 9/20/19 8:04 PM, Leonid Ryzhyk wrote:
> Hi Mark,
>
> Firstly, many thanks for giving DDlog a try!  I am extremely impressed
> that you managed to get this far without any help from us.  Hopefully,
> we will be able to help make life easier for you going forward.
>
> Happy to hear that you see the pros of using DDlog in OVN. Your summary
> of where it helps to improve things is exactly what we intended it for.
> Let's see if we can help with some of the problems!
>
> * The need for Rust code:
>     It's certainly true that DDlog adds yet another language (Rust) in
> the mix. One of the things on our roadmap is to expand DDlog's libraries
> with a more complete set of string manipulation routines, so that many
> of the functions that currently must be implemented in Rust can be coded
> directly in DDlog. Having said that, it's interesting that you had to
> implement your string manipulation in Rust.  In my experience with OVN,
> most string processing logic is already implemented in C, so in most
> cases all I had to do was to write a Rust wrapper around an existing C
> function.
>

Unfortunately, this wasn't possible without exporting some new C
functions. In this particular case, the string parsing is taking care of
the "addresses" field of logical switch ports. The field can look like
any of the following:

"<MAC>"
"<MAC> <IPv4 address>"
"<MAC> <IPv6 address>"
"dynamic"
"<MAC> dynamic"
"dynamic <IPv4 address>"
"dynamic <IPv6 address>"

The first five of these can be done using an exported C function
(extract_addresses(), I believe). The final two are handled internally
within ovn-northd.c and not exported. Therefore, in the ddlog version,
these two are handled in northd/ovn.rs. The sixth version is handled
with scan_static_dynamic_ip(). The final version was not handled at all,
so I added scan_static_dynamic_ip6() to attempt to cover it.


> * Failure resilience:
>     I think your summary is correct.  I don't really understand
> `ovn-northd-ddlog.c`, but Justin did mention that he plans to cleanup
> and harden it. DDlog itself should be able to handle intermittent
> failures, e.g., if we lose connection to Southbound DB and then restore
> it, DDlog can just pick up the new database state and generate the set
> of changes needed to bring it up to speed with the NB database.

Yes, my impressions were that this wasn't really a fault of DDLog as
much as the program driving it. I'm guessing that the reason for not
using the OVSDB IDL is that it doesn't have the hooks necessary for
DDLog. Rather than modify the OVSDB IDL, ovn-northd-ddlog.c essentially
copies the logic of the IDL and inserts code where necessary for DDLog
usage.

>
> * Debuggability:
>    I agree that debugging is a problem. We keep building new debugging
> tools, and it sounds like you've already tried some of them, but I can
> totally see that these may not always be enough.  I think the best
> strategy is to go case-by-case and figure out how we can make it easier
> to troubleshoot every particular type of error in the future. Can you
> point me to your code that fails?

Sure:
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fputnopvut%2Fovn%2Fcommit%2F0ea3b23d1b6a11f88a870acc4cf15b324c3ffb35&amp;data=02%7C01%7Clryzhyk%40vmware.com%7C8b71c127664a4c09fc5608d740268759%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637048406536970159&amp;sdata=HAd8s%2FgJgtownxR592TdB5mJCtWmRCAvjjdgBUtQR9E%3D&amp;reserved=0

I imagine you'll take one look and immediately realize what's wrong :)

 From my POV, what would have made this easiest to figure out was the
ability to open a debugger and insert break points into the DDlog to
determine the current values at various steps. I think the closest I'd
be able to do right now would be to step through the generated Rust code
instead. It's possible that I could figure out what I'm doing wrong that
way, but it would be so much better if it were possible to debug at the
DDLog level.

>
> * Compile times
>    Yep, those are driving me nuts, and sadly I don't think we will have
> a radical solution in the near future. Have you tried
> `--enable-ddlog-fast-build`? In my experience, this speeds up
> compilation by a factor of 2, at the cost of the compiled code being
> twice slower, which is usually not a problem during development. Also,
> make sure you do _not_ use `--enable-ddlog-northd-cli` unless you need
> to do replay debugging with CLI. I think, with those two improvements I
> get compilation times in the order of 5 minutes on my laptop.

I have not enabled the fast build option, nor have I enabled
ddlog-northd-cli. I'll give the fast build option a try. Is there a
reason it is not enabled by default?

>
> Leonid
> ------------------------------------------------------------------------
> *From:* Mark Michelson <mmichels at redhat.com>
> *Sent:* Friday, September 20, 2019 2:31 PM
> *To:* Leonid Ryzhyk <lryzhyk at vmware.com>; Numan Siddique
> <nusiddiq at redhat.com>; Lorenzo Bianconi <lbiancon at redhat.com>; Dumitru
> Ceara <dceara at redhat.com>
> *Subject:* DDLog after one week
> Hi Leonid,
>
> After learning, reading, and using DDLog for a week, I figure I should
> give my thoughts.
>
> On the positive side, I think that once we get used to writing DDLog, it
> will be easier to maintain and add to than the current incremental
> engine we have in the C code. The language requires different thinking
> than we're used to, but that's not a bad thing.
>
> Regarding language features, I like the memory safety and strong type
> system. It reminds me of using a functional language (e.g. Haskell) in
> that regard. I also like the fact that its rules engine makes for easy
> incremental processing of the code. I only have to think about how a
> relation translates to another relation. I don't need to think about the
> deltas to the existing items or anything like that.
>
> On the not-exactly-positive-but-also-not-exactly-negative side, I was
> surprised about the amount of Rust I might have to know and use when
> using DDLog. For instance, the issue I chose to take on required more
> Rust coding than DDLog coding. That's because my change required
> changing some string parsing code, and that's offloaded to Rust.
>
> I have three main negatives to share:
>
> 1) The current implementation is not resilient in the face of failures.
> For example, if ovsdb transactions don't go properly, there is a chance
> that the code can enter a busy loop and no longer function. I get the
> feeling that ovn-northd-ddlog.c was written with the mindset of getting
> something working quickly. It doesn't use the ovsdb IDL, and as a
> result, much of the state machine is re-copied locally. I don't think
> failure scenarios are properly handled.
>
> 2) Debuggability. While I was making my changes, I found that what I had
> written had resulted in a database commit failing. However, even when
> replaying DDLog, dumping a relevant DDLog relation, and adding printfs,
> I still have no idea why my code is wrong. I'm kind of just stuck at
> this point. Even more of a concern is debugging live deployments.
>
> 3) Compile times. While working on my improvement, I found that making
> changes and recompiling resulted in 10+ minute builds every time. It
> didn't matter how small or large my changes were. When trying to debug
> why a test is failing, and incrementally adding more changes to the
> code, this is excruciating.
>
> I've included Numan, Dumitru, and Lorenzo on this message so they can
> chime in with their thoughts as well. Note that for them it's already
> night time on Friday, so they may not add their thoughts until after the
> weekend :)
>
> Thanks,
> Mark Michelson



More information about the dev mailing list