[ovs-dev] [PATCH ovn] ovn-controller: Fix use of dangling pointers in I-P runtime_data.

Han Zhou hzhou at ovn.org
Thu Oct 31 05:15:49 UTC 2019

On Tue, Oct 29, 2019 at 8:11 AM Dumitru Ceara <dceara at redhat.com> wrote:
> The incremental processing engine might stop a run before the
> en_runtime_data node is processed. In such cases the ed_runtime_data
> fields might contain pointers to already deleted SB records. For
> example, if a port binding corresponding to a patch port is removed from
> the SB database and the incremental processing engine aborts before the
> en_runtime_data node is processed then the corresponding local_datapath
> hashtable entry in ed_runtime_data is stale and will store a pointer to
> the already freed sbrec_port_binding record.
> This will cause invalid memory accesses in various places (e.g.,
> pinctrl_run() -> prepare_ipv6_ras()).
> To fix the issue we need a way to track how each node was processed
> during an engine run. This commit transforms the 'changed' field in
> struct engine_node in a 'state' field. Possible node states are:
> - "Stale": data in the node is not up to date with the DB.
> - "Updated": data in the node is valid but was updated during
>   the last run of the engine.
> - "Valid": data in the node is valid and didn't change during
>   the last run of the engine.
> - "Aborted": during the last run, processing was aborted for
>   this node.
> The commit also simplifies the logic of calling engine_run and
> engine_need_run in order to reduce the number of external variables
> required to track the result of the last engine execution.
> Functions that need to be called from the main loop and depend on
> various data contents of the engine's nodes are now called only if
> the data is up to date.
> CC: Han Zhou <hzhou8 at ebay.com>
> Fixes: a6b7d9f4f047 ("ovn-controller: Fix flow installation latency
> caused by recompute.")
> Signed-off-by: Dumitru Ceara <dceara at redhat.com>

Thanks Dumitru for the great fix! This was my omission.
I realized that the problem should have existed even before the commit
a6b7d9f4f047 (so the commit message can be updated), because there are
chances the engine would not run at all, which is mostly caused by a
previous SB update from this node and the corresponding ovsdb otification
is coming back. At that moment since the transaction's reply is not coming
yet, the txn is always NULL and engine_run() is always skipped. If the
transaction itself deleted some data, which happen to be used by
pinctrl_run() in this iteration, then it would hit the dangling pointer
problem. I am not sure why I never encountered such case though.

In general I think one option to fix this is never store references to
another node's data. For example, always replicate the IDL data to engine
node's data. In practice, this is both inefficient and complex. It is
inefficient because of data copy. It is complex because data need to be
deep copied.

The other option is the one implemented by your patch, to maintain states
of each egnine node and make sure accessing node data only when it is
valid. I think this is a more practical approach.

I also like the way you handle the case when engine_run is not invoked, by
passing run_id in the checks.

However, there are still some places that some of the engine nodes’ data
are used without checking engine_node_data_valid(), such as ofctrl_run()
using ed_runtime_data.pending_ct_zones, ofctrl_inject_pkt() using
&ed_addr_sets.addr_sets and &ed_port_groups.port_groups, etc. However,
those data are actually valid all the time - they don't maintain any
reference pointers to their inputs. So it may not be a real problem.
However, it can be confusing why at some places we validate but at other
places we don't. So I think here is an improvement idea. Basically, we
should always validate the engine node data before accessing it, and the
validation mechanism can be improved to reflect each node's real state. We
can add an optional interface is_data_valid() for engine_node, which can
override the default engine_node_data_valid() behavior. We can let
engine_node_data_valid() call the node’s is_data_valid() interface if it is
implemented, otherwise fallback to the default behavior, which validates
only from the node state and run_id. For runtime_data, since part of it is
always valid, while other part may be invalid, I think we should split the
pending_ct_zones out as a separate engine node. This improvement is not
urgent, so I think it is ok to add it as a follow-up patch.

Regarding the current patch, I have some other comments below:

> +                    /* We need to make sure that at least the runtime
> +                     * (e.g., local datapaths, ct zones) are fresh before
> +                     * calling ofctrl_put and pinctrl_run to avoid using
> +                     * potentially freed references to database records.
> +                     */
> +                    if (engine_node_data_valid(&en_runtime_data,
> +                                               engine_run_id)) {
> +                        ofctrl_put(&ed_flow_output.flow_table,
> +                                   &ed_runtime_data.pending_ct_zones,
> +
> +                                   get_nb_cfg(sbrec_sb_global_table_get(
> +                                                   ovnsb_idl_loop.idl)),
> +
> +

I think ofctrl_put() needs to be invoked every iteration, because it takes
care of reinstalling flows to OVS when ovs-vswitchd is restarted. See
"need_reinstall_flows" in ofctrl.c.
For now, I think it is ok to leave this function out of the if
(engine_node_data_valid()) check, because pending_ct_zones is always valid.
For improvement, we should move pending_ct_zones to a separate engine node
so that it always passes the check.

> @@ -2095,17 +2118,16 @@ main(int argc, char *argv[])
>                  }
>              }
> -            if (old_engine_run_id == engine_run_id || !engine_run_done) {
> -                if (!engine_run_done ||
engine_need_run(&en_flow_output)) {
> -                    VLOG_DBG("engine did not run, force recompute next
time: "
> -                             "br_int %p, chassis %p", br_int, chassis);
> -                    engine_set_force_recompute(true);
> -                    poll_immediate_wake();
> -                } else {
> -                    VLOG_DBG("engine did not run, and it was not needed"
> -                             " either: br_int %p, chassis %p",
> -                             br_int, chassis);
> -                }
> +            if (engine_need_run(&en_flow_output, engine_run_id)) {
> +                VLOG_DBG("engine did not run, force recompute next time:
> +                            "br_int %p, chassis %p", br_int, chassis);
> +                engine_set_force_recompute(true);
> +                poll_immediate_wake();
> +            } else if (engine_aborted(&en_flow_output)) {
> +                VLOG_DBG("engine was aborted, force recompute next time:
> +                         "br_int %p, chassis %p", br_int, chassis);
> +                engine_set_force_recompute(true);
> +                poll_immediate_wake();

The logic is clearer than before. However, can we have the DBG log that
tells when engine didn't run and was not needed? It could be an else if
here, like:
                } else if (!engine_has_run(&en_flow_output, engine_run_id))

>              } else {
>                  engine_set_force_recompute(false);
>              }

> @@ -82,6 +82,20 @@ struct engine_node_input {
>      bool (*change_handler)(struct engine_node *node);
>  };
> +enum engine_node_state {
> +    EN_STALE,     /* Data in the node is not up to date with the DB. */
> +    EN_UPDATED,   /* Data in the node is valid but was updated during the
> +                   * last run.
> +                   */
> +    EN_VALID,     /* Data in the node is valid and didn't change during
> +                   * last run.
> +                   */
> +    EN_ABORTED,   /* During the last run, processing was aborted for
> +                   * this node.
> +                   */
> +    EN_STATE_MAX,
> +};
> +

Do we really need STALE state? I think it is same as ABORTED. It seems the
node state would never be STALE, right? Even though, I think I am ok for
separating them for clarity. Just want to make sure you thought about it.

>  struct engine_node {
>      /* A unique id to distinguish each iteration of the engine_run(). */
>      uint64_t run_id;
> @@ -102,8 +116,8 @@ struct engine_node {
>       * node. */
>      void *data;
> -    /* Whether the data changed in the last engine run. */
> -    bool changed;
> +    /* State of the node after the last engine run. */
> +    enum engine_node_state state;
>      /* Method to initialize data. It may be NULL. */
>      void (*init)(struct engine_node *);
> @@ -121,9 +135,9 @@ struct engine_node {
>  void engine_init(struct engine_node *);
>  /* Execute the processing recursively, which should be called in the main
> - * loop. Returns true if the execution is compelte, false if it is
> - * which could happen when engine_abort_recompute is set. */
> -bool engine_run(struct engine_node *, uint64_t run_id);
> + * loop. Updates the engine node's states accordingly.
> + */
> +void engine_run(struct engine_node *, uint64_t run_id);
>  /* Clean up the data for the engine nodes recursively. It calls each
>   * cleanup() method if not NULL. It should be called before the program
> @@ -132,7 +146,7 @@ void engine_cleanup(struct engine_node *);
>  /* Check if engine needs to run, i.e. any change to be processed. */

We need to update the comment: Check if engine needs to run but didn't.

>  bool
> -engine_need_run(struct engine_node *);
> +engine_need_run(struct engine_node *, uint64_t run_id);


More information about the dev mailing list