[ovs-dev] [PATCH ovn] ovn-controller: Fix use of dangling pointers in I-P runtime_data.

Mon Nov 4 12:56:48 UTC 2019

On Thu, Oct 31, 2019 at 9:21 AM Dumitru Ceara <dceara at redhat.com> wrote:
>
> On Thu, Oct 31, 2019 at 6:16 AM Han Zhou <hzhou at ovn.org> wrote:
> >
> >
> >
> > On Tue, Oct 29, 2019 at 8:11 AM Dumitru Ceara <dceara at redhat.com> wrote:
> > >
> > > The incremental processing engine might stop a run before the
> > > en_runtime_data node is processed. In such cases the ed_runtime_data
> > > fields might contain pointers to already deleted SB records. For
> > > example, if a port binding corresponding to a patch port is removed from
> > > the SB database and the incremental processing engine aborts before the
> > > en_runtime_data node is processed then the corresponding local_datapath
> > > hashtable entry in ed_runtime_data is stale and will store a pointer to
> > > the already freed sbrec_port_binding record.
> > >
> > > This will cause invalid memory accesses in various places (e.g.,
> > > pinctrl_run() -> prepare_ipv6_ras()).
> > >
> > > To fix the issue we need a way to track how each node was processed
> > > during an engine run. This commit transforms the 'changed' field in
> > > struct engine_node in a 'state' field. Possible node states are:
> > > - "Stale": data in the node is not up to date with the DB.
> > > - "Updated": data in the node is valid but was updated during
> > >   the last run of the engine.
> > > - "Valid": data in the node is valid and didn't change during
> > >   the last run of the engine.
> > > - "Aborted": during the last run, processing was aborted for
> > >   this node.
> > >
> > > The commit also simplifies the logic of calling engine_run and
> > > engine_need_run in order to reduce the number of external variables
> > > required to track the result of the last engine execution.
> > >
> > > Functions that need to be called from the main loop and depend on
> > > various data contents of the engine's nodes are now called only if
> > > the data is up to date.
> > >
> > > CC: Han Zhou <hzhou8 at ebay.com>
> > > Fixes: a6b7d9f4f047 ("ovn-controller: Fix flow installation latency
> > > caused by recompute.")
> > > Signed-off-by: Dumitru Ceara <dceara at redhat.com>
> >
> >
> > Thanks Dumitru for the great fix! This was my omission.
> > I realized that the problem should have existed even before the commit a6b7d9f4f047 (so the commit message can be updated), because there are chances the engine would not run at all, which is mostly caused by a previous SB update from this node and the corresponding ovsdb otification is coming back. At that moment since the transaction's reply is not coming yet, the txn is always NULL and engine_run() is always skipped. If the transaction itself deleted some data, which happen to be used by pinctrl_run() in this iteration, then it would hit the dangling pointer problem. I am not sure why I never encountered such case though.
> >
>
> Hi Han,
>
> Right, I selected the wrong commit. I'll update the commit message. I
> think it's harder to see the issue because you need to be waiting for
> a transaction reply and at the same time deleting port bindings from
> the SB database.
>
> > In general I think one option to fix this is never store references to another node's data. For example, always replicate the IDL data to engine node's data. In practice, this is both inefficient and complex. It is inefficient because of data copy. It is complex because data need to be deep copied.
> >
> > The other option is the one implemented by your patch, to maintain states of each egnine node and make sure accessing node data only when it is valid. I think this is a more practical approach.
> >
>
> I think this will be fixed indirectly when/if pinctrl moves to a
> different process. Until then this seems like the easiest solution.
>
> > I also like the way you handle the case when engine_run is not invoked, by passing run_id in the checks.
> >
> > However, there are still some places that some of the engine nodes’ data are used without checking engine_node_data_valid(), such as ofctrl_run() using ed_runtime_data.pending_ct_zones, ofctrl_inject_pkt() using &ed_addr_sets.addr_sets and &ed_port_groups.port_groups, etc. However, those data are actually valid all the time - they don't maintain any reference pointers to their inputs. So it may not be a real problem. However, it can be confusing why at some places we validate but at other places we don't. So I think here is an improvement idea. Basically, we should always validate the engine node data before accessing it, and the validation mechanism can be improved to reflect each node's real state. We can add an optional interface is_data_valid() for engine_node, which can override the default engine_node_data_valid() behavior. We can let engine_node_data_valid() call the node’s is_data_valid() interface if it is implemented, otherwise fallback to the default behavior, which validates only from the node state and run_id. For runtime_data, since part of it is always valid, while other part may be invalid, I think we should split the pending_ct_zones out as a separate engine node. This improvement is not urgent, so I think it is ok to add it as a follow-up patch.
>
> True, I missed those cases. It does sound better to have an optional
> is_data_valid() node callback. I'll add it in v2. If it's not too much
> work I'll also move pending_ct_zones to a separate engine node.
> Otherwise we can add it as a follow up.
>
> >
> > Regarding the current patch, I have some other comments below:
> >
> > >
> > >
> > > +                    /* We need to make sure that at least the runtime data
> > > +                     * (e.g., local datapaths, ct zones) are fresh before
> > > +                     * calling ofctrl_put and pinctrl_run to avoid using
> > > +                     * potentially freed references to database records.
> > > +                     */
> > > +                    if (engine_node_data_valid(&en_runtime_data,
> > > +                                               engine_run_id)) {
> > > +                        ofctrl_put(&ed_flow_output.flow_table,
> > > +                                   &ed_runtime_data.pending_ct_zones,
> > > +                                   sbrec_meter_table_get(ovnsb_idl_loop.idl),
> > > +                                   get_nb_cfg(sbrec_sb_global_table_get(
> > > +                                                   ovnsb_idl_loop.idl)),
> > > +                                   engine_node_data_changed(&en_flow_output,
> > > +                                                            engine_run_id));
> >
> >
> > I think ofctrl_put() needs to be invoked every iteration, because it takes care of reinstalling flows to OVS when ovs-vswitchd is restarted. See "need_reinstall_flows" in ofctrl.c.
> > For now, I think it is ok to leave this function out of the if (engine_node_data_valid()) check, because pending_ct_zones is always valid. For improvement, we should move pending_ct_zones to a separate engine node so that it always passes the check.
>
> Ack.
>
> >
> > >
> > >
> > > @@ -2095,17 +2118,16 @@ main(int argc, char *argv[])
> > >                  }
> > >
> > >              }
> > > -            if (old_engine_run_id == engine_run_id || !engine_run_done) {
> > > -                if (!engine_run_done || engine_need_run(&en_flow_output)) {
> > > -                    VLOG_DBG("engine did not run, force recompute next time: "
> > > -                             "br_int %p, chassis %p", br_int, chassis);
> > > -                    engine_set_force_recompute(true);
> > > -                    poll_immediate_wake();
> > > -                } else {
> > > -                    VLOG_DBG("engine did not run, and it was not needed"
> > > -                             " either: br_int %p, chassis %p",
> > > -                             br_int, chassis);
> > > -                }
> > > +            if (engine_need_run(&en_flow_output, engine_run_id)) {
> > > +                VLOG_DBG("engine did not run, force recompute next time: "
> > > +                            "br_int %p, chassis %p", br_int, chassis);
> > > +                engine_set_force_recompute(true);
> > > +                poll_immediate_wake();
> > > +            } else if (engine_aborted(&en_flow_output)) {
> > > +                VLOG_DBG("engine was aborted, force recompute next time: "
> > > +                         "br_int %p, chassis %p", br_int, chassis);
> > > +                engine_set_force_recompute(true);
> > > +                poll_immediate_wake();
> >
> >
> > The logic is clearer than before. However, can we have the DBG log that tells when engine didn't run and was not needed? It could be an else if here, like:
>
> Sure, I'll add it.
>
> >                 } else if (!engine_has_run(&en_flow_output, engine_run_id)) {
> >                      VLOG_DBG(...)
> >
> > >              } else {
> > >                  engine_set_force_recompute(false);
> > >              }
> > >
> >
> >
> > >
> > > @@ -82,6 +82,20 @@ struct engine_node_input {
> > >      bool (*change_handler)(struct engine_node *node);
> > >  };
> > >
> > > +enum engine_node_state {
> > > +    EN_STALE,     /* Data in the node is not up to date with the DB. */
> > > +    EN_UPDATED,   /* Data in the node is valid but was updated during the
> > > +                   * last run.
> > > +                   */
> > > +    EN_VALID,     /* Data in the node is valid and didn't change during the
> > > +                   * last run.
> > > +                   */
> > > +    EN_ABORTED,   /* During the last run, processing was aborted for
> > > +                   * this node.
> > > +                   */
> > > +    EN_STATE_MAX,
> > > +};
> > > +
> >
> >
> > Do we really need STALE state? I think it is same as ABORTED. It seems the node state would never be STALE, right? Even though, I think I am ok for separating them for clarity. Just want to make sure you thought about it.
>
> We don't actually end up with the final state STALE because in
> engine_run() we always move from STALE to VALID if the data didn't
> change and we didn't abort execution. It did help me follow the logic
> of the code though so I would keep it.
>
> >
> > >
> > >  struct engine_node {
> > >      /* A unique id to distinguish each iteration of the engine_run(). */
> > >      uint64_t run_id;
> > > @@ -102,8 +116,8 @@ struct engine_node {
> > >       * node. */
> > >      void *data;
> > >
> > > -    /* Whether the data changed in the last engine run. */
> > > -    bool changed;
> > > +    /* State of the node after the last engine run. */
> > > +    enum engine_node_state state;
> > >
> > >      /* Method to initialize data. It may be NULL. */
> > >      void (*init)(struct engine_node *);
> > > @@ -121,9 +135,9 @@ struct engine_node {
> > >  void engine_init(struct engine_node *);
> > >
> > >  /* Execute the processing recursively, which should be called in the main
> > > - * loop. Returns true if the execution is compelte, false if it is aborted,
> > > - * which could happen when engine_abort_recompute is set. */
> > > -bool engine_run(struct engine_node *, uint64_t run_id);
> > > + * loop. Updates the engine node's states accordingly.
> > > + */
> > > +void engine_run(struct engine_node *, uint64_t run_id);
> > >
> > >  /* Clean up the data for the engine nodes recursively. It calls each node's
> > >   * cleanup() method if not NULL. It should be called before the program
> > > @@ -132,7 +146,7 @@ void engine_cleanup(struct engine_node *);
> > >
> > >  /* Check if engine needs to run, i.e. any change to be processed. */
> >
> >
> > We need to update the comment: Check if engine needs to run but didn't.
>
> Ack.
>
> >
> > >  bool
> > > -engine_need_run(struct engine_node *);
> > > +engine_need_run(struct engine_node *, uint64_t run_id);
> > >
> >
> > Thanks,
> > Han
>
> I'll send a follow up v2 soon.

Hi Han,

v2 is available here: https://patchwork.ozlabs.org/patch/1188930/

In the end I decided not to add a custom "is_data_valid" handler per
node and instead separate in ovn-controller the data that can be
accessed at any time (regardless of engine_run success). I had
initially tried with a custom handler but I found the code a bit
harder to follow.

What do you think?

Thanks,
Dumitru

>
> Thanks,
> Dumitru