[ovs-dev] [PATCHv2] connmgr: Check nullptr inside ofmonitor_report()

Sat Feb 20 02:29:06 UTC 2021

Hi, Ilya

Ilya Maximets <i.maximets at ovn.org> 于2021年2月19日周五 下午7:19写道：
>
> On 2/19/21 3:12 AM, 贺鹏 wrote:
> > Hi,
> >
> > Looks like this bug is caused by violating the fact that if a rule is
> > referenced, the related ofproto should not be destroyed.
> >
> > If so, I have a patch that also fixes the problem, not sure if this helps.
> >
> > http://patchwork.ozlabs.org/project/openvswitch/patch/20200717015041.82746-1-hepeng.0320@bytedance.com/
>
> There is at least one more problem that is not strictly related but
> in more or less the same part of the code:
>   https://mail.openvswitch.org/pipermail/ovs-dev/2021-February/380582.html

So maybe before add it into *ovsrcu_postpone*, we should add refcount
of ofproto also?

>
> >
> > William Tu <u9012063 at gmail.com> 于2021年2月19日周五 上午8:10写道：
> >>
> >> On Wed, Feb 17, 2021 at 1:09 PM Yifeng Sun <pkusunyifeng at gmail.com> wrote:
> >>>
> >>> ovs-vswitchd could crash under these circumstances:
> >>> 1. When one bridge is being destroyed, ofproto_destroy() is called and
> >>> connmgr pointer of its ofproto struct is nullified. This ofproto struct is
> >>> deallocated through 'ovsrcu_postpone(ofproto_destroy_defer__, p);'.
> >>> 2. Before RCU enters quiesce state to actually free this ofproto struct,
> >>> revalidator thread calls udpif_revalidator(), which could handle
> >>> a learn flow and calls ofproto_flow_mod_learn(), it later calls
> >>> ofmonitor_report() and ofproto struct's connmgr pointer is accessed.
> >>>
> >>> The crash stack trace is shown below:
> >>>
> >>> 0  ofmonitor_report (mgr=0x0, rule=rule at entry=0x7fa4ac067c30, event=event at entry=NXFME_ADDED,
> >>>     reason=reason at entry=OFPRR_IDLE_TIMEOUT, abbrev_ofconn=0x0, abbrev_xid=0, old_actions=old_actions at entry=0x0)
> >>>     at ofproto/connmgr.c:2160
> >>> 1  0x00007fa4d6803495 in add_flow_finish (ofproto=0x55d9075d4ab0, ofm=<optimized out>, req=req at entry=0x0)
> >>>     at ofproto/ofproto.c:5221
> >>> 2  0x00007fa4d68036af in modify_flows_finish (req=0x0, ofm=0x7fa4980753f0, ofproto=0x55d9075d4ab0)
> >>>     at ofproto/ofproto.c:5823
> >>> 3  ofproto_flow_mod_finish (ofproto=0x55d9075d4ab0, ofm=ofm at entry=0x7fa4980753f0, req=req at entry=0x0)
> >>>     at ofproto/ofproto.c:8088
> >>> 4  0x00007fa4d680372d in ofproto_flow_mod_learn_finish (ofm=ofm at entry=0x7fa4980753f0,
> >>>     orig_ofproto=orig_ofproto at entry=0x0) at ofproto/ofproto.c:5439
> >>> 5  0x00007fa4d68072f9 in ofproto_flow_mod_learn (ofm=0x7fa4980753f0, keep_ref=keep_ref at entry=true,
> >>>     limit=<optimized out>, below_limitp=below_limitp at entry=0x0) at ofproto/ofproto.c:5499
> >>> 6  0x00007fa4d6835d33 in xlate_push_stats_entry (entry=0x7fa498012448, stats=stats at entry=0x7fa4d2701a10,
> >>>     offloaded=offloaded at entry=false) at ofproto/ofproto-dpif-xlate-cache.c:127
> >>> 7  0x00007fa4d6835e3a in xlate_push_stats (xcache=<optimized out>, stats=stats at entry=0x7fa4d2701a10,
> >>>     offloaded=offloaded at entry=false) at ofproto/ofproto-dpif-xlate-cache.c:181
> >>> 8  0x00007fa4d6822046 in revalidate_ukey (udpif=udpif at entry=0x55d90760b240, ukey=ukey at entry=0x7fa4b0191660,
> >>>     stats=stats at entry=0x7fa4d2705118, odp_actions=odp_actions at entry=0x7fa4d2701b50,
> >>>     reval_seq=reval_seq at entry=5655486242, recircs=recircs at entry=0x7fa4d2701b40, offloaded=false)
> >>>     at ofproto/ofproto-dpif-upcall.c:2294
> >>> 9  0x00007fa4d6825aee in revalidate (revalidator=0x55d90769dd00) at ofproto/ofproto-dpif-upcall.c:2683
> >>> 10 0x00007fa4d6825cf3 in udpif_revalidator (arg=0x55d90769dd00) at ofproto/ofproto-dpif-upcall.c:936
> >>> 11 0x00007fa4d6259c9f in ovsthread_wrapper (aux_=<optimized out>) at lib/ovs-thread.c:423
> >>> 12 0x00007fa4d582cea5 in start_thread () from /usr/lib64/libpthread.so.0
> >>> 13 0x00007fa4d504b96d in clone () from /usr/lib64/libc.so.6
> >>>
> >>> At the time of crash, the involved ofproto was already deallocated:
> >>>
> >>> (gdb) print *ofproto
> >>> $1 = ..., name = 0x55d907602820 "nsx-managed", ..., ports = {...,
> >>>     one = 0x0, mask = 63, n = 0}, ..., connmgr = 0x0, ...
> >>>
> >>> This patch fixes it.
> >>>
> >>> VMware-BZ: #2700626
> >>> Signed-off-by: Yifeng Sun <pkusunyifeng at gmail.com>
> >>> ---
> >>> v1->v2: Add check for ofmonitor_flush, thanks William.
> >>>
> >>
> >> LGTM, thanks.
> >> Acked-by: William Tu < u9012063 at gmail.com>
> >>
> >> CC Ilya and Ben to see if any comments.
> >>
> >> I feel this kind of RCU issue is hard to find out, and
> >> existing tools such as addressSanitizer are usually
> >> not helpful.
> >>
> >> William
> >>
> >>>  ofproto/connmgr.c | 6 +++++-
> >>>  1 file changed, 5 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/ofproto/connmgr.c b/ofproto/connmgr.c
> >>> index 9c5c633b4171..fa8f6cd0e83a 100644
> >>> --- a/ofproto/connmgr.c
> >>> +++ b/ofproto/connmgr.c
> >>> @@ -2140,7 +2140,7 @@ ofmonitor_report(struct connmgr *mgr, struct rule *rule,
> >>>                   const struct rule_actions *old_actions)
> >>>      OVS_REQUIRES(ofproto_mutex)
> >>>  {
> >>> -    if (rule_is_hidden(rule)) {
> >>> +    if (!mgr || rule_is_hidden(rule)) {
> >>>          return;
> >>>      }
> >>>
> >>> @@ -2244,6 +2244,10 @@ ofmonitor_flush(struct connmgr *mgr)
> >>>  {
> >>>      struct ofconn *ofconn;
> >>>
> >>> +    if (!mgr) {
> >>> +        return;
> >>> +    }
> >>> +
> >>>      LIST_FOR_EACH (ofconn, connmgr_node, &mgr->conns) {
> >>>          struct rconn_packet_counter *counter = ofconn->monitor_counter;
> >>>
> >>> --
> >>> 2.7.4
> >>>
> >>> _______________________________________________
> >>> dev mailing list
> >>> dev at openvswitch.org
> >>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >> _______________________________________________
> >> dev mailing list
> >> dev at openvswitch.org
> >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
> >
> >
>

-- 
hepeng