[ovs-dev] [PATCH ovn v2 0/9] Incremental processing for flow installation.

Han Zhou hzhou at ovn.org
Tue Sep 15 08:19:47 UTC 2020


On Mon, Sep 14, 2020 at 5:29 PM Han Zhou <hzhou at ovn.org> wrote:
>
>
>
> On Mon, Sep 14, 2020 at 9:54 AM Ilya Maximets <i.maximets at ovn.org> wrote:
> >
> > On 9/14/20 6:45 PM, Dumitru Ceara wrote:
> > > On 9/14/20 6:36 PM, Han Zhou wrote:
> > >>
> > >>
> > >> On Mon, Sep 14, 2020 at 6:04 AM Dumitru Ceara <dceara at redhat.com
> > >> <mailto:dceara at redhat.com>> wrote:
> > >>>
> > >>> On 9/11/20 9:16 PM, Dumitru Ceara wrote:
> > >>>> On 9/11/20 8:39 PM, Han Zhou wrote:
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Sep 11, 2020 at 6:17 AM Dumitru Ceara <dceara at redhat.com
> > >> <mailto:dceara at redhat.com>
> > >>>>> <mailto:dceara at redhat.com <mailto:dceara at redhat.com>>> wrote:
> > >>>>>>
> > >>>>>> On 9/9/20 11:13 PM, Han Zhou wrote:
> > >>>>>>> On Wed, Sep 9, 2020 at 6:39 AM Mark Michelson
> > >> <mmichels at redhat.com <mailto:mmichels at redhat.com>
> > >>>>> <mailto:mmichels at redhat.com <mailto:mmichels at redhat.com>>> wrote:
> > >>>>>>>>
> > >>>>>>>> Thanks for addressing my concerns on patch 9.
> > >>>>>>>>
> > >>>>>>>> Acked-by: Mark Michelson <mmichels at redhat.com
> > >> <mailto:mmichels at redhat.com>
> > >>>>> <mailto:mmichels at redhat.com <mailto:mmichels at redhat.com>>>
> > >>>>>>>>
> > >>>>>>>> Warning: I'm also ACKing Numan's patch series that also adds
some
> > >>>>>>>> lflow-level caching (but he is caching expressions). So whoever
> > >> loses
> > >>>>>>>> this merge race may have some work to do in order to apply
their
> > >> patch
> > >>>>>>>> cleanly.
> > >>>>>>>>
> > >>>>>>> Thanks Mark for the review, and the warning :).
> > >>>>>>> I applied the series to master, and also the first 4 patches to
> > >>>>>>> branch-20.06 and patch 1,2,4 to branch-20.03, to fix the
conjunction
> > >>>>> flow
> > >>>>>>> bug in those branches.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Han
> > >>>>>>>
> > >>>>>>
> > >>>>>> Hi Han,
> > >>>>>>
> > >>>>>> This seems to cause an assertion failure in ovn-controller
> > >> compiled with
> > >>>>>> today's OVS/OVN master branches. I'm not completely sure about
the
> > >> steps
> > >>>>>> to reproduce because I was testing some other ovn-k8s related
> > >> things but
> > >>>>>> I did get a core dump and the NB/DB/conf DBs.
> > >>>>>>
> > >>>>>> I opened a RedHat BZ [0] (more for my tracking than anything
else)
> > >> and I
> > >>>>>> shared there the OVS/OVN revisions and how to build the exact
same
> > >> image
> > >>>>>> that I was running when I hit the crash.
> > >>>>>>
> > >>>>>> I'll try to investigate what's happening but you're probably more
> > >>>>>> familiar with the code so any hints would be great.
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Dumitru
> > >>>>>>
> > >>>>>> [0] https://bugzilla.redhat.com/show_bug.cgi?id=1878139
> > >>>>>>
> > >>>>>
> > >>>>> Thanks Dumitru for reporting the issue. I reviewed the code and
the
> > >> test
> > >>>>> cases again, and didn't figure out any problem yet. The failed
> > >> assertion
> > >>>>> "ovs_list_is_empty(&f->list_node) " is to assert that the
desired_flow
> > >>>>> is not accessed more than once in a single flood_remove call. This
> > >>>>> should be true because the loop at
> > >>>>>
> > >> (
https://github.com/ovn-org/ovn/blob/master/controller/ofctrl.c#L1126-L1134)
> > >>>>> ensures that all flows in to_be_removed (which is the only list
that
> > >>>>> uses the "list_node" member) is detached from the cross reference
> > >> lists,
> > >>>>> so that in the following recursive calls these flows will never be
> > >>>>> accessed again. Now that this assert fails, it could be some
orphaned
> > >>>>> pointer somewhere, leading to the "f" pointing to a flow that's
already
> > >>>>> destroyed.
> > >>>>>
> > >>>>> One thing not clear to me is the version you used for the testing.
> > >> In BZ
> > >>>>> you mentioned revision: 520189bf313054702f5f802acd7944cca3b6baaa.
> > >>>>> However, in this revision, the line number of the assertion
doesn't
> > >>>>> match the one you pasted in BZ. What you pasted was 1135:
> > >>>>
> > >>>> The revision should be good, I think the problem is with the way I
> > >>>> rebuilt the packages in the new container after I retrieved the
core
> > >> dump.
> > >>>>
> > >>>> The assert message clearly points to line 1108. This is from the
old
> > >>>> logs of the run where I saw the crash:
> > >>>>
> > >>>> 2020-09-11T11:53:18.426Z|00037|util|EMER|controller/ofctrl.c:1108:
> > >>>> assertion ovs_list_is_empty(&f->list_node) failed in
> > >>>> flood_remove_flows_for_sb_uuid()
> > >>>>
> > >>>>>
> > >>>>> #6  0x00005607d73502fa in flood_remove_flows_for_sb_uuid
> > >>>>> (flow_table=flow_table at entry=0x5607d8c4b380,
> > >>>>> sb_uuid=sb_uuid at entry=0x5607d8f91430,
> > >>>>> flood_remove_nodes=flood_remove_nodes at entry=0x7ffeccadb9c0) at
> > >>>>> controller/ofctrl.c:1135
> > >>>>>
> > >>>>> What's in the code for revision (520189bf) is 1108:
> > >>>>>
> > >>
https://github.com/ovn-org/ovn/blob/520189bf313054702f5f802acd7944cca3b6baaa/controller/ofctrl.c#L1108
> > >>>>>
> > >>>>> Did you change anything while doing the testing?
> > >>>>>
> > >>>>
> > >>>> I'll try to replicate the issue again next week, maybe we get more
info.
> > >>>>
> > >>>> Thanks,
> > >>>> Dumitru
> > >>>>
> > >>>
> > >>> Hi Han,
> > >>>
> > >>> I managed to minimize the way to replicate the crash (with OVN
master).
> > >>> I do:
> > >>>
> > >>> make sandbox
> > >>> ovn-nbctl ls-add ls
> > >>> ovn-nbctl lsp-add ls vm1
> > >>> ovn-nbctl acl-add lsp-set-addresses vm1 "0a:58:fc:09:1d:4e
> > >> fd00:10:244:1::5"
> > >>
> > >> Thanks Dumitru! Here it seems a copy/paste problem. Could you provide
> > >> the ACLs added?
> >
> > Just skip this line, it's not needed.
> > 100% reproducible for me without it:
> >
> > 2020-09-14T16:52:04.039Z|00020|util|EMER|controller/ofctrl.c:1108:
> >   assertion ovs_list_is_empty(&f->list_node) failed in
flood_remove_flows_for_sb_uuid()
> >
> >
> >
> > >
> > > There are no ACLs involved. This line was a mistake, please ignore it.
> > > The conjunctions come from the IPv6 port security logical flows.
> > >
> > > https://paste.centos.org/view/bb93f7d3
> > >
>
> Thanks Dumitru and Ilya. With these clear steps to reproduce, I was able
to find the root cause. It is related to different problems in the history:
> 1. commit 07e3017 added a check to skip a lflow if lport is not local.
The check is done on every match condition in a loop, but instead I think
it should be done only once per-flow.
> 2. commit ade4e77 added the resource reference for lport checking at the
same location in the loop, which added redundant references for the same
lport - lflow mapping.
> 3. commit 580aea7 added flood removing based on the lflow-resource
references, which assumes that there are no redundant items in the lists.
Since the assumption isn't true because of the above 2 problems, it results
in corrupted pointers.
>
> There are different ways to fix the issue. I am trying to figure out the
best way and also working on debugging some regression test failures. I
will post the fix later today.
>

Here is the patch that fixes the issue:
https://patchwork.ozlabs.org/project/ovn/patch/1600157757-18964-1-git-send-email-hzhou@ovn.org/
I was wrong about 1), which is correct because matches for the same lflow
can have different inport/outport matches, e.g. when a port-group is used
to match inport/outport. I put more details about the fix in the commit
message.

> Thanks,
> Han
>
> > > Regards,
> > > Dumitru
> > >
> > >>
> > >>> ovn-nbctl lsp-set-addresses vm1 "0a:58:fc:09:1d:4e fd00:10:244:1::5"
> > >>> ovn-nbctl lsp-set-port-security vm1 "0a:58:fc:09:1d:4e
fd00:10:244:1::5"
> > >>> ovs-vsctl add-port br-int vm1 -- set interface vm1 type=internal --
set
> > >>> interface vm1 external-ids:iface-id=vm1
> > >>> sleep 1
> > >>> ovs-vsctl set interface vm1 external-ids:iface-id=foo
> > >>> sleep 1
> > >>> ovs-vsctl set interface vm1 external-ids:iface-id=vm1
> > >>> sleep 1
> > >>> ovs-vsctl set interface vm1 external-ids:iface-id=foo
> > >>>
> > >>> This crashes ovn-controller and if I look at the openflows before
the
> > >>> last port unbind, I notice some weird flows with action:
> > >>>
> > >>>
> > >>
actions=conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2),conjunction(2,2/2)
> > >>>
> > >>> This looks broken. I tried to figure out what's happening but
> > >>> unfortunately I didn't manage to.
> > >>>
> > >>> Hope this helps pinpoint the root cause.
> > >>>
> > >>> Regards,
> > >>> Dumitru
> > >>>
> > >
> > > _______________________________________________
> > > dev mailing list
> > > dev at openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> > >
> >


More information about the dev mailing list