[ovs-dev] [PATCH ovn branch-20.03] chassis: Fix the way encaps are updated for a chassis record.

Han Zhou zhouhan at gmail.com
Tue Sep 8 18:45:32 UTC 2020


On Tue, Sep 8, 2020 at 11:33 AM Dumitru Ceara <dceara at redhat.com> wrote:
>
> On 9/8/20 7:43 PM, Han Zhou wrote:
> >
> >
> > On Tue, Sep 8, 2020 at 2:24 AM Dumitru Ceara <dceara at redhat.com
> > <mailto:dceara at redhat.com>> wrote:
> >>
> >> On 9/8/20 9:47 AM, Numan Siddique wrote:
> >> >
> >> >
> >> > On Tue, Sep 8, 2020 at 1:16 PM Numan Siddique <numans at ovn.org
> > <mailto:numans at ovn.org>
> >> > <mailto:numans at ovn.org <mailto:numans at ovn.org>>> wrote:
> >> >
> >> >
> >> >
> >> >     On Tue, Sep 8, 2020 at 12:58 PM Han Zhou <zhouhan at gmail.com
> > <mailto:zhouhan at gmail.com>
> >> >     <mailto:zhouhan at gmail.com <mailto:zhouhan at gmail.com>>> wrote:
> >> >
> >> >         On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara
> > <dceara at redhat.com <mailto:dceara at redhat.com>
> >> >         <mailto:dceara at redhat.com <mailto:dceara at redhat.com>>> wrote:
> >> >         >
> >> >         > ovn-controller always stores the last configured
> >> >         system-id/chassis-id in
> >> >         > memory regardless if the connection to the SB is up or
down.
> >> >         This is OK
> >> >         > as long as the change can be committed successfully when
> > the SB DB
> >> >         > connection comes back up.
> >> >         >
> >> >         > Without this change, if the chassis-id changes while the SB
> >> >         connection is
> >> >         > down, ovn-controller will fail to create the new record but
> >> >         nevertheless
> >> >         > update its in-memory chassis-id. When the SB connection is
> >> >         restored
> >> >         > ovn-controller tries to find the record corresponding to
the
> >> >         chassis-id
> >> >         > it stored in memory. This fails causing ovn-controller to
try
> >> >         to insert
> >> >         > a new record. But at this point a constraint violation is
hit
> >> >         in the SB
> >> >         > because the Encap records of the "stale" chassis still
exist
> >> >         in the DB,
> >> >         > along with the old chassis record.
> >> >         >
> >> >         > This commit changes the way we search for a "stale" chassis
> >> >         record in the
> >> >         > SB to cover the above mentioned case. Also, in such cases
> >> >         there's no need
> >> >         > to recreate the Encaps, it's sufficient to update the
> >> >         chassis_name field.
> >> >         >
> >> >         > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to
> >> >         abstract the
> >> >         string parsing")
> >> >         > Signed-off-by: Dumitru Ceara <dceara at redhat.com
> > <mailto:dceara at redhat.com>
> >> >         <mailto:dceara at redhat.com <mailto:dceara at redhat.com>>>
> >> >         >
> >> >         > (cherry-picked from master commit
> >> >         94a32fca2d2b825fece0ef5b1873459bd9857dd3)
> >> >
> >> >         Hi Dumitru,
> >> >
> >>
> >> Hi Han, Numan,
> >>
> >> >         Sorry that I missed the original review, but this patch seems
> >> >         causing
> >> >         problem in master already. When RBAC is enabled, it will
result
> >> >         in below
> >> >         error:
> >> >         2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming lport
lsp1
> >> >         for this
> >> >         chassis.
> >> >         2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction
error:
> >> >         {"details":"RBAC rules for client \"chassis-1\" role
> >> >         \"ovn-controller\"
> >> >         prohibit modification of table
\"Encap\".","error":"permission
> >> >         error"}
> >> >
> >> >         You can simply reproduce by: make sandbox, and then run:
> > ovn-nbctl
> >> >         --wait=hv sync, which will hang forever because of the
problem.
> >> >         This patch seems to be updating the "name" field of Encap
table,
> >> >         which
> >> >         violates the design of RBAC, which uses "name" as the
identity
> >> >         of the
> >> >         client. We shouldn't allow an user to change
> >> >         system-id/chassis-id directly.
> >>
> >> How can we enforce this from within OVN?
> >>
> > My suggestion is don't enforce this from OVN. We can simply document
> > this clearly. It would be good to prevent misuse as much as we can, but
> > there are so many ways to break OVN if users don't follow the correct
> > procedure, and we can't prevent all of them. Changing system-id is
> > something I think very abnormal and if someone does it purposely they
> > must know what they are doing. The logic is becoming more and more
> > complex in OVN just for this misuse case, so I am thinking about
> > simplifying this from the requirement point of view, unless there is a
> > decent way of solving the problem.
> >
> >> >         I think the proper way is firstly destroying the old chassis
and
> >> >         then
> >> >         creating a new chassis, which would also avoid the complex
> > logic in
> >> >         ovn-controller regarding the "stale record" handling. What do
> >> >         you think?
> >> >
> >> >
> >>
> >> I'm afraid this would be a behavior change that might introduce bugs in
> >> CMSs that allow changing system-ids (without RBAC enabled) without
> >> explicitly deleting the "stale" Chassis *and* Chassis_private record.
> >>
> >> Also, forcing CMS to do this housekeeping will probably add complex
code
> >> in the CMS. For example, if the "old" Chassis record is deleted before
> >> ovn-controller receives the update about the system-id change in OVS DB
> >> then ovn-controller will just recreate it.
> >>
> > Is changing system-id a common operation? I couldn't think about a real
> > use case when we need to change system-id. Also, it doesn't make much
> > sense to me if the solution supports changing it only when RBAC is
> > disabled. In such case we will at least document more clearly what's
> > supported in RBAC case and non-RBAC case, which could be more confusing
> > than just don't support changing system-id. For corner cases, would it
> > be sufficient to just document the proper steps of destroying and
> > recreating a chassis whenever necessary?
> >
>
> Right, it shouldn't happen very often but it may happen.
>
> >> >     Hi Han,
> >> >
> >> >     Yes. Dumitru submitted the patch to fix this issue in master and
I
> >> >     applied it just now.
> >> >     Can you please test again with the latest master.
> >> >
> >> >
> >
https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34
> >> >
> >> >     We see the same issue with branch-20.06 too.
> >> >
> >> >
> >> >
> >> > Hi Dumitru,
> >> >
> >> > Can you please make a 2 patch series for branch-20.03 which includes
> >> > this patch and the other one
> >> > which fixes the RBAC issue.
> >> >
> >> > Thanks
> >> > Numan
> >> >
> >>
> >> Like you said, we have two options:
> >>
> >> 1. Leave it up to the CMS to cleanup Chassis/Chassis_private when
> >> system-id changes and completely ignore in ovn-controller any stale
> >> chassis records that might exist in the SB DB. In this case we'd just
> >> let transactions fail due to constraint violations if the CMS cleanup
of
> >> the Chassis/Chassis_private records doesn't happen or is incorrect.
> >>
> >> 2. Make it more user friendly for the case when RBAC is _not used_ and
> >> try to cleanup after ourselves in ovn-controller and reuse the stale
> >> records. This is already happening with the code on the master branch
> >> and if we cherry pick [0] as Numan suggested and add it to this series
> >> it would be the case for branch-20.03 too. [0] would have to go to
> >> branch-20.06 as well.
> >>
> >> I would say, at least for now, the least intrusive change, with the
> >> least chance of breaking CMSs, is #2 above but I'll wait for
> >> confirmation before sending a new version of this patch.
> >
> > Unfortunately [0] is still broken. The steps to reproduce:
> >     $ make sandbox
> >     $ ovs-vsctl set open . external_ids:system-id=new-chassis
> >     $ ovs-vsctl set open . external_ids:system-id=new-chassis-2
> >
> > The problem is, the "name" column is the RBAC identity of a
> > ovn-controller client. RBAC defines what operations are allowed for an
> > authenticated client, but the identity itself should never be changed.
> > Otherwise, there is no point for RBAC.
> >
>
> OK but you're changing the system-id with RBAC enabled which we decided
> already that is not working and can't be fixed.
>
> In any case, you're right, this is getting too complicated and we
> clearly can't handle all cases without operator help so I'll send a new
> patch (on master first) to:
> - document that changing system-id is something that can't be handled
> completely in ovn-controller on its own and will need operator help to
> cleanup stale records in SB.

Thanks Dumitru. The plan sounds good to me. Let's be clear what will be
supported in OVN and what will not. IMHO, I'd say we don't support directly
changing system-id. If it needs to be changed, follow the procedure of
deleting and recreating a new chassis, which can be described in detail for
RBAC and non-RBAC respectively. Is this the same as what you are about to
address?

> - document how changing the system-id can be performed with RBAC enabled
> (I think we need to change the SSL certificates too to use the new
> system-id).
> - remove the parts of the code in chassis.c that reuse stale
> Chassis/Chassis_private records.
>
> What do you think?
>
> Thanks,
> Dumitru
>
>


More information about the dev mailing list