[ovs-dev] [PATCH ovn branch-20.03] chassis: Fix the way encaps are updated for a chassis record.

Dumitru Ceara dceara at redhat.com
Thu Sep 10 07:00:58 UTC 2020


On 9/8/20 8:45 PM, Han Zhou wrote:
> 
> 
> 
> On Tue, Sep 8, 2020 at 11:33 AM Dumitru Ceara <dceara at redhat.com
> <mailto:dceara at redhat.com>> wrote:
>>
>> On 9/8/20 7:43 PM, Han Zhou wrote:
>> >
>> >
>> > On Tue, Sep 8, 2020 at 2:24 AM Dumitru Ceara <dceara at redhat.com
> <mailto:dceara at redhat.com>
>> > <mailto:dceara at redhat.com <mailto:dceara at redhat.com>>> wrote:
>> >>
>> >> On 9/8/20 9:47 AM, Numan Siddique wrote:
>> >> >
>> >> >
>> >> > On Tue, Sep 8, 2020 at 1:16 PM Numan Siddique <numans at ovn.org
> <mailto:numans at ovn.org>
>> > <mailto:numans at ovn.org <mailto:numans at ovn.org>>
>> >> > <mailto:numans at ovn.org <mailto:numans at ovn.org>
> <mailto:numans at ovn.org <mailto:numans at ovn.org>>>> wrote:
>> >> >
>> >> >
>> >> >
>> >> >     On Tue, Sep 8, 2020 at 12:58 PM Han Zhou <zhouhan at gmail.com
> <mailto:zhouhan at gmail.com>
>> > <mailto:zhouhan at gmail.com <mailto:zhouhan at gmail.com>>
>> >> >     <mailto:zhouhan at gmail.com <mailto:zhouhan at gmail.com>
> <mailto:zhouhan at gmail.com <mailto:zhouhan at gmail.com>>>> wrote:
>> >> >
>> >> >         On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara
>> > <dceara at redhat.com <mailto:dceara at redhat.com>
> <mailto:dceara at redhat.com <mailto:dceara at redhat.com>>
>> >> >         <mailto:dceara at redhat.com <mailto:dceara at redhat.com>
> <mailto:dceara at redhat.com <mailto:dceara at redhat.com>>>> wrote:
>> >> >         >
>> >> >         > ovn-controller always stores the last configured
>> >> >         system-id/chassis-id in
>> >> >         > memory regardless if the connection to the SB is up or
> down.
>> >> >         This is OK
>> >> >         > as long as the change can be committed successfully when
>> > the SB DB
>> >> >         > connection comes back up.
>> >> >         >
>> >> >         > Without this change, if the chassis-id changes while the SB
>> >> >         connection is
>> >> >         > down, ovn-controller will fail to create the new record but
>> >> >         nevertheless
>> >> >         > update its in-memory chassis-id. When the SB connection is
>> >> >         restored
>> >> >         > ovn-controller tries to find the record corresponding
> to the
>> >> >         chassis-id
>> >> >         > it stored in memory. This fails causing ovn-controller
> to try
>> >> >         to insert
>> >> >         > a new record. But at this point a constraint violation
> is hit
>> >> >         in the SB
>> >> >         > because the Encap records of the "stale" chassis still
> exist
>> >> >         in the DB,
>> >> >         > along with the old chassis record.
>> >> >         >
>> >> >         > This commit changes the way we search for a "stale" chassis
>> >> >         record in the
>> >> >         > SB to cover the above mentioned case. Also, in such cases
>> >> >         there's no need
>> >> >         > to recreate the Encaps, it's sufficient to update the
>> >> >         chassis_name field.
>> >> >         >
>> >> >         > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to
>> >> >         abstract the
>> >> >         string parsing")
>> >> >         > Signed-off-by: Dumitru Ceara <dceara at redhat.com
> <mailto:dceara at redhat.com>
>> > <mailto:dceara at redhat.com <mailto:dceara at redhat.com>>
>> >> >         <mailto:dceara at redhat.com <mailto:dceara at redhat.com>
> <mailto:dceara at redhat.com <mailto:dceara at redhat.com>>>>
>> >> >         >
>> >> >         > (cherry-picked from master commit
>> >> >         94a32fca2d2b825fece0ef5b1873459bd9857dd3)
>> >> >
>> >> >         Hi Dumitru,
>> >> >
>> >>
>> >> Hi Han, Numan,
>> >>
>> >> >         Sorry that I missed the original review, but this patch seems
>> >> >         causing
>> >> >         problem in master already. When RBAC is enabled, it will
> result
>> >> >         in below
>> >> >         error:
>> >> >         2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming
> lport lsp1
>> >> >         for this
>> >> >         chassis.
>> >> >         2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction
> error:
>> >> >         {"details":"RBAC rules for client \"chassis-1\" role
>> >> >         \"ovn-controller\"
>> >> >         prohibit modification of table
> \"Encap\".","error":"permission
>> >> >         error"}
>> >> >
>> >> >         You can simply reproduce by: make sandbox, and then run:
>> > ovn-nbctl
>> >> >         --wait=hv sync, which will hang forever because of the
> problem.
>> >> >         This patch seems to be updating the "name" field of Encap
> table,
>> >> >         which
>> >> >         violates the design of RBAC, which uses "name" as the
> identity
>> >> >         of the
>> >> >         client. We shouldn't allow an user to change
>> >> >         system-id/chassis-id directly.
>> >>
>> >> How can we enforce this from within OVN?
>> >>
>> > My suggestion is don't enforce this from OVN. We can simply document
>> > this clearly. It would be good to prevent misuse as much as we can, but
>> > there are so many ways to break OVN if users don't follow the correct
>> > procedure, and we can't prevent all of them. Changing system-id is
>> > something I think very abnormal and if someone does it purposely they
>> > must know what they are doing. The logic is becoming more and more
>> > complex in OVN just for this misuse case, so I am thinking about
>> > simplifying this from the requirement point of view, unless there is a
>> > decent way of solving the problem.
>> >
>> >> >         I think the proper way is firstly destroying the old
> chassis and
>> >> >         then
>> >> >         creating a new chassis, which would also avoid the complex
>> > logic in
>> >> >         ovn-controller regarding the "stale record" handling. What do
>> >> >         you think?
>> >> >
>> >> >
>> >>
>> >> I'm afraid this would be a behavior change that might introduce bugs in
>> >> CMSs that allow changing system-ids (without RBAC enabled) without
>> >> explicitly deleting the "stale" Chassis *and* Chassis_private record.
>> >>
>> >> Also, forcing CMS to do this housekeeping will probably add complex
> code
>> >> in the CMS. For example, if the "old" Chassis record is deleted before
>> >> ovn-controller receives the update about the system-id change in OVS DB
>> >> then ovn-controller will just recreate it.
>> >>
>> > Is changing system-id a common operation? I couldn't think about a real
>> > use case when we need to change system-id. Also, it doesn't make much
>> > sense to me if the solution supports changing it only when RBAC is
>> > disabled. In such case we will at least document more clearly what's
>> > supported in RBAC case and non-RBAC case, which could be more confusing
>> > than just don't support changing system-id. For corner cases, would it
>> > be sufficient to just document the proper steps of destroying and
>> > recreating a chassis whenever necessary?
>> >
>>
>> Right, it shouldn't happen very often but it may happen.
>>
>> >> >     Hi Han,
>> >> >
>> >> >     Yes. Dumitru submitted the patch to fix this issue in master
> and I
>> >> >     applied it just now.
>> >> >     Can you please test again with the latest master.
>> >> >
>> >> >    
>> >
> https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34
>> >> >
>> >> >     We see the same issue with branch-20.06 too.
>> >> >
>> >> >
>> >> >
>> >> > Hi Dumitru,
>> >> >
>> >> > Can you please make a 2 patch series for branch-20.03 which includes
>> >> > this patch and the other one
>> >> > which fixes the RBAC issue.
>> >> >
>> >> > Thanks
>> >> > Numan
>> >> >  
>> >>
>> >> Like you said, we have two options:
>> >>
>> >> 1. Leave it up to the CMS to cleanup Chassis/Chassis_private when
>> >> system-id changes and completely ignore in ovn-controller any stale
>> >> chassis records that might exist in the SB DB. In this case we'd just
>> >> let transactions fail due to constraint violations if the CMS
> cleanup of
>> >> the Chassis/Chassis_private records doesn't happen or is incorrect.
>> >>
>> >> 2. Make it more user friendly for the case when RBAC is _not used_ and
>> >> try to cleanup after ourselves in ovn-controller and reuse the stale
>> >> records. This is already happening with the code on the master branch
>> >> and if we cherry pick [0] as Numan suggested and add it to this series
>> >> it would be the case for branch-20.03 too. [0] would have to go to
>> >> branch-20.06 as well.
>> >>
>> >> I would say, at least for now, the least intrusive change, with the
>> >> least chance of breaking CMSs, is #2 above but I'll wait for
>> >> confirmation before sending a new version of this patch.
>> >
>> > Unfortunately [0] is still broken. The steps to reproduce:
>> >     $ make sandbox
>> >     $ ovs-vsctl set open . external_ids:system-id=new-chassis
>> >     $ ovs-vsctl set open . external_ids:system-id=new-chassis-2
>> >
>> > The problem is, the "name" column is the RBAC identity of a
>> > ovn-controller client. RBAC defines what operations are allowed for an
>> > authenticated client, but the identity itself should never be changed.
>> > Otherwise, there is no point for RBAC.
>> >
>>
>> OK but you're changing the system-id with RBAC enabled which we decided
>> already that is not working and can't be fixed.
>>
>> In any case, you're right, this is getting too complicated and we
>> clearly can't handle all cases without operator help so I'll send a new
>> patch (on master first) to:
>> - document that changing system-id is something that can't be handled
>> completely in ovn-controller on its own and will need operator help to
>> cleanup stale records in SB.
> 
> Thanks Dumitru. The plan sounds good to me. Let's be clear what will be
> supported in OVN and what will not. IMHO, I'd say we don't support
> directly changing system-id. If it needs to be changed, follow the
> procedure of deleting and recreating a new chassis, which can be
> described in detail for RBAC and non-RBAC respectively. Is this the same
> as what you are about to address?
> 

Exactly.

Thanks,
Dumitru

>> - document how changing the system-id can be performed with RBAC enabled
>> (I think we need to change the SSL certificates too to use the new
>> system-id).
>> - remove the parts of the code in chassis.c that reuse stale
>> Chassis/Chassis_private records.
>>
>> What do you think?
>>
>> Thanks,
>> Dumitru
>>
>>



More information about the dev mailing list