[ovs-dev] [PATCH ovn branch-20.03] chassis: Fix the way encaps are updated for a chassis record.

Dumitru Ceara dceara at redhat.com
Tue Sep 8 18:33:09 UTC 2020


On 9/8/20 7:43 PM, Han Zhou wrote:
> 
> 
> On Tue, Sep 8, 2020 at 2:24 AM Dumitru Ceara <dceara at redhat.com
> <mailto:dceara at redhat.com>> wrote:
>>
>> On 9/8/20 9:47 AM, Numan Siddique wrote:
>> >
>> >
>> > On Tue, Sep 8, 2020 at 1:16 PM Numan Siddique <numans at ovn.org
> <mailto:numans at ovn.org>
>> > <mailto:numans at ovn.org <mailto:numans at ovn.org>>> wrote:
>> >
>> >
>> >
>> >     On Tue, Sep 8, 2020 at 12:58 PM Han Zhou <zhouhan at gmail.com
> <mailto:zhouhan at gmail.com>
>> >     <mailto:zhouhan at gmail.com <mailto:zhouhan at gmail.com>>> wrote:
>> >
>> >         On Mon, Sep 7, 2020 at 3:11 AM Dumitru Ceara
> <dceara at redhat.com <mailto:dceara at redhat.com>
>> >         <mailto:dceara at redhat.com <mailto:dceara at redhat.com>>> wrote:
>> >         >
>> >         > ovn-controller always stores the last configured
>> >         system-id/chassis-id in
>> >         > memory regardless if the connection to the SB is up or down.
>> >         This is OK
>> >         > as long as the change can be committed successfully when
> the SB DB
>> >         > connection comes back up.
>> >         >
>> >         > Without this change, if the chassis-id changes while the SB
>> >         connection is
>> >         > down, ovn-controller will fail to create the new record but
>> >         nevertheless
>> >         > update its in-memory chassis-id. When the SB connection is
>> >         restored
>> >         > ovn-controller tries to find the record corresponding to the
>> >         chassis-id
>> >         > it stored in memory. This fails causing ovn-controller to try
>> >         to insert
>> >         > a new record. But at this point a constraint violation is hit
>> >         in the SB
>> >         > because the Encap records of the "stale" chassis still exist
>> >         in the DB,
>> >         > along with the old chassis record.
>> >         >
>> >         > This commit changes the way we search for a "stale" chassis
>> >         record in the
>> >         > SB to cover the above mentioned case. Also, in such cases
>> >         there's no need
>> >         > to recreate the Encaps, it's sufficient to update the
>> >         chassis_name field.
>> >         >
>> >         > Fixes: 5344f24ecb1a ("ovn-controller: Refactor chassis.c to
>> >         abstract the
>> >         string parsing")
>> >         > Signed-off-by: Dumitru Ceara <dceara at redhat.com
> <mailto:dceara at redhat.com>
>> >         <mailto:dceara at redhat.com <mailto:dceara at redhat.com>>>
>> >         >
>> >         > (cherry-picked from master commit
>> >         94a32fca2d2b825fece0ef5b1873459bd9857dd3)
>> >
>> >         Hi Dumitru,
>> >
>>
>> Hi Han, Numan,
>>
>> >         Sorry that I missed the original review, but this patch seems
>> >         causing
>> >         problem in master already. When RBAC is enabled, it will result
>> >         in below
>> >         error:
>> >         2020-09-08T07:20:31.991Z|00014|binding|INFO|Claiming lport lsp1
>> >         for this
>> >         chassis.
>> >         2020-09-08T07:20:31.992Z|00015|ovsdb_idl|WARN|transaction error:
>> >         {"details":"RBAC rules for client \"chassis-1\" role
>> >         \"ovn-controller\"
>> >         prohibit modification of table \"Encap\".","error":"permission
>> >         error"}
>> >
>> >         You can simply reproduce by: make sandbox, and then run:
> ovn-nbctl
>> >         --wait=hv sync, which will hang forever because of the problem.
>> >         This patch seems to be updating the "name" field of Encap table,
>> >         which
>> >         violates the design of RBAC, which uses "name" as the identity
>> >         of the
>> >         client. We shouldn't allow an user to change
>> >         system-id/chassis-id directly.
>>
>> How can we enforce this from within OVN?
>>
> My suggestion is don't enforce this from OVN. We can simply document
> this clearly. It would be good to prevent misuse as much as we can, but
> there are so many ways to break OVN if users don't follow the correct
> procedure, and we can't prevent all of them. Changing system-id is
> something I think very abnormal and if someone does it purposely they
> must know what they are doing. The logic is becoming more and more
> complex in OVN just for this misuse case, so I am thinking about
> simplifying this from the requirement point of view, unless there is a
> decent way of solving the problem.
> 
>> >         I think the proper way is firstly destroying the old chassis and
>> >         then
>> >         creating a new chassis, which would also avoid the complex
> logic in
>> >         ovn-controller regarding the "stale record" handling. What do
>> >         you think?
>> >
>> >
>>
>> I'm afraid this would be a behavior change that might introduce bugs in
>> CMSs that allow changing system-ids (without RBAC enabled) without
>> explicitly deleting the "stale" Chassis *and* Chassis_private record.
>>
>> Also, forcing CMS to do this housekeeping will probably add complex code
>> in the CMS. For example, if the "old" Chassis record is deleted before
>> ovn-controller receives the update about the system-id change in OVS DB
>> then ovn-controller will just recreate it.
>>
> Is changing system-id a common operation? I couldn't think about a real
> use case when we need to change system-id. Also, it doesn't make much
> sense to me if the solution supports changing it only when RBAC is
> disabled. In such case we will at least document more clearly what's
> supported in RBAC case and non-RBAC case, which could be more confusing
> than just don't support changing system-id. For corner cases, would it
> be sufficient to just document the proper steps of destroying and
> recreating a chassis whenever necessary?
> 

Right, it shouldn't happen very often but it may happen.

>> >     Hi Han,
>> >
>> >     Yes. Dumitru submitted the patch to fix this issue in master and I
>> >     applied it just now.
>> >     Can you please test again with the latest master.
>> >
>> >    
> https://github.com/ovn-org/ovn/commit/849c5a492a26337789db1a2dc97570989ef08c34
>> >
>> >     We see the same issue with branch-20.06 too.
>> >
>> >
>> >
>> > Hi Dumitru,
>> >
>> > Can you please make a 2 patch series for branch-20.03 which includes
>> > this patch and the other one
>> > which fixes the RBAC issue.
>> >
>> > Thanks
>> > Numan
>> >  
>>
>> Like you said, we have two options:
>>
>> 1. Leave it up to the CMS to cleanup Chassis/Chassis_private when
>> system-id changes and completely ignore in ovn-controller any stale
>> chassis records that might exist in the SB DB. In this case we'd just
>> let transactions fail due to constraint violations if the CMS cleanup of
>> the Chassis/Chassis_private records doesn't happen or is incorrect.
>>
>> 2. Make it more user friendly for the case when RBAC is _not used_ and
>> try to cleanup after ourselves in ovn-controller and reuse the stale
>> records. This is already happening with the code on the master branch
>> and if we cherry pick [0] as Numan suggested and add it to this series
>> it would be the case for branch-20.03 too. [0] would have to go to
>> branch-20.06 as well.
>>
>> I would say, at least for now, the least intrusive change, with the
>> least chance of breaking CMSs, is #2 above but I'll wait for
>> confirmation before sending a new version of this patch.
> 
> Unfortunately [0] is still broken. The steps to reproduce:
>     $ make sandbox
>     $ ovs-vsctl set open . external_ids:system-id=new-chassis
>     $ ovs-vsctl set open . external_ids:system-id=new-chassis-2
> 
> The problem is, the "name" column is the RBAC identity of a
> ovn-controller client. RBAC defines what operations are allowed for an
> authenticated client, but the identity itself should never be changed.
> Otherwise, there is no point for RBAC.
> 

OK but you're changing the system-id with RBAC enabled which we decided
already that is not working and can't be fixed.

In any case, you're right, this is getting too complicated and we
clearly can't handle all cases without operator help so I'll send a new
patch (on master first) to:
- document that changing system-id is something that can't be handled
completely in ovn-controller on its own and will need operator help to
cleanup stale records in SB.
- document how changing the system-id can be performed with RBAC enabled
(I think we need to change the SSL certificates too to use the new
system-id).
- remove the parts of the code in chassis.c that reuse stale
Chassis/Chassis_private records.

What do you think?

Thanks,
Dumitru




More information about the dev mailing list