[ovs-dev] [PATCH v2] ofproto-dpif-upcall: Fix race condition while purging
jianbol at nvidia.com
Tue Jun 1 02:02:00 UTC 2021
The 05/31/2021 19:58, Ilya Maximets wrote:
> On 5/28/21 5:16 AM, Jianbo Liu wrote:
> > The 05/27/2021 12:43, Ilya Maximets wrote:
> >> On 5/27/21 12:29 PM, Jianbo Liu wrote:
> >>> There is a race condidtion between purger and handler in dpif-netlink.
> >>> Handler may create new ukey and install it while executing 'ovs-appctl
> >>> revalidator/purge' command. However, before handler calls
> >>> transition_ukey() in handle_upcalls(), purger may get this ukey from
> >>> umap, then evict and delete it. This will trigger ovs_abort in
> >>> transition_ukey() for handler because it is trying to set state to
> >>> EVICTED or OPERATIONAL, but ukey is already in DELETED state.
> >>> To fix this issue, purger must not delete ukey in VISIBLE state.
> >> Hi. This is not a good thing to trigger abort(), but "purge" means
> >> "purge". And, AFAIU, most of ukeys are visible. This is a purely
> >> debug interface that was introduced to test functionality of OVS and
> >> should not be used in production environment. The fact that "purge"
> >> doesn't remove visible ukeys, in my opinion, just makes the appctl
> >> "revalidator/purge" useless. Can we replace abort() with rate-limited
> > But currently ukey is also not removed if we can't hold its lock.
> > Besides, new ukey will be installed while purging. We can't make sure
> > that no visilbe/operational ukeys exist at the monent this command is
> > finished. So in order to resolve the race, visible ukey should be kept,
> > it is just like a new incoming ukey.
> The purpose of this call is to be used in the testsuite where we have
> a full control over traffic that flows through OVS ports, so locking
> issue or installation of a new ukey is not possible there.
> >> error message for this scenario instead to avoid process termination?
> > No sure if there are issues, because ukey is deleted (maybe freed) while
> > handler still access it by the pointer.
> Well, if there is a race that could lead to a crash, it make sense
> fixing, but it doesn't make sense to fix only for one datapath type.
It is related to handle_upcalls(), and dpif-netlink calls it, but
dpif-netdev does not, right?
> If it's not possible to fix keeping the semantics of the call, I'd
> prefer to not fix that at all as we need this appctl for unit testing,
> other use cases are not supported. i.e. there is no point to change
> this call if the fix beaks its main functionality.
I don't think it breaks. If it's used only for unit test as you
suggested, there should not be VISIBLE ukey while purge for
dpif-netlink, so the behavior is still the same.
> >> BTW, how did you catch this? Is it reproducible with system tests?
> > We found this issue when testing CT offload, do purge without stop
> > traffic.
> What is the point of calling 'purge' in this scenario if it will
> not purge visible ukeys?
Maybe stability, I think. We don't know what command user will run on
his system. Whatever he do, we must make sure openvswitch does not
> > Thanks!
> > Jianbo
> >> Best regards, Ilya Maximets.
More information about the dev