[ovs-dev] [PATCH] dpif-netdev: Fix crash when PACKET_OUT is metered
i.maximets at ovn.org
Mon Jun 21 20:58:06 UTC 2021
On 6/19/21 4:11 AM, Tony van der Peet wrote:
> Hi all
> I am in favour of a better fix, bandaids tend to come back and bite you in the end anyway. So, will be happy to help with this effort, though I am probably going to have to defer to the rest of you when it comes to actually knowing what to do in this area.
It's great to have a full correct solution, but unfortunately, I don't
think that we can come up with something like a full and correct
reference counting in a short period of time. But we still need to fix
a crash that is pretty easy to trigger. I'm also nervous that OVN is
using meters more and more (e.g. control plane protection patch set)
and they might actually hit this issue at the high load. Another point
is that we need a fix that we can backport to LTS, and I don't think that
reference counting would be a small change that we can easily backport
without worrying about breaking something else.
All in all, I think that we need to come up with a "bandaid" solution
and work further on correct reference counting after that.
And we also need to create a unit test that will mimic packet-out that
I encountered in OVN tests, so we can test this kind of behavior in OVS.
For the reference, the packet-out generated by OVN controller had a few
set() actions and the resubmit() to a different table. And this
table had rules leading to packet output to a tunnel port, resulting
in a tunnel push + output datapath actions.
> From: Aaron Conole <aconole at redhat.com>
> Sent: Saturday, 19 June 2021 2:50 a.m.
> To: Ilya Maximets
> Cc: Ben Pfaff; Tony van der Peet; dev at openvswitch.org; Tony van der Peet
> Subject: Re: [ovs-dev] [PATCH] dpif-netdev: Fix crash when PACKET_OUT is metered
> Ilya Maximets <i.maximets at ovn.org> writes:
>> On 6/17/21 11:59 PM, Ben Pfaff wrote:
>>> All these flags for stealing, allowing stealing, blah blah, are just
>>> ways to do some kind of dumb reference counting without actually have a
>>> reference count. When it gets super complex like this, maybe
>>> introducing a reference count is the way to go. It would be a bigger
>>> change, but perhaps more maintainable over time.
>> Yes, I absolutely agree. We just removed one of these hacky flags from
>> the struct dp_packet_batch and we need to get rid of the rest of them.
> +1 from me - it's a bigger scoped change, but over-all, I think it's a
> better one, and makes the most sense.
>> One thing that bothers me is that some parts of the code treats
>> 'should_steal=false' as "do not modify". For example, I don't really
>> understand why these functions are carrying cutlen around and clones
>> the packet before truncating if 'should_steal=false'.
>> Some action executors also have optimizations that allows to not clone
>> the packet before the fatal action if this action is the last one.
>> So, in general, yes, we need to get rid of these flags and accurately
>> re-work all the packet paths. And yes, this would be not a small
>> For now, I think, we need to find a less ugly as possible solution for
>> existing crash (maybe the one that I suggested), so we will be able to
>> backport it and continue working on correct reference counting.
>> What do you think? Tony? Aaron?
> That makes sense to me, and I hope we will actually work on the
> ref-count stuff. I had started taking a look a few weeks back, but the
> idea of 'steal' is pretty ingrained throughout the code, so getting a
> ref-count semantic correct is a big effort. As an example, the
> odp-execute area expects that each sub-action will have its own copy to
> modify (and this is documented with the 'should_steal' semantics).
> Taking that out will be a big rework in that area. I think it makes
> the most sense, and we probably could implement something like COW to
> cover those cases where we really need to modify a packet without
> propagating those changes back up.
>> Best regards, Ilya Maximets.
More information about the dev