[ovs-dev] [ovs-discuss] Patch for PACKET_OUT getting deleted twice crash

Aaron Conole aconole at redhat.com
Wed Jun 9 13:12:44 UTC 2021


Ilya Maximets <i.maximets at ovn.org> writes:

> On 6/7/21 3:59 PM, Ilya Maximets wrote:
>> On 6/7/21 3:09 PM, Aaron Conole wrote:
>>> Ilya Maximets <i.maximets at ovn.org> writes:
>>>
>>>>> Here is a patch with both a test and a fix.
>>>
>>> Thanks so much!  It's nice to get fixes, but I think it's really great
>>> when test cases come along with them.
>>>
>>>> Hi.  Thanks for working n this!
>>>>
>>>> CC: ovs-dev
>>>>
>>>>> Not submitting as a formal
>>>>> patch because I would like some feedback on whether 1) maintainers feel
>>>>> this is worth fixing and
>>>>
>>>> I can reproduce the crash with your test.  Basically, actions in userspace
>>>> datapath may drop packets if something goes wrong.  'meter' action just
>>>> seems to be the most explicit variant.  So, I think, this is definitely
>>>> worth fixing as some other condition might trigger this crash on packet-out
>>>> as well.
>>>>
>>>> ==2568112==ERROR: AddressSanitizer: heap-use-after-free
>>>> on address 0x61600000699c at pc 0x000000573860 bp 0x7ffebc6cc880 sp 0x7ffebc6cc878
>>>> READ of size 1 at 0x61600000699c thread T0
>>>>     #0 0x57385f in dp_packet_delete lib/dp-packet.h:242:16
>>>>     #1 0x57372c in ofproto_packet_out_uninit ofproto/ofproto.c:3562:5
>>>>     #2 0x585e77 in handle_packet_out ofproto/ofproto.c:3722:5
>>>>     #3 0x583801 in handle_single_part_openflow ofproto/ofproto.c:8499:16
>>>>     #4 0x570c9c in handle_openflow ofproto/ofproto.c:8686:21
>>>>     #5 0x611781 in ofconn_run ofproto/connmgr.c:1329:13
>>>>     #6 0x6112ed in connmgr_run ofproto/connmgr.c:356:9
>>>>     #7 0x56fdf4 in ofproto_run ofproto/ofproto.c:1891:5
>>>>     #8 0x545ec0 in bridge_run__ vswitchd/bridge.c:3251:9
>>>>     #9 0x5456a5 in bridge_run vswitchd/bridge.c:3310:5
>>>>     #10 0x55f5b1 in main vswitchd/ovs-vswitchd.c:127:9
>>>>     #11 0x7f85bfe09081 in __libc_start_main (/lib64/libc.so.6+0x27081)
>>>>     #12 0x46d00d in _start (vswitchd/ovs-vswitchd+0x46d00d)
>>>>
>>>>> 2) whether this is the way to fix it.
>>>>
>>>> See inline.
>>>>
>>>>>
>>>>> I have tried to make the most minimal change possible, but this means that
>>>>> there might be paths through the code that give unexpected behaviour (which
>>>>> in the worst case would be a memory leak I suppose).
>>>>>
>>>>> Tony
>>>>>
>>>>> diff --git a/lib/dp-packet.h b/lib/dp-packet.h
>>>>> index 246be14d0..5e0dabe67 100644
>>>>> --- a/lib/dp-packet.h
>>>>> +++ b/lib/dp-packet.h
>>>>> @@ -739,6 +739,7 @@ struct dp_packet_batch {
>>>>>     size_t count;
>>>>>     bool trunc; /* true if the batch needs truncate. */
>>>>>     bool do_not_steal; /* Indicate that the packets should not be stolen.
>>>>> */
>>>>> +    bool packet_out; /* Indicate single packet is PACKET_OUT */
>>>>>     struct dp_packet *packets[NETDEV_MAX_BURST];
>>>>> };
>>>>>
>>>>> @@ -748,6 +749,7 @@ dp_packet_batch_init(struct dp_packet_batch *batch)
>>>>>     batch->count = 0;
>>>>>     batch->trunc = false;
>>>>>     batch->do_not_steal = false;
>>>>> +    batch->packet_out = false;
>>>>> }
>>>>>
>>>>> static inline void
>>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>>>>> index 650e67ab3..deba4a94a 100644
>>>>> --- a/lib/dpif-netdev.c
>>>>> +++ b/lib/dpif-netdev.c
>>>>> @@ -4170,6 +4170,7 @@ dpif_netdev_execute(struct dpif *dpif, struct
>>>>> dpif_execute *execute)
>>>>>
>>>>>     dp_packet_batch_init_packet(&pp, execute->packet);
>>>>>     pp.do_not_steal = true;
>>>>> +    pp.packet_out = execute->packet_out;
>>>>>     dp_netdev_execute_actions(pmd, &pp, false, execute->flow,
>>>>>                               execute->actions, execute->actions_len);
>>>>
>>>> There is already a dirty hack named "do_not_steal" that was introduced,
>>>> I guess, exactly to avoid crash in the conntrack code that could drop/steal
>>>> the packet just like meter action.  And it seems that here in
>>>> dpif_netdev_execute() is the only problematic entry point as all other
>>>> normal paths expects that packet might be destroyed.
>>>>
>>>> The problem was, I suppose, introduced when we tried to unify semantics
>>>> of "may_steal" flag by turning it into "should_steal".  But it seems that
>>>> in this function we really need to prohibit stealing of the packet since
>>>> ofproto layer still owns it regardless of the result of execution.
>>>>
>>>> I don't think that we need one more flag here, but we have several options
>>>> how to fix the crash:
>>>>
>>>> 1. Start honoring batch->do_not_steal flag in all actions that may result
>>>>    in packet drops.  As the original idea of having 'do_not_steal' flag for
>>>>    a batch is very hacky, I'd like to not do that.
>>>
>>> +1
>>>
>>>> 2. Try to propagate information that packet was deleted up to ofproto layer,
>>>>    i.e. make handle_packet_out() aware of that.  Will, probably, be not that
>>>>    easy to do.
>>>
>>> I had a look at doing this, but as you note it is quite intrusive, and
>>> we need to make changes all over.
>>>
>>>> 3. This function (dpif_netdev_execute) is not on a hot path in userspace
>>>>    datapath, IIUC.  It might be that it's just easier to remove the
>>>>    'do_not_steal' flag entirely, clone the packet here and call the
>>>>    dp_netdev_execute_actions() with should_steal=true.
>>>>    This sounds like the best solution for me, unless I overlooked some
>>>>    scenario, where this code is on a hot path.
>>>
>>> I like this approach.
>>>
>>>> Thoughts?
>>>>
>>>> Aaron, you have a patch[1] to remove 'do_not_steal' flag, so fix for this issue
>>>> will, likely, touch the same parts of the code.  What do you think about this
>>>> issue and possible solutions?
>>>
>>> I guess we should do the same thing we do in other places, ie: default
>>> assume that the packet cannot be 'stolen' and we should clone our own
>>> copies.
>> 
>> I'm not sure that I understood this correctly, but the idea was that default
>> assumption is that packet can be stolen at any point of datapath processing
>> and higher layers should deal with this.

Re-reading what I wrote, I don't either.  :-/  I agree with what you've
written.

>> For the IP fragmentation handling this will mean that ipf will just take a
>> packet directly from the original batch without copying, so the original
>> batch will not have this packet anymore and ipf is allowed to free it at
>> any point in time, because now it owns this packet.
>> This aligns with the "should_steal" semantics, as any function called with
>> "should_steal=true" must take the ownership of the packet.  If the function
>> called with "should_steal=false" it still allowed to take ownership of some
>> packets from the batch and caller must be prepared for that.
>> If some function in datapath has no "should_steal" argument, it should be
>> treated as a function with "should_steal=false".   This applies to both
>> conntrack_execute() and dp_netdev_run_meter(), also to netdev_pop_header()
>> and so on.
>
> Hmm, OVS_ACTION_ATTR_TUNNEL_POP implies recirculation, so netdev_pop_header()
> is not a fully valid example here.   Actions that implies recirculation or
> recirculates packets in any other way (e.g. OVS_ACTION_ATTR_USERSPACE) should
> clone packets before doing that if not asked to take ownership.

The ownership of the dp_packet object needs to be well established.  Use
of should_steal could be okay, but as an example, we have something in
the packet batch but it doesn't actually get honored (which is why I
proposed removing it).

>>> If we are worried about the time it takes to copy the dp_packet
>>> structure and buffer, we can always introduce a reference counting
>>> mechanism later as an optimization.
>>>
>>> I would just prefer to do clone, and then the functional area which
>>> needs to hold a reference to a valid packet buffer can delete when it
>>> makes sense.
>>>
>>> Hope it helps.
>>>
>>>> Best regards, Ilya Maximets.
>>>>
>>>>
>>>> [1] https://patchwork.ozlabs.org/project/openvswitch/patch/20210521175905.165779-1-aconole@redhat.com/
>>>>
>>>> Non-line-wrapped version of the test for convenience:
>>>>
>>>> diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at
>>>> index 31064ed95..d01f438b8 100644
>>>> --- a/tests/ofproto-dpif.at
>>>> +++ b/tests/ofproto-dpif.at
>>>> @@ -2159,6 +2159,27 @@ meter:controller flow_count:0 packet_in_count:8 byte_in_count:112 duration:0.0s
>>>>  OVS_VSWITCHD_STOP
>>>>  AT_CLEANUP
>>>>  
>>>> +AT_SETUP([ofproto-dpif packet-out table meter drop qwe])
>>>> +OVS_VSWITCHD_START
>>>> +add_of_ports br0 1 2
>>>> +
>>>> +AT_CHECK([ovs-ofctl -O OpenFlow13 add-meter br0 'meter=1 pktps bands=type=drop rate=1'])
>>>> +AT_CHECK([ovs-ofctl -O OpenFlow13 add-flow br0 'in_port=1 action=meter:1,output:2'])
>>>> +
>>>> +ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000400080000 actions=resubmit(,0)"
>>>> +ovs-ofctl -O OpenFlow13 packet-out br0 "in_port=1 packet=50540000000a50540000000908004500001c000000000011a4cd0a0101010a0101020001000400080000 actions=resubmit(,0)"
>>>> +
>>>> +# Check that vswitchd hasn't crashed by dumping the meter added above
>>>> +AT_CHECK([ovs-ofctl -O OpenFlow13 dump-meters br0 | ofctl_strip], [0], [dnl
>>>> +OFPST_METER_CONFIG reply (OF1.3):
>>>> +meter=1 pktps bands=
>>>> +type=drop rate=1
>>>> +])
>>>> +
>>>> +OVS_VSWITCHD_STOP
>>>> +AT_CLEANUP
>>>> +
>>>> +
>>>>  AT_SETUP([ofproto-dpif - MPLS handling])
>>>>  OVS_VSWITCHD_START([dnl
>>>>     add-port br0 p1 -- set Interface p1 type=dummy
>>>> ---
>>>
>> 



More information about the dev mailing list