[ovs-dev] [PATCH v1] Fix crash due to multiple tnl push action

Ilya Maximets i.maximets at samsung.com
Tue May 15 08:34:01 UTC 2018


Thanks for review.

Yes, you're right that currently 'netdev_push_header' always returns 0.
I think it happened while introducing 'netdev_has_tunnel_push_pop()'. But
even before that, the error code was only for the ENOTSUPP and the only
place, where 'netdev_push_header' actually could fail is 'OVS_NOT_REACHED()'
inside 'dp_packet_resize__()'. And it will abort there.

So, the only way to make a proper error handling is to allow at least
'dp_packet_resize__()' to fail and return proper error code. But this will
require, I guess, a big refactoring of the whole dp_packet module with
it's dependencies.

But, yes, I still think that things like 'OVS_NOT_REACHED()' for the
reachable code paths is a really bad thing. And we need to fix this.

About my patch:
The main idea was just to 'break' instead of 'return' in case where
'may_steal == true'. This triggers the 'dp_packet_batch_free()' placed
at the end of the function and prevents the packet leak.
There are no changes in error handling code, only refactoring to prevent
double free. Maybe I should make it more clear in a first non-RFC version.
Maybe something like this:
--------------------------------------------------------------------------
diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index f86ed2a..9476a03 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -5672,12 +5672,23 @@ dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
         break;
 
     case OVS_ACTION_ATTR_TUNNEL_PUSH:
-        if (*depth < MAX_RECIRC_DEPTH) {
-            dp_packet_batch_apply_cutlen(packets_);
-            push_tnl_action(pmd, a, packets_);
-            return;
+        /*
+         * XXX: 'may_steal' concept is broken here, because we're
+         *      unconditionally changing the packets just like for other PUSH_*
+         *      actions in 'odp_execute()'. 'false' value could be ignored,
+         *      because we could reach here only after clone, but we still need
+         *      to free the packets in case 'may_steal == true'.
+         */
+        if (may_steal) {
+            /* We're requested to push tunnel header, but also we need to take
+             * the ownership of these packets. So, we may not execute the
+             * action because the caller will not use the result anyway.
+             * Just break to free the batch. */
+            break;
         }
-        break;
+        dp_packet_batch_apply_cutlen(packets_);
+        push_tnl_action(pmd, a, packets_);
+        return;
 
     case OVS_ACTION_ATTR_TUNNEL_POP:
         if (*depth < MAX_RECIRC_DEPTH) {
--------------------------------------------------------------------------

Best regards, Ilya Maximets.

On 15.05.2018 09:43, Anju Thomas wrote:
> Hi Ilya,
> I had a look at the patch that you have submitted.
> In push_tnl_action, I can see that with the patch you are trying to return error  in case  we don’t find the tunnel port or return the error that netdev_push header returns.  And we are deciding to delete the batch in the calling function dp_execute_cb in case of error.  But netdev_push_header always returns 0.  Actually the following code in push_tnl_action is dead code.
> 
>  
>     if (!err) {
>         return 0;
>     }
> 
> The actual issue is that the push_header function always return void . So any error happening during push_header is never captured or acted upon.
> 
> I think the fix also be should be to add/capture error scenarios during tnl header push as well.
> 
> Regards & Thanks
> Anju
> 
> -----Original Message-----
> From: Ilya Maximets [mailto:i.maximets at samsung.com] 
> Sent: Monday, May 14, 2018 6:28 PM
> To: ovs-dev at openvswitch.org; Anju Thomas <anju.thomas at ericsson.com>; Ben Pfaff <blp at ovn.org>
> Cc: Tiago Lam <tiago.lam at intel.com>
> Subject: Re: [ovs-dev] [PATCH v1] Fix crash due to multiple tnl push action
> 
> Hello Anju, Ben,
> 
> Looks like I fixed a leak reported here in my recent patch:
>   https://mail.openvswitch.org/pipermail/ovs-dev/2018-May/346892.html
> 
> Could you, please, take a look at it?
> 
> I actually managed to reproduce the packet leak on tunnel-push-pop unit tests with valgrind:
> 
> ==6445== 2,912 (2,208 direct, 704 indirect) bytes in 4 blocks are definitely lost in loss record 400 of 410
> ==6445==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==6445==    by 0x516314: xmalloc (util.c:120)
> ==6445==    by 0x466154: dp_packet_new (dp-packet.c:138)
> ==6445==    by 0x466154: dp_packet_new_with_headroom (dp-packet.c:148)
> ==6445==    by 0x4F6CFE: eth_from_hex (packets.c:498)
> ==6445==    by 0x48EB43: eth_from_packet (netdev-dummy.c:1450)
> ==6445==    by 0x48EB43: netdev_dummy_receive (netdev-dummy.c:1562)
> ==6445==    by 0x515980: process_command (unixctl.c:313)
> ==6445==    by 0x515980: run_connection (unixctl.c:347)
> ==6445==    by 0x515980: unixctl_server_run (unixctl.c:398)
> ==6445==    by 0x406F1E: main (ovs-vswitchd.c:120)
> 
> Above patch fixes it.
> 
> Best regards, Ilya Maximets.
> 
>> Hi Ben,
>>
>> I will work on these changes as well.
>>
>> Regards
>> Anju
>>
>> -----Original Message-----
>> From: Ben Pfaff [mailto:blp at ovn.org]
>> Sent: Friday, May 11, 2018 2:00 AM
>> To: Anju Thomas <anju.thomas at ericsson.com>
>> Cc: dev at openvswitch.org
>> Subject: Re: [ovs-dev] [PATCH v1] Fix crash due to multiple tnl push 
>> action
>>
>> On Thu, May 10, 2018 at 08:51:03AM +0000, Anju Thomas wrote:
>>> It looks like your commit message describes at least two other bugs 
>>> in OVS, though.  First, if OVS crashes when it pushes tunnel headers, 
>>> even if there's not enough headroom, that's really bad.  At worst, it 
>>> should drop the packet.  Do you know where the crash occurs?  We 
>>> should fix the problem, since it might recur in some other context.
>>>
>>> [Anju] OVS was actually crashing in the dpdk datapath. The crash is a manual assert in our case.
>>> The rootcause is that dp receives the actions after the upcall (say with >=3 tunnel pushes ) . Now as part of action processing , since it is a tunnel push action , we try to find space in the dpdk mbuf packet headroom (which Is 128 bytes). By the time we try to process the third tunnel push , there is no headroom left since each tunnel header is of 50 bytes (50 *3 > 128 bytes headroom). Hence it manually asserts .  This assert is to catch any unexpected code flow . Do you think that in this case we should still go ahead and  prevent the crash ? 
>>
>> I don't understand why it's OK to crash in this case.  Why do you think so?
>>
>>> Second, it's a little shocking to hear that an encapsulation action without a following output action causes a memory leak.  We also need to fix that.  Do you have any more details?
>>> [Anju] Now as explained above, the crash happens because we run out of headroom. But in case we have say 2 or less than 2 tunnel pushes we will have a mem leak as packet is never freed because the tnl push is the dp last action and there is no other output action or any other action like recirc that may translate to output action in the end leading  to packet buffer not being freed.
>>> Are you proposing that we have some sort of preventive fix in the dp to handle an incorrect action list from the upcall handling? 
>>
>> Yes.  It's unacceptable to leak memory because there's a packet modification without an output action.  The kernel datapath never does this, for example.
> 


More information about the dev mailing list