[ovs-dev] [PATCH] ofproto-dpif-xlate: Incorrect handling of errors in group action processing

Vishal Deep Ajmera vishal.deep.ajmera at ericsson.com
Wed Dec 6 11:57:48 UTC 2017


Hi Ben,

Thank you for reviewing the patch. Please see inline for comments.

-----Original Message-----
From: Ben Pfaff [mailto:blp at ovn.org] 
Sent: Tuesday, December 05, 2017 12:52 AM
To: Vishal Deep Ajmera <vishal.deep.ajmera at ericsson.com>
Cc: dev at openvswitch.org
Subject: Re: [ovs-dev] [PATCH] ofproto-dpif-xlate: Incorrect handling of errors in group action processing

On Mon, Dec 04, 2017 at 08:47:38AM +0000, Vishal Deep Ajmera wrote:
> As per OpenFlow v1.3 specification, when an action list contains a 
> group action a copy of the packet is passed to the group for 
> processing by the group. This means that if there is an error 
> encountered during group processing, only the copy of packet should be 
> dropped, but subsequent actions in the action list should be executed on the original packet.
> 
> Additionally, if the group type is "ALL", each action bucket of the 
> group should process a copy of the packet. If there is an error while 
> processing one bucket other buckets should still be processed.
> 
> Example 1:
> table=0,in_port=tap0 actions=output:tap1,group:10,output:tap2
> 
> Even if any error is encountered while processing the group action, 
> the packet should still be forwarded to ports tap1 and tap2.
> 
> Example 2:
> group_id=1,type=all,bucket=actions=output:tap1,bucket=actions=encap(et
> h)
> 
> Even if processing the action in the second bucket fails because the 
> packet already has an Ethernet header, the other copy of the packet 
> should still be processed by the first bucket and output to port tap1.
> 
> Currently the error handling in OVS does not comply with those rules. 
> When any group bucket execution fails the error is recorded in the 
> so-called "translation context" which is global for the processing of 
> the original packet. Once an error is recorded, OVS skips processing 
> subsequent buckets and installs a drop action in the datapath even if 
> parts of the action list were previously processed successfully.
> 
> This patch clears the error flag after any bucket of a group is executed.
> This way the error encountered in processing any bucket of the group 
> will not impact other actions of action-list or other buckets of the group.
> 
> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera at ericsson.com>
> Signed-off-by: Keshav Gupta <keshav.gupta at ericsson.com>

Thank you for thinking about this issue.  Clearly there is something to be done here.

When I look at the uses of translation errors, I see now that there are a few categories:

   1. Translation can't start at all: XLATE_NO_RECIRCULATION_CONTEXT,
      XLATE_BRIDGE_NOT_FOUND, XLATE_RECIRCULATION_CONFLICT,
      XLATE_INVALID_TUNNEL_METADATA.

   2. Translation has run too long or used too much space and must end
      to bound memory or CPU time.  XLATE_RECURSION_TOO_DEEP and
      XLATE_TOO_MANY_RESUBMITS are in this category for time,
      XLATE_STACK_TOO_DEEP for space.

   3. Some step can't be executed: XLATE_TOO_MANY_MPLS_LABELS,
      XLATE_UNSUPPORTED_PACKET_TYPE.

I believe that the appropriate handling varies among category.  Category
1 is fundamentally unrecoverable.  For category 2, we must not try to recover because that would extend CPU time (perhaps exponentially) or it would cause malfunctions (because a later "pop" would miss out on some data that wasn't "push"ed).  For category 3, we could recover at an appropriate point in the way you suggest.
=======
[Vishal/Keshav] Any packet modification done in group bucket processing is local to bucket only and original packet is retained (as it was before entering the group processing) after bucket is executed. OVS takes care of this by restoring the flow back again in ctx->flow in xlate_group_bucket ().  Now if there are errors in any action in the bucket then the bucket processing will halt for the packet. Also any changes to the ctx->flow done in the bucket will not be committed in ctx->odp_actions. So, if we reset the ctx->error back to old value (which must be 0) we still will be able to process further actions in the action-list. 

Is our understanding correct ? Or are we missing some case which can potentially break because of this change ? We think that XLATE_RECURSION_TOO_DEEP and XLATE_TOO_MANY_RESUBMITS should be treated local for group processing. Ideally we should also be restoring the ctx->resubmits to old value after bucket processing is complete to make sure one bucket does not disrupt processing (exhaust number of resubmits) of other buckets or other actions of the action-list. ctx->depth parameter is already adjusted to original value after bucket processing is done in xlate_group_bucket ().

We agree that for cases where errors are global in nature and can cause data path issues should not be masked for e.g. XLATE_STACK_TOO_DEEP. Is there any more such errors we should not mask ?
=======

I think that the patch should be rewritten to make these more fine grained distinctions.  I don't think that category 1 needs special handling because this code should never see such an error, but it seems important to distinguish categories 2 and 3.

Thanks,

Ben.


More information about the dev mailing list