[ovs-dev] [PATCH v5] tunneling: Avoid recirculation on datapath by computing the recirculate actions at translate time.

Wed May 10 19:13:36 UTC 2017

On 9 May 2017 at 04:46, Zoltán Balogh <zoltan.balogh at ericsson.com> wrote:
> Hi,
>
> Thank you for your comments. Actually, I was wrong in my previous e-mail when I stated that packet is truncated only at the end of the pipeline, when output is applied. The packet size is set according max_len set by truncate when tunnel_push is applied. So the size of packet is correct.
>
> The root cause of the problem, why stats in underlay bridge is wrong, is that statistics will be incremented by the packet number and packet size set by the upcall_xlate().
>
> static void
> upcall_xlate(struct udpif *udpif, struct upcall *upcall,
>              struct ofpbuf *odp_actions, struct flow_wildcards *wc)
> {
>     struct dpif_flow_stats stats;
>     struct xlate_in xin;
>
>     stats.n_packets = 1;
>     stats.n_bytes = dp_packet_size(upcall->packet);
>     stats.used = time_msec();
>     stats.tcp_flags = ntohs(upcall->flow->tcp_flags);
> ...
> }
>
> Since we excluded recirculation in the "Avoid recirculation" patch, there is no second upcall when packet enters tunnel, but stats created by "first" and only upcall are used for statistics of both integrate and underlay bridges. And that's not correct.
>
> I completed my old patch to solve this problem by adding two new members to struct rule_dpif and modify stats with their values.

Hi Zoltan, thanks again for looking into this.

I had trouble trying to review this, in part because I felt like
changes to fix multiple issues are being placed into one patch. If
there are separate issues being addressed, each issue should be fixed
in a separate patch, each with a clear description of the intent of
the change (with links to patches that introduced the breakage if
possible!). If there is refactoring to do, consider separating the
non-functional changes into a separate patch---when looking at the
original patch that started this discussion, I found it difficult to
review post-merge because the refactoring was included and I couldn't
tell if there were actually functional changes.

Given that this is taking a bit longer than I thought and we need to
take a holistic approach to how all of these interactions should work,
I plan to go ahead and apply the revert patch. I look forward to
seeing a fresh series proposal that will bring back the benefits of
the original patch.

More feedback below.

> Here comes the patch:
>
>
> Since tunneling and forwarding via patch port uses clone action, the tunneled

Where does forwarding via patch port use clone action?

> packet will not be parsed by miniflow_extract() and flow data will not be
> updated within clone. So when a tunnel header is added to the packet, the MAC
> and IP data of struct flow will not be updated according to the newly created
> tunnel header.
>
> This patch stores MAC and IP data of flow before calling
> apply_nested_clone_actions() in build_tunnel_send(), then restores the data
> after apply_nested_clone_actions() has returned.
>
> Furthermore, it introduces truncate_packet_size and tunnel_header_size struct
> rule_dpif members in order to correct n_bytes statistics of OF flows.
> ---
>  ofproto/ofproto-dpif-xlate.c | 92 ++++++++++++++++++++++++++++++++++++++++++++
>  ofproto/ofproto-dpif.c       | 14 ++++++-
>  ofproto/ofproto-dpif.h       |  5 +++
>  tests/ofproto-dpif.at        | 11 +++++-
>  4 files changed, 120 insertions(+), 2 deletions(-)
>
> diff --git a/ofproto/ofproto-dpif-xlate.c b/ofproto/ofproto-dpif-xlate.c
> index b308f21..55015d7 100644
> --- a/ofproto/ofproto-dpif-xlate.c
> +++ b/ofproto/ofproto-dpif-xlate.c
> @@ -3141,6 +3141,33 @@ tnl_send_arp_request(struct xlate_ctx *ctx, const struct xport *out_dev,
>      dp_packet_uninit(&packet);
>  }
>
> +/*
> + * Copy IP data of 'flow->tunnel' into 'flow' and 'base_flow'.
> + */
> +static void
> +propagate_tunnel_data_to_flow(struct flow *flow, struct flow *base_flow)
> +{
> +    flow->nw_dst = flow->tunnel.ip_dst;
> +    flow->nw_src = flow->tunnel.ip_src;
> +    flow->ipv6_dst = flow->tunnel.ipv6_dst;
> +    flow->ipv6_src = flow->tunnel.ipv6_src;
> +
> +    flow->nw_tos = flow->tunnel.ip_tos;
> +    flow->nw_ttl = flow->tunnel.ip_ttl;
> +    flow->tp_dst = flow->tunnel.tp_dst;
> +    flow->tp_src = flow->tunnel.tp_src;
> +
> +    base_flow->nw_dst = flow->nw_dst;
> +    base_flow->nw_src = flow->nw_src;
> +    base_flow->ipv6_dst = flow->ipv6_dst;
> +    base_flow->ipv6_src = flow->ipv6_src;
> +
> +    base_flow->nw_tos = flow->nw_tos;
> +    base_flow->nw_ttl = flow->nw_ttl;
> +    base_flow->tp_dst = flow->tp_dst;
> +    base_flow->tp_src = flow->tp_src;
> +}
> +
>  static int
>  build_tunnel_send(struct xlate_ctx *ctx, const struct xport *xport,
>                    const struct flow *flow, odp_port_t tunnel_odp_port)
> @@ -3156,6 +3183,11 @@ build_tunnel_send(struct xlate_ctx *ctx, const struct xport *xport,
>      int err;
>      char buf_sip6[INET6_ADDRSTRLEN];
>      char buf_dip6[INET6_ADDRSTRLEN];
> +    /* Structures to backup Ethernet and IP data of flow and base_flow. */
> +    struct flow old_flow = ctx->xin->flow;
> +    struct flow old_base_flow = ctx->base_flow;
> +    /* Variable to backup actual tunnel header size. */
> +    uint64_t old_ths = 0;

apply_nested_clone_action() also takes copies of the flow and
base_flow, do we need to do it twice in this path? I suspect that this
work could benefit from some further refactoring, taking note to have
separate patches to refactor code and different patches for functional
changes.

>
>      err = tnl_route_lookup_flow(flow, &d_ip6, &s_ip6, &out_dev);
>      if (err) {
> @@ -3216,16 +3248,57 @@ build_tunnel_send(struct xlate_ctx *ctx, const struct xport *xport,
>      tnl_push_data.tnl_port = odp_to_u32(tunnel_odp_port);
>      tnl_push_data.out_port = odp_to_u32(out_dev->odp_port);
>
> +    /* After tunnel header has been added, MAC and IP data of flow and
> +     * base_flow need to be set properly, since there is not recirculation
> +     * any more when sending packet to tunnel. */
> +    if (!eth_addr_is_zero(ctx->xin->flow.dl_dst)) {
> +        ctx->xin->flow.dl_dst = dmac;
> +        ctx->base_flow.dl_dst = ctx->xin->flow.dl_dst;
> +    }

What's the expectation if ctx->xin->flow.dl_dst is zero? Is that for
L3 tunneled packets?

> +
> +    ctx->xin->flow.dl_src = smac;
> +    ctx->base_flow.dl_src = ctx->xin->flow.dl_src;
> +
> +    propagate_tunnel_data_to_flow(&ctx->xin->flow, &ctx->base_flow);
> +
> +    if (!tnl_params.is_ipv6) {
> +        ctx->xin->flow.dl_type = htons(ETH_TYPE_IP);
> +    } else {
> +        ctx->xin->flow.dl_type = htons(ETH_TYPE_IPV6);
> +    }

When writing if (!condition) { ... } else { ... }, please consider
whether you can swap the conditions and remove the negation. It's less
cognitive load if we don't have to negate a negative condition to
consider which condition the "else" applies to. (This otherwise looks
correct, so just a style request.)

> +
> +    if (tnl_push_data.tnl_type == OVS_VPORT_TYPE_GRE) {
> +        ctx->xin->flow.nw_proto = IPPROTO_GRE;
> +    } else {
> +        ctx->xin->flow.nw_proto = IPPROTO_UDP;
> +    }
> +    ctx->base_flow.nw_proto = ctx->xin->flow.nw_proto;

Should we use a switch () statement here to ensure that newer tunnel
ports are properly handled in this case if/when they are added? What
about things like LISP or STT?

> +
>      size_t push_action_size = 0;
>      size_t clone_ofs = nl_msg_start_nested(ctx->odp_actions,
>                                             OVS_ACTION_ATTR_CLONE);
>      odp_put_tnl_push_action(ctx->odp_actions, &tnl_push_data);
>      push_action_size = ctx->odp_actions->size;
> +
> +    if (ctx->rule) {
> +        old_ths = ctx->rule->tunnel_header_size;
> +        ctx->rule->tunnel_header_size = tnl_push_data.header_len;
> +    }
> +
>      apply_nested_clone_actions(ctx, xport, out_dev);
>      if (ctx->odp_actions->size > push_action_size) {
>          /* Update the CLONE action only when combined */
>          nl_msg_end_nested(ctx->odp_actions, clone_ofs);
>      }

If there are no actions on the underlay bridge (ie drop), the else
condition here should probably reset the nl_msg nesting so that we
don't generate any actions to the datapath at all. You may also
consider adding a xlate_report() call to explain why the drop is
occuring - this should help debugging later on using ofproto/trace.

> +
> +    if (ctx->rule) {
> +        ctx->rule->tunnel_header_size = old_ths;
> +    }
> +
> +    /* Restore flow and base_flow data. */
> +    ctx->xin->flow = old_flow;
> +    ctx->base_flow = old_base_flow;
> +
>      return 0;
>  }
>
> @@ -3728,6 +3801,10 @@ xlate_table_action(struct xlate_ctx *ctx, ofp_port_t in_port, uint8_t table_id,
>          }
>
>          if (rule) {
> +            if (ctx->rule) {
> +                rule->truncated_packet_size = ctx->rule->truncated_packet_size;
> +                rule->tunnel_header_size = ctx->rule->tunnel_header_size;
> +            }
>              /* Fill in the cache entry here instead of xlate_recursively
>               * to make the reference counting more explicit.  We take a
>               * reference in the lookups above if we are going to cache the
> @@ -4602,6 +4679,8 @@ xlate_output_trunc_action(struct xlate_ctx *ctx,
>      bool support_trunc = ctx->xbridge->support.trunc;
>      struct ovs_action_trunc *trunc;
>      char name[OFP10_MAX_PORT_NAME_LEN];
> +    /* Variable to backup truncate max len. */
> +    uint64_t old_tps = 0;
>
>      switch (port) {
>      case OFPP_TABLE:
> @@ -4634,7 +4713,20 @@ xlate_output_trunc_action(struct xlate_ctx *ctx,
>                                  OVS_ACTION_ATTR_TRUNC,
>                                  sizeof *trunc);
>              trunc->max_len = max_len;
> +
> +            /* Update truncate correction. */
> +            if (ctx->rule) {
> +                old_tps = ctx->rule->truncated_packet_size;
> +                ctx->rule->truncated_packet_size = max_len;
> +            }
> +
>              xlate_output_action(ctx, port, max_len, false);
> +
> +            /* Restore truncate correction. */
> +            if (ctx->rule) {
> +                ctx->rule->truncated_packet_size = old_tps;
> +            }
> +
>              if (!support_trunc) {
>                  ctx->xout->slow |= SLOW_ACTION;
>              }
> diff --git a/ofproto/ofproto-dpif.c b/ofproto/ofproto-dpif.c
> index dc5f004..800d0f6 100644
> --- a/ofproto/ofproto-dpif.c
> +++ b/ofproto/ofproto-dpif.c
> @@ -3959,8 +3959,18 @@ rule_dpif_credit_stats__(struct rule_dpif *rule,
>      OVS_REQUIRES(rule->stats_mutex)
>  {
>      if (credit_counts) {
> +        uint64_t stats_n_bytes = 0;
> +
> +        if (rule->truncated_packet_size) {
> +            stats_n_bytes = MIN(rule->truncated_packet_size, stats->n_bytes);
> +        } else {
> +            stats_n_bytes = stats->n_bytes;
> +        }

Is this fixing a separate issue? I'm confused about whether truncated
packet stats are being correctly attributed today on master. If so, I
think this should split out into a separate patch. You might also
consider whether it makes more sense to modify 'stats' earlier in the
path so that each rule doesn't need to individually apply the stats
adjustment. I could imagine a store/restore of the stats plus
modifying for truncation during the xlate_output_trunc_action()
processing rather than pushing this complexity all the way into the
rule stats attribution.

> +
> +        stats_n_bytes += rule->tunnel_header_size;
> +
>          rule->stats.n_packets += stats->n_packets;
> -        rule->stats.n_bytes += stats->n_bytes;
> +        rule->stats.n_bytes += stats_n_bytes;
>      }
>      rule->stats.used = MAX(rule->stats.used, stats->used);
>  }
> @@ -4153,6 +4163,8 @@ rule_dpif_lookup_from_table(struct ofproto_dpif *ofproto,
>              entry->table.match = (rule != NULL);
>          }
>          if (rule) {
> +            rule->truncated_packet_size = 0;
> +            rule->tunnel_header_size = 0;
>              goto out;   /* Match. */
>          }
>          if (honor_table_miss) {
> diff --git a/ofproto/ofproto-dpif.h b/ofproto/ofproto-dpif.h
> index a3a6f1f..748df4c 100644
> --- a/ofproto/ofproto-dpif.h
> +++ b/ofproto/ofproto-dpif.h
> @@ -93,6 +93,11 @@ struct rule_dpif {
>       * The recirculation id and associated internal flow should
>       * be freed when the rule is freed */
>      uint32_t recirc_id;
> +
> +    /* Variables to be used to correct statistic when packet is sent to tunnel
> +     * or truncated. */
> +    uint64_t truncated_packet_size;
> +    uint64_t tunnel_header_size;
>  };
>
>  struct rule_dpif *rule_dpif_lookup_from_table(struct ofproto_dpif *,
> diff --git a/tests/ofproto-dpif.at b/tests/ofproto-dpif.at
> index e52ab32..1f6cd84 100644
> --- a/tests/ofproto-dpif.at
> +++ b/tests/ofproto-dpif.at
> @@ -6257,6 +6257,15 @@ HEADER
>         dgramSeqNo=1
>         ds=127.0.0.1>2:1000
>         fsSeqNo=1
> +       tunnel4_out_length=0
> +       tunnel4_out_protocol=47
> +       tunnel4_out_src=1.1.2.88
> +       tunnel4_out_dst=1.1.2.92
> +       tunnel4_out_src_port=0
> +       tunnel4_out_dst_port=0
> +       tunnel4_out_tcp_flags=0
> +       tunnel4_out_tos=0
> +       tunnel_out_vni=456
>         in_vlan=0
>         in_priority=0
>         out_vlan=0
> @@ -6266,7 +6275,7 @@ HEADER
>         dropEvents=0
>         in_ifindex=2011
>         in_format=0
> -       out_ifindex=2
> +       out_ifindex=1
>         out_format=2
>         hdr_prot=1
>         pkt_len=46
> --
> 2.7.4