[ovs-dev] [PATCH v2 4/4] tunneling: Avoid datapath-recirc by combining recirc actions at xlate.

Fri Jul 14 23:46:18 UTC 2017

On 4 July 2017 at 02:46, Sugesh Chandran <sugesh.chandran at intel.com> wrote:
> This patch set removes the recirculation of encapsulated tunnel packets
> if possible. It is done by computing the post tunnel actions at the time of
> translation. The combined nested action set are programmed in the datapath
> using CLONE action.
>
> Signed-off-by: Sugesh Chandran <sugesh.chandran at intel.com>
> Signed-off-by: Zoltán Balogh <zoltan.balogh at ericsson.com>
> Co-authored-by: Zoltán Balogh <zoltan.balogh at ericsson.com>
> ---

Hi Sugesh, Zoltán,

Thanks for working on this, it's subtle code but it seems like you've
found a useful optimization here. Hopefully we can move this forward
soon.

Would you be able to put your performance numbers into the commit
message here? They could be useful to refer back to in the future.

This patch needs rebasing.

More feedback below.

>  lib/dpif-netdev.c                  |  18 +--
>  ofproto/ofproto-dpif-xlate-cache.c |  14 +-
>  ofproto/ofproto-dpif-xlate-cache.h |  12 +-
>  ofproto/ofproto-dpif-xlate.c       | 295 ++++++++++++++++++++++++++++++++++++-
>  ofproto/ofproto-dpif-xlate.h       |   1 +
>  ofproto/ofproto-dpif.c             |   3 +-
>  tests/packet-type-aware.at         |  24 +--
>  7 files changed, 324 insertions(+), 43 deletions(-)
>
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 4e29085..4d996c1 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -5028,24 +5028,8 @@ dp_execute_cb(void *aux_, struct dp_packet_batch *packets_,
>
>      case OVS_ACTION_ATTR_TUNNEL_PUSH:
>          if (*depth < MAX_RECIRC_DEPTH) {
> -            struct dp_packet_batch tnl_pkt;
> -            struct dp_packet_batch *orig_packets_ = packets_;
> -            int err;
> -
> -            if (!may_steal) {
> -                dp_packet_batch_clone(&tnl_pkt, packets_);
> -                packets_ = &tnl_pkt;
> -                dp_packet_batch_reset_cutlen(orig_packets_);
> -            }
> -
>              dp_packet_batch_apply_cutlen(packets_);
> -
> -            err = push_tnl_action(pmd, a, packets_);
> -            if (!err) {
> -                (*depth)++;
> -                dp_netdev_recirculate(pmd, packets_);
> -                (*depth)--;
> -            }
> +            push_tnl_action(pmd, a, packets_);

If I follow, this was the previous datapath-level code to ensure that
when packets are output to a tunnel, only the copy of the packet that
goes to the tunnel gets the cutlen applied. With the new version of
the code, we replace this by making sure that the ofproto layer
generates a clone to wrap the push_tunnel and all subsequent bridge
translation, so then it results in the equivalent behaviour: The
cutlen is only ever applied to a clone of the packets. Is that right?

<snip>

> +/* Validate if the transalated combined actions are OK to proceed.
> + * If actions consist of TRUNC action, it is not allowed to do the
> + * tunnel_push combine as it cannot update stats correctly.
> + */
> +static bool
> +is_tunnel_actions_clone_ready(struct ofpbuf *actions)
> +{
> +    struct nlattr *tnl_actions;
> +    const struct nlattr *a;
> +    unsigned int left;
> +    size_t actions_len;
> +
> +    if (!actions) {
> +        /* No actions, no harm in doing combine. */
> +        return true;
> +    }
> +    actions_len = actions->size;
> +
> +    tnl_actions =(struct nlattr *)(actions->data);
> +    NL_ATTR_FOR_EACH_UNSAFE (a, left, tnl_actions, actions_len) {
> +        int type = nl_attr_type(a);
> +        if (type == OVS_ACTION_ATTR_TRUNC) {
> +            VLOG_DBG("Cannot do tunnel action-combine on trunc action");
> +            return false;
> +            break;
> +        }
> +    }
> +    return true;
> +}
> +
> +/*
> + * Copy the xc entry and update the references accordingly.
> + */
> +static void
> +copy_xc_entry(struct xc_entry *xc_dst, struct xc_entry *xc_src)

I think that this whole function should probably reside in
ofproto/ofproto-dpif-xlate-cache.c. However, I have an alternative
proposal detailed below that would not require us to also take
expensive refcounts due to this approach.

> +{
> +    memcpy(xc_dst, xc_src, sizeof *xc_dst);
> +    switch (xc_src->type) {
> +    case XC_RULE:
> +        ofproto_rule_ref(&xc_dst->rule->up);
> +        break;
> +    case XC_BOND:
> +        bond_ref(xc_dst->bond.bond);
> +        xc_dst->bond.flow = xmemdup(xc_src->bond.flow,
> +                            sizeof *xc_src->bond.flow);
> +        break;
> +    case XC_NETDEV:
> +        if (xc_src->dev.tx) {
> +            netdev_ref(xc_dst->dev.tx);
> +        }
> +        if (xc_src->dev.rx) {
> +            netdev_ref(xc_dst->dev.rx);
> +        }
> +        if (xc_src->dev.bfd) {
> +            bfd_ref(xc_dst->dev.bfd);
> +        }
> +        break;
> +    case XC_NETFLOW:
> +        netflow_ref(xc_dst->nf.netflow);
> +        xc_dst->nf.flow = xmemdup(xc_src->nf.flow, sizeof *xc_src->nf.flow);
> +        break;
> +    case XC_MIRROR:
> +        mbridge_ref(xc_dst->mirror.mbridge);
> +        break;
> +    case XC_LEARN:
> +    case XC_TABLE:
> +    case XC_NORMAL:
> +    case XC_FIN_TIMEOUT:
> +    case XC_GROUP:
> +    case XC_TNL_NEIGH:
> +    case XC_CONTROLLER:
> +    case XC_TUNNEL_HEADER:
> +    break;
> +    default:
> +        OVS_NOT_REACHED();
> +    }
> +}
> +
> +static bool
> +validate_and_combine_post_tnl_actions(struct xlate_ctx *ctx,
> +                                      const struct xport *xport,
> +                                      struct xport *out_dev,
> +                                      struct ovs_action_push_tnl tnl_push_data)
> +{
> +    const struct dpif_flow_stats *backup_resubmit_stats;
> +    struct xlate_cache *backup_xcache;
> +    bool nested_act_flag = false;
> +    struct flow_wildcards tmp_flow_wc;
> +    struct flow_wildcards *backup_flow_wc_ptr;
> +    bool backup_side_effects;
> +    const struct dp_packet *backup_pkt;
> +
> +    memset(&tmp_flow_wc, 0 , sizeof tmp_flow_wc);
> +    backup_flow_wc_ptr = ctx->wc;
> +    ctx->wc = &tmp_flow_wc;
> +    ctx->xin->wc = NULL;
> +    backup_resubmit_stats = ctx->xin->resubmit_stats;
> +    backup_xcache = ctx->xin->xcache;
> +    backup_side_effects = ctx->xin->allow_side_effects;
> +    backup_pkt = ctx->xin->packet;
> +
> +    size_t push_action_size = 0;
> +    size_t clone_ofs = nl_msg_start_nested(ctx->odp_actions,
> +                                           OVS_ACTION_ATTR_CLONE);
> +    odp_put_tnl_push_action(ctx->odp_actions, &tnl_push_data);
> +    push_action_size = ctx->odp_actions->size;
> +
> +    ctx->xin->resubmit_stats =  NULL;
> +    ctx->xin->xcache = xlate_cache_new(); /* Use new temporary cache. */
> +    ctx->xin->allow_side_effects = false;
> +    ctx->xin->packet = NULL;

I realised that you asked about this after the previous patch, but I
didn't understand the implications of it. I believe that this packet
is how, for instance, the second bridge can forward packets to the
controller. If you reset it to NULL here, then I don't think that the
controller action is executed correctly during the second bridge
translation. (That's a test we could add).

I suspect that some multicast snooping stuff and some other pieces
that would be affected by this.

> +
> +    if (ctx->xin->xcache) {

A few lines above, this is unconditionally set. Could be removed?

> +        /* Push the cache entry for the tunnel first. */
> +        struct xc_entry *entry;
> +        entry = xlate_cache_add_entry(ctx->xin->xcache, XC_TUNNEL_HEADER);
> +        entry->tunnel_hdr.hdr_size = tnl_push_data.header_len;
> +        entry->tunnel_hdr.operation = ADD;
> +    }
> +
> +    apply_nested_clone_actions(ctx, xport, out_dev);
> +    nested_act_flag = is_tunnel_actions_clone_ready(ctx->odp_actions);
> +
> +    if (nested_act_flag) {
> +        struct dpif_flow_stats tmp_resubmit_stats;
> +        struct xc_entry *entry;
> +        struct ofpbuf entries;
> +
> +        memset(&tmp_resubmit_stats, 0, sizeof tmp_resubmit_stats);

I think that tmp_resubmit_stats is only ever used in the if condition
below, which will initialize the entire structure anyway so the
declaration can be moved below, and this memset line can be removed.

> +         /* Similar to the stats update in revalidation, the x_cache entries
> +          * are populated by the previous translation are used to update the
> +          * stats correctly.
> +         .*/
> +        ovs_mutex_lock(&ctx->xin->resubmit_stats_mutex);

Which piece of shared memory is this mutex protecting?

> +        if (backup_resubmit_stats) {
> +            memcpy(&tmp_resubmit_stats, backup_resubmit_stats,
> +                   sizeof tmp_resubmit_stats);
> +            xlate_push_stats(ctx->xin->xcache, &tmp_resubmit_stats);
> +        }
> +        entries = ctx->xin->xcache->entries;
> +        XC_ENTRY_FOR_EACH (entry, &entries) {
> +            struct xc_entry *xc_copy_entry;
> +            if (!backup_xcache) {
> +                break;

Do we need to do this for every single XC entry?

> +            }
> +            xc_copy_entry = xlate_cache_add_entry(backup_xcache, entry->type);
> +            copy_xc_entry(xc_copy_entry, entry);

I wonder if instead of open-coding this stuff here, instead there was
a function defined in ofproto-dpif-xlate-cache.c which would append a
second xlate_cache onto a first. A function prototype like this,
perhaps?

/* Iterates through 'src' and removes the xc_entry elements from it,
and places them inside 'dst' */
void xlate_cache_steal_entries(struct xlate_cache *dst, struct
xlate_cache *src);

The current execution should own both xlate_caches so I think that it
should be safe to more or less just append the latter xlate_cache onto
the original one.

> +        }
> +        ovs_mutex_unlock(&ctx->xin->resubmit_stats_mutex);
> +    }
> +    else {

Please place the else on the same line as the close bracket.

> +        /* Combine is not valid. */
> +        nl_msg_cancel_nested(ctx->odp_actions, clone_ofs);
> +        goto out;
> +    }
> +    if (ctx->odp_actions->size > push_action_size) {
> +        /* Update the CLONE action only when combined. */
> +        nl_msg_end_nested(ctx->odp_actions, clone_ofs);
> +    }
> +    else {

Same style request.

> +        /* No actions after the tunnel, no need of clone. */
> +        nl_msg_cancel_nested(ctx->odp_actions, clone_ofs);
> +        odp_put_tnl_push_action(ctx->odp_actions, &tnl_push_data);

If there are no actions after the tunnel, then do we need the tunnel
header at all?

Do we have a test case which outputs to a tunnel which goes to another
bridge that drops the packet, then on the first bridge continues to
execute some more stuff to make sure this path works in the way we
expect?