[ovs-dev] [ovs-dev, v4] dpif-netdev: Avoid reordering of packets in a batch with same megaflow

Ilya Maximets i.maximets at samsung.com
Tue Jul 10 08:51:23 UTC 2018


Not a full review.
One comment inline.

Best regards, Ilya Maximets.

On 10.07.2018 00:13, Vishal Deep Ajmera wrote:
> OVS reads packets in batches from a given port and packets in the
> batch are subjected to potentially 3 levels of lookups to identify
> the datapath megaflow entry (or flow) associated with the packet.
> Each megaflow entry has a dedicated buffer in which packets that match
> the flow classification criteria are collected. This buffer helps OVS
> perform batch processing for all packets associated with a given flow.
> 
> Each packet in the received batch is first subjected to lookup in the
> Exact Match Cache (EMC). Each EMC entry will point to a flow. If the
> EMC lookup is successful, the packet is moved from the rx batch to the
> per-flow buffer.
> 
> Packets that did not match any EMC entry are rearranged in the rx batch
> at the beginning and are now subjected to a lookup in the megaflow cache.
> Packets that match a megaflow cache entry are *appended* to the per-flow
> buffer.
> 
> Packets that do not match any megaflow entry are subjected to slow-path
> processing through the upcall mechanism. This cannot change the order of
> packets as by definition upcall processing is only done for packets
> without matching megaflow entry.
> 
> The EMC entry match fields encompass all potentially significant header
> fields, typically more than specified in the associated flow's match
> criteria. Hence, multiple EMC entries can point to the same flow. Given
> that per-flow batching happens at each lookup stage, packets belonging
> to the same megaflow can get re-ordered because some packets match EMC
> entries while others do not.
> 
> The following example can illustrate the issue better. Consider
> following batch of packets (labelled P1 to P8) associated with a single
> TCP connection and associated with a single flow. Let us assume that
> packets with just the ACK bit set in TCP flags have been received in a
> prior batch also and a corresponding EMC entry exists.
> 
> 1. P1 (TCP Flag: ACK)
> 2. P2 (TCP Flag: ACK)
> 3. P3 (TCP Flag: ACK)
> 4. P4 (TCP Flag: ACK, PSH)
> 5. P5 (TCP Flag: ACK)
> 6. P6 (TCP Flag: ACK)
> 7. P7 (TCP Flag: ACK)
> 8. P8 (TCP Flag: ACK)
> 
> The megaflow classification criteria does not include TCP flags while
> the EMC match criteria does. Thus, all packets other than P4 match
> the existing EMC entry and are moved to the per-flow packet batch.
> Subsequently, packet P4 is moved to the same per-flow packet batch as
> a result of the megaflow lookup. Though the packets have all been
> correctly classified as being associated with the same flow, the
> packet order has not been preserved because of the per-flow batching
> performed during the EMC lookup stage. This packet re-ordering has
> performance implications for TCP applications.
> 
> This patch preserves the packet ordering by performing the per-flow
> batching after both the EMC and megaflow lookups are complete. As an
> optimization, packets are flow-batched in emc processing till any
> packet in the batch has an EMC miss.
> 
> A new flow map is maintained to keep the original order of packet
> along with flow information. Post fastpath processing, packets from
> flow map are *appended* to per-flow buffer.
> 
> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera at ericsson.com>
> Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep at ericsson.com>
> Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep at ericsson.com>
> ---
>  lib/dpif-netdev.c | 103 +++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 87 insertions(+), 16 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 8b3556d..d4b8f99 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -208,6 +208,13 @@ struct dpcls_rule {
>      /* 'flow' must be the last field, additional space is allocated here. */
>  };
>  
> +/* data structure to keep packet order till fastpath processing */
> +struct dp_packet_flow_map {
> +    struct dp_packet *packet;
> +    struct dp_netdev_flow *flow;
> +    uint16_t tcp_flags;
> +};
> +
>  static void dpcls_init(struct dpcls *);
>  static void dpcls_destroy(struct dpcls *);
>  static void dpcls_sort_subtable_vector(struct dpcls *);
> @@ -5602,6 +5609,19 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>      packet_batch_per_flow_update(batch, pkt, tcp_flags);
>  }
>  
> +static inline void
> +packet_enqueue_to_flow_map(struct dp_packet_flow_map *flow_map,
> +                           struct dp_netdev_flow *flow,
> +                           struct dp_packet *packet,
> +                           uint16_t tcp_flags,
> +                           size_t *map_cnt)
> +{
> +    struct dp_packet_flow_map *map = &flow_map[(*map_cnt)++];
> +    map->flow = flow;
> +    map->packet = packet;
> +    map->tcp_flags = tcp_flags;
> +}
> +
>  /* Try to process all ('cnt') the 'packets' using only the exact match cache
>   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
>   * miniflow is copied into 'keys' and the packet pointer is moved at the
> @@ -5621,6 +5641,9 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>                 struct dp_packet_batch *packets_,
>                 struct netdev_flow_key *keys,
>                 struct packet_batch_per_flow batches[], size_t *n_batches,
> +               struct dp_packet_flow_map *flow_map,
> +               size_t *n_flows,
> +               uint8_t *index_map,
>                 bool md_is_valid, odp_port_t port_no)
>  {
>      struct emc_cache *flow_cache = &pmd->flow_cache;
> @@ -5631,6 +5654,8 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>      uint32_t cur_min;
>      int i;
>      uint16_t tcp_flags;
> +    size_t map_cnt = 0;
> +    bool batch_enable = true;
>  
>      atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);
>      pmd_perf_update_counter(&pmd->perf_stats,
> @@ -5661,11 +5686,20 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>          if ((*recirc_depth_get() == 0) &&
>              dp_packet_has_flow_mark(packet, &mark)) {
>              flow = mark_to_flow_find(pmd, mark);
> -            if (flow) {
> +            if (OVS_LIKELY(flow)) {
>                  tcp_flags = parse_tcp_flags(packet);
> -                dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
> -                                        n_batches);
> -                continue;
> +                if (OVS_LIKELY(batch_enable)) {
> +                    dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
> +                                            n_batches);
> +                    continue;
> +                } else {
> +                    /* Flow batching should be performed only after fast-path
> +                     * processing is also completed for packets with emc miss
> +                     * or else it will result in reordering of packets with
> +                     * same datapath flows. */
> +                    packet_enqueue_to_flow_map(flow_map, flow, packet,
> +                                               tcp_flags, &map_cnt);
> +                }
>              }
>          }
>  
> @@ -5685,8 +5719,18 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>          }
>          if (OVS_LIKELY(flow)) {
>              tcp_flags = miniflow_get_tcp_flags(&key->mf);
> -            dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
> -                                    n_batches);
> +            if (OVS_LIKELY(batch_enable)) {
> +                dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
> +                                        n_batches);
> +            } else {
> +                /* Flow batching should be performed only after fast-path
> +                 * processing is also completed for packets with emc miss
> +                 * or else it will result in reordering of packets with
> +                 * same datapath flows. */
> +                packet_enqueue_to_flow_map(flow_map, flow, packet, tcp_flags,
> +                                           &map_cnt);
> +            }
> +
>          } else {
>              /* Exact match cache missed. Group missed packets together at
>               * the beginning of the 'packets' array. */
> @@ -5695,9 +5739,16 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>               * must be returned to the caller. The next key should be extracted
>               * to 'keys[n_missed + 1]'. */
>              key = &keys[++n_missed];
> +
> +            /* preserve the order of packet for flow batching */
> +            index_map[packets_->count - 1] = map_cnt;

Using the "packets_->count" without accessor + inside a refill loop is
highly inaccurate/potentially dangerous. Isn't it equal to "n_missed"?

> +            flow_map[map_cnt++].flow = NULL;
> +
> +            /* skip batching for subsequent packets to avoid reordering */
> +            batch_enable = false;
>          }
>      }
> -
> +    *n_flows = map_cnt;
>      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT,
>                              cnt - n_dropped - n_missed);
>  
> @@ -5784,8 +5835,8 @@ static inline void
>  fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>                       struct dp_packet_batch *packets_,
>                       struct netdev_flow_key *keys,
> -                     struct packet_batch_per_flow batches[],
> -                     size_t *n_batches,
> +                     struct dp_packet_flow_map *flow_map,
> +                     uint8_t *index_map,
>                       odp_port_t in_port)
>  {
>      const size_t cnt = dp_packet_batch_size(packets_);
> @@ -5864,6 +5915,8 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>  
>      DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
>          struct dp_netdev_flow *flow;
> +        /* get the original order of this packet in received batch */
> +        int recv_idx = index_map[i];
>  
>          if (OVS_UNLIKELY(!rules[i])) {
>              continue;
> @@ -5872,9 +5925,13 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>          flow = dp_netdev_flow_cast(rules[i]);
>  
>          emc_probabilistic_insert(pmd, &keys[i], flow);
> -        dp_netdev_queue_batches(packet, flow,
> -                                miniflow_get_tcp_flags(&keys[i].mf),
> -                                batches, n_batches);
> +        /* add these packets into the flow map in the same order
> +         * as received.
> +         */
> +        flow_map[recv_idx].packet = packet;
> +        flow_map[recv_idx].flow = flow;
> +        flow_map[recv_idx].tcp_flags = miniflow_get_tcp_flags(&keys[i].mf);
> +
>      }
>  
>      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT,
> @@ -5905,19 +5962,34 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
>      OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
>          struct netdev_flow_key keys[PKT_ARRAY_SIZE];
>      struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
> -    size_t n_batches;
> +    struct dp_packet_flow_map flow_map[PKT_ARRAY_SIZE];
> +    uint8_t index_map[PKT_ARRAY_SIZE];
> +    size_t n_batches, n_flows = 0;
>      odp_port_t in_port;
> +    size_t i;
>  
>      n_batches = 0;
>      emc_processing(pmd, packets, keys, batches, &n_batches,
> -                            md_is_valid, port_no);
> +                   flow_map, &n_flows, index_map, md_is_valid, port_no);
>      if (!dp_packet_batch_is_empty(packets)) {
>          /* Get ingress port from first packet's metadata. */
>          in_port = packets->packets[0]->md.in_port.odp_port;
>          fast_path_processing(pmd, packets, keys,
> -                             batches, &n_batches, in_port);
> +                             flow_map, index_map, in_port);
>      }
>  
> +    /* batch rest of packets which are in flow map */
> +    for (i = 0; i < n_flows; i++) {
> +        struct dp_packet_flow_map *map = &flow_map[i];
> +
> +        if (OVS_UNLIKELY(!map->flow)) {
> +            continue;
> +        }
> +        dp_netdev_queue_batches(map->packet, map->flow, map->tcp_flags,
> +                                batches, &n_batches);
> +     }
> +
> +
>      /* All the flow batches need to be reset before any call to
>       * packet_batch_per_flow_execute() as it could potentially trigger
>       * recirculation. When a packet matching flow ‘j’ happens to be
> @@ -5927,7 +5999,6 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
>       * already its own batches[k] still waiting to be served.  So if its
>       * ‘batch’ member is not reset, the recirculated packet would be wrongly
>       * appended to batches[k] of the 1st call to dp_netdev_input__(). */
> -    size_t i;
>      for (i = 0; i < n_batches; i++) {
>          batches[i].flow->batch = NULL;
>      }
> 


More information about the dev mailing list