[ovs-dev] [PATCH v3] dpif-netdev: Avoid reordering of packets in a batch with same megaflow

Shahaf Shuler shahafs at mellanox.com
Mon Jul 9 07:07:58 UTC 2018


Hi Vishal,

I guess the series is not rebased on top of the HWOL series, many of the comments are related to this issue. 

Saturday, July 7, 2018 3:47 AM, Vishal Deep Ajmera:
> Subject: [ovs-dev] [PATCH v3] dpif-netdev: Avoid reordering of packets in a
> batch with same megaflow
> 
> OVS reads packets in batches from a given port and packets in the batch are
> subjected to potentially 3 levels of lookups to identify the datapath
> megaflow entry (or flow) associated with the packet.
> Each megaflow entry has a dedicated buffer in which packets that match the
> flow classification criteria are collected. This buffer helps OVS perform batch
> processing for all packets associated with a given flow.
> 
> Each packet in the received batch is first subjected to lookup in the Exact
> Match Cache (EMC). Each EMC entry will point to a flow. If the EMC lookup is
> successful, the packet is moved from the rx batch to the per-flow buffer.
> 
> Packets that did not match any EMC entry are rearranged in the rx batch at
> the beginning and are now subjected to a lookup in the megaflow cache.
> Packets that match a megaflow cache entry are *appended* to the per-flow
> buffer.
> 
> Packets that do not match any megaflow entry are subjected to slow-path
> processing through the upcall mechanism. This cannot change the order of
> packets as by definition upcall processing is only done for packets without
> matching megaflow entry.
> 
> The EMC entry match fields encompass all potentially significant header
> fields, typically more than specified in the associated flow's match criteria.
> Hence, multiple EMC entries can point to the same flow. Given that per-flow
> batching happens at each lookup stage, packets belonging to the same
> megaflow can get re-ordered because some packets match EMC entries
> while others do not.
> 
> The following example can illustrate the issue better. Consider following
> batch of packets (labelled P1 to P8) associated with a single TCP connection
> and associated with a single flow. Let us assume that packets with just the
> ACK bit set in TCP flags have been received in a prior batch also and a
> corresponding EMC entry exists.
> 
> 1. P1 (TCP Flag: ACK)
> 2. P2 (TCP Flag: ACK)
> 3. P3 (TCP Flag: ACK)
> 4. P4 (TCP Flag: ACK, PSH)
> 5. P5 (TCP Flag: ACK)
> 6. P6 (TCP Flag: ACK)
> 7. P7 (TCP Flag: ACK)
> 8. P8 (TCP Flag: ACK)
> 
> The megaflow classification criteria does not include TCP flags while the EMC
> match criteria does. Thus, all packets other than P4 match the existing EMC
> entry and are moved to the per-flow packet batch.
> Subsequently, packet P4 is moved to the same per-flow packet batch as a
> result of the megaflow lookup. Though the packets have all been correctly
> classified as being associated with the same flow, the packet order has not
> been preserved because of the per-flow batching performed during the EMC
> lookup stage. This packet re-ordering has performance implications for TCP
> applications.
> 
> This patch preserves the packet ordering by performing the per-flow
> batching after both the EMC and megaflow lookups are complete. As an
> optimization, packets are flow-batched in emc processing till any packet in
> the batch has an EMC miss.
> 
> A new flow map is maintained to keep the original order of packet along with
> flow information. Post fastpath processing, packets from flow map are
> *appended* to per-flow buffer.
> 
> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera at ericsson.com>
> Co-authored-by: Venkatesan Pradeep
> <venkatesan.pradeep at ericsson.com>
> Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep at ericsson.com>
> ---
>  lib/dpif-netdev.c | 80
> ++++++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 67 insertions(+), 13 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 9390fff..23cda57
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -207,6 +207,13 @@ struct dpcls_rule {
>      /* 'flow' must be the last field, additional space is allocated here. */  };
> 
> +/* data structure to keep packet order till fastpath processing */
> +struct dp_packet_flow_map {
> +    struct dp_packet *packet;
> +    struct dp_netdev_flow *flow;
> +    uint16_t tcp_flags;
> +};
> +
>  static void dpcls_init(struct dpcls *);  static void dpcls_destroy(struct dpcls *);
> static void dpcls_sort_subtable_vector(struct dpcls *); @@ -5081,10 +5088,10
> @@ struct packet_batch_per_flow {  static inline void
> packet_batch_per_flow_update(struct packet_batch_per_flow *batch,
>                               struct dp_packet *packet,
> -                             const struct miniflow *mf)
> +                             uint16_t tcp_flags)

This change is already done. 

>  {
>      batch->byte_count += dp_packet_size(packet);
> -    batch->tcp_flags |= miniflow_get_tcp_flags(mf);
> +    batch->tcp_flags |= tcp_flags;

Ditto

>      batch->array.packets[batch->array.count++] = packet;  }
> 
> @@ -5118,7 +5125,7 @@ packet_batch_per_flow_execute(struct
> packet_batch_per_flow *batch,
> 
>  static inline void
>  dp_netdev_queue_batches(struct dp_packet *pkt,
> -                        struct dp_netdev_flow *flow, const struct miniflow *mf,
> +                        struct dp_netdev_flow *flow, uint16_t
> + tcp_flags,

Ditto

>                          struct packet_batch_per_flow *batches,
>                          size_t *n_batches)  { @@ -5129,7 +5136,7 @@
> dp_netdev_queue_batches(struct dp_packet *pkt,
>          packet_batch_per_flow_init(batch, flow);
>      }
> 
> -    packet_batch_per_flow_update(batch, pkt, mf);
> +    packet_batch_per_flow_update(batch, pkt, tcp_flags);

Ditto 

>  }
> 
>  /* Try to process all ('cnt') the 'packets' using only the exact match cache
> @@ -5151,6 +5158,9 @@ emc_processing(struct dp_netdev_pmd_thread
> *pmd,
>                 struct dp_packet_batch *packets_,
>                 struct netdev_flow_key *keys,
>                 struct packet_batch_per_flow batches[], size_t *n_batches,
> +               struct dp_packet_flow_map *flow_map,
> +               size_t *n_flows,
> +               uint8_t *index_map,
>                 bool md_is_valid, odp_port_t port_no)  {
>      struct emc_cache *flow_cache = &pmd->flow_cache; @@ -5160,6 +5170,9
> @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>      const size_t cnt = dp_packet_batch_size(packets_);
>      uint32_t cur_min;
>      int i;
> +    size_t map_cnt = 0;
> +    uint16_t tcp_flags = 0;

This var was already defined by HWOL series 

> +    bool batch_enable = true;
> 
>      atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);
>      pmd_perf_update_counter(&pmd->perf_stats,
> @@ -5168,6 +5181,7 @@ emc_processing(struct dp_netdev_pmd_thread
> *pmd,
> 
>      DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
>          struct dp_netdev_flow *flow;
> +        struct dp_packet_flow_map *map;
> 
>          if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
>              dp_packet_delete(packet);
> @@ -5200,8 +5214,20 @@ emc_processing(struct dp_netdev_pmd_thread
> *pmd,
>              flow = NULL;
>          }
>          if (OVS_LIKELY(flow)) {
> -            dp_netdev_queue_batches(packet, flow, &key->mf, batches,
> -                                    n_batches);
> +            tcp_flags = miniflow_get_tcp_flags(&key->mf);

This change was already done. 

> +            if (OVS_LIKELY(batch_enable)) {
> +                dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
> +                                        n_batches);

I think we should have the same if/else in case we have a hit on the flow mark search (mark_to_flow_find).
Currently in case we have a flow mark we immediately queue the packet into batch, this can cause reordering as you stated. 
Perhaps this logic should be on a sperate inline function. 

> +            } else {
> +                /* Flow batching should be performed only after fast-path
> +                 * processing is also completed for packets with emc miss
> +                 * or else it will result in reordering of packets with
> +                 * same datapath flows. */
> +                map = &flow_map[map_cnt++];
> +                map->flow = flow;
> +                map->packet = packet;
> +                map->tcp_flags = tcp_flags;
> +            }
>          } else {
>              /* Exact match cache missed. Group missed packets together at
>               * the beginning of the 'packets' array. */ @@ -5210,9 +5236,17 @@
> emc_processing(struct dp_netdev_pmd_thread *pmd,
>               * must be returned to the caller. The next key should be extracted
>               * to 'keys[n_missed + 1]'. */
>              key = &keys[++n_missed];
> +
> +            /* preserve the order of packet for flow batching */
> +            index_map[packets_->count - 1] = map_cnt;
> +            flow_map[map_cnt++].flow = NULL;
> +
> +            /* skip batching for subsequent packets to avoid reordering */
> +            batch_enable = false;
>          }
>      }
> 
> +    *n_flows = map_cnt;
>      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT,
>                              cnt - n_dropped - n_missed);
> 
> @@ -5299,8 +5333,8 @@ static inline void  fast_path_processing(struct
> dp_netdev_pmd_thread *pmd,
>                       struct dp_packet_batch *packets_,
>                       struct netdev_flow_key *keys,
> -                     struct packet_batch_per_flow batches[],
> -                     size_t *n_batches,
> +                     struct dp_packet_flow_map *flow_map,
> +                     uint8_t *index_map,
>                       odp_port_t in_port)  {
>      const size_t cnt = dp_packet_batch_size(packets_); @@ -5379,6 +5413,8
> @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
> 
>      DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
>          struct dp_netdev_flow *flow;
> +        /* get the original order of this packet in received batch */
> +        int recv_idx = index_map[i];
> 
>          if (OVS_UNLIKELY(!rules[i])) {
>              continue;
> @@ -5387,7 +5423,12 @@ fast_path_processing(struct
> dp_netdev_pmd_thread *pmd,
>          flow = dp_netdev_flow_cast(rules[i]);
> 
>          emc_probabilistic_insert(pmd, &keys[i], flow);
> -        dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,
> n_batches);
> +        /* add these packets into the flow map in the same order
> +         * as received.
> +         */
> +        flow_map[recv_idx].packet = packet;
> +        flow_map[recv_idx].flow = flow;
> +        flow_map[recv_idx].tcp_flags =
> + miniflow_get_tcp_flags(&keys[i].mf);
>      }
> 
>      pmd_perf_update_counter(&pmd->perf_stats,
> PMD_STAT_MASKED_HIT, @@ -5418,17 +5459,31 @@
> dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
>      OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
>          struct netdev_flow_key keys[PKT_ARRAY_SIZE];
>      struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
> -    size_t n_batches;
> +    struct dp_packet_flow_map flow_map[PKT_ARRAY_SIZE];
> +    uint8_t index_map[PKT_ARRAY_SIZE];
> +    size_t n_batches, n_flows = 0;
>      odp_port_t in_port;
> +    size_t i;
> 
>      n_batches = 0;
>      emc_processing(pmd, packets, keys, batches, &n_batches,
> -                            md_is_valid, port_no);
> +                   flow_map, &n_flows, index_map, md_is_valid,
> + port_no);
>      if (!dp_packet_batch_is_empty(packets)) {
>          /* Get ingress port from first packet's metadata. */
>          in_port = packets->packets[0]->md.in_port.odp_port;
>          fast_path_processing(pmd, packets, keys,
> -                             batches, &n_batches, in_port);
> +                             flow_map, index_map, in_port);
> +    }
> +
> +    /* batch rest of packets which are in flow map */
> +    for (i = 0; i < n_flows; i++) {
> +        struct dp_packet_flow_map *map = &flow_map[i];
> +
> +        if (OVS_UNLIKELY(!map->flow)) {
> +            continue;
> +        }
> +        dp_netdev_queue_batches(map->packet, map->flow, map->tcp_flags,
> +                                batches, &n_batches);
>      }
> 
>      /* All the flow batches need to be reset before any call to @@ -5440,7
> +5495,6 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
>       * already its own batches[k] still waiting to be served.  So if its
>       * ‘batch’ member is not reset, the recirculated packet would be wrongly
>       * appended to batches[k] of the 1st call to dp_netdev_input__(). */
> -    size_t i;
>      for (i = 0; i < n_batches; i++) {
>          batches[i].flow->batch = NULL;
>      }
> --
> 1.9.1
> 
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fma
> il.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-
> dev&amp;data=02%7C01%7Cshahafs%40mellanox.com%7Cd7bfe76711254a9
> 791fb08d5e3600bf0%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C1%7C
> 636664924035769004&amp;sdata=0qETrlq3VxQOYG7rPjP%2FSJ6vPYaHXeIV9F
> wfYTCFlIg%3D&amp;reserved=0


More information about the dev mailing list