[ovs-dev] [PATCH] dpif-netdev: Avoid reordering of packets in a batch with same megaflow

Ian Stokes ian.stokes at intel.com
Tue Jul 3 16:32:51 UTC 2018


On 6/17/2018 3:21 AM, Vishal Deep Ajmera wrote:
> OVS reads packets in batches from a given port and packets in the
> batch are subjected to potentially 3 levels of lookups to identify
> the datapath megaflow entry (or flow) associated with the packet.
> Each megaflow entry has a dedicated buffer in which packets that match
> the flow classification criteria are collected. This buffer helps OVS
> perform batch processing for all packets associated with a given flow.
> 
> Each packet in the received batch is first subjected to lookup in the
> Exact Match Cache (EMC). Each EMC entry will point to a flow. If the
> EMC lookup is successful, the packet is moved from the rx batch to the
> per-flow buffer.
> 
> Packets that did not match any EMC entry are rearranged in the rx batch
> at the beginning and are now subjected to a lookup in the megaflow cache.
> Packets that match a megaflow cache entry are *appended* to the per-flow
> buffer.
> 
> Packets that do not match any megaflow entry are subjected to slow-path
> processing through the upcall mechanism. This cannot change the order of
> packets as by definition upcall processing is only done for packets
> without matching megaflow entry.
> 
> The EMC entry match fields encompass all potentially significant header
> fields, typically more than specified in the associated flow's match
> criteria. Hence, multiple EMC entries can point to the same flow. Given
> that per-flow batching happens at each lookup stage, packets belonging
> to the same megaflow can get re-ordered because some packets match EMC
> entries while others do not.
> 
> The following example can illustrate the issue better. Consider
> following batch of packets (labelled P1 to P8) associated with a single
> TCP connection and associated with a single flow. Let us assume that
> packets with just the ACK bit set in TCP flags have been received in a
> prior batch also and a corresponding EMC entry exists.
> 
> 1. P1 (TCP Flag: ACK)
> 2. P2 (TCP Flag: ACK)
> 3. P3 (TCP Flag: ACK)
> 4. P4 (TCP Flag: ACK, PSH)
> 5. P5 (TCP Flag: ACK)
> 6. P6 (TCP Flag: ACK)
> 7. P7 (TCP Flag: ACK)
> 8. P8 (TCP Flag: ACK)
> 
> The megaflow classification criteria does not include TCP flags while
> the EMC match criteria does. Thus, all packets other than P4 match
> the existing EMC entry and are moved to the per-flow packet batch.
> Subsequently, packet P4 is moved to the same per-flow packet batch as
> a result of the megaflow lookup. Though the packets have all been
> correctly classified as being associated with the same flow, the
> packet order has not been preserved because of the per-flow batching
> performed during the EMC lookup stage. This packet re-ordering has
> performance implications for TCP applications.
> 
> This patch preserves the packet ordering by performing the per-flow
> batching after both the EMC and megaflow lookups are complete. As an
> optimization, packets are flow-batched in emc processing till any
> packet in the batch has an EMC miss.
> 
> A new flow map is maintained to keep the original order of packet
> along with flow information. Post fastpath processing, packets from
> flow map are *appended* to per-flow buffer.
> 
> Signed-off-by: Vishal Deep Ajmera <vishal.deep.ajmera at ericsson.com>
> Co-authored-by: Venkatesan Pradeep <venkatesan.pradeep at ericsson.com>
> Signed-off-by: Venkatesan Pradeep <venkatesan.pradeep at ericsson.com>
> ---
>   lib/dpif-netdev.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 66 insertions(+), 13 deletions(-)
> 

Hi Vishal, thanks for the patch, I've some minor comments below. I'm 
still testing but so far it seems good.

Unless there are objections (or any new issues come to light) I will put 
this as part of the pull request this week and back port to the 
appropriate branches.

Thanks
Ian

> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 9390fff..2c423ed 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -207,6 +207,13 @@ struct dpcls_rule {
>       /* 'flow' must be the last field, additional space is allocated here. */
>   };
>   
> +/* data structure to keep packet order till fastpath processing */
> +struct dp_packet_flow_map {
> +    struct dp_packet *packet;
> +    struct dp_netdev_flow *flow;
> +    uint16_t tcp_flags;
> +};
> +
>   static void dpcls_init(struct dpcls *);
>   static void dpcls_destroy(struct dpcls *);
>   static void dpcls_sort_subtable_vector(struct dpcls *);
> @@ -5081,10 +5088,10 @@ struct packet_batch_per_flow {
>   static inline void
>   packet_batch_per_flow_update(struct packet_batch_per_flow *batch,
>                                struct dp_packet *packet,
> -                             const struct miniflow *mf)
> +                             uint16_t tcp_flags)
>   {
>       batch->byte_count += dp_packet_size(packet);
> -    batch->tcp_flags |= miniflow_get_tcp_flags(mf);
> +    batch->tcp_flags |= tcp_flags;
>       batch->array.packets[batch->array.count++] = packet;
>   }
>   
> @@ -5118,7 +5125,7 @@ packet_batch_per_flow_execute(struct packet_batch_per_flow *batch,
>   
>   static inline void
>   dp_netdev_queue_batches(struct dp_packet *pkt,
> -                        struct dp_netdev_flow *flow, const struct miniflow *mf,
> +                        struct dp_netdev_flow *flow, uint16_t tcp_flags,
>                           struct packet_batch_per_flow *batches,
>                           size_t *n_batches)
>   {
> @@ -5129,7 +5136,7 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>           packet_batch_per_flow_init(batch, flow);
>       }
>   
> -    packet_batch_per_flow_update(batch, pkt, mf);
> +    packet_batch_per_flow_update(batch, pkt, tcp_flags);
>   }
>   
>   /* Try to process all ('cnt') the 'packets' using only the exact match cache
> @@ -5151,6 +5158,9 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>                  struct dp_packet_batch *packets_,
>                  struct netdev_flow_key *keys,
>                  struct packet_batch_per_flow batches[], size_t *n_batches,
> +               struct dp_packet_flow_map *flow_map,
> +               size_t *n_flows,
> +               uint8_t *index_map,
>                  bool md_is_valid, odp_port_t port_no)
>   {
>       struct emc_cache *flow_cache = &pmd->flow_cache;
> @@ -5160,6 +5170,9 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>       const size_t cnt = dp_packet_batch_size(packets_);
>       uint32_t cur_min;
>       int i;
> +    size_t map_cnt = 0;
> +    uint16_t tcp_flags;
> +    bool batch_enable = true;
>   
>       atomic_read_relaxed(&pmd->dp->emc_insert_min, &cur_min);
>       pmd_perf_update_counter(&pmd->perf_stats,
> @@ -5168,6 +5181,7 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>   
>       DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
>           struct dp_netdev_flow *flow;
> +        struct dp_packet_flow_map *map;
>   
>           if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
>               dp_packet_delete(packet);
> @@ -5200,8 +5214,20 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>               flow = NULL;
>           }
>           if (OVS_LIKELY(flow)) {
> -            dp_netdev_queue_batches(packet, flow, &key->mf, batches,
> -                                    n_batches);
> +            if (OVS_LIKELY(batch_enable)) {
> +                tcp_flags = miniflow_get_tcp_flags(&key->mf);
> +                dp_netdev_queue_batches(packet, flow, tcp_flags, batches,
> +                                        n_batches);
> +            } else {
> +                /* Flow batching should be performed only after fast-path processing
> +                 * is also completed for packets with emc miss or else it will
> +                 * result in reordering of packets with same datapath flows.
> +                 */

You need to reformat the comment above as per ovs style guide as 
currently it exceeds 79 characters in width.

> +                map = &flow_map[map_cnt++];
> +                map->flow = flow;
> +                map->packet = packet;
> +                map->tcp_flags = miniflow_get_tcp_flags(&key->mf);
> +            }
>           } else {
>               /* Exact match cache missed. Group missed packets together at
>                * the beginning of the 'packets' array. */
> @@ -5210,9 +5236,17 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>                * must be returned to the caller. The next key should be extracted
>                * to 'keys[n_missed + 1]'. */
>               key = &keys[++n_missed];
> +
> +            /* preserve the order of packet for flow batching */
> +            index_map[packets_->count - 1] = map_cnt;
> +            flow_map[map_cnt++].flow = NULL;
> +
> +            /* skip batching for subsequent packets to avoid reordering */
> +            batch_enable = false;
>           }
>       }
>   
> +    *n_flows = map_cnt;
>       pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT,
>                               cnt - n_dropped - n_missed);
>   
> @@ -5299,8 +5333,8 @@ static inline void
>   fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>                        struct dp_packet_batch *packets_,
>                        struct netdev_flow_key *keys,
> -                     struct packet_batch_per_flow batches[],
> -                     size_t *n_batches,
> +                     struct dp_packet_flow_map *flow_map,
> +                     uint8_t *index_map,
>                        odp_port_t in_port)
>   {
>       const size_t cnt = dp_packet_batch_size(packets_);
> @@ -5379,6 +5413,8 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>   
>       DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
>           struct dp_netdev_flow *flow;
> +        /* get the original order of this packet in received batch */
> +        int recv_idx = index_map[i];
>   
>           if (OVS_UNLIKELY(!rules[i])) {
>               continue;
> @@ -5387,7 +5423,12 @@ fast_path_processing(struct dp_netdev_pmd_thread *pmd,
>           flow = dp_netdev_flow_cast(rules[i]);
>   
>           emc_probabilistic_insert(pmd, &keys[i], flow);
> -        dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches, n_batches);
> +        /* add these packets into the flow map in the same order
> +         * as received.
> +         */
> +        flow_map[recv_idx].packet = packet;
> +        flow_map[recv_idx].flow = flow;
> +        flow_map[recv_idx].tcp_flags = miniflow_get_tcp_flags(&keys[i].mf);
>       }
>   
>       pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_MASKED_HIT,
> @@ -5418,17 +5459,30 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
>       OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
>           struct netdev_flow_key keys[PKT_ARRAY_SIZE];
>       struct packet_batch_per_flow batches[PKT_ARRAY_SIZE];
> -    size_t n_batches;
> +    struct dp_packet_flow_map flow_map[PKT_ARRAY_SIZE];
> +    uint8_t index_map[PKT_ARRAY_SIZE];
> +    size_t n_batches, n_flows = 0;
>       odp_port_t in_port;
>   
>       n_batches = 0;
>       emc_processing(pmd, packets, keys, batches, &n_batches,
> -                            md_is_valid, port_no);
> +                   flow_map, &n_flows, index_map, md_is_valid, port_no);
>       if (!dp_packet_batch_is_empty(packets)) {
>           /* Get ingress port from first packet's metadata. */
>           in_port = packets->packets[0]->md.in_port.odp_port;
>           fast_path_processing(pmd, packets, keys,
> -                             batches, &n_batches, in_port);
> +                             flow_map, index_map, in_port);
> +    }
> +
> +    size_t i;

I would prefer to see 'size_t i' declared at the beginning of the 
function to keep with the existing style.

> +    /* batch rest of packets which are in flow map */
> +    for (i = 0;i < n_flows; i++) {

OVS style guide, missing space between ; and i above.

> +        struct dp_packet_flow_map *map = &flow_map[i];
> +
> +        if (OVS_UNLIKELY(!map->flow)) {
> +            continue;
> +        }
> +        dp_netdev_queue_batches(map->packet, map->flow, map->tcp_flags, batches, &n_batches);

Above exceeds 79 character width, need to split arguments to new line 
from batches.

>       }
>   
>       /* All the flow batches need to be reset before any call to
> @@ -5440,7 +5494,6 @@ dp_netdev_input__(struct dp_netdev_pmd_thread *pmd,
>        * already its own batches[k] still waiting to be served.  So if its
>        * ‘batch’ member is not reset, the recirculated packet would be wrongly
>        * appended to batches[k] of the 1st call to dp_netdev_input__(). */
> -    size_t i;
>       for (i = 0; i < n_batches; i++) {
>           batches[i].flow->batch = NULL;
>       }
> 
> 
> 
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> 



More information about the dev mailing list