[ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC lookup/insert for recirculated packets.

O Mahony, Billy billy.o.mahony at intel.com
Fri Jun 23 17:38:47 UTC 2017


Hi Antonio,

This is a really interesting patch. Comments inline below. 

Thanks,
/Billy.

> -----Original Message-----
> From: ovs-dev-bounces at openvswitch.org [mailto:ovs-dev-
> bounces at openvswitch.org] On Behalf Of antonio.fischetti at intel.com
> Sent: Monday, June 19, 2017 11:12 AM
> To: dev at openvswitch.org
> Subject: [ovs-dev] [PATCH RFC 2/4] dpif-netdev: Skip EMC lookup/insert for
> recirculated packets.
> 
> From: Antonio Fischetti <antonio.fischetti at intel.com>
> 
> When OVS is configured as a firewall, with thousands of active concurrent
> connections, the EMC gets quicly saturated and may come under heavy
> thrashing for the reason that original and recirculated packets keep overwrite
> existing active EMC entries due to its limited size(8k).
> 
> This thrashing causes the EMC to be less efficient than the dcpls in terms of
> lookups and insertions.
> 
> This patch allows to use the EMC efficiently by allowing only the 'original'
> packets to hit EMC. All recirculated packets are sent to classifier directly.
> An empirical threshold (EMC_FULL_THRESHOLD - of 50%) for EMC occupancy
> is set to trigger this logic. By doing so when EMC utilization exceeds
> EMC_FULL_THRESHOLD.
>  - EMC Insertions are allowed just for original packets. EMC insertion
>    and look up is skipped for recirculated packets.
>  - Recirculated packets are sent to classifier.
> 
> This patch depends on the previous one in this series. It's based on patch
> "dpif-netdev: add EMC entry count and %full figure to pmd-stats-show" at:
> https://mail.openvswitch.org/pipermail/ovs-dev/2017-January/327570.html
> 
> Signed-off-by: Antonio Fischetti <antonio.fischetti at intel.com>
> Signed-off-by: Bhanuprakash Bodireddy
> <bhanuprakash.bodireddy at intel.com>
> Co-authored-by: Bhanuprakash Bodireddy
> <bhanuprakash.bodireddy at intel.com>
> ---
> In our Connection Tracker testbench set up with
> 
>  table=0, priority=1 actions=drop
>  table=0, priority=10,arp actions=NORMAL  table=0, priority=100,ct_state=-
> trk,ip actions=ct(table=1)  table=1, ct_state=+new+trk,ip,in_port=1
> actions=ct(commit),output:2  table=1, ct_state=+est+trk,ip,in_port=1
> actions=output:2  table=1, ct_state=+new+trk,ip,in_port=2 actions=drop
> table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1
> 
> we saw the following performance improvement.
> 
> Measured packet Rx rate (regardless of packet loss). Bidirectional test with
> 64B UDP packets.
> Each row is a test with a different number of traffic streams. The traffic
> generator is set so that each stream establishes one UDP connection.
> Mpps columns reports the Rx rates on the 2 sides.
> 
>  Traffic |    Orig    |     Orig      |  +changes  |   +changes
>  Streams |   [Mpps]   | [EMC entries] |   [Mpps]   | [EMC entries]
> ---------+------------+---------------+------------+---------------
>      10  |  3.4, 3.4  |      20       |  3.4, 3.4  |      20
>     100  |  2.6, 2.7  |     200       |  2.6, 2.7  |     201
>   1,000  |  2.4, 2.4  |    2009       |  2.4, 2.4  |    1994
>   2,000  |  2.2, 2.2  |    3903       |  2.2, 2.2  |    3900
>   3,000  |  2.1, 2.1  |    5473       |  2.2, 2.2  |    4798
>   4,000  |  2.0, 2.0  |    6478       |  2.2, 2.2  |    5663
>  10,000  |  1.8, 1.9  |    8070       |  2.0, 2.0  |    7347
> 100,000  |  1.7, 1.7  |    8192       |  1.8, 1.8  |    8192
> 

[[BO'M]] 
A few questions on the test:
Are all the pkts rxd being recirculated?
Are there any flows present where the pkts do not require recirculation? 
Was the rxd rss hash calculation offloaded to the NIC?
For the cases with larger numbers of flows (10K , 100K) did you investigate the results when the EMC is simply switched off? 

Say we have 3000 flows (the lowest figure at which we see a positive effect) that means 6000 flows are contending for places in the emc.  
Is the effect we see here to do with disabling recirculated packets in particular or just reducing contention on the emc in general. I know that the recirculated pkt hashes require software hashing albeit a small amount so they do make a good category of packet to drop from the emc when contention is severe.

Once there are too many entries contending  for any cache it's going to become a liability as the lookup_cost + (miss_rate * cost_of_miss) grows to be greater that the cost_of_miss. And in that case any scheme where a very cheap decision can be made to not insert a certain category of packets and to also not check the emc for that same category will reduce the cache miss rate.

But I'm not sure that is_recirculated is the best categorization on which to make that decision. Mainly because it is not controllable. In this test case 50% pkts were recirculated so by ruling these packets out of eligibility for the EMC you get a really large reduction in EMC contention. However you can never know ahead of time if any packets will be recirc'd. You may have a situation where the EMC is totally swamped with 200K flows as above but none of them are re-circ'd packets so ruling out those packets will give no benefit. 


>  lib/dpif-netdev.c | 46 ++++++++++++++++++++++++++++++++++++++++--
> ----
>  1 file changed, 40 insertions(+), 6 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index fd2ed52..64a3cd4
> 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -4538,6 +4538,8 @@ dp_netdev_queue_batches(struct dp_packet *pkt,
>      packet_batch_per_flow_update(batch, pkt, mf);  }
> 
> +#define EMC_FULL_THRESHOLD 0x0000F000
> +
[[BO'M]] 

Use of FULL in #def is a little misleading as it is really indicating the threshold after which recirc pkts do not get inserted to the EMC. 

EMC_RECIRCT_NO_INSERT_THRESHOLD maybe?

As this #def is used with an & operator not a > operator in the patch it needs to be clear that number has the restrictions of 
a) having exactly one bit set so is limited to power of two values. 
b) needs to not greater  than 1u << (EM_FLOW_HASH_SHIFT-1)

Would suggest that value is based directly off EM_FLOW_HASH_SHIFT

e.g 
#def EMC_RECIRCT_NO_INSERT_HALF (1u << (EM_FLOW_HASH_SHIFT-1)) 
#def EMC_RECIRCT_NO_INSERT_QUARTER (1u << (EM_FLOW_HASH_SHIFT-2))
#def EMC_RECIRCT_NO_INSERT_EIGHT (1u << (EM_FLOW_HASH_SHIFT-3))

#def EMC_RECIRCT_NO_INSERT_THRESHOLD  EMC_RECIRCT_NO_INSERT_HALF


>  /* Try to process all ('cnt') the 'packets' using only the exact match cache
>   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
>   * miniflow is copied into 'keys' and the packet pointer is moved at the @@ -
> 4582,6 +4584,19 @@ emc_processing(struct dp_netdev_pmd_thread *pmd,
>              pkt_metadata_prefetch_init(&packets[i+1]->md);
>          }
> 
> +        /*
> +         * EMC lookup is skipped when one or both of the following
> +         * two cases occurs:
> +         *
> +         *   - EMC is disabled.  This is detected from cur_min.
> +         *
> +         *   - The EMC occupancy exceeds EMC_FULL_THRESHOLD and the
> +         *     packet to be classified is being recirculated.  When this
> +         *     happens also EMC insertions are skipped for recirculated
> +         *     packets.  So that EMC is used just to store entries which
> +         *     are hit from the 'original' packets.  This way the EMC
> +         *     thrashing is mitigated with a benefit on performance.
> +         */
>          if (!md_is_valid) {
>              pkt_metadata_init(&packet->md, port_no);
>              miniflow_extract(packet, &key->mf); @@ -4603,11 +4618,18 @@
> emc_processing(struct dp_netdev_pmd_thread *pmd,
>          } else {
>              /* Recirculated packets. */
>              miniflow_extract(packet, &key->mf);
> -            if (OVS_LIKELY(cur_min)) {
> -                key->hash = dpif_netdev_packet_get_rss_hash(packet, &key->mf);
> -                flow = emc_lookup(flow_cache, key);
> -            } else {
> +            if (flow_cache->n_entries & EMC_FULL_THRESHOLD) {
> +                /* EMC occupancy is over the threshold.  We skip EMC
> +                 * lookup for recirculated packets. */
>                  flow = NULL;
> +            } else {
> +                if (OVS_LIKELY(cur_min)) {
> +                    key->hash = dpif_netdev_packet_get_rss_hash(packet,
> +                                    &key->mf);
> +                    flow = emc_lookup(flow_cache, key);
> +                } else {
> +                    flow = NULL;
> +                }
>              }
>          }
>          key->len = 0; /* Not computed yet. */ @@ -4695,7 +4717,13 @@
> handle_packet_upcall(struct dp_netdev_pmd_thread *pmd,
>                                               add_actions->size);
>          }
>          ovs_mutex_unlock(&pmd->flow_mutex);
> -        emc_probabilistic_insert(pmd, key, netdev_flow);
> +        /* When EMC occupancy goes over a threshold we avoid inserting new
> +         * entries for recirculated packets. */
> +        if (!packet->md.recirc_id) {
> +            emc_probabilistic_insert(pmd, key, netdev_flow);
> +        } else if (!(pmd->flow_cache.n_entries & EMC_FULL_THRESHOLD)) {
> +            emc_probabilistic_insert(pmd, key, netdev_flow);
> +        }
>      }
>  }
> 
> @@ -4788,7 +4816,13 @@ fast_path_processing(struct
> dp_netdev_pmd_thread *pmd,
> 
>          flow = dp_netdev_flow_cast(rules[i]);
> 
> -        emc_probabilistic_insert(pmd, &keys[i], flow);
> +        /* When EMC occupancy goes over a threshold we avoid inserting new
> +         * entries for recirculated packets. */
> +        if (!packet->md.recirc_id) {
> +            emc_probabilistic_insert(pmd, &keys[i], flow);
> +        } else if (!(pmd->flow_cache.n_entries & EMC_FULL_THRESHOLD)) {
> +            emc_probabilistic_insert(pmd, &keys[i], flow);
> +        }
>          dp_netdev_queue_batches(packet, flow, &keys[i].mf, batches,
> n_batches);
>      }
> 
> --
> 2.4.11
> 
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev


More information about the dev mailing list