[ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert for recirc packets

Jan Scheurich jan.scheurich at ericsson.com
Wed Aug 16 16:23:43 UTC 2017


Hi, 

I agree that in the event of EMC overload it is beneficial to reduce the number of EMC insertions and lookups as they just generate overhead and degrade overall throughput. At the same time we want to keep as much of the EMC acceleration as possible for a fraction of traffic that can benefit from EMC most.

For EMC insertion we have already done earlier this by introducing probabilistic EMC insertion, which greatly reduces the costly effect of EMC thrashing. But we didn't touch the lookup part. How should we select the packets (or rather packet datapath traversals) for which to perform lookup?

There are several proposals in the air: Only do it for the first pass, not for recirculated packets, only do it for RSS hash values below a (dynamic) threshold, possibly others.

For EMC insertion we consciously settled on a random selection as the datapath has no a priori insight into which flows are better candidates than others and big flows that benefit most have a higher chance of getting cached.

Is there a reason to assume that a deterministic selection on some non-random criteria like the recirculation count will on average (over deployments and applications) give a better performance than a random selection?

I don't believe so. For example, the number of "EMC flows" in each pass through the datapath can differ hugely: 1 GRE tunnel flow in first pass (from phy port), 100K tenant flows after tunnel decapsulation. Or 100K tenant flows in first pass (from VM) but 1 flow after NSH encapsulation in second pass.

I believe a random selection with dynamically adapted probability is the best we can do without a priori knowledge about the traffic patterns and pipeline organization.

The RSS hash threshold method looks like the only pseudo-random criterion that we can use that produces consistent result for every packet of a flow and does require more information. Of course elephant flows with an unlucky hash value might never get to use the EMC, but that risk we have with any stateless selection scheme.

The new thing required will be the dynamic adjustment of lookup probability to the EMC fill level and/or hit ratio. Any ideas for that? I guess we'd need a scheme that periodically increases the probability again to probe for changed traffic patterns. 

Once we have that I think the same dynamic probability could be possible to use also for probabilistic EMC insertion.

BR, Jan

> -----Original Message-----
> From: ovs-dev-bounces at openvswitch.org [mailto:ovs-dev-
> bounces at openvswitch.org] On Behalf Of Fischetti, Antonio
> Sent: Wednesday, 16 August, 2017 14:42
> To: Darrell Ball <dball at vmware.com>; dev at openvswitch.org
> Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert
> for recirc packets
> 
> 
> > -----Original Message-----
> > From: Darrell Ball [mailto:dball at vmware.com]
> > Sent: Wednesday, August 16, 2017 9:09 AM
> > To: Fischetti, Antonio <antonio.fischetti at intel.com>;
> dev at openvswitch.org
> > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert
> for
> > recirc packets
> >
> >
> >
> > -----Original Message-----
> > From: "Fischetti, Antonio" <antonio.fischetti at intel.com>
> > Date: Tuesday, August 15, 2017 at 6:55 AM
> > To: Darrell Ball <dball at vmware.com>, "dev at openvswitch.org"
> > <dev at openvswitch.org>
> > Subject: RE: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC lookup/insert
> for
> > recirc packets
> >
> >
> >
> >     > -----Original Message-----
> >     > From: Darrell Ball [mailto:dball at vmware.com]
> >     > Sent: Monday, August 14, 2017 7:27 AM
> >     > To: Fischetti, Antonio <antonio.fischetti at intel.com>;
> dev at openvswitch.org
> >     > Subject: Re: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC
> lookup/insert
> > for
> >     > recirc packets
> >     >
> >     >
> >     >
> >     > -----Original Message-----
> >     > From: <ovs-dev-bounces at openvswitch.org> on behalf of
> >     > "antonio.fischetti at intel.com" <antonio.fischetti at intel.com>
> >     > Date: Friday, August 11, 2017 at 8:52 AM
> >     > To: "dev at openvswitch.org" <dev at openvswitch.org>
> >     > Subject: [ovs-dev] [PATCH v3 1/4] dpif-netdev: Skip EMC
> lookup/insert for
> >     > 	recirc packets
> >     >
> >     >     When OVS is configured as a firewall, with thousands of active
> >     >     concurrent connections, the EMC gets quicly saturated and may
> >     >     come under heavy thrashing for the reason that original and
> >     >     recirculated packets keep overwriting the existing active EMC
> >     >     entries due to its limited size (8k).
> >     >
> >     >
> >     > The recirculated packet could have been modified, in which case,
> maybe we
> >     > still want to do the emc lookup/insert ?
> >
> >     [Antonio]
> >     IMPO I'd say we should still skip emc anyway, because the purpose is
> to
> >     mitigate thrashing when emc is full. So any recirculated packet should
> >     be classified at the dpcls/ofproto layers.
> >     I don't know if I'm missing something from your question?
> >
> >     We can expect that a recirc pkt that has been modified - similarly to
> all
> >     other recirculated pkts - could result in a miss when emc is full.
> >     Later we should do an emc insertion that is likely to overwrite some
> >     active entry. And recursively, this new insertion itself could be
> >     overwritten - due to the shortage of locations - even before it is hit
> >     again. This proposal is to mitigate the thrashing with the criteria of
> >     reserving emc usage to original packets only.
> >     So a limited resource like emc hopefully could be used more
> efficiently,
> >     especially when there is more than 1 recirculation.
> >     I guess that adding an exception for modified recirc pkts could also
> >     drop a bit the throughtput as we should add another if statement
> inside
> >     emc_processing.
> >
> > [Darrell]
> > I’ll can drop the edited packet case as my concern was really more
> general.
> > The concern is that recirculated packets should still be forwarded quickly
> if
> > possible
> > and using emc should help that. The first time through, emc is used for
> the
> > packet and then the second
> > time through, emc is not used, so it is slower. But, possibly the argument
> > could be made that since it is recirculated,
> > it is already slower, in which case, maybe a penalty for recirculated
> packets
> > is reasonable.
> 
> [Antonio]
> Agree. Other than that, in case of an emc congestion - eg a firewall with
> say 6,000 connections - with a lot of overwrites, the effect could be that
> a lot of lookups will fail and the new insertions are just overwriting active
> flows. This keeps a high failure for lookups and the continuous overwrites
> for insertions become an overhead. So in this case there's a penalty
> as for the original (ie the 1st time through) as for the recirculated packets.
> With this approach we are considering that with 6,000 flows we would
> need at
> least 12,000 entries with 1 recirculation. So one strategy to reduce
> thrashing
> could be to restrict emc usage to original packets only. The counterpart is
> that recirculated packets are slower, but the overall effect should be a
> benefit.
> 
> 
> > Instead of having a simple 50% black and white cutoff, maybe a penalty
> to the
> > insertion probability could be used ?
> 
> [Antonio]
> Yes, at the beginning I was considering this solution. I then preferred
> the current one because it allows not only to skip insertions but also
> to skip lookups, especially when RSS hash must be computed in software.
> 
> The check of the threshold - as this is happening inside emc_processing -
> is done with an '&' operation so to use as less cpu cycles as possible.
> 
> 
> >
> >
> >     >
> >     >
> >     >     This thrashing causes the EMC to be less efficient than the dcpls
> >     >     in terms of lookups and insertions.
> >     >
> >     >     This patch allows to use the EMC efficiently by allowing only
> >     >     the 'original' packets to hit EMC. All recirculated packets are
> >     >     sent to the classifier directly.
> >     >     An empirical threshold EMC_RECIRCT_NO_INSERT_THRESHOLD -
> of 50% -
> >     >     for EMC occupancy is set to trigger this logic. By doing so when
> >     >     EMC utilization exceeds EMC_RECIRCT_NO_INSERT_THRESHOLD:
> >     >      - EMC Insertions are allowed just for original packets.
> >     >        EMC insertion and look up are skipped for recirculated packets.
> >     >      - Recirculated packets are sent to the classifier.
> >     >
> >     >     This patch is based on patch
> >     >     "dpif-netdev: add EMC entry count and %full figure to pmd-stats-
> show"
> > at:
> >     >     https://urldefense.proofpoint.com/v2/url?u=https-
> >     > 3A__mail.openvswitch.org_pipermail_ovs-2Ddev_2017-
> >     >
> >
> 2DJanuary_327570.html&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BV
> hFA09CGX7JQ5Ih-
> >     > uZnsw&m=NHY06RD-
> Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-
> >     > PhWyltJ71UipVzd1D0H0I9k4uSTLdCJ_zanXxHd7fo&e=
> >     >
> >     >     CC: Jan Scheurich <jan.scheurich at ericsson.com>
> >     >     Signed-off-by: Antonio Fischetti <antonio.fischetti at intel.com>
> >     >     Signed-off-by: Bhanuprakash Bodireddy
> > <bhanuprakash.bodireddy at intel.com>
> >     >     Co-authored-by: Bhanuprakash Bodireddy
> > <bhanuprakash.bodireddy at intel.com>
> >     >     ---
> >     >     Connection Tracker testbench set up with
> >     >
> >     >      table=0, priority=1 actions=drop
> >     >      table=0, priority=10,arp actions=NORMAL
> >     >      table=0, priority=100,ct_state=-trk,ip actions=ct(table=1)
> >     >      table=1, ct_state=+new+trk,ip,in_port=1
> actions=ct(commit),output:2
> >     >      table=1, ct_state=+est+trk,ip,in_port=1 actions=output:2
> >     >      table=1, ct_state=+new+trk,ip,in_port=2 actions=drop
> >     >      table=1, ct_state=+est+trk,ip,in_port=2 actions=output:1
> >     >
> >     >     2 PMDs, 3 Tx queues.
> >     >
> >     >     I measured packet Rx rate (regardless of packet loss).
> Bidirectional
> >     >     test with 64B UDP packets.
> >     >     Each row is a test with a different number of traffic streams. The
> > traffic
> >     >     generator is set so that each stream establishes one UDP
> connection.
> >     >     Mpps columns reports the Rx rates on the 2 sides.
> >     >
> >     >     I set up the generator to loop on the dest IP addr on one side,
> >     >     and loop instead on the source IP addr on the other side.
> >     >
> >     >     For example to generate 10 different flows, I was sending to phy
> port
> > #1
> >     >     UDP, IPsrc:10.10.10.10, IPdest: 20.20.20.[20-29], PortSrc: 63,
> > PortDest: 63
> >     >
> >     >     Instead to phy port #2 (source and dest IPs are now swapped):
> >     >     UDP, IPsrc: 20.20.20.[20-29], IPdest: 10.10.10.10, PortSrc: 63,
> > PortDest:
> >     > 63
> >     >
> >     >     I saw the following performance improvement.
> >     >
> >     >     Original OvS-DPDK means at Commit ID:
> >     >       6b1babacc3ca0488e07596bf822fe356c9bab646
> >     >
> >     >               +----------------------+-----------------------+
> >     >               |  Original OvS-DPDK   |   Original OvS-DPDK   |
> >     >               |                      |    + this patch       |
> >     >      ---------+------------+---------+------------+----------+
> >     >       Traffic |     Rx     |   EMC   |     Rx     |   EMC    |
> >     >       Streams |   [Mpps]   | entries |   [Mpps]   | entries  |
> >     >      ---------+------------+---------+------------+----------+
> >     >          100  | 2.43, 2.49 |   200   | 2.55, 2.57 |   201    |
> >     >        1,000  | 2.01, 2.02 |  2007   | 2.12, 2.12 |  2006    |
> >     >        2,000  | 1.93, 1.95 |  3868   | 1.98, 1.96 |  3884    |
> >     >        3,000  | 1.87, 1.91 |  5086   | 1.97, 1.97 |  4757    |
> >     >        4,000  | 1.83, 1.82 |  6173   | 1.94, 1.93 |  5280    |
> >     >       10,000  | 1.67, 1.69 |  7826   | 1.82, 1.81 |  7090    |
> >     >       30,000  | 1.57, 1.59 |  8192   | 1.66, 1.67 |  8192    |
> >     >      ---------+------------+---------+------------+----------+
> >     >
> >     >     This test setup implies 1 recirculation on each received packet.
> >     >     We didn't check this patch in a test scenario where more than 1
> >     >     recirculation is occurring per packet.
> >     >     ---
> >     >      lib/dpif-netdev.c | 65
> >     > +++++++++++++++++++++++++++++++++++++++++++++++++++----
> >     >      1 file changed, 61 insertions(+), 4 deletions(-)
> >     >
> >     >     diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> >     >     index bea1c3f..8f6b96b 100644
> >     >     --- a/lib/dpif-netdev.c
> >     >     +++ b/lib/dpif-netdev.c
> >     >     @@ -4663,6 +4663,9 @@ dp_netdev_queue_batches(struct
> dp_packet *pkt,
> >     >          packet_batch_per_flow_update(batch, pkt, mf);
> >     >      }
> >     >
> >     >     +/* Threshold to skip EMC for recirculated packets. */
> >     >     +#define EMC_RECIRCT_NO_INSERT_THRESHOLD 0xFFFFF000
> >     >     +
> >     >      /* Try to process all ('cnt') the 'packets' using only the exact
> > match
> >     > cache
> >     >       * 'pmd->flow_cache'. If a flow is not found for a packet
> > 'packets[i]',
> >     > the
> >     >       * miniflow is copied into 'keys' and the packet pointer is moved
> at
> > the
> >     >     @@ -4714,8 +4717,36 @@ emc_processing(struct
> dp_netdev_pmd_thread
> > *pmd,
> >     >              key->len = 0; /* Not computed yet. */
> >     >              key->hash = dpif_netdev_packet_get_rss_hash(packet, &key-
> > >mf);
> >     >
> >     >     -        /* If EMC is disabled skip emc_lookup */
> >     >     -        flow = (cur_min == 0) ? NULL: emc_lookup(flow_cache, key);
> >     >     +        /*
> >     >     +         * EMC lookup is skipped when one or both of the following
> >     >     +         * two cases occurs:
> >     >     +         *
> >     >     +         *    - EMC is disabled.  This is detected from cur_min.
> >     >     +         *
> >     >     +         *    - The EMC occupancy exceeds
> > EMC_RECIRCT_NO_INSERT_THRESHOLD
> >     > and
> >     >     +         *      the packet to be classified is being recirculated.
> > When
> >     > this
> >     >     +         *      happens also EMC insertions are skipped for
> > recirculated
> >     >     +         *      packets.  So that EMC is used just to store entries
> > which
> >     >     +         *      are hit from the 'original' packets.  This way the
> > EMC
> >     >     +         *      thrashing is mitigated with a benefit on
> > performance.
> >     >     +         */
> >     >     +        if (OVS_LIKELY(cur_min)) {
> >     >     +            if (!md_is_valid) {
> >     >     +                flow = emc_lookup(flow_cache, key);
> >     >     +            } else {
> >     >     +                /* Recirculated packet. */
> >     >     +                if (flow_cache->n_entries &
> >     > EMC_RECIRCT_NO_INSERT_THRESHOLD) {
> >     >     +                    /* EMC occupancy is over the threshold.  We skip
> > EMC
> >     >     +                     * lookup for recirculated packets. */
> >     >     +                    flow = NULL;
> >     >     +                } else {
> >     >     +                    flow = emc_lookup(flow_cache, key);
> >     >     +                }
> >     >     +            }
> >     >     +        } else {
> >     >     +            flow = NULL;
> >     >     +        }
> >     >     +
> >     >              if (OVS_LIKELY(flow)) {
> >     >                  dp_netdev_queue_batches(packet, flow, &key->mf,
> batches,
> >     >                                          n_batches);
> >     >     @@ -4800,7 +4831,20 @@ handle_packet_upcall(struct
> > dp_netdev_pmd_thread
> >     > *pmd,
> >     >                                                   add_actions->size);
> >     >              }
> >     >              ovs_mutex_unlock(&pmd->flow_mutex);
> >     >     -        emc_probabilistic_insert(pmd, key, netdev_flow);
> >     >     +        /* EMC insertion can be skipped by a probabilistic criteria
> > or
> >     >     +         * - in case of recirculated packets - depending on the
> > number of
> >     >     +         * EMC entries. */
> >     >     +        if (!packet->md.recirc_id) {
> >     >     +            emc_probabilistic_insert(pmd, key, netdev_flow);
> >     >     +        } else {
> >     >     +            /* Recirculated packets.  When EMC occupancy goes over
> >     >     +             * a threshold we avoid inserting new entries. */
> >     >     +            if (!(pmd->flow_cache.n_entries &
> >     >     +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {
> >     >     +                /* Still under the threshold. */
> >     >     +                emc_probabilistic_insert(pmd, key, netdev_flow);
> >     >     +            }
> >     >     +        }
> >     >          }
> >     >      }
> >     >
> >     >     @@ -4893,7 +4937,20 @@ fast_path_processing(struct
> > dp_netdev_pmd_thread
> >     > *pmd,
> >     >
> >     >              flow = dp_netdev_flow_cast(rules[i]);
> >     >
> >     >     -        emc_probabilistic_insert(pmd, &keys[i], flow);
> >     >     +        /* EMC insertion can be skipped by a probabilistic criteria
> > or
> >     >     +         * - in case of recirculated packets - depending on the
> > number of
> >     >     +         * EMC entries. */
> >     >     +        if (!packet->md.recirc_id) {
> >     >     +            emc_probabilistic_insert(pmd, &keys[i], flow);
> >     >     +        } else {
> >     >     +            /* Recirculated packets.  When EMC occupancy goes over
> >     >     +             * a threshold we avoid inserting new entries. */
> >     >     +            if (!(pmd->flow_cache.n_entries &
> >     >     +                    EMC_RECIRCT_NO_INSERT_THRESHOLD)) {
> >     >     +                /* Still under the threshold. */
> >     >     +                emc_probabilistic_insert(pmd, &keys[i], flow);
> >     >     +            }
> >     >     +        }
> >     >              dp_netdev_queue_batches(packet, flow, &keys[i].mf,
> batches,
> >     > n_batches);
> >     >          }
> >     >
> >     >     --
> >     >     2.4.11
> >     >
> >     >     _______________________________________________
> >     >     dev mailing list
> >     >     dev at openvswitch.org
> >     >     https://urldefense.proofpoint.com/v2/url?u=https-
> >     > 3A__mail.openvswitch.org_mailman_listinfo_ovs-
> >     >
> 2Ddev&d=DwICAg&c=uilaK90D4TOVoH58JNXRgQ&r=BVhFA09CGX7JQ5Ih-
> > uZnsw&m=NHY06RD-
> >     > Bcweizxd86m6hcsLPKpe7a4WVSyh9aNZQlo&s=-
> xSW7voYnxrudlh_WPXXsKJ1n1o680-
> >     > 3ZCuwj33q0H8&e=
> >     >
> >
> >
> 
> _______________________________________________
> dev mailing list
> dev at openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev


More information about the dev mailing list