[ovs-dev] [v12 15/16] dpif-netdev: Optimize dp output action.

Ferriter, Cian cian.ferriter at intel.com
Wed Jun 16 10:48:08 UTC 2021


Hi Ian,

Thanks for the review. My responses are inline.

> -----Original Message-----
> From: Stokes, Ian <ian.stokes at intel.com>
> Sent: Tuesday 15 June 2021 18:04
> To: Ferriter, Cian <cian.ferriter at intel.com>; ovs-dev at openvswitch.org; Ferriter, Cian <cian.ferriter at intel.com>; Van Haaren, Harry
> <harry.van.haaren at intel.com>
> Cc: i.maximets at ovn.org
> Subject: RE: [ovs-dev] [v12 15/16] dpif-netdev: Optimize dp output action.
> 
> > From: Harry van Haaren <harry.van.haaren at intel.com>
> >
> > This commit optimizes the output action, by enabling the compiler to
> > optimize the code better through reducing code complexity.
> >
> > The core concept of this optimization is that the array-length checks
> > have already been performed above the copying code, so can be removed.
> > Removing of the per-packet length checks allows the compiler to auto-vectorize
> > the stores using SIMD registers.
> >
> > Signed-off-by: Harry van Haaren <harry.van.haaren at intel.com>
> >
> 
> Thanks for the patch Cian/Harry.
> 
> Comments inline.
> > ---
> >
> > v8: Add NEWS entry.
> > ---
> >  NEWS              |  1 +
> >  lib/dpif-netdev.c | 23 ++++++++++++++++++-----
> >  2 files changed, 19 insertions(+), 5 deletions(-)
> >
> > diff --git a/NEWS b/NEWS
> > index d04dac746..a7ec34af1 100644
> > --- a/NEWS
> > +++ b/NEWS
> > @@ -17,6 +17,7 @@ Post-v2.15.0
> >       * Enable the AVX512 DPCLS implementation to use VPOPCNT instruction if
> > the
> >         CPU supports it. This enhances performance by using the native vpopcount
> >         instructions, instead of the emulated version of vpopcount.
> > +     * Optimize dp_netdev_output by enhancing compiler optimization potential.
> >     - ovs-ctl:
> >       * New option '--no-record-hostname' to disable hostname configuration
> >         in ovsdb on startup.
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > index 04a4f71cb..f46269ae3 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -7290,12 +7290,25 @@ dp_execute_output_action(struct
> > dp_netdev_pmd_thread *pmd,
> >          pmd->n_output_batches++;
> >      }
> >
> > -    struct dp_packet *packet;
> > -    DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> > -        p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> > -            pmd->ctx.last_rxq;
> > -        dp_packet_batch_add(&p->output_pkts, packet);
> > +    /* The above checks ensure that there is enough space in the output batch.
> > +     * Using dp_packet_batch_add() has a branch to check if the batch is full.
> > +     * This branch reduces the compiler's ability to optimize efficiently. The
> > +     * below code implements packet movement between batches without
> > checks,
> > +     * with the required semantics of output batch perhaps containing packets.
> > +     */
> 
> What is the performance gain recorded with this approach with compiler optimizations?
> 

The test was run using the scalar DPIF, 1 flow and EMC on, where the before/after cycle cost of the patch is 243 cycles before to 230 cycles per packet after. Hence this optimization saves 13 cycles per packet in my measurements.

> > +    int batch_size = dp_packet_batch_size(packets_);
> > +    int out_batch_idx = dp_packet_batch_size(&p->output_pkts);
> > +    struct dp_netdev_rxq *rxq = pmd->ctx.last_rxq;
> > +    struct dp_packet_batch *output_batch = &p->output_pkts;
> > +
> > +    for (int i = 0; i < batch_size; i++) {
> > +        struct dp_packet *packet = packets_->packets[i];
> > +        p->output_pkts_rxqs[out_batch_idx] = rxq;
> > +        output_batch->packets[out_batch_idx] = packet;
> > +        out_batch_idx++;
> >      }
> > +    output_batch->count += batch_size;
> 
> So I guess the counter argument here is that the previous approach uses the dp_packet standard functions.
> 
> There has been work over time to ensure that dp_packets are handled in a similar way using the dp_packet api.
> 
> This I think reverses the position. That's why I'd like to get a sense of the performance gain as I think it's important to justify breaking
> from code conformance.
> 
> Has there been thought to modifying the dp_packet_batch_add ?
> 

We thought about adding an "unsafe" version of dp_packet_batch_add() which doesn't check whether the packet batch is full. Due to this optimization being dependent on other checks in the surrounding code, it isn't appropriate to have a generic function containing this optimization.

> Regards
> Ian
> > +
> >      return true;
> >  }
> >
> > --
> > 2.31.1
> >
> > _______________________________________________
> > dev mailing list
> > dev at openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev



More information about the dev mailing list