[ovs-dev] [v12 15/16] dpif-netdev: Optimize dp output action.

Wed Jun 16 12:46:44 UTC 2021

> Hi Ian,
> 
> Thanks for the review. My responses are inline.
> 
> > -----Original Message-----
> > From: Stokes, Ian <ian.stokes at intel.com>
> > Sent: Tuesday 15 June 2021 18:04
> > To: Ferriter, Cian <cian.ferriter at intel.com>; ovs-dev at openvswitch.org;
> Ferriter, Cian <cian.ferriter at intel.com>; Van Haaren, Harry
> > <harry.van.haaren at intel.com>
> > Cc: i.maximets at ovn.org
> > Subject: RE: [ovs-dev] [v12 15/16] dpif-netdev: Optimize dp output action.
> >
> > > From: Harry van Haaren <harry.van.haaren at intel.com>
> > >
> > > This commit optimizes the output action, by enabling the compiler to
> > > optimize the code better through reducing code complexity.
> > >
> > > The core concept of this optimization is that the array-length checks
> > > have already been performed above the copying code, so can be removed.
> > > Removing of the per-packet length checks allows the compiler to auto-
> vectorize
> > > the stores using SIMD registers.
> > >
> > > Signed-off-by: Harry van Haaren <harry.van.haaren at intel.com>
> > >
> >
> > Thanks for the patch Cian/Harry.
> >
> > Comments inline.
> > > ---
> > >
> > > v8: Add NEWS entry.
> > > ---
> > >  NEWS              |  1 +
> > >  lib/dpif-netdev.c | 23 ++++++++++++++++++-----
> > >  2 files changed, 19 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/NEWS b/NEWS
> > > index d04dac746..a7ec34af1 100644
> > > --- a/NEWS
> > > +++ b/NEWS
> > > @@ -17,6 +17,7 @@ Post-v2.15.0
> > >       * Enable the AVX512 DPCLS implementation to use VPOPCNT instruction
> if
> > > the
> > >         CPU supports it. This enhances performance by using the native
> vpopcount
> > >         instructions, instead of the emulated version of vpopcount.
> > > +     * Optimize dp_netdev_output by enhancing compiler optimization
> potential.
> > >     - ovs-ctl:
> > >       * New option '--no-record-hostname' to disable hostname configuration
> > >         in ovsdb on startup.
> > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > > index 04a4f71cb..f46269ae3 100644
> > > --- a/lib/dpif-netdev.c
> > > +++ b/lib/dpif-netdev.c
> > > @@ -7290,12 +7290,25 @@ dp_execute_output_action(struct
> > > dp_netdev_pmd_thread *pmd,
> > >          pmd->n_output_batches++;
> > >      }
> > >
> > > -    struct dp_packet *packet;
> > > -    DP_PACKET_BATCH_FOR_EACH (i, packet, packets_) {
> > > -        p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] =
> > > -            pmd->ctx.last_rxq;
> > > -        dp_packet_batch_add(&p->output_pkts, packet);
> > > +    /* The above checks ensure that there is enough space in the output
> batch.
> > > +     * Using dp_packet_batch_add() has a branch to check if the batch is full.
> > > +     * This branch reduces the compiler's ability to optimize efficiently. The
> > > +     * below code implements packet movement between batches without
> > > checks,
> > > +     * with the required semantics of output batch perhaps containing
> packets.
> > > +     */
> >
> > What is the performance gain recorded with this approach with compiler
> optimizations?
> >
> 
> The test was run using the scalar DPIF, 1 flow and EMC on, where the
> before/after cycle cost of the patch is 243 cycles before to 230 cycles per packet
> after. Hence this optimization saves 13 cycles per packet in my measurements.

So I think that info is important and should be added to the commit message and will help inform whether to make the change.

Ilya has worked in this area extensively so would be interested to hear his opinion?

I don't think this is a blocker to the DPIF work so could possibly be submitted separately as a patch for master?

Regards
Ian
> 
> > > +    int batch_size = dp_packet_batch_size(packets_);
> > > +    int out_batch_idx = dp_packet_batch_size(&p->output_pkts);
> > > +    struct dp_netdev_rxq *rxq = pmd->ctx.last_rxq;
> > > +    struct dp_packet_batch *output_batch = &p->output_pkts;
> > > +
> > > +    for (int i = 0; i < batch_size; i++) {
> > > +        struct dp_packet *packet = packets_->packets[i];
> > > +        p->output_pkts_rxqs[out_batch_idx] = rxq;
> > > +        output_batch->packets[out_batch_idx] = packet;
> > > +        out_batch_idx++;
> > >      }
> > > +    output_batch->count += batch_size;
> >
> > So I guess the counter argument here is that the previous approach uses the
> dp_packet standard functions.
> >
> > There has been work over time to ensure that dp_packets are handled in a
> similar way using the dp_packet api.
> >
> > This I think reverses the position. That's why I'd like to get a sense of the
> performance gain as I think it's important to justify breaking
> > from code conformance.
> >
> > Has there been thought to modifying the dp_packet_batch_add ?
> >
> 
> We thought about adding an "unsafe" version of dp_packet_batch_add() which
> doesn't check whether the packet batch is full. Due to this optimization being
> dependent on other checks in the surrounding code, it isn't appropriate to have
> a generic function containing this optimization.
> 
> > Regards
> > Ian
> > > +
> > >      return true;
> > >  }
> > >
> > > --
> > > 2.31.1
> > >
> > > _______________________________________________
> > > dev mailing list
> > > dev at openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>