[ovs-dev] [v12 11/16] dpif-netdev/dpcls-avx512: Enable 16 block processing.

Stokes, Ian ian.stokes at intel.com
Wed Jun 16 12:21:17 UTC 2021


> > -----Original Message-----
> > From: Stokes, Ian <ian.stokes at intel.com>
> > Sent: Wednesday, June 9, 2021 1:03 PM
> > To: Ferriter, Cian <cian.ferriter at intel.com>; ovs-dev at openvswitch.org; Van
> > Haaren, Harry <harry.van.haaren at intel.com>
> > Cc: i.maximets at ovn.org
> > Subject: RE: [ovs-dev] [v12 11/16] dpif-netdev/dpcls-avx512: Enable 16 block
> > processing.
> <snip commit message body>
> > > The code has been run with unit tests under autovalidation and
> > > passes all cases, and unit test coverage has been checked to
> > > ensure the 16 block code paths are executing.
> > >
> >
> > Hi Harry/Cian, thanks for the patch.
> 
> Thanks for review!
> 
> 
> > I looked through it the past few days. There is a lot  here in terms of
> masking/shifting
> > and taking advantage of the intrinsic to accomplish that.
> > The logic seemed OK to my mind but I didn't run something like GDB to check
> line by
> > line.
> >
> > I think key to this is what you've called out above that although this is the
> specific
> > AVX512 implementation, the unit tests match with what is expected from the
> scalar
> > implementation e.g.
> >
> > Although the implementation differs the end results are the same. The
> performance
> > of the two methods will differ for sure but that's to be expected.
> >
> > With that in mind and given that this has been tested for months with different
> > traffic patterns I have confidence with this patch.
> 
> All the above is accurate IMO too, thanks.
> 
> > A few more comments inline below.
> 
> Addressed inline, with <snip>s to reduce context to minimum.
> 
> 
> > > Signed-off-by: Harry van Haaren <harry.van.haaren at intel.com>
> > >
> <snip code without comments>
> 
> > >
> > >  VLOG_DEFINE_THIS_MODULE(dpif_lookup_avx512_gather);
> > > @@ -69,22 +83,83 @@ netdev_rule_matches_key(const struct dpcls_rule
> > > *rule,
> > >  {
> > >      const uint64_t *keyp = miniflow_get_values(&rule->flow.mf);
> > >      const uint64_t *maskp = miniflow_get_values(&rule->mask->mf);
> > > -    const uint32_t lane_mask = (1 << mf_bits_total) - 1;
> > > +    const uint32_t lane_mask = (1ULL << mf_bits_total) - 1;
> > >
> > >      /* Always load a full cache line from blocks_cache. Other loads must be
> > >       * trimmed to the amount of data required for mf_bits_total blocks.
> > >       */
> > > -    __m512i v_blocks = _mm512_loadu_si512(&block_cache[0]);
> > > -    __m512i v_mask   = _mm512_maskz_loadu_epi64(lane_mask,
> > > &maskp[0]);
> > > -    __m512i v_key    = _mm512_maskz_loadu_epi64(lane_mask, &keyp[0]);
> > > +    uint32_t res_mask;
> > > +
> > > +    {
> >
> > Just curious as to the need of the brace above and the brace below on this
> code
> > block? Is this a specific requirement for operations with _mm512 calls?
> 
> This is just using the { and } to create a new scope for variables.
> As you can see, the two workloads are near identical, and the variable
> names are actually the same. By using { and } the variables in the inner
> scope are different (and replace) those in the outer scope.

That was my suspicion afterwards having though about it.

> 
> Continuing this comment under the code/patch snippet here for context;
> 
> > > +        __m512i v_blocks = _mm512_loadu_si512(&block_cache[0]);
> > > +        __m512i v_mask   = _mm512_maskz_loadu_epi64(lane_mask,
> > > &maskp[0]);
> > > +        __m512i v_key    = _mm512_maskz_loadu_epi64(lane_mask,
> > > &keyp[0]);
> > > +        __m512i v_data = _mm512_and_si512(v_blocks, v_mask);
> > > +        res_mask = _mm512_mask_cmpeq_epi64_mask(lane_mask, v_data,
> > > v_key);
> > > +    }
> > >
> > > -    __m512i v_data = _mm512_and_si512(v_blocks, v_mask);
> > > -    uint32_t res_mask = _mm512_mask_cmpeq_epi64_mask(lane_mask,
> > > v_data, v_key);
> > > +    if (mf_bits_total > 8) {
> > > +        uint32_t lane_mask_gt8 = lane_mask >> 8;
> > > +        __m512i v_blocks = _mm512_loadu_si512(&block_cache[8]);
> > > +        __m512i v_mask   = _mm512_maskz_loadu_epi64(lane_mask_gt8,
> > > &maskp[8]);
> > > +        __m512i v_key    = _mm512_maskz_loadu_epi64(lane_mask_gt8,
> > > &keyp[8]);
> > > +        __m512i v_data = _mm512_and_si512(v_blocks, v_mask);
> > > +        uint32_t c = _mm512_mask_cmpeq_epi64_mask(lane_mask_gt8,
> > > v_data,
> > > +                                                  v_key);
> > > +        res_mask |= (c << 8);
> > > +    }
> 
> Notice there are patch lines removed with - and added with + here..
> it's a bit hard to read! But the parts being added starts with loading
> of the "block_cache[8]", and the processing that follows is very similar
> to the first time.
> 
> We don't want a loop here - as this code gets inlined & special cased
> based on the "mf_bits_total", so using { and } seemed an elegant solution
> to having similar-but-slightly different code here :)
> 
> We can change it to the following "double-named-variables" if preferred?
> __m512i v_blocks_00_07 = _load(blocks[0]);
> __m512i v_blocks_08_15 = _load(blocks[8]);
> 
> I prefer the { scope } style, so will leave as is unless told to change it.

Ya this is hard to say, there's no guide in OVS for this type of case.

For readability I would prefer the double-named variables but if you think it doesn't look as nice then maybe a comment on the current format to flag the scope of the variables as they have similar/same names.

Either of above is fine with me.
> 
> > >
> > > -    /* returns 1 assuming result of SIMD compare is all blocks. */
> > > +    /* returns 1 assuming result of SIMD compare is all blocks matching. */
> > >      return res_mask == lane_mask;
> > >  }
> > >
> > > +/* Takes u0 and u1 inputs, and gathers the next 8 blocks to be stored
> > > + * contigously into the blocks cache. Note that the pointers and bitmasks
> > > + * passed into this function must be incremented for handling next 8 blocks.
> > > + */
> > > +static inline ALWAYS_INLINE __m512i
> > > +avx512_blocks_gather(__m512i v_u0, /* reg of u64 of all u0 bits */
> > > +                     __m512i v_u1, /* reg of u64 of all u1 bits */
> > > +                     const uint64_t *pkt_blocks, /* ptr pkt blocks to load */
> > > +                     const void *tbl_blocks,     /* ptr to blocks in tbl */
> > > +                     const void *tbl_mf_masks,   /* ptr to subtable mf masks */
> > > +                     __mmask64 u1_bcast_msk,      /* mask of u1 lanes */
> > > +                     const uint64_t pkt_mf_u0_pop, /* num bits in u0 of pkt */
> > > +                     __mmask64 zero_mask, /* maskz if pkt not have mf bit */
> > Typo in comment above, maskz -> masks.
> 
> Ack, will fix.
> 
> 
> > > +                     __mmask64 u64_lanes_mask) /* total lane count to use */
> >
> > Not sure on the format of the comment above, would it be better to explain
> each
> > parameter in the comment preceding the function?
> > As regards OVS standards I have not seen an example where arguments and
> > comments are mixed like this.
> 
> Ack, can refactor if preferred, will do.

Thanks

> 
> >
> > > +{
> > > +        /* Suggest to compiler to load tbl blocks ahead of gather(). */
> > > +        __m512i v_tbl_blocks =
> > > _mm512_maskz_loadu_epi64(u64_lanes_mask,
> > > +                                                        tbl_blocks);
> > > +
> > > +        /* Blend u0 and u1 bits together for these 8 blocks. */
> > > +        __m512i v_pkt_bits = _mm512_mask_blend_epi64(u1_bcast_msk,
> > > v_u0, v_u1);
> > > +
> > > +        /* Load pre-created tbl miniflow bitmasks, bitwise AND with them. */
> > > +        __m512i v_tbl_masks =
> > > _mm512_maskz_loadu_epi64(u64_lanes_mask,
> > > +                                                      tbl_mf_masks);
> > > +        __m512i v_masks = _mm512_and_si512(v_pkt_bits, v_tbl_masks);
> > > +
> > > +        /* Manual AVX512 popcount for u64 lanes. */
> > > +        __m512i v_popcnts = _mm512_popcnt_epi64_manual(v_masks);
> >
> > Can I ask what's specific about the manual count? I could find intrinsic guides
> for the
> > other functions above but not the manual count here? IS there a performance
> hit
> > associated with a manual count?
> 
> The _manual() is a separate function, which _emulates_ the real popcnt_epi64
> functionality. As only Icelake (3rd gen Xeon) has a native instruction, older
> processors
> that support AVX512 can emulate this functionality using "normal" AVX512
> integer instructions.
> 
> Yes it has a performance cost, as instead of a single instruction, about 8 or 10
> AVX512 instructions
> are required to bit-shift, bit-AND, and translate the results into the required
> popcount output.
> 

Thanks for the explanation, I spotted in a later patch when it can be replaced by the vpop count so it made more sense when I spotted it there.

> 
> > > +        /* Add popcounts and offset for u1 bits. */
> > > +        __m512i v_idx_u0_offset =
> > > _mm512_maskz_set1_epi64(u1_bcast_msk,
> > > +                                                          pkt_mf_u0_pop);
> >
> > Above, is the assumption that the popcount for u0 will be the same as u1?
> > Is there a case where they would not be?
> 
> No, there is no assumption here about u0 and u1 being the same.
> This code puts the "u0_popcount" from the packet into _parts_ of
> the v_idx_u0_offset register. This is later added to each miniflow block
> offset, allowing a single "gather" instruction to load blocks from both
> u0 and u1 at the same time.
> 
> Think of it this way, the mf bits[0] and bits[1] are indexing into a single flat array.
> If popcount(bits[0]) is 5, then any popcount of bits[1] must access array[ u0_pop
> + bits[1] ]
> 
> By broadcasting (set1 intrinsic) the value of u0_pop to the register, we
> can offset the "lanes" in the register that are using bits[1], by the u0_pop.
> 
> To test its behavior, change "pkt_mf_u0_pop" to a hard-coded zero (or other
> value..)
> and run the autovalidator with unit tests ... bam ... failure on block processing.
> :)

Thanks for the explanation, that makes more sense now.

> 
> 
> > > +        __m512i v_indexes = _mm512_add_epi64(v_popcnts,
> > > v_idx_u0_offset);
> > > +
> > > +        /* Gather u64 blocks from packet miniflow. */
> > > +        __m512i v_zeros = _mm512_setzero_si512();
> > > +        __m512i v_blocks = _mm512_mask_i64gather_epi64(v_zeros,
> > > u64_lanes_mask,
> > > +                                                       v_indexes, pkt_blocks,
> > > +                                                       GATHER_SCALE_8);
> > > +
> > > +        /* Mask pkt blocks with subtable blocks, k-mask to zero lanes. */
> > > +        __m512i v_masked_blocks = _mm512_maskz_and_epi64(zero_mask,
> > > v_blocks,
> > > +                                                         v_tbl_blocks);
> > > +        return v_masked_blocks;
> > > +}
> > > +
> > >  static inline uint32_t ALWAYS_INLINE
> > >  avx512_lookup_impl(struct dpcls_subtable *subtable,
> > >                     uint32_t keys_map,
> > > @@ -94,76 +169,86 @@ avx512_lookup_impl(struct dpcls_subtable
> > > *subtable,
> > >                     const uint32_t bit_count_u1)
> > >  {
> > >      OVS_ALIGNED_VAR(CACHE_LINE_SIZE)uint64_t
> > > block_cache[BLOCKS_CACHE_SIZE];
> > > -
> > > -    const uint32_t bit_count_total = bit_count_u0 + bit_count_u1;
> > > -    int i;
> > >      uint32_t hashes[NETDEV_MAX_BURST];
> > > +
> > White space added here, should have probably been done in previous patch
> rather
> > than fixed here.
> 
> Not sure about this, the code being modified is from OVS 2.14 era, the AVX512
> DPCLS patchset.
> Will just remove this whitespace addition instead, and leave old code as was.
> 
> 
> > BR
> > Ian
> 
> Regards, -Harry

Thanks for the explanations Harry.

Regards
Ian



More information about the dev mailing list