[ovs-dev] [v12 11/16] dpif-netdev/dpcls-avx512: Enable 16 block processing.

Thu Jun 10 19:42:02 UTC 2021

> -----Original Message-----
> From: Stokes, Ian <ian.stokes at intel.com>
> Sent: Wednesday, June 9, 2021 1:03 PM
> To: Ferriter, Cian <cian.ferriter at intel.com>; ovs-dev at openvswitch.org; Van
> Haaren, Harry <harry.van.haaren at intel.com>
> Cc: i.maximets at ovn.org
> Subject: RE: [ovs-dev] [v12 11/16] dpif-netdev/dpcls-avx512: Enable 16 block
> processing.
<snip commit message body>
> > The code has been run with unit tests under autovalidation and
> > passes all cases, and unit test coverage has been checked to
> > ensure the 16 block code paths are executing.
> >
> 
> Hi Harry/Cian, thanks for the patch.

Thanks for review!

> I looked through it the past few days. There is a lot  here in terms of masking/shifting
> and taking advantage of the intrinsic to accomplish that.
> The logic seemed OK to my mind but I didn't run something like GDB to check line by
> line.
> 
> I think key to this is what you've called out above that although this is the specific
> AVX512 implementation, the unit tests match with what is expected from the scalar
> implementation e.g.
> 
> Although the implementation differs the end results are the same. The performance
> of the two methods will differ for sure but that's to be expected.
> 
> With that in mind and given that this has been tested for months with different
> traffic patterns I have confidence with this patch.

All the above is accurate IMO too, thanks.

> A few more comments inline below.

Addressed inline, with <snip>s to reduce context to minimum.

> > Signed-off-by: Harry van Haaren <harry.van.haaren at intel.com>
> >
<snip code without comments>

> >
> >  VLOG_DEFINE_THIS_MODULE(dpif_lookup_avx512_gather);
> > @@ -69,22 +83,83 @@ netdev_rule_matches_key(const struct dpcls_rule
> > *rule,
> >  {
> >      const uint64_t *keyp = miniflow_get_values(&rule->flow.mf);
> >      const uint64_t *maskp = miniflow_get_values(&rule->mask->mf);
> > -    const uint32_t lane_mask = (1 << mf_bits_total) - 1;
> > +    const uint32_t lane_mask = (1ULL << mf_bits_total) - 1;
> >
> >      /* Always load a full cache line from blocks_cache. Other loads must be
> >       * trimmed to the amount of data required for mf_bits_total blocks.
> >       */
> > -    __m512i v_blocks = _mm512_loadu_si512(&block_cache[0]);
> > -    __m512i v_mask   = _mm512_maskz_loadu_epi64(lane_mask,
> > &maskp[0]);
> > -    __m512i v_key    = _mm512_maskz_loadu_epi64(lane_mask, &keyp[0]);
> > +    uint32_t res_mask;
> > +
> > +    {
> 
> Just curious as to the need of the brace above and the brace below on this code
> block? Is this a specific requirement for operations with _mm512 calls?

This is just using the { and } to create a new scope for variables.
As you can see, the two workloads are near identical, and the variable
names are actually the same. By using { and } the variables in the inner
scope are different (and replace) those in the outer scope.

Continuing this comment under the code/patch snippet here for context;

> > +        __m512i v_blocks = _mm512_loadu_si512(&block_cache[0]);
> > +        __m512i v_mask   = _mm512_maskz_loadu_epi64(lane_mask,
> > &maskp[0]);
> > +        __m512i v_key    = _mm512_maskz_loadu_epi64(lane_mask,
> > &keyp[0]);
> > +        __m512i v_data = _mm512_and_si512(v_blocks, v_mask);
> > +        res_mask = _mm512_mask_cmpeq_epi64_mask(lane_mask, v_data,
> > v_key);
> > +    }
> >
> > -    __m512i v_data = _mm512_and_si512(v_blocks, v_mask);
> > -    uint32_t res_mask = _mm512_mask_cmpeq_epi64_mask(lane_mask,
> > v_data, v_key);
> > +    if (mf_bits_total > 8) {
> > +        uint32_t lane_mask_gt8 = lane_mask >> 8;
> > +        __m512i v_blocks = _mm512_loadu_si512(&block_cache[8]);
> > +        __m512i v_mask   = _mm512_maskz_loadu_epi64(lane_mask_gt8,
> > &maskp[8]);
> > +        __m512i v_key    = _mm512_maskz_loadu_epi64(lane_mask_gt8,
> > &keyp[8]);
> > +        __m512i v_data = _mm512_and_si512(v_blocks, v_mask);
> > +        uint32_t c = _mm512_mask_cmpeq_epi64_mask(lane_mask_gt8,
> > v_data,
> > +                                                  v_key);
> > +        res_mask |= (c << 8);
> > +    }

Notice there are patch lines removed with - and added with + here..
it's a bit hard to read! But the parts being added starts with loading
of the "block_cache[8]", and the processing that follows is very similar
to the first time.

We don't want a loop here - as this code gets inlined & special cased
based on the "mf_bits_total", so using { and } seemed an elegant solution
to having similar-but-slightly different code here :)

We can change it to the following "double-named-variables" if preferred?
__m512i v_blocks_00_07 = _load(blocks[0]);
__m512i v_blocks_08_15 = _load(blocks[8]);

I prefer the { scope } style, so will leave as is unless told to change it.

> >
> > -    /* returns 1 assuming result of SIMD compare is all blocks. */
> > +    /* returns 1 assuming result of SIMD compare is all blocks matching. */
> >      return res_mask == lane_mask;
> >  }
> >
> > +/* Takes u0 and u1 inputs, and gathers the next 8 blocks to be stored
> > + * contigously into the blocks cache. Note that the pointers and bitmasks
> > + * passed into this function must be incremented for handling next 8 blocks.
> > + */
> > +static inline ALWAYS_INLINE __m512i
> > +avx512_blocks_gather(__m512i v_u0, /* reg of u64 of all u0 bits */
> > +                     __m512i v_u1, /* reg of u64 of all u1 bits */
> > +                     const uint64_t *pkt_blocks, /* ptr pkt blocks to load */
> > +                     const void *tbl_blocks,     /* ptr to blocks in tbl */
> > +                     const void *tbl_mf_masks,   /* ptr to subtable mf masks */
> > +                     __mmask64 u1_bcast_msk,      /* mask of u1 lanes */
> > +                     const uint64_t pkt_mf_u0_pop, /* num bits in u0 of pkt */
> > +                     __mmask64 zero_mask, /* maskz if pkt not have mf bit */
> Typo in comment above, maskz -> masks.

Ack, will fix.

> > +                     __mmask64 u64_lanes_mask) /* total lane count to use */
> 
> Not sure on the format of the comment above, would it be better to explain each
> parameter in the comment preceding the function?
> As regards OVS standards I have not seen an example where arguments and
> comments are mixed like this.

Ack, can refactor if preferred, will do. 

> 
> > +{
> > +        /* Suggest to compiler to load tbl blocks ahead of gather(). */
> > +        __m512i v_tbl_blocks =
> > _mm512_maskz_loadu_epi64(u64_lanes_mask,
> > +                                                        tbl_blocks);
> > +
> > +        /* Blend u0 and u1 bits together for these 8 blocks. */
> > +        __m512i v_pkt_bits = _mm512_mask_blend_epi64(u1_bcast_msk,
> > v_u0, v_u1);
> > +
> > +        /* Load pre-created tbl miniflow bitmasks, bitwise AND with them. */
> > +        __m512i v_tbl_masks =
> > _mm512_maskz_loadu_epi64(u64_lanes_mask,
> > +                                                      tbl_mf_masks);
> > +        __m512i v_masks = _mm512_and_si512(v_pkt_bits, v_tbl_masks);
> > +
> > +        /* Manual AVX512 popcount for u64 lanes. */
> > +        __m512i v_popcnts = _mm512_popcnt_epi64_manual(v_masks);
> 
> Can I ask what's specific about the manual count? I could find intrinsic guides for the
> other functions above but not the manual count here? IS there a performance hit
> associated with a manual count?

The _manual() is a separate function, which _emulates_ the real popcnt_epi64
functionality. As only Icelake (3rd gen Xeon) has a native instruction, older processors
that support AVX512 can emulate this functionality using "normal" AVX512 integer instructions.

Yes it has a performance cost, as instead of a single instruction, about 8 or 10 AVX512 instructions
are required to bit-shift, bit-AND, and translate the results into the required popcount output.

> > +        /* Add popcounts and offset for u1 bits. */
> > +        __m512i v_idx_u0_offset =
> > _mm512_maskz_set1_epi64(u1_bcast_msk,
> > +                                                          pkt_mf_u0_pop);
> 
> Above, is the assumption that the popcount for u0 will be the same as u1?
> Is there a case where they would not be?

No, there is no assumption here about u0 and u1 being the same.
This code puts the "u0_popcount" from the packet into _parts_ of
the v_idx_u0_offset register. This is later added to each miniflow block
offset, allowing a single "gather" instruction to load blocks from both
u0 and u1 at the same time.

Think of it this way, the mf bits[0] and bits[1] are indexing into a single flat array.
If popcount(bits[0]) is 5, then any popcount of bits[1] must access array[ u0_pop + bits[1] ]

By broadcasting (set1 intrinsic) the value of u0_pop to the register, we
can offset the "lanes" in the register that are using bits[1], by the u0_pop.

To test its behavior, change "pkt_mf_u0_pop" to a hard-coded zero (or other value..)
and run the autovalidator with unit tests ... bam ... failure on block processing.  :)

> > +        __m512i v_indexes = _mm512_add_epi64(v_popcnts,
> > v_idx_u0_offset);
> > +
> > +        /* Gather u64 blocks from packet miniflow. */
> > +        __m512i v_zeros = _mm512_setzero_si512();
> > +        __m512i v_blocks = _mm512_mask_i64gather_epi64(v_zeros,
> > u64_lanes_mask,
> > +                                                       v_indexes, pkt_blocks,
> > +                                                       GATHER_SCALE_8);
> > +
> > +        /* Mask pkt blocks with subtable blocks, k-mask to zero lanes. */
> > +        __m512i v_masked_blocks = _mm512_maskz_and_epi64(zero_mask,
> > v_blocks,
> > +                                                         v_tbl_blocks);
> > +        return v_masked_blocks;
> > +}
> > +
> >  static inline uint32_t ALWAYS_INLINE
> >  avx512_lookup_impl(struct dpcls_subtable *subtable,
> >                     uint32_t keys_map,
> > @@ -94,76 +169,86 @@ avx512_lookup_impl(struct dpcls_subtable
> > *subtable,
> >                     const uint32_t bit_count_u1)
> >  {
> >      OVS_ALIGNED_VAR(CACHE_LINE_SIZE)uint64_t
> > block_cache[BLOCKS_CACHE_SIZE];
> > -
> > -    const uint32_t bit_count_total = bit_count_u0 + bit_count_u1;
> > -    int i;
> >      uint32_t hashes[NETDEV_MAX_BURST];
> > +
> White space added here, should have probably been done in previous patch rather
> than fixed here.

Not sure about this, the code being modified is from OVS 2.14 era, the AVX512 DPCLS patchset.
Will just remove this whitespace addition instead, and leave old code as was.

> BR
> Ian

Regards, -Harry