[ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather implementation

Mon May 18 16:12:43 UTC 2020

> -----Original Message-----
> From: William Tu <u9012063 at gmail.com>
> Sent: Monday, May 18, 2020 3:58 PM
> To: Van Haaren, Harry <harry.van.haaren at intel.com>
> Cc: ovs-dev at openvswitch.org; i.maximets at ovn.org
> Subject: Re: [ovs-dev] [PATCH v2 5/5] dpif-lookup: add avx512 gather
> implementation
> 
> On Wed, May 06, 2020 at 02:06:09PM +0100, Harry van Haaren wrote:
> > This commit adds an AVX-512 dpcls lookup implementation.
> > It uses the AVX-512 SIMD ISA to perform multiple miniflow
> > operations in parallel.
> >
> > To run this implementation, the "avx512f" and "bmi2" ISAs are
> > required. These ISA checks are performed at runtime while
> > probing the subtable implementation. If a CPU does not provide
> > both "avx512f" and "bmi2", then this code does not execute.
> >
> > The avx512 code is built as a seperate static library, with added
> > CFLAGS to enable the required ISA features. By building only this
> > static library with avx512 enabled, it is ensured that the main OVS
> > core library is *not* using avx512, and that OVS continues to run
> > as before on CPUs that do not support avx512.
> >
> > The approach taken in this implementation is to use the
> > gather instruction to access the packet miniflow, allowing
> > any miniflow blocks to be loaded into an AVX-512 register.
> > This maximises the usefulness of the register, and hence this
> > implementation handles any subtable with up to miniflow 8 bits.
> >
> > Note that specialization of these avx512 lookup routines
> > still provides performance value, as the hashing of the
> > resulting data is performed in scalar code, and compile-time
> > loop unrolling occurs when specialized to miniflow bits.
> >
> 
> Hi Harry,
> 
> I haven't tried running the code due to my machine only
> support avx2. There are some minor issues such as indentation.
> But I read through it with example below and I think it's correct.

Thanks for the review! I'll post replies inline for context.

Note, the Software Development Emulator (SDE) tool enables emulation of AVX512 ISA.
Full details provided at the link below, using this would enable running AVX512 DPCLS
implementation itself, should you want to test it locally:
https://software.intel.com/content/www/us/en/develop/articles/intel-software-development-emulator.html

> Given that you have to do a lot of preparation (ex: popcount, creating
> bit_masks, broadcast, ... etc) before using avx instructions, do you
> have some performance number? I didn't see any from ovsconf 18 or 19.
> Is using avx512 much better than avx2?

Correct there is some "pre-work" to do before the miniflow manipulation itself.
Note that much of the more complex work (e.g. miniflow bitmask generation for the subtable)
is done at subtable instantiation time, instead of on the critical path. Also the popcount
lookup table is "static const", which will turn into a single AVX512 load at runtime.

AVX512 provides some very useful features, which are used throughout the code
below. In particular, the AVX512 "k-mask" feature allows the developer to switch-off
a lane in the SIMD register (this is sometimes referred to as a predication mask).
Using these "k-masks" solves requiring more instructions later to "merge" results
back together (as SSE or AVX2 code would have to do).
Example : "mask_set1_epi64" allows setting a specific value into the "lanes" as
given by the k-mask, and results in an AVX512 register with those contents.

There are also new instructions in AVX512 which provide even more powerful ISA, for example
the "AVX512VPOPCNTDQ" CPUID provides a vectorized popcount which can be used instead of
the "_mm512_popcnt_epi64_manual()" helper function. Enabling of the AVX512 VPOPCNT instruction
is planned in future patches to OVS. Details of the instruction are available on the intrinsics guide:
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_popcnt_epi64&expand=4368

Finally, although the code can seem a bit verbose, most _mm512_xxx_yyy() intrinsics result in a single
instruction. This means that although the code looks "big", however the resulting instruction stream often
extremely densely packed. Combine that with the fact that the implementation is focused on using instructions
to deliver the maximum amount of required compute without any waste, it can result in very high performance :)

Regarding performance numbers, unfortunately I don't have official numbers to state here.
For an approximation (caveats such as "depends on exact usage" etc apply), for about the same packet
rate, the CPU cycles spent in DPCLS is about halved in the AVX512 version, compared to the scalar version.

<snip lots of patch contents>

> I think the below function is the most difficult one.
> I wonder if there is a better way to make it easier to understand?
> ex: break it into subfunctions or utility functions

My experience has been that breaking it up into smaller snippets causes me to
lose sight of the big picture. Code like below is typically not written in one pass but
more of an iterative process. Seeing the desired register-contents is valuable,
and knowing the context and state of registers in near proximity to it can often provide
new optimizations or strength reduction of existing code.

Clearly commenting the reason for the compute, and sometimes how it is computed
is the best-known-method for writing maintainable SIMD code. This method is also used
in DPDK for its PMDs, for example the i40e driver SIMD rx codepath:
http://git.dpdk.org/dpdk/tree/drivers/net/i40e/i40e_rxtx_vec_avx2.c#n221 

> I end up using an example from your slides 2 here:
> https://www.openvswitch.org/support/ovscon2019/day1/1108-
> next_steps_sw_datapath_hvh.pdf
> and the API document here
> https://software.intel.com/sites/landingpage/IntrinsicsGuide/

Aha, you've found the colorful instruction set architecture guide :)
There is another which presents the data-movement more graphically,
I'll mention it but advise using the IntrinsicsGuide as linked above as it
is the official resource, and maintained and up-to-datedate. The graphical
webpage is here: https://www.officedaytime.com/simd512e/simd.html

> > +static inline uint32_t ALWAYS_INLINE
> > +avx512_lookup_impl(struct dpcls_subtable *subtable,
> > +                   uint32_t keys_map,
> > +                   const struct netdev_flow_key *keys[],
> > +                   struct dpcls_rule **rules,
> > +                   const uint32_t bit_count_u0,
> > +                   const uint32_t bit_count_u1)
> > +{
> > +    const uint32_t bit_count_total = bit_count_u0 + bit_count_u1;
> > +    int i;
> > +    uint32_t hashes[NETDEV_MAX_BURST];
> > +    const uint32_t n_pkts = __builtin_popcountll(keys_map);
> > +    ovs_assert(NETDEV_MAX_BURST >= n_pkts);
> > +
> > +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE)uint64_t
> block_cache[NETDEV_MAX_BURST * 8];
> > +
> > +    const uint64_t tbl_u0 = subtable->mask.mf.map.bits[0];
> > +    const uint64_t tbl_u1 = subtable->mask.mf.map.bits[1];
> > +    ovs_assert(__builtin_popcountll(tbl_u0) == bit_count_u0);
> > +    ovs_assert(__builtin_popcountll(tbl_u1) == bit_count_u1);
> > +
> > +    /* Load subtable blocks for masking later */
> > +    const uint64_t *tbl_blocks = miniflow_get_values(&subtable->mask.mf);
> > +    const __m512i v_tbl_blocks = _mm512_loadu_si512(&tbl_blocks[0]);
> > +
> > +    /* Load pre-created subtable masks for each block in subtable */
> > +    const __mmask8 bit_count_total_mask = (1 << bit_count_total) - 1;
> > +    const __m512i v_mf_masks =
> _mm512_maskz_loadu_epi64(bit_count_total_mask,
> > +                                                        subtable->mf_masks);
> > +
> > +    ULLONG_FOR_EACH_1 (i, keys_map) {
> > +        const uint64_t pkt_mf_u0_bits = keys[i]->mf.map.bits[0];
> > +        const uint64_t pkt_mf_u0_pop = __builtin_popcountll(pkt_mf_u0_bits);
> > +
> > +        /* Pre-create register with *PER PACKET* u0 offset */
> > +        const __mmask8 u1_bcast_mask = (UINT8_MAX << bit_count_u0);
> > +        const __m512i v_idx_u0_offset =
> _mm512_maskz_set1_epi64(u1_bcast_mask,
> > +                                                                pkt_mf_u0_pop);
> > +
> > +        /* Broadcast u0, u1 bitmasks to 8x u64 lanes */
> > +        __m512i v_u0 = _mm512_set1_epi64(pkt_mf_u0_bits);
> > +        __m512i v_pkt_bits = _mm512_mask_set1_epi64(v_u0, u1_bcast_mask,
> > +                                         keys[i]->mf.map.bits[1]);
> > +
> > +        /* Bitmask by pre-created masks */
> > +        __m512i v_masks = _mm512_and_si512(v_pkt_bits, v_mf_masks);
> > +
> > +        /* Manual AVX512 popcount for u64 lanes */
> > +        __m512i v_popcnts = _mm512_popcnt_epi64_manual(v_masks);
> > +
> > +        /* Offset popcounts for u1 with pre-created offset register */
> > +        __m512i v_indexes = _mm512_add_epi64(v_popcnts, v_idx_u0_offset);
> > +
> > +        /* Gather u64 on single packet, merge with zero reg, up to 8 blocks */
> > +        const __m512i v_zeros = _mm512_setzero_si512();
> > +        const uint64_t *pkt_data = miniflow_get_values(&keys[i]->mf);
> > +        __m512i v_all_blocks = _mm512_mask_i64gather_epi64(v_zeros,
> > +                                 bit_count_total_mask, v_indexes, pkt_data, 8);
> indent

Thanks!

> > +        /* Zero out bits that pkt doesn't have:
> > +         * - 2x pext() to extract bits from packet miniflow as needed by TBL
> > +         * - Shift u1 over by bit_count of u0, OR to create zero bitmask
> > +         */
> > +         uint64_t u0_to_zero = _pext_u64(keys[i]->mf.map.bits[0], tbl_u0);
> > +         uint64_t u1_to_zero = _pext_u64(keys[i]->mf.map.bits[1], tbl_u1);
> > +         uint64_t zero_mask = (u1_to_zero << bit_count_u0) | u0_to_zero;
> indentation: remove one space

Will fix.

> > +        /* Mask blocks using AND with subtable blocks, use k-mask to zero
> > +         * where lanes as required for this packet.
> > +         */
> > +        __m512i v_masked_blocks = _mm512_maskz_and_epi64(zero_mask,
> > +                                                v_all_blocks, v_tbl_blocks);
> > +
> > +        /* Store to blocks cache, full cache line aligned */
> > +        _mm512_storeu_si512(&block_cache[i * 8], v_masked_blocks);
> > +    }
> > +
> > +    /* Hash the now linearized blocks of packet metadata. */
> > +    ULLONG_FOR_EACH_1 (i, keys_map) {
> > +        uint64_t *block_ptr = &block_cache[i * 8];
> > +        uint32_t hash = hash_add_words64(0, block_ptr, bit_count_total);
> > +        hashes[i] = hash_finish(hash, bit_count_total * 8);
> > +    }
> > +
> > +    /* Lookup: this returns a bitmask of packets where the hash table had
> > +     * an entry for the given hash key. Presence of a hash key does not
> > +     * guarantee matching the key, as there can be hash collisions.
> > +     */
> > +    uint32_t found_map;
> > +    const struct cmap_node *nodes[NETDEV_MAX_BURST];
> > +    found_map = cmap_find_batch(&subtable->rules, keys_map, hashes,
> nodes);
> > +
> > +    /* Verify that packet actually matched rule. If not found, a hash
> > +     * collision has taken place, so continue searching with the next node.
> > +     */
> > +    ULLONG_FOR_EACH_1 (i, found_map) {
> > +        struct dpcls_rule *rule;
> > +
> > +        CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) {
> > +            const uint32_t cidx = i * 8;
> > +            uint32_t match = netdev_rule_matches_key(rule, bit_count_total,
> > +                                                     &block_cache[cidx]);
> > +            if (OVS_LIKELY(match)) {
> > +                rules[i] = rule;
> > +                subtable->hit_cnt++;
> > +                goto next;
> > +            }
> > +        }
> > +
> > +        /* None of the found rules was a match.  Clear the i-th bit to
> > +         * search for this key in the next subtable. */
> > +        ULLONG_SET0(found_map, i);
> > +    next:
> > +        ;                     /* Keep Sparse happy. */
> > +    }
> > +
> > +    return found_map;
> > +}
> 
> If someone is interested, the example below with the slides
> help understand the above function.

Wow - nice work! Impressive to see the code taken apart and reduced
to its logical behavior like this, interesting to see.

> diff --git a/lib/dpif-netdev-lookup-avx512-gather.c b/lib/dpif-netdev-lookup-
> avx512-gather.c
> index 52348041bd00..f84a95423cf8 100644
> --- a/lib/dpif-netdev-lookup-avx512-gather.c
> +++ b/lib/dpif-netdev-lookup-avx512-gather.c
> @@ -93,56 +93,77 @@ avx512_lookup_impl(struct dpcls_subtable *subtable,
> 
>      OVS_ALIGNED_VAR(CACHE_LINE_SIZE)uint64_t
> block_cache[NETDEV_MAX_BURST * 8];
> 
> -    const uint64_t tbl_u0 = subtable->mask.mf.map.bits[0];
> -    const uint64_t tbl_u1 = subtable->mask.mf.map.bits[1];
> -    ovs_assert(__builtin_popcountll(tbl_u0) == bit_count_u0);
> -    ovs_assert(__builtin_popcountll(tbl_u1) == bit_count_u1);
> +    const uint64_t tbl_u0 = subtable->mask.mf.map.bits[0]; //1000,0000
> +    const uint64_t tbl_u1 = subtable->mask.mf.map.bits[1]; //0100,0000
> +    ovs_assert(__builtin_popcountll(tbl_u0) == bit_count_u0); //1
> +    ovs_assert(__builtin_popcountll(tbl_u1) == bit_count_u1); //1
> +    // bit_count_total = 2
> 
>      /* Load subtable blocks for masking later */
> -    const uint64_t *tbl_blocks = miniflow_get_values(&subtable->mask.mf);
> -    const __m512i v_tbl_blocks = _mm512_loadu_si512(&tbl_blocks[0]);
> +    const uint64_t *tbl_blocks = miniflow_get_values(&subtable-
> >mask.mf);//point to ipv4 mask
> +    const __m512i v_tbl_blocks = _mm512_loadu_si512(&tbl_blocks[0]);
> //porint to ipv4 mask
> 
>      /* Load pre-created subtable masks for each block in subtable */
> -    const __mmask8 bit_count_total_mask = (1 << bit_count_total) - 1;
> -    const __m512i v_mf_masks =
> _mm512_maskz_loadu_epi64(bit_count_total_mask,
> +    const __mmask8 bit_count_total_mask = (1 << bit_count_total) - 1; // (1 <<
> 2) - 1 = 0x3
> +    const __m512i v_mf_masks =
> _mm512_maskz_loadu_epi64(bit_count_total_mask /* 0x3 */,
>                                                          subtable->mf_masks);
> +    // subtable->mf_masks[0] = 0b01111111
> +    // subtable->mf_masks[1] = 0b00111111
> +    // v_mf_masks = [0,0,0,0,0,0, 0b00111111, 0b01111111]
> 
> -    ULLONG_FOR_EACH_1 (i, keys_map) {
> -        const uint64_t pkt_mf_u0_bits = keys[i]->mf.map.bits[0];
> -        const uint64_t pkt_mf_u0_pop = __builtin_popcountll(pkt_mf_u0_bits);
> +    ULLONG_FOR_EACH_1 (i, keys_map) {// for each packets in batch
> +        const uint64_t pkt_mf_u0_bits = keys[i]->mf.map.bits[0]; //0b1000,0100
> +        const uint64_t pkt_mf_u0_pop = __builtin_popcountll(pkt_mf_u0_bits);
> //2
> 
>          /* Pre-create register with *PER PACKET* u0 offset */
> -        const __mmask8 u1_bcast_mask = (UINT8_MAX << bit_count_u0);
> +        const __mmask8 u1_bcast_mask = (UINT8_MAX << bit_count_u0); //(0xff
> << 1) = 0xfe
>          const __m512i v_idx_u0_offset =
> _mm512_maskz_set1_epi64(u1_bcast_mask,
>                                                                  pkt_mf_u0_pop);
> +        //v_idx_u0_offset = [2,2,2,2,2,2,2,0]
> 
>          /* Broadcast u0, u1 bitmasks to 8x u64 lanes */
> -        __m512i v_u0 = _mm512_set1_epi64(pkt_mf_u0_bits);
> -        __m512i v_pkt_bits = _mm512_mask_set1_epi64(v_u0, u1_bcast_mask,
> -                                         keys[i]->mf.map.bits[1]);
> +        __m512i v_u0 = _mm512_set1_epi64(pkt_mf_u0_bits);//
> [0b10000100,0b10000100,0b10000100, ...]
> +
> +        __m512i v_pkt_bits = _mm512_mask_set1_epi64(v_u0, u1_bcast_mask
> /*0xfe*/,
> +                                         keys[i]->mf.map.bits[1] /* 0b01100000 */);
> +        //0b01100000, 0b01100000, 0b01100000, 0b01100000, 0b01100000,
> 0b01100000, 0b01100000,0b10000100
> 
> -        /* Bitmask by pre-created masks */
> +
> +        /* Bitmask by pre-created masks. */
>          __m512i v_masks = _mm512_and_si512(v_pkt_bits, v_mf_masks);
> +        // v_masks = [0,0,0,0,0,0, 0b00100000,0b00000100]
> 
>          /* Manual AVX512 popcount for u64 lanes */
>          __m512i v_popcnts = _mm512_popcnt_epi64_manual(v_masks);
> +        // v_popcnts = [0,0,0,0,0,0,1,1]
> 
>          /* Offset popcounts for u1 with pre-created offset register */
>          __m512i v_indexes = _mm512_add_epi64(v_popcnts, v_idx_u0_offset);
> +        // v_indexes = [0,0,0,0,0,0,3,1]
> 
>          /* Gather u64 on single packet, merge with zero reg, up to 8 blocks */
>          const __m512i v_zeros = _mm512_setzero_si512();
>          const uint64_t *pkt_data = miniflow_get_values(&keys[i]->mf);
> +        // pkt_data = ipv4_src, ipv4_dst, mac_src, vlan_tci
> +
>          __m512i v_all_blocks = _mm512_mask_i64gather_epi64(v_zeros,
> -                                 bit_count_total_mask, v_indexes, pkt_data, 8);
> +                                 bit_count_total_mask /* 0x3 */,
> +                                 v_indexes, pkt_data, 8);
> +        //v_all_blocks: use v_index[0]=1*8 , v_index[1]=3*8 to gather data
> +        //v_all_blocks = [0,0,0,0,0,0, ipv4_dst, vlan_tci]
> 
>          /* Zero out bits that pkt doesn't have:
>           * - 2x pext() to extract bits from packet miniflow as needed by TBL
>           * - Shift u1 over by bit_count of u0, OR to create zero bitmask
>           */
> -         uint64_t u0_to_zero = _pext_u64(keys[i]->mf.map.bits[0], tbl_u0);
> -         uint64_t u1_to_zero = _pext_u64(keys[i]->mf.map.bits[1], tbl_u1);
> +         uint64_t u0_to_zero = _pext_u64(keys[i]->mf.map.bits[0] /*
> 0b1000,0100*/,
> +                                         tbl_u0 /* 0b1000,0000 */);
> +         // u0_to_zero = 0b00000001
> +         uint64_t u1_to_zero = _pext_u64(keys[i]->mf.map.bits[1] /* 0b0110,
> 0000*/,
> +                                         tbl_u1 /* 0b0100,0000 */);
> +         // u1_to_zero = 0b00000001
>           uint64_t zero_mask = (u1_to_zero << bit_count_u0) | u0_to_zero;
> +        // 0b00000011
> 
>          /* Mask blocks using AND with subtable blocks, use k-mask to zero
>           * where lanes as required for this packet.
> 
> ---
> Pretty cool piece of code. Thanks!
> 
> William

Pretty cool review. Thanks!

Harry