[ovs-dev] [v12 04/16] dpif-avx512: Add ISA implementation of dpif.

Wed Jun 16 13:13:04 UTC 2021

Hi Ian,

Further comments are inline.

> -----Original Message-----
> From: Stokes, Ian <ian.stokes at intel.com>
> Sent: Wednesday 16 June 2021 12:03
> To: Ferriter, Cian <cian.ferriter at intel.com>; ovs-dev at openvswitch.org; Van Haaren, Harry <harry.van.haaren at intel.com>
> Cc: i.maximets at ovn.org
> Subject: RE: [ovs-dev] [v12 04/16] dpif-avx512: Add ISA implementation of dpif.
> 
> > Hi Ian,
> >
> > Thanks for the review. My responses are inline.
> >
> > > -----Original Message-----
> > > From: Stokes, Ian <ian.stokes at intel.com>
> > > Sent: Tuesday 1 June 2021 19:59
> > > To: Ferriter, Cian <cian.ferriter at intel.com>; ovs-dev at openvswitch.org; Van
> > Haaren, Harry <harry.van.haaren at intel.com>
> > > Cc: i.maximets at ovn.org
> > > Subject: RE: [ovs-dev] [v12 04/16] dpif-avx512: Add ISA implementation of
> > dpif.
> > >
> > > > This commit adds the AVX512 implementation of DPIF functionality,
> > > > specifically the dp_netdev_input_outer_avx512 function. This function
> > > > only handles outer (no re-circulations), and is optimized to use the
> > > > AVX512 ISA for packet batching and other DPIF work.
> > > >
> > > > Sparse is not able to handle the AVX512 intrinsics, causing compile
> > > > time failures, so it is disabled for this file.
> > > >
> > > > Signed-off-by: Harry van Haaren <harry.van.haaren at intel.com>
> > > > Co-authored-by: Cian Ferriter <cian.ferriter at intel.com>
> > > > Signed-off-by: Cian Ferriter <cian.ferriter at intel.com>
> > >
> > > Thanks for the patch Harry/Cian, still testing this to a degree but questions
> > below on initial thoughts.
> > >
> > > >
> > > > ---
> > > >
> > > > v8:
> > > > - Fixup AVX512 mask to uint32_t conversion compilation warning.
> > > > ---
> > > >  lib/automake.mk                  |   5 +-
> > > >  lib/dpif-netdev-avx512.c         | 265 +++++++++++++++++++++++++++++++
> > > >  lib/dpif-netdev-private-dfc.h    |   8 +
> > > >  lib/dpif-netdev-private-dpif.h   |  32 ++++
> > > >  lib/dpif-netdev-private-thread.h |  11 +-
> > > >  lib/dpif-netdev-private.h        |  25 +++
> > > >  lib/dpif-netdev.c                |  70 ++++++--
> > > >  7 files changed, 400 insertions(+), 16 deletions(-)
> > > >  create mode 100644 lib/dpif-netdev-avx512.c
> > > >  create mode 100644 lib/dpif-netdev-private-dpif.h
> > > >
> > > > diff --git a/lib/automake.mk b/lib/automake.mk
> > > > index 0bef0cc69..5fab8ba4f 100644
> > > > --- a/lib/automake.mk
> > > > +++ b/lib/automake.mk
> > > > @@ -33,11 +33,13 @@ lib_libopenvswitchavx512_la_CFLAGS = \
> > > >  -mavx512f \
> > > >  -mavx512bw \
> > > >  -mavx512dq \
> > > > +-mbmi \
> > >
> > > Can I ask what's needed in bmi that was not already included in bmi2? Just
> > curiosity.
> > >
> >
> > So in the dp_netdev_input_outer_avx512() function (the AVX512 DPIF
> > implementation), ' _blsr_u64'  is used.
> > It's used twice to reset (set to 0) the lowest bit in a variable.
> >
> > More info on '_blsr_u64':
> > https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_blsr_u64&
> > expand=463
> >
> > > >  -mbmi2 \
> > > >  -fPIC \
> > > >  $(AM_CFLAGS)
> > > >  lib_libopenvswitchavx512_la_SOURCES = \
> > > > -lib/dpif-netdev-lookup-avx512-gather.c
> > > > +lib/dpif-netdev-lookup-avx512-gather.c \
> > > > +lib/dpif-netdev-avx512.c
> > > >  lib_libopenvswitchavx512_la_LDFLAGS = \
> > > >  -static
> > > >  endif
> > > > @@ -113,6 +115,7 @@ lib_libopenvswitch_la_SOURCES = \
> > > >  lib/dpif-netdev.h \
> > > >  lib/dpif-netdev-private-dfc.h \
> > > >  lib/dpif-netdev-private-dpcls.h \
> > > > +lib/dpif-netdev-private-dpif.h \
> > > >  lib/dpif-netdev-private-flow.h \
> > > >  lib/dpif-netdev-private-hwol.h \
> > > >  lib/dpif-netdev-private-thread.h \
> > > > diff --git a/lib/dpif-netdev-avx512.c b/lib/dpif-netdev-avx512.c
> > > > new file mode 100644
> > > > index 000000000..91f51c479
> > > > --- /dev/null
> > > > +++ b/lib/dpif-netdev-avx512.c
> > > > @@ -0,0 +1,265 @@
> > > > +/*
> > > > + * Copyright (c) 2021 Intel Corporation.
> > > > + *
> > > > + * Licensed under the Apache License, Version 2.0 (the "License");
> > > > + * you may not use this file except in compliance with the License.
> > > > + * You may obtain a copy of the License at:
> > > > + *
> > > > + *     http://www.apache.org/licenses/LICENSE-2.0
> > > > + *
> > > > + * Unless required by applicable law or agreed to in writing, software
> > > > + * distributed under the License is distributed on an "AS IS" BASIS,
> > > > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > > > implied.
> > > > + * See the License for the specific language governing permissions and
> > > > + * limitations under the License.
> > > > + */
> > > > +
> > > > +#ifdef __x86_64__
> > > > +/* Sparse cannot handle the AVX512 instructions. */
> > >
> > > So is this a limitation with sparse currently? Do you know if there are any plans
> > for support in sparse for AVX512 in the future?
> > >
> >
> > Yes, unfortunately this is a limitation with sparse. I'm not sure if this will be
> > added in the future.
> >
> OK, important to keep in mind so for any future AVX512 work that it needs to be excluded with Sparse.
> 
> Would be interesting to get line of sight on whether a solution in the sparse community in the future, not a blocker here I would say
> for the moment.
> 
> > > > +#if !defined(__CHECKER__)
> > > > +
> > > > +#include <config.h>
> > > > +
> > > > +#include "dpif-netdev.h"
> > > > +#include "dpif-netdev-perf.h"
> > > > +
> > > > +#include "dpif-netdev-private.h"
> > > > +#include "dpif-netdev-private-dpcls.h"
> > > > +#include "dpif-netdev-private-flow.h"
> > > > +#include "dpif-netdev-private-thread.h"
> > > > +
> > > > +#include "dp-packet.h"
> > > > +#include "netdev.h"
> > > > +
> > > > +#include "immintrin.h"
> > > > +
> > > > +/* Structure to contain per-packet metadata that must be attributed to the
> > > > + * dp netdev flow. This is unfortunate to have to track per packet, however
> > > > + * it's a bit awkward to maintain them in a performant way. This structure
> > > > + * helps to keep two variables on a single cache line per packet.
> > > > + */
> > > > +struct pkt_flow_meta {
> > > > +    uint16_t bytes;
> > > > +    uint16_t tcp_flags;
> > > > +};
> > > > +
> > > > +/* Structure of heap allocated memory for DPIF internals. */
> > > > +struct dpif_userdata {
> > > > +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
> > > > +        struct netdev_flow_key keys[NETDEV_MAX_BURST];
> > > > +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
> > > > +        struct netdev_flow_key *key_ptrs[NETDEV_MAX_BURST];
> > > > +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
> > > > +        struct pkt_flow_meta pkt_meta[NETDEV_MAX_BURST];
> > > > +};
> > > > +
> > > > +int32_t
> > > > +dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd,
> > > > +                             struct dp_packet_batch *packets,
> > > > +                             odp_port_t in_port)
> > > > +{
> > > > +    /* Allocate DPIF userdata. */
> > > > +    if (OVS_UNLIKELY(!pmd->netdev_input_func_userdata)) {
> > > > +        pmd->netdev_input_func_userdata =
> > > > +                xmalloc_pagealign(sizeof(struct dpif_userdata));
> > > > +    }
> > > > +
> > > > +    struct dpif_userdata *ud = pmd->netdev_input_func_userdata;
> > > > +    struct netdev_flow_key *keys = ud->keys;
> > > > +    struct netdev_flow_key **key_ptrs = ud->key_ptrs;
> > > > +    struct pkt_flow_meta *pkt_meta = ud->pkt_meta;
> > > > +
> > > > +    /* The AVX512 DPIF implementation handles rules in a way that is
> > > > optimized
> > > > +     * for reducing data-movement between HWOL/EMC/SMC and DPCLS.
> > > > This is
> > > > +     * achieved by separating the rule arrays. Bitmasks are kept for each
> > > > +     * packet, indicating if it matched in the HWOL/EMC/SMC array or DPCLS
> > > > +     * array. Later the two arrays are merged by AVX-512 expand
> > instructions.
> > > > +     */
> > > > +
> > > > +    /* Stores the computed output: a rule pointer for each packet. */
> > > > +    struct dpcls_rule *rules[NETDEV_MAX_BURST];
> > > > +    struct dpcls_rule *dpcls_rules[NETDEV_MAX_BURST];
> > >
> > > At first glimpse there's not much difference between rules and dpcls_rules
> > above, maybe clearer through the code later but  one line
> > > comment on each to explain their use would be beneficial.
> > >
> >
> > Good point. Let's make this clear. I'll add comments in the next version to
> > highlight that "*rules[]" is for storing rule pointers from HWOL/EMC/SMC and
> > "*dpcls_rules[]" is for rule pointers from DPCLS.
> >
> > > > +    uint32_t dpcls_key_idx = 0;
> > > > +
> > > > +    for (uint32_t i = 0; i < NETDEV_MAX_BURST; i += 8) {
> > >
> > > Magic number 8 above. As you using it to index into the array of rules I can see
> > why you would use it in this form if it was a once off.
> > >
> > > But from a quick glimpse of the code I can see "8" being used to index in the
> > arrays multiple times, I'd suggest a define equal to 8 at
> > > the beginning of the file in this case instead.
> > >
> >
> > Unless you feel strongly about the #define I prefer having the "8" number at the
> > code locations where it's used rather than having to remember the value of a
> > macro called " UINT64_PER_ZMM" or something similar.
> >
> > Let me know what you think.
> 
> So if it's once off I think it's ok, but if it's used more than once in different areas of the code I'd prefer to see a Define.
> 
> If you are aware of other areas in the code that will use it I would suggest using the define.
> 

I've refactored the code to use a hash define here. I'll send in the next version.

> >
> > > > +        _mm512_storeu_si512(&rules[i], _mm512_setzero_si512());
> > > > +        _mm512_storeu_si512(&dpcls_rules[i], _mm512_setzero_si512());
> > > > +    }
> > >
> > > So from above the first operation is to set all elements in rules and dpcls_rules
> > to 0, as they were allocated values may not have been
> > > set to zero already correct?
> > >
> >
> > Correct, this is a faster way of initializing the entire rules and dpcls_rules arrays
> > to 0.
> >
> > > > +
> > > > +    /* Prefetch each packet's metadata. */
> > > > +    const size_t batch_size = dp_packet_batch_size(packets);
> > > > +    for (int i = 0; i < batch_size; i++) {
> > > > +        struct dp_packet *packet = packets->packets[i];
> > > > +        OVS_PREFETCH(dp_packet_data(packet));
> > > > +        pkt_metadata_prefetch_init(&packet->md);
> > > > +    }
> > > > +
> > > > +    /* Check if EMC or SMC are enabled. */
> > > > +    struct dfc_cache *cache = &pmd->flow_cache;
> > > > +    const uint32_t emc_enabled = pmd->ctx.emc_insert_min != 0;
> > > > +    const uint32_t smc_enabled = pmd->ctx.smc_enable_db;
> > >
> > > I assume there is a lock on the pmd being called here (i.e. before
> > dp_netdev_input_outer_avx512 Is called) so as to avoid these pmd
> > > values being changed while this check is occurring?
> > >
> >
> > I think there's no lock on the pmd. These "pmd->ctx.*mc" values are set by the
> > pmd thread in pmd_thread_main() and are private to each thread. Then
> > pmd_thread_main() calls into either scalar or AVX512 DPIFs and reads the
> > values. They can't be changed by another thread in the meantime. They are only
> > set by the thread which will use the values. Hopefully that makes sense.
> >
> 
> OK that makes sense.
> 
> > > > +
> > > > +    uint32_t emc_hits = 0;
> > > > +    uint32_t smc_hits = 0;
> > > > +
> > > > +    /* A 1 bit in this mask indicates a hit, so no DPCLS lookup on the pkt. */
> > > > +    uint32_t hwol_emc_smc_hitmask = 0;
> > > > +
> > > > +    /* Perform first packet interation. */
> > >
> > > Minor typo above, interaction.
> > >
> >
> > Good catch, I'll fix this in the next version.
> >
> > > > +    uint32_t lookup_pkts_bitmask = (1ULL << batch_size) - 1;
> > > > +    uint32_t iter = lookup_pkts_bitmask;
> > > > +    while (iter) {
> > > > +        uint32_t i = __builtin_ctz(iter);
> > > > +        iter = _blsr_u64(iter);
> > >
> > > Having some trouble understanding the logic above for iter, i and
> > lookup_pkts_bitmask.
> > >
> > > From what I can tell, lookup_pkts_bitmask will represent the number of
> > packets in a batch that require a lookup?
> > >
> > > Set iter equal to the lookup_pkst_bitmask (assuming lookup_pkst_bitmask will
> > be required later so must be unchanged until then).
> > >
> >
> > Yes, all the above is true.
> >
> > > For "i" set it to the number of trailing zeroes following the LSB in in iter. So at
> > this point, does each trailing zero represent a packet yet
> > > to be extracted? Are you expecting that it has already hit the EMC/SMC/HWOL
> > here or is this for a packet that has not hit any of these
> > > yet?
> > >
> >
> > Each trailing zero represents a packet that has already been extracted and
> > looked up with HWOL/EMC/SMC if any of those are enabled. "i" is for all packets
> > that have entered this AVX512 DPIF.
> >
> > > Also __builtin_ctz Can return undefined, I'm thinking is there a case to be
> > handled here for that situation?
> > >
> >
> > I think this isn't an issue. From https://gcc.gnu.org/onlinedocs/gcc/Other-
> > Builtins.html:
> > Built-in Function: int __builtin_ctz (unsigned int x)
> > Returns the number of trailing 0-bits in x, starting at the least significant bit
> > position. If x is 0, the result is undefined.
> >
> > We check whether "iter" is 0 as part of the loop "while (iter)". So we shouldn't
> > get an undefined result.
> >
> 
> Yes, your correct.
> 
> > > Finally set iter to _blsr_u64(iter), As this is the only change that I saw that
> > would influence iter to break the while loop, could you
> > > explain the expected operation here?
> > >
> > > From the intrinsic guide on _blsr_u64:
> > >
> > > Copy all bits from unsigned 64-bit integer a to dst, and reset (set to 0) the bit in
> > dst that corresponds to the lowest set bit in a.
> > >
> > > From the initial comment before this block, I assumed this is a once off
> > operation on the First packet, but is it the case it happens for all
> > > packets?
> > >
> >
> > The expected operation here is that we have a number of bits set in "iter"
> > representing packets that we want to process (HWOL->MFEX->EMC->SMC). As
> > each packet is processed, we clear (set to 0) the bit representing that packet
> > using "_blsr_u64()". The "__builtin_ctz ()" will give us the correct index into the
> > "packets", "pkt_meta", "keys" and "rules" arrays.
> >
> > Let's walk through the below code with a 4 packet batch example. Let's
> > represent "iter" in binary.
> >
> > while (iter) {
> >     uint32_t i = __builtin_ctz(iter);
> >     iter = _blsr_u64(iter);
> >
> > Replace the "while" with an "if" and write out each iteration
> > if (iter) { // iter = 1111
> >     uint32_t i = __builtin_ctz(iter); // i = 0
> >     iter = _blsr_u64(iter); // iter = 1110
> >     // do all processing (HWOL->MFEX->EMC->SMC)
> > }
> > if (iter) { // iter = 1110
> >     uint32_t i = __builtin_ctz(iter); // i = 1
> >     iter = _blsr_u64(iter); // iter = 1100
> >     // do all processing (HWOL->MFEX->EMC->SMC)
> > }
> > if (iter) { // iter = 1100
> >     uint32_t i = __builtin_ctz(iter); // i = 2
> >     iter = _blsr_u64(iter); // iter = 1000
> >     // do all processing (HWOL->MFEX->EMC->SMC)
> > }
> > if (iter) { // iter = 1000
> >     uint32_t i = __builtin_ctz(iter); // i = 3
> >     iter = _blsr_u64(iter); // iter = 0000
> >     // do all processing (HWOL->MFEX->EMC->SMC)
> > }
> > if (iter) { // iter = 0000
> >     // fail if check, this isn't reached, move on to DPCLS code.
> > }
> >
> > Hopefully that makes sense.
> >
> 
> That's really helpful, can I make a suggestion to provide a summarized example of this as a comment before the function itself?
> AS this is the expected behavior it will make it easier to test against/modify if needs be in future. It does not have to be as detailed
> As above (maybe even provide the example for 1100. I think long term it would help with maintainability.
> 

I've added a comment with this explanation, thanks for the suggestion!

> > > > +
> > > > +        /* Initialize packet md and do miniflow extract. */
> > > > +        struct dp_packet *packet = packets->packets[i];
> > > > +        pkt_metadata_init(&packet->md, in_port);
> > > > +        struct netdev_flow_key *key = &keys[i];
> > > > +        miniflow_extract(packet, &key->mf);
> > > > +
> > > > +        /* Cache TCP and byte values for all packets. */
> > > > +        pkt_meta[i].bytes = dp_packet_size(packet);
> > > > +        pkt_meta[i].tcp_flags = miniflow_get_tcp_flags(&key->mf);
> > > > +
> > > > +        key->len = netdev_flow_key_size(miniflow_n_values(&key->mf));
> > > > +        key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
> > > > &key->mf);
> > > > +
> > > > +        struct dp_netdev_flow *f = NULL;
> > > > +
> > > > +        if (emc_enabled) {
> > > > +            f = emc_lookup(&cache->emc_cache, key);
> > > > +
> > > > +            if (f) {
> > > > +                rules[i] = &f->cr;
> > > > +                emc_hits++;
> > > > +                hwol_emc_smc_hitmask |= (1 << i);
> > >
> > > So HWOL may be disabled, but the assumption here would be that if it is
> > enabled you would have a hit, is that correct?
> > >
> >
> > We implement HWOL the same way as in a similar way to the scalar DPIF. We
> > don't check recirculation depth since the AVX512 DPIF is for outer packets only.
> > Otherwise, the checks are the same as with the scalar DPIF. So basically the
> > checks are:
> > 1. Does the packet have a flow mark
> > 2. If so, does the flow mark match with an actual flow.
> >
> > So there is no assumption that there will be a hit. We just check whether there is
> > a hit.
> 
> Understood.
> 
> >
> > > I'm wondering is there a case with this logic that you have a traffic type that
> > we have a hit for in EMC/SMC but that possibly is not
> > > supported by HWOL and as such you may not have a hit?
> > >
> >
> > Yes, this is possible. When there is no HWOL hit, we will continue to EMC and
> > SMC lookups. These will be performed if they are enabled respectively.
> >
> 
> Thanks for clarifying.
> 
> > > > +                continue;
> > > > +            }
> > > > +        };
> > >
> > > Is the semi colon a typo above?
> > >
> >
> > Good catch, I'll remove this in the next version.
> >
> > > > +
> > > > +        if (smc_enabled && !f) {
> > > > +            f = smc_lookup_single(pmd, packet, key);
> > > > +            if (f) {
> > > > +                rules[i] = &f->cr;
> > > > +                smc_hits++;
> > > > +                hwol_emc_smc_hitmask |= (1 << i);
> > > > +                continue;
> > > > +            }
> > > > +        }
> > > > +
> > > > +        /* The flow pointer was not found in HWOL/EMC/SMC, so add it to the
> > > > +         * dpcls input keys array for batch lookup later.
> > > > +         */
> > > > +        key_ptrs[dpcls_key_idx] = &keys[i];
> > > > +        dpcls_key_idx++;
> > > > +    }
> > > > +
> > > > +
> > > > +    /* DPCLS handles any packets missed by HWOL/EMC/SMC. It operates on
> > > > the
> > > > +     * key_ptrs[] for input miniflows to match, storing results in the
> > > > +     * dpcls_rules[] array.
> > > > +     */
> > > > +    if (dpcls_key_idx > 0) {
> > > > +        struct dpcls *cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
> > > > +        if (OVS_UNLIKELY(!cls)) {
> > > > +            return -1;
> > > > +        }
> > > > +        int any_miss = !dpcls_lookup(cls,
> > > > +                                    (const struct netdev_flow_key **) key_ptrs,
> > > > +                                    dpcls_rules, dpcls_key_idx, NULL);
> > >
> > > Is there a reason you've used int any_miss rather than a bool above?
> > >
> >
> > This is just personal preference. I'll change this to use a bool in the next version.
> >
> > > dpcls_lookup reutrns bool anyway, true if all entries are found, otherwise false
> > so it could avoid the ! operand also on the call to
> > > dpcls_lookup no?
> > >
> >
> > We call "dpcls_lookup" in a similar way to the scalar DPIF here, with the "!"
> > operand to represent "any_miss" and branch on this afterwards. Hopefully that's
> > OK.
> 
> I think it is, just had to re-read on first glimpse.
> 
> >
> > > > +        if (OVS_UNLIKELY(any_miss)) {
> > > > +            return -1;
> > > > +        }
> > > > +
> > > > +        /* Merge DPCLS rules and HWOL/EMC/SMC rules. */
> > > > +        uint32_t dpcls_idx = 0;
> > > > +        for (int i = 0; i < NETDEV_MAX_BURST; i += 8) {
> > > > +            /* Indexing here is somewhat complicated due to DPCLS output rule
> > > > +             * load index depending on the hitmask of HWOL/EMC/SMC. More
> > > > +             * packets from HWOL/EMC/SMC bitmask means less DPCLS rules
> > are
> > > > +             * used.
> > > > +             */
> > > > +            __m512i v_cache_rules = _mm512_loadu_si512(&rules[i]);
> > > > +            __m512i v_merged_rules =
> > > > +                        _mm512_mask_expandloadu_epi64(v_cache_rules,
> > > > +                                                      ~hwol_emc_smc_hitmask,
> > > > +                                                      &dpcls_rules[dpcls_idx]);
> > > To clarify above, where is the destination for the memory load here? Is it
> > v_merged_rules?
> > >
> >
> > We are storing the result of the memory load to "v_merged_rules".
> >
> > > Also is it the mask is interacting with the dpcls_rules[dpcls_idx] ? i.e. for each
> > bit set in mask take the value from dpcls_rules, but
> > > when the bit is not set take the value from v_cache_rules, just trying to get my
> > head around the operation of how the merged rules
> > > would look.
> > >
> >
> > Yes, you are correct. This is where we are merging some elements of the
> > "dpcls_rules" array and some elements of the "rules" array. For the
> > "_mm512_mask_expandloadu_epi64()", we are loading from "dpcls_rules" or
> > "v_cache_rules" (which is the "rules" array loaded into an AVX512 zmm register).
> > When a bit in the mask is set, we take that "dpcls_rule". Otherwise, we use the
> > "rule".
> >
> > > > +            _mm512_storeu_si512(&rules[i], v_merged_rules);
> > > > +
> > > > +            /* Update DPCLS load index and bitmask for HWOL/EMC/SMC hits.
> > > > +             * There are 8 output pointer per register, subtract the
> > > > +             * HWOL/EMC/SMC lanes equals the number of DPCLS rules
> > > > consumed.
> > > > +             */
> > > > +            uint32_t hitmask_FF = (hwol_emc_smc_hitmask & 0xFF);
> > > > +            dpcls_idx += 8 - __builtin_popcountll(hitmask_FF);
> > > > +            hwol_emc_smc_hitmask = (hwol_emc_smc_hitmask >> 8);
> > > > +        }
> > > > +    }
> > > > +
> > > > +    /* At this point we don't return error anymore, so commit stats here. */
> > > > +    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_RECV,
> > > > batch_size);
> > > > +    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT,
> > > > emc_hits);
> > > > +    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT,
> > > > smc_hits);
> > > > +    pmd_perf_update_counter(&pmd->perf_stats,
> > > > PMD_STAT_MASKED_HIT,
> > > > +                            dpcls_key_idx);
> > > > +    pmd_perf_update_counter(&pmd->perf_stats,
> > > > PMD_STAT_MASKED_LOOKUP,
> > > > +                            dpcls_key_idx);
> > > > +
> > > > +    /* Initialize the "Action Batch" for each flow handled below. */
> > > > +    struct dp_packet_batch action_batch;
> > > > +    action_batch.trunc = 0;
> > > > +    action_batch.do_not_steal = false;
> > > > +
> > > > +    while (lookup_pkts_bitmask) {
> > > > +        uint32_t rule_pkt_idx = __builtin_ctz(lookup_pkts_bitmask);
> > > > +        uint64_t needle = (uintptr_t) rules[rule_pkt_idx];
> > > > +
> > > > +        /* Parallel compare 8 flow* 's to the needle, create a bitmask. */
> > > > +        uint32_t batch_bitmask = 0;
> > > > +        for (uint32_t j = 0; j < NETDEV_MAX_BURST; j += 8) {
> > > > +            /* Pre-calculate store addr */
> > >
> > > Minor, period missing above in comment.
> > >
> >
> > Good catch, I'll fix this in the next version.
> >
> > > > +            uint32_t num_pkts_in_batch = __builtin_popcountll(batch_bitmask);
> > >
> > > So num_pkts_in_batch will always be 0 on first iteration as batch_bitmask will
> > be equal to zero?
> > >
> >
> > Correct.
> >
> > > A little unsure here of the ordering but I guess the key is to update the bitmask
> > for the next iteriation?
> > >
> >
> > The popcount of "batch_bitmask" looks silly for the first iteration of the for
> > loop, but makes sense for the subsequent iterations of the for loop. We
> > calculate "num_pkts_in_batch" to get the correct index to use for the
> > "action_batch" of packets.
> >
> 
> Sure, that confirms what I thought.
> 
> > > > +            void *store_addr = &action_batch.packets[num_pkts_in_batch];
> > > > +
> > > > +            /* Search for identical flow* in burst, update bitmask. */
> > > > +            __m512i v_needle = _mm512_set1_epi64(needle);
> > > > +            __m512i v_hay = _mm512_loadu_si512(&rules[j]);
> > > > +            __mmask8 k_cmp_bits = _mm512_cmpeq_epi64_mask(v_needle,
> > > > v_hay);
> > > > +            uint32_t cmp_bits = k_cmp_bits;
> > > > +            batch_bitmask |= cmp_bits << j;
> > > > +
> > > > +            /* Compress and store the batched packets. */
> > > > +            struct dp_packet **packets_ptrs = &packets->packets[j];
> > > > +            __m512i v_pkt_ptrs = _mm512_loadu_si512(packets_ptrs);
> > > > +            _mm512_mask_compressstoreu_epi64(store_addr, cmp_bits,
> > > > v_pkt_ptrs);
> > > > +        }
> > > > +
> > > > +        /* Strip all packets in this batch from the lookup_pkts_bitmask. */
> > > > +        lookup_pkts_bitmask &= (~batch_bitmask);
> > > > +        action_batch.count = __builtin_popcountll(batch_bitmask);
> > > > +
> > > > +        /* Loop over all packets in this batch, to gather the byte and tcp_flag
> > > > +         * values, and pass them to the execute function. It would be nice to
> > > > +         * optimize this away, however it is not easy to refactor in dpif.
> > > > +         */
> > > > +        uint32_t bytes = 0;
> > > > +        uint16_t tcp_flags = 0;
> > > > +        uint32_t bitmask_iter = batch_bitmask;
> > > > +        for (int i = 0; i < action_batch.count; i++) {
> > > > +            uint32_t idx = __builtin_ctzll(bitmask_iter);
> > > > +            bitmask_iter = _blsr_u64(bitmask_iter);
> > > > +
> > > > +            bytes += pkt_meta[idx].bytes;
> > > > +            tcp_flags |= pkt_meta[idx].tcp_flags;
> > > > +        }
> > > > +
> > > > +        dp_netdev_batch_execute(pmd, &action_batch, rules[rule_pkt_idx],
> > > > +                                bytes, tcp_flags);
> > > > +    }
> > > > +
> > > > +    return 0;
> > > > +}
> > > > +
> > > > +#endif
> > > > +#endif
> > > > diff --git a/lib/dpif-netdev-private-dfc.h b/lib/dpif-netdev-private-dfc.h
> > > > index 52349a3fc..bd18bd3fd 100644
> > > > --- a/lib/dpif-netdev-private-dfc.h
> > > > +++ b/lib/dpif-netdev-private-dfc.h
> > > > @@ -81,6 +81,9 @@ extern "C" {
> > > >  #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX /                     \
> > > >                                      DEFAULT_EM_FLOW_INSERT_INV_PROB)
> > > >
> > > > +/* Forward declaration for SMC function prototype. */
> > >
> > > A little vague, maybe extend with " prototypes that require access to
> > dp_netdev_pmd_thread "
> > >
> >
> > I'll add more detail in the next version, like this:
> > /* Forward declaration for SMC function prototype that requires access to
> >  * 'struct dp_netdev_pmd_thread'. */
> >
> 
> Sounds good.
> 
> > > > +struct dp_netdev_pmd_thread;
> > > > +
> > > >  struct emc_entry {
> > > >      struct dp_netdev_flow *flow;
> > > >      struct netdev_flow_key key;   /* key.hash used for emc hash value. */
> > > > @@ -237,6 +240,11 @@ emc_lookup(struct emc_cache *cache, const struct
> > > > netdev_flow_key *key)
> > > >      return NULL;
> > > >  }
> > > >
> > > > +struct dp_netdev_flow *
> > > > +smc_lookup_single(struct dp_netdev_pmd_thread *pmd,
> > > > +                  struct dp_packet *packet,
> > > > +                  struct netdev_flow_key *key);
> > > > +
> > > >  #ifdef  __cplusplus
> > > >  }
> > > >  #endif
> > > > diff --git a/lib/dpif-netdev-private-dpif.h b/lib/dpif-netdev-private-dpif.h
> > > > new file mode 100644
> > > > index 000000000..2fd7cc400
> > > > --- /dev/null
> > > > +++ b/lib/dpif-netdev-private-dpif.h
> > > > @@ -0,0 +1,32 @@
> > > > +/*
> > > > + * Copyright (c) 2021 Intel Corporation.
> > > > + *
> > > > + * Licensed under the Apache License, Version 2.0 (the "License");
> > > > + * you may not use this file except in compliance with the License.
> > > > + * You may obtain a copy of the License at:
> > > > + *
> > > > + *     http://www.apache.org/licenses/LICENSE-2.0
> > > > + *
> > > > + * Unless required by applicable law or agreed to in writing, software
> > > > + * distributed under the License is distributed on an "AS IS" BASIS,
> > > > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > > > implied.
> > > > + * See the License for the specific language governing permissions and
> > > > + * limitations under the License.
> > > > + */
> > > > +
> > > > +#ifndef DPIF_NETDEV_PRIVATE_DPIF_H
> > > > +#define DPIF_NETDEV_PRIVATE_DPIF_H 1
> > > > +
> > > > +#include "openvswitch/types.h"
> > > > +
> > > > +/* Forward declarations to avoid including files. */
> > > > +struct dp_netdev_pmd_thread;
> > > > +struct dp_packet_batch;
> > > > +
> > > > +/* Available implementations for dpif work. */
> > > > +int32_t
> > > > +dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd,
> > > > +                             struct dp_packet_batch *packets,
> > > > +                             odp_port_t in_port);
> > >
> > > Just on the function name dp_netdev_input_outer_avx512 Above.
> > >
> > > Is it likely to see a separate inner equivalent in the future? And would it be a
> > separate function such as
> > > dp_netdev_input_inner_avx512 or would it just be on function handling both
> > cases?
> > >
> >
> > We want to add support for recirculation, or handling of the inner headers of
> > packets to provide more performance benefits. Whether it will be separate
> > functions is a good question. We'll need to look closer when it comes to actually
> > adding support for inner packets and find the appropriate solution.
> >
> > I guess naming this with "outer" hopefully makes sense for now since it only
> > handles a packets first pass through OVS.
> 
> Sure, outer makes sense for the moment, could be updated at a later stage depending on the implementation.
> 
> >
> > > > +
> > > > +#endif /* netdev-private.h */
> > > > diff --git a/lib/dpif-netdev-private-thread.h b/lib/dpif-netdev-private-
> > > > thread.h
> > > > index 01a28a681..c0c94c566 100644
> > > > --- a/lib/dpif-netdev-private-thread.h
> > > > +++ b/lib/dpif-netdev-private-thread.h
> > > > @@ -45,14 +45,19 @@ struct dp_netdev_pmd_thread_ctx {
> > > >      struct dp_netdev_rxq *last_rxq;
> > > >      /* EMC insertion probability context for the current processing cycle. */
> > > >      uint32_t emc_insert_min;
> > > > +    /* Enable the SMC cache from ovsdb config. */
> > > > +    bool smc_enable_db;
> > > >  };
> > > >
> > > >  /* Forward declaration for typedef. */
> > > >  struct dp_netdev_pmd_thread;
> > > >
> > > > -typedef void (*dp_netdev_input_func)(struct dp_netdev_pmd_thread
> > > > *pmd,
> > > > -                                     struct dp_packet_batch *packets,
> > > > -                                     odp_port_t port_no);
> > > > +/* Typedef for DPIF functions.
> > > > + * Returns a bitmask of packets to handle, possibly including upcall/misses.
> > > > + */
> > > > +typedef int32_t (*dp_netdev_input_func)(struct dp_netdev_pmd_thread
> > > > *pmd,
> > > > +                                        struct dp_packet_batch *packets,
> > > > +                                        odp_port_t port_no);
> > > >
> > > >  /* PMD: Poll modes drivers.  PMD accesses devices via polling to eliminate
> > > >   * the performance overhead of interrupt processing.  Therefore netdev can
> > > > diff --git a/lib/dpif-netdev-private.h b/lib/dpif-netdev-private.h
> > > > index d7b6fd7ec..0315b5bf6 100644
> > > > --- a/lib/dpif-netdev-private.h
> > > > +++ b/lib/dpif-netdev-private.h
> > > > @@ -31,4 +31,29 @@
> > > >  #include "dpif-netdev-private-dfc.h"
> > > >  #include "dpif-netdev-private-thread.h"
> > > >
> > > > +/* Allow other implementations to lookup the DPCLS instances. */
> > > > +struct dpcls *
> > > > +dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd,
> > > > +                           odp_port_t in_port);
> > > > +
> > > > +/* Allow other implementations to call dpcls_lookup() for subtable search.
> > > > */
> > > > +bool
> > > > +dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key *keys[],
> > > > +             struct dpcls_rule **rules, const size_t cnt,
> > > > +             int *num_lookups_p);
> > > > +
> > > > +/* Allow other implementations to execute actions on a batch. */
> > > > +void
> > > > +dp_netdev_batch_execute(struct dp_netdev_pmd_thread *pmd,
> > > > +                        struct dp_packet_batch *packets,
> > > > +                        struct dpcls_rule *rule,
> > > > +                        uint32_t bytes,
> > > > +                        uint16_t tcp_flags);
> > > > +
> > > > +/* Available implementations for dpif work. */
> > > > +int32_t
> > > > +dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd,
> > > > +                             struct dp_packet_batch *packets,
> > > > +                             odp_port_t in_port);
> > > > +
> > > >  #endif /* netdev-private.h */
> > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > > > index bec984643..5ed61d08b 100644
> > > > --- a/lib/dpif-netdev.c
> > > > +++ b/lib/dpif-netdev.c
> > > > @@ -185,10 +185,6 @@ static uint32_t
> > > > dpcls_subtable_lookup_reprobe(struct dpcls *cls);
> > > >  static void dpcls_insert(struct dpcls *, struct dpcls_rule *,
> > > >                           const struct netdev_flow_key *mask);
> > > >  static void dpcls_remove(struct dpcls *, struct dpcls_rule *);
> > > > -static bool dpcls_lookup(struct dpcls *cls,
> > > > -                         const struct netdev_flow_key *keys[],
> > > > -                         struct dpcls_rule **rules, size_t cnt,
> > > > -                         int *num_lookups_p);
> > > >
> > > >  /* Set of supported meter flags */
> > > >  #define DP_SUPPORTED_METER_FLAGS_MASK \
> > > > @@ -485,7 +481,7 @@ static void dp_netdev_execute_actions(struct
> > > > dp_netdev_pmd_thread *pmd,
> > > >                                        const struct flow *flow,
> > > >                                        const struct nlattr *actions,
> > > >                                        size_t actions_len);
> > > > -static void dp_netdev_input(struct dp_netdev_pmd_thread *,
> > > > +static int32_t dp_netdev_input(struct dp_netdev_pmd_thread *,
> > > >                              struct dp_packet_batch *, odp_port_t port_no);
> > > >  static void dp_netdev_recirculate(struct dp_netdev_pmd_thread *,
> > > >                                    struct dp_packet_batch *);
> > > > @@ -557,7 +553,7 @@ dpif_netdev_xps_revalidate_pmd(const struct
> > > > dp_netdev_pmd_thread *pmd,
> > > >                                 bool purge);
> > > >  static int dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread
> > > > *pmd,
> > > >                                        struct tx_port *tx);
> > > > -static inline struct dpcls *
> > > > +inline struct dpcls *
> > > >  dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd,
> > > >                             odp_port_t in_port);
> > > >
> > > > @@ -1922,7 +1918,7 @@ void dp_netdev_flow_unref(struct
> > > > dp_netdev_flow *flow)
> > > >      }
> > > >  }
> > > >
> > > > -static inline struct dpcls *
> > > > +inline struct dpcls *
> > > >  dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd,
> > > >                             odp_port_t in_port)
> > > >  {
> > > > @@ -2722,7 +2718,7 @@ dp_netdev_pmd_lookup_flow(struct
> > > > dp_netdev_pmd_thread *pmd,
> > > >                            int *lookup_num_p)
> > > >  {
> > > >      struct dpcls *cls;
> > > > -    struct dpcls_rule *rule;
> > > > +    struct dpcls_rule *rule = NULL;
> > > >      odp_port_t in_port = u32_to_odp(MINIFLOW_GET_U32(&key->mf,
> > > >                                                       in_port.odp_port));
> > > >      struct dp_netdev_flow *netdev_flow = NULL;
> > > > @@ -4236,7 +4232,10 @@ dp_netdev_process_rxq_port(struct
> > > > dp_netdev_pmd_thread *pmd,
> > > >          }
> > > >
> > > >          /* Process packet batch. */
> > > > -        pmd->netdev_input_func(pmd, &batch, port_no);
> > > > +        int32_t ret = pmd->netdev_input_func(pmd, &batch, port_no);
> > > > +        if (ret) {
> > > > +            dp_netdev_input(pmd, &batch, port_no);
> > > > +        }
> > > >
> > > >          /* Assign processing cycles to rx queue. */
> > > >          cycles = cycle_timer_stop(&pmd->perf_stats, &timer);
> > > > @@ -5254,6 +5253,8 @@ dpif_netdev_run(struct dpif *dpif)
> > > >                      non_pmd->ctx.emc_insert_min = 0;
> > > >                  }
> > > >
> > > > +                non_pmd->ctx.smc_enable_db = dp->smc_enable_db;
> > > > +
> > > >                  for (i = 0; i < port->n_rxq; i++) {
> > > >
> > > >                      if (!netdev_rxq_enabled(port->rxqs[i].rx)) {
> > > > @@ -5525,6 +5526,8 @@ reload:
> > > >                  pmd->ctx.emc_insert_min = 0;
> > > >              }
> > > >
> > > > +            pmd->ctx.smc_enable_db = pmd->dp->smc_enable_db;
> > > > +
> > > >              process_packets =
> > > >                  dp_netdev_process_rxq_port(pmd, poll_list[i].rxq,
> > > >                                             poll_list[i].port_no);
> > > > @@ -6419,6 +6422,24 @@ packet_batch_per_flow_execute(struct
> > > > packet_batch_per_flow *batch,
> > > >                                actions->actions, actions->size);
> > > >  }
> > > >
> > > > +void
> > > > +dp_netdev_batch_execute(struct dp_netdev_pmd_thread *pmd,
> > > > +                        struct dp_packet_batch *packets,
> > > > +                        struct dpcls_rule *rule,
> > > > +                        uint32_t bytes,
> > > > +                        uint16_t tcp_flags)
> > > > +{
> > > > +    /* Gets action* from the rule. */
> > > > +    struct dp_netdev_flow *flow = dp_netdev_flow_cast(rule);
> > > > +    struct dp_netdev_actions *actions = dp_netdev_flow_get_actions(flow);
> > > > +
> > > > +    dp_netdev_flow_used(flow, dp_packet_batch_size(packets), bytes,
> > > > +                        tcp_flags, pmd->ctx.now / 1000);
> > > > +    const uint32_t steal = 1;
> > > > +    dp_netdev_execute_actions(pmd, packets, steal, &flow->flow,
> > > > +                              actions->actions, actions->size);
> > > > +}
> > > > +
> > > >  static inline void
> > > >  dp_netdev_queue_batches(struct dp_packet *pkt,
> > > >                          struct dp_netdev_flow *flow, uint16_t tcp_flags,
> > > > @@ -6523,6 +6544,30 @@ smc_lookup_batch(struct
> > > > dp_netdev_pmd_thread *pmd,
> > > >      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT,
> > > > n_smc_hit);
> > > >  }
> > > >
> > > > +struct dp_netdev_flow *
> > > > +smc_lookup_single(struct dp_netdev_pmd_thread *pmd,
> > > > +                  struct dp_packet *packet,
> > > > +                  struct netdev_flow_key *key)
> > > > +{
> > > > +    const struct cmap_node *flow_node = smc_entry_get(pmd, key->hash);
> > > > +
> > > > +    if (OVS_LIKELY(flow_node != NULL)) {
> > > > +        struct dp_netdev_flow *flow = NULL;
> > > > +
> > > > +        CMAP_NODE_FOR_EACH (flow, node, flow_node) {
> > > > +            /* Since we dont have per-port megaflow to check the port
> > > > +             * number, we need to verify that the input ports match. */
> > > > +            if (OVS_LIKELY(dpcls_rule_matches_key(&flow->cr, key) &&
> > > > +                flow->flow.in_port.odp_port == packet->md.in_port.odp_port)) {
> > > > +
> > > > +                return (void *) flow;
> > > > +            }
> > > > +        }
> > > > +    }
> > > > +
> > > > +    return NULL;
> > > > +}
> > > > +
> > > >  /* Try to process all ('cnt') the 'packets' using only the datapath flow cache
> > > >   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
> > > >   * miniflow is copied into 'keys' and the packet pointer is moved at the
> > > > @@ -6928,12 +6973,13 @@ dp_netdev_input__(struct
> > > > dp_netdev_pmd_thread *pmd,
> > > >      }
> > > >  }
> > > >
> > > > -static void
> > > > +static int32_t
> > > >  dp_netdev_input(struct dp_netdev_pmd_thread *pmd,
> > > >                  struct dp_packet_batch *packets,
> > > >                  odp_port_t port_no)
> > > >  {
> > > >      dp_netdev_input__(pmd, packets, false, port_no);
> > > > +    return 0;
> > >
> > > Always returning 0, I think this would change at a later stage no in the patch
> > series?
> > >
> >
> > This won't change later in the patch series. The "return 0;" is added so the scalar
> > DPIF API will match with newly introduced API for DPIF functions defined by
> > "dp_netdev_input_func" in " lib/dpif-netdev-private-dpif.h". Other DPIF
> > implementations might return a nonzero value to indicate packets to handle
> > because of misses in that DPIF implementation. The "dp_netdev_input" should
> > handle all cases, that other implementations might not, so will always return 0.
> >
> > > BR
> > > Ian
> >
> > Thanks again for the review Ian.
> 
> No problem, looking forward to the next revision.
> 
> Regards
> Ian
> > > >  }
> > > >
> > > >  static void
> > > > @@ -8374,7 +8420,7 @@ netdev_flow_key_gen_masks(const struct
> > > > netdev_flow_key *tbl,
> > > >
> > > >  /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
> > > >   * in 'mask' the values in 'key' and 'target' are the same. */
> > > > -bool
> > > > +inline bool ALWAYS_INLINE
> > > >  dpcls_rule_matches_key(const struct dpcls_rule *rule,
> > > >                         const struct netdev_flow_key *target)
> > > >  {
> > > > @@ -8400,7 +8446,7 @@ dpcls_rule_matches_key(const struct dpcls_rule
> > > > *rule,
> > > >   * priorities, instead returning any rule which matches the flow.
> > > >   *
> > > >   * Returns true if all miniflows found a corresponding rule. */
> > > > -static bool
> > > > +bool
> > > >  dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key *keys[],
> > > >               struct dpcls_rule **rules, const size_t cnt,
> > > >               int *num_lookups_p)
> > > > --
> > > > 2.31.1
> > > >
> > > > _______________________________________________
> > > > dev mailing list
> > > > dev at openvswitch.org
> > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >
>