[ovs-dev] [v12 04/16] dpif-avx512: Add ISA implementation of dpif.

Ferriter, Cian cian.ferriter at intel.com
Thu Jun 10 14:32:43 UTC 2021


Hi Ian,

Thanks for the review. My responses are inline.

> -----Original Message-----
> From: Stokes, Ian <ian.stokes at intel.com>
> Sent: Tuesday 1 June 2021 19:59
> To: Ferriter, Cian <cian.ferriter at intel.com>; ovs-dev at openvswitch.org; Van Haaren, Harry <harry.van.haaren at intel.com>
> Cc: i.maximets at ovn.org
> Subject: RE: [ovs-dev] [v12 04/16] dpif-avx512: Add ISA implementation of dpif.
> 
> > This commit adds the AVX512 implementation of DPIF functionality,
> > specifically the dp_netdev_input_outer_avx512 function. This function
> > only handles outer (no re-circulations), and is optimized to use the
> > AVX512 ISA for packet batching and other DPIF work.
> >
> > Sparse is not able to handle the AVX512 intrinsics, causing compile
> > time failures, so it is disabled for this file.
> >
> > Signed-off-by: Harry van Haaren <harry.van.haaren at intel.com>
> > Co-authored-by: Cian Ferriter <cian.ferriter at intel.com>
> > Signed-off-by: Cian Ferriter <cian.ferriter at intel.com>
> 
> Thanks for the patch Harry/Cian, still testing this to a degree but questions below on initial thoughts.
> 
> >
> > ---
> >
> > v8:
> > - Fixup AVX512 mask to uint32_t conversion compilation warning.
> > ---
> >  lib/automake.mk                  |   5 +-
> >  lib/dpif-netdev-avx512.c         | 265 +++++++++++++++++++++++++++++++
> >  lib/dpif-netdev-private-dfc.h    |   8 +
> >  lib/dpif-netdev-private-dpif.h   |  32 ++++
> >  lib/dpif-netdev-private-thread.h |  11 +-
> >  lib/dpif-netdev-private.h        |  25 +++
> >  lib/dpif-netdev.c                |  70 ++++++--
> >  7 files changed, 400 insertions(+), 16 deletions(-)
> >  create mode 100644 lib/dpif-netdev-avx512.c
> >  create mode 100644 lib/dpif-netdev-private-dpif.h
> >
> > diff --git a/lib/automake.mk b/lib/automake.mk
> > index 0bef0cc69..5fab8ba4f 100644
> > --- a/lib/automake.mk
> > +++ b/lib/automake.mk
> > @@ -33,11 +33,13 @@ lib_libopenvswitchavx512_la_CFLAGS = \
> >  -mavx512f \
> >  -mavx512bw \
> >  -mavx512dq \
> > +-mbmi \
> 
> Can I ask what's needed in bmi that was not already included in bmi2? Just curiosity.
> 

So in the dp_netdev_input_outer_avx512() function (the AVX512 DPIF implementation), ' _blsr_u64'  is used.
It's used twice to reset (set to 0) the lowest bit in a variable.

More info on '_blsr_u64':
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_blsr_u64&expand=463

> >  -mbmi2 \
> >  -fPIC \
> >  $(AM_CFLAGS)
> >  lib_libopenvswitchavx512_la_SOURCES = \
> > -lib/dpif-netdev-lookup-avx512-gather.c
> > +lib/dpif-netdev-lookup-avx512-gather.c \
> > +lib/dpif-netdev-avx512.c
> >  lib_libopenvswitchavx512_la_LDFLAGS = \
> >  -static
> >  endif
> > @@ -113,6 +115,7 @@ lib_libopenvswitch_la_SOURCES = \
> >  lib/dpif-netdev.h \
> >  lib/dpif-netdev-private-dfc.h \
> >  lib/dpif-netdev-private-dpcls.h \
> > +lib/dpif-netdev-private-dpif.h \
> >  lib/dpif-netdev-private-flow.h \
> >  lib/dpif-netdev-private-hwol.h \
> >  lib/dpif-netdev-private-thread.h \
> > diff --git a/lib/dpif-netdev-avx512.c b/lib/dpif-netdev-avx512.c
> > new file mode 100644
> > index 000000000..91f51c479
> > --- /dev/null
> > +++ b/lib/dpif-netdev-avx512.c
> > @@ -0,0 +1,265 @@
> > +/*
> > + * Copyright (c) 2021 Intel Corporation.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#ifdef __x86_64__
> > +/* Sparse cannot handle the AVX512 instructions. */
> 
> So is this a limitation with sparse currently? Do you know if there are any plans for support in sparse for AVX512 in the future?
> 

Yes, unfortunately this is a limitation with sparse. I'm not sure if this will be added in the future.

> > +#if !defined(__CHECKER__)
> > +
> > +#include <config.h>
> > +
> > +#include "dpif-netdev.h"
> > +#include "dpif-netdev-perf.h"
> > +
> > +#include "dpif-netdev-private.h"
> > +#include "dpif-netdev-private-dpcls.h"
> > +#include "dpif-netdev-private-flow.h"
> > +#include "dpif-netdev-private-thread.h"
> > +
> > +#include "dp-packet.h"
> > +#include "netdev.h"
> > +
> > +#include "immintrin.h"
> > +
> > +/* Structure to contain per-packet metadata that must be attributed to the
> > + * dp netdev flow. This is unfortunate to have to track per packet, however
> > + * it's a bit awkward to maintain them in a performant way. This structure
> > + * helps to keep two variables on a single cache line per packet.
> > + */
> > +struct pkt_flow_meta {
> > +    uint16_t bytes;
> > +    uint16_t tcp_flags;
> > +};
> > +
> > +/* Structure of heap allocated memory for DPIF internals. */
> > +struct dpif_userdata {
> > +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
> > +        struct netdev_flow_key keys[NETDEV_MAX_BURST];
> > +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
> > +        struct netdev_flow_key *key_ptrs[NETDEV_MAX_BURST];
> > +    OVS_ALIGNED_VAR(CACHE_LINE_SIZE)
> > +        struct pkt_flow_meta pkt_meta[NETDEV_MAX_BURST];
> > +};
> > +
> > +int32_t
> > +dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd,
> > +                             struct dp_packet_batch *packets,
> > +                             odp_port_t in_port)
> > +{
> > +    /* Allocate DPIF userdata. */
> > +    if (OVS_UNLIKELY(!pmd->netdev_input_func_userdata)) {
> > +        pmd->netdev_input_func_userdata =
> > +                xmalloc_pagealign(sizeof(struct dpif_userdata));
> > +    }
> > +
> > +    struct dpif_userdata *ud = pmd->netdev_input_func_userdata;
> > +    struct netdev_flow_key *keys = ud->keys;
> > +    struct netdev_flow_key **key_ptrs = ud->key_ptrs;
> > +    struct pkt_flow_meta *pkt_meta = ud->pkt_meta;
> > +
> > +    /* The AVX512 DPIF implementation handles rules in a way that is
> > optimized
> > +     * for reducing data-movement between HWOL/EMC/SMC and DPCLS.
> > This is
> > +     * achieved by separating the rule arrays. Bitmasks are kept for each
> > +     * packet, indicating if it matched in the HWOL/EMC/SMC array or DPCLS
> > +     * array. Later the two arrays are merged by AVX-512 expand instructions.
> > +     */
> > +
> > +    /* Stores the computed output: a rule pointer for each packet. */
> > +    struct dpcls_rule *rules[NETDEV_MAX_BURST];
> > +    struct dpcls_rule *dpcls_rules[NETDEV_MAX_BURST];
> 
> At first glimpse there's not much difference between rules and dpcls_rules above, maybe clearer through the code later but  one line
> comment on each to explain their use would be beneficial.
> 

Good point. Let's make this clear. I'll add comments in the next version to highlight that "*rules[]" is for storing rule pointers from HWOL/EMC/SMC and "*dpcls_rules[]" is for rule pointers from DPCLS.

> > +    uint32_t dpcls_key_idx = 0;
> > +
> > +    for (uint32_t i = 0; i < NETDEV_MAX_BURST; i += 8) {
> 
> Magic number 8 above. As you using it to index into the array of rules I can see why you would use it in this form if it was a once off.
> 
> But from a quick glimpse of the code I can see "8" being used to index in the arrays multiple times, I'd suggest a define equal to 8 at
> the beginning of the file in this case instead.
> 

Unless you feel strongly about the #define I prefer having the "8" number at the code locations where it's used rather than having to remember the value of a macro called " UINT64_PER_ZMM" or something similar.

Let me know what you think.

> > +        _mm512_storeu_si512(&rules[i], _mm512_setzero_si512());
> > +        _mm512_storeu_si512(&dpcls_rules[i], _mm512_setzero_si512());
> > +    }
> 
> So from above the first operation is to set all elements in rules and dpcls_rules to 0, as they were allocated values may not have been
> set to zero already correct?
> 

Correct, this is a faster way of initializing the entire rules and dpcls_rules arrays to 0.

> > +
> > +    /* Prefetch each packet's metadata. */
> > +    const size_t batch_size = dp_packet_batch_size(packets);
> > +    for (int i = 0; i < batch_size; i++) {
> > +        struct dp_packet *packet = packets->packets[i];
> > +        OVS_PREFETCH(dp_packet_data(packet));
> > +        pkt_metadata_prefetch_init(&packet->md);
> > +    }
> > +
> > +    /* Check if EMC or SMC are enabled. */
> > +    struct dfc_cache *cache = &pmd->flow_cache;
> > +    const uint32_t emc_enabled = pmd->ctx.emc_insert_min != 0;
> > +    const uint32_t smc_enabled = pmd->ctx.smc_enable_db;
> 
> I assume there is a lock on the pmd being called here (i.e. before dp_netdev_input_outer_avx512 Is called) so as to avoid these pmd
> values being changed while this check is occurring?
> 

I think there's no lock on the pmd. These "pmd->ctx.*mc" values are set by the pmd thread in pmd_thread_main() and are private to each thread. Then pmd_thread_main() calls into either scalar or AVX512 DPIFs and reads the values. They can't be changed by another thread in the meantime. They are only set by the thread which will use the values. Hopefully that makes sense. 

> > +
> > +    uint32_t emc_hits = 0;
> > +    uint32_t smc_hits = 0;
> > +
> > +    /* A 1 bit in this mask indicates a hit, so no DPCLS lookup on the pkt. */
> > +    uint32_t hwol_emc_smc_hitmask = 0;
> > +
> > +    /* Perform first packet interation. */
> 
> Minor typo above, interaction.
> 

Good catch, I'll fix this in the next version.

> > +    uint32_t lookup_pkts_bitmask = (1ULL << batch_size) - 1;
> > +    uint32_t iter = lookup_pkts_bitmask;
> > +    while (iter) {
> > +        uint32_t i = __builtin_ctz(iter);
> > +        iter = _blsr_u64(iter);
> 
> Having some trouble understanding the logic above for iter, i and lookup_pkts_bitmask.
> 
> From what I can tell, lookup_pkts_bitmask will represent the number of packets in a batch that require a lookup?
> 
> Set iter equal to the lookup_pkst_bitmask (assuming lookup_pkst_bitmask will be required later so must be unchanged until then).
> 

Yes, all the above is true.

> For "i" set it to the number of trailing zeroes following the LSB in in iter. So at this point, does each trailing zero represent a packet yet
> to be extracted? Are you expecting that it has already hit the EMC/SMC/HWOL here or is this for a packet that has not hit any of these
> yet?
> 

Each trailing zero represents a packet that has already been extracted and looked up with HWOL/EMC/SMC if any of those are enabled. "i" is for all packets that have entered this AVX512 DPIF.

> Also __builtin_ctz Can return undefined, I'm thinking is there a case to be handled here for that situation?
> 

I think this isn't an issue. From https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html:
Built-in Function: int __builtin_ctz (unsigned int x)
Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined.

We check whether "iter" is 0 as part of the loop "while (iter)". So we shouldn't get an undefined result.

> Finally set iter to _blsr_u64(iter), As this is the only change that I saw that would influence iter to break the while loop, could you
> explain the expected operation here?
> 
> From the intrinsic guide on _blsr_u64:
> 
> Copy all bits from unsigned 64-bit integer a to dst, and reset (set to 0) the bit in dst that corresponds to the lowest set bit in a.
> 
> From the initial comment before this block, I assumed this is a once off operation on the First packet, but is it the case it happens for all
> packets?
> 

The expected operation here is that we have a number of bits set in "iter" representing packets that we want to process (HWOL->MFEX->EMC->SMC). As each packet is processed, we clear (set to 0) the bit representing that packet using "_blsr_u64()". The "__builtin_ctz ()" will give us the correct index into the "packets", "pkt_meta", "keys" and "rules" arrays.

Let's walk through the below code with a 4 packet batch example. Let's represent "iter" in binary.

while (iter) {
    uint32_t i = __builtin_ctz(iter);
    iter = _blsr_u64(iter);

Replace the "while" with an "if" and write out each iteration
if (iter) { // iter = 1111
    uint32_t i = __builtin_ctz(iter); // i = 0
    iter = _blsr_u64(iter); // iter = 1110
    // do all processing (HWOL->MFEX->EMC->SMC)
}
if (iter) { // iter = 1110
    uint32_t i = __builtin_ctz(iter); // i = 1
    iter = _blsr_u64(iter); // iter = 1100
    // do all processing (HWOL->MFEX->EMC->SMC)
}
if (iter) { // iter = 1100
    uint32_t i = __builtin_ctz(iter); // i = 2
    iter = _blsr_u64(iter); // iter = 1000
    // do all processing (HWOL->MFEX->EMC->SMC)
}
if (iter) { // iter = 1000
    uint32_t i = __builtin_ctz(iter); // i = 3
    iter = _blsr_u64(iter); // iter = 0000
    // do all processing (HWOL->MFEX->EMC->SMC)
}
if (iter) { // iter = 0000
    // fail if check, this isn't reached, move on to DPCLS code.
}

Hopefully that makes sense.

> > +
> > +        /* Initialize packet md and do miniflow extract. */
> > +        struct dp_packet *packet = packets->packets[i];
> > +        pkt_metadata_init(&packet->md, in_port);
> > +        struct netdev_flow_key *key = &keys[i];
> > +        miniflow_extract(packet, &key->mf);
> > +
> > +        /* Cache TCP and byte values for all packets. */
> > +        pkt_meta[i].bytes = dp_packet_size(packet);
> > +        pkt_meta[i].tcp_flags = miniflow_get_tcp_flags(&key->mf);
> > +
> > +        key->len = netdev_flow_key_size(miniflow_n_values(&key->mf));
> > +        key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
> > &key->mf);
> > +
> > +        struct dp_netdev_flow *f = NULL;
> > +
> > +        if (emc_enabled) {
> > +            f = emc_lookup(&cache->emc_cache, key);
> > +
> > +            if (f) {
> > +                rules[i] = &f->cr;
> > +                emc_hits++;
> > +                hwol_emc_smc_hitmask |= (1 << i);
> 
> So HWOL may be disabled, but the assumption here would be that if it is enabled you would have a hit, is that correct?
> 

We implement HWOL the same way as in a similar way to the scalar DPIF. We don't check recirculation depth since the AVX512 DPIF is for outer packets only. Otherwise, the checks are the same as with the scalar DPIF. So basically the checks are:
1. Does the packet have a flow mark
2. If so, does the flow mark match with an actual flow.

So there is no assumption that there will be a hit. We just check whether there is a hit.

> I'm wondering is there a case with this logic that you have a traffic type that we have a hit for in EMC/SMC but that possibly is not
> supported by HWOL and as such you may not have a hit?
> 

Yes, this is possible. When there is no HWOL hit, we will continue to EMC and SMC lookups. These will be performed if they are enabled respectively.

> > +                continue;
> > +            }
> > +        };
> 
> Is the semi colon a typo above?
> 

Good catch, I'll remove this in the next version.

> > +
> > +        if (smc_enabled && !f) {
> > +            f = smc_lookup_single(pmd, packet, key);
> > +            if (f) {
> > +                rules[i] = &f->cr;
> > +                smc_hits++;
> > +                hwol_emc_smc_hitmask |= (1 << i);
> > +                continue;
> > +            }
> > +        }
> > +
> > +        /* The flow pointer was not found in HWOL/EMC/SMC, so add it to the
> > +         * dpcls input keys array for batch lookup later.
> > +         */
> > +        key_ptrs[dpcls_key_idx] = &keys[i];
> > +        dpcls_key_idx++;
> > +    }
> > +
> > +
> > +    /* DPCLS handles any packets missed by HWOL/EMC/SMC. It operates on
> > the
> > +     * key_ptrs[] for input miniflows to match, storing results in the
> > +     * dpcls_rules[] array.
> > +     */
> > +    if (dpcls_key_idx > 0) {
> > +        struct dpcls *cls = dp_netdev_pmd_lookup_dpcls(pmd, in_port);
> > +        if (OVS_UNLIKELY(!cls)) {
> > +            return -1;
> > +        }
> > +        int any_miss = !dpcls_lookup(cls,
> > +                                    (const struct netdev_flow_key **) key_ptrs,
> > +                                    dpcls_rules, dpcls_key_idx, NULL);
> 
> Is there a reason you've used int any_miss rather than a bool above?
> 

This is just personal preference. I'll change this to use a bool in the next version.

> dpcls_lookup reutrns bool anyway, true if all entries are found, otherwise false so it could avoid the ! operand also on the call to
> dpcls_lookup no?
> 

We call "dpcls_lookup" in a similar way to the scalar DPIF here, with the "!" operand to represent "any_miss" and branch on this afterwards. Hopefully that's OK.

> > +        if (OVS_UNLIKELY(any_miss)) {
> > +            return -1;
> > +        }
> > +
> > +        /* Merge DPCLS rules and HWOL/EMC/SMC rules. */
> > +        uint32_t dpcls_idx = 0;
> > +        for (int i = 0; i < NETDEV_MAX_BURST; i += 8) {
> > +            /* Indexing here is somewhat complicated due to DPCLS output rule
> > +             * load index depending on the hitmask of HWOL/EMC/SMC. More
> > +             * packets from HWOL/EMC/SMC bitmask means less DPCLS rules are
> > +             * used.
> > +             */
> > +            __m512i v_cache_rules = _mm512_loadu_si512(&rules[i]);
> > +            __m512i v_merged_rules =
> > +                        _mm512_mask_expandloadu_epi64(v_cache_rules,
> > +                                                      ~hwol_emc_smc_hitmask,
> > +                                                      &dpcls_rules[dpcls_idx]);
> To clarify above, where is the destination for the memory load here? Is it v_merged_rules?
> 

We are storing the result of the memory load to "v_merged_rules".

> Also is it the mask is interacting with the dpcls_rules[dpcls_idx] ? i.e. for each bit set in mask take the value from dpcls_rules, but
> when the bit is not set take the value from v_cache_rules, just trying to get my head around the operation of how the merged rules
> would look.
> 

Yes, you are correct. This is where we are merging some elements of the "dpcls_rules" array and some elements of the "rules" array. For the "_mm512_mask_expandloadu_epi64()", we are loading from "dpcls_rules" or "v_cache_rules" (which is the "rules" array loaded into an AVX512 zmm register). When a bit in the mask is set, we take that "dpcls_rule". Otherwise, we use the "rule".

> > +            _mm512_storeu_si512(&rules[i], v_merged_rules);
> > +
> > +            /* Update DPCLS load index and bitmask for HWOL/EMC/SMC hits.
> > +             * There are 8 output pointer per register, subtract the
> > +             * HWOL/EMC/SMC lanes equals the number of DPCLS rules
> > consumed.
> > +             */
> > +            uint32_t hitmask_FF = (hwol_emc_smc_hitmask & 0xFF);
> > +            dpcls_idx += 8 - __builtin_popcountll(hitmask_FF);
> > +            hwol_emc_smc_hitmask = (hwol_emc_smc_hitmask >> 8);
> > +        }
> > +    }
> > +
> > +    /* At this point we don't return error anymore, so commit stats here. */
> > +    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_RECV,
> > batch_size);
> > +    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_EXACT_HIT,
> > emc_hits);
> > +    pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT,
> > smc_hits);
> > +    pmd_perf_update_counter(&pmd->perf_stats,
> > PMD_STAT_MASKED_HIT,
> > +                            dpcls_key_idx);
> > +    pmd_perf_update_counter(&pmd->perf_stats,
> > PMD_STAT_MASKED_LOOKUP,
> > +                            dpcls_key_idx);
> > +
> > +    /* Initialize the "Action Batch" for each flow handled below. */
> > +    struct dp_packet_batch action_batch;
> > +    action_batch.trunc = 0;
> > +    action_batch.do_not_steal = false;
> > +
> > +    while (lookup_pkts_bitmask) {
> > +        uint32_t rule_pkt_idx = __builtin_ctz(lookup_pkts_bitmask);
> > +        uint64_t needle = (uintptr_t) rules[rule_pkt_idx];
> > +
> > +        /* Parallel compare 8 flow* 's to the needle, create a bitmask. */
> > +        uint32_t batch_bitmask = 0;
> > +        for (uint32_t j = 0; j < NETDEV_MAX_BURST; j += 8) {
> > +            /* Pre-calculate store addr */
> 
> Minor, period missing above in comment.
> 

Good catch, I'll fix this in the next version.

> > +            uint32_t num_pkts_in_batch = __builtin_popcountll(batch_bitmask);
> 
> So num_pkts_in_batch will always be 0 on first iteration as batch_bitmask will be equal to zero?
> 

Correct.

> A little unsure here of the ordering but I guess the key is to update the bitmask for the next iteriation?
> 

The popcount of "batch_bitmask" looks silly for the first iteration of the for loop, but makes sense for the subsequent iterations of the for loop. We calculate "num_pkts_in_batch" to get the correct index to use for the "action_batch" of packets.

> > +            void *store_addr = &action_batch.packets[num_pkts_in_batch];
> > +
> > +            /* Search for identical flow* in burst, update bitmask. */
> > +            __m512i v_needle = _mm512_set1_epi64(needle);
> > +            __m512i v_hay = _mm512_loadu_si512(&rules[j]);
> > +            __mmask8 k_cmp_bits = _mm512_cmpeq_epi64_mask(v_needle,
> > v_hay);
> > +            uint32_t cmp_bits = k_cmp_bits;
> > +            batch_bitmask |= cmp_bits << j;
> > +
> > +            /* Compress and store the batched packets. */
> > +            struct dp_packet **packets_ptrs = &packets->packets[j];
> > +            __m512i v_pkt_ptrs = _mm512_loadu_si512(packets_ptrs);
> > +            _mm512_mask_compressstoreu_epi64(store_addr, cmp_bits,
> > v_pkt_ptrs);
> > +        }
> > +
> > +        /* Strip all packets in this batch from the lookup_pkts_bitmask. */
> > +        lookup_pkts_bitmask &= (~batch_bitmask);
> > +        action_batch.count = __builtin_popcountll(batch_bitmask);
> > +
> > +        /* Loop over all packets in this batch, to gather the byte and tcp_flag
> > +         * values, and pass them to the execute function. It would be nice to
> > +         * optimize this away, however it is not easy to refactor in dpif.
> > +         */
> > +        uint32_t bytes = 0;
> > +        uint16_t tcp_flags = 0;
> > +        uint32_t bitmask_iter = batch_bitmask;
> > +        for (int i = 0; i < action_batch.count; i++) {
> > +            uint32_t idx = __builtin_ctzll(bitmask_iter);
> > +            bitmask_iter = _blsr_u64(bitmask_iter);
> > +
> > +            bytes += pkt_meta[idx].bytes;
> > +            tcp_flags |= pkt_meta[idx].tcp_flags;
> > +        }
> > +
> > +        dp_netdev_batch_execute(pmd, &action_batch, rules[rule_pkt_idx],
> > +                                bytes, tcp_flags);
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +#endif
> > +#endif
> > diff --git a/lib/dpif-netdev-private-dfc.h b/lib/dpif-netdev-private-dfc.h
> > index 52349a3fc..bd18bd3fd 100644
> > --- a/lib/dpif-netdev-private-dfc.h
> > +++ b/lib/dpif-netdev-private-dfc.h
> > @@ -81,6 +81,9 @@ extern "C" {
> >  #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX /                     \
> >                                      DEFAULT_EM_FLOW_INSERT_INV_PROB)
> >
> > +/* Forward declaration for SMC function prototype. */
> 
> A little vague, maybe extend with " prototypes that require access to dp_netdev_pmd_thread "
> 

I'll add more detail in the next version, like this:
/* Forward declaration for SMC function prototype that requires access to
 * 'struct dp_netdev_pmd_thread'. */

> > +struct dp_netdev_pmd_thread;
> > +
> >  struct emc_entry {
> >      struct dp_netdev_flow *flow;
> >      struct netdev_flow_key key;   /* key.hash used for emc hash value. */
> > @@ -237,6 +240,11 @@ emc_lookup(struct emc_cache *cache, const struct
> > netdev_flow_key *key)
> >      return NULL;
> >  }
> >
> > +struct dp_netdev_flow *
> > +smc_lookup_single(struct dp_netdev_pmd_thread *pmd,
> > +                  struct dp_packet *packet,
> > +                  struct netdev_flow_key *key);
> > +
> >  #ifdef  __cplusplus
> >  }
> >  #endif
> > diff --git a/lib/dpif-netdev-private-dpif.h b/lib/dpif-netdev-private-dpif.h
> > new file mode 100644
> > index 000000000..2fd7cc400
> > --- /dev/null
> > +++ b/lib/dpif-netdev-private-dpif.h
> > @@ -0,0 +1,32 @@
> > +/*
> > + * Copyright (c) 2021 Intel Corporation.
> > + *
> > + * Licensed under the Apache License, Version 2.0 (the "License");
> > + * you may not use this file except in compliance with the License.
> > + * You may obtain a copy of the License at:
> > + *
> > + *     http://www.apache.org/licenses/LICENSE-2.0
> > + *
> > + * Unless required by applicable law or agreed to in writing, software
> > + * distributed under the License is distributed on an "AS IS" BASIS,
> > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
> > implied.
> > + * See the License for the specific language governing permissions and
> > + * limitations under the License.
> > + */
> > +
> > +#ifndef DPIF_NETDEV_PRIVATE_DPIF_H
> > +#define DPIF_NETDEV_PRIVATE_DPIF_H 1
> > +
> > +#include "openvswitch/types.h"
> > +
> > +/* Forward declarations to avoid including files. */
> > +struct dp_netdev_pmd_thread;
> > +struct dp_packet_batch;
> > +
> > +/* Available implementations for dpif work. */
> > +int32_t
> > +dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd,
> > +                             struct dp_packet_batch *packets,
> > +                             odp_port_t in_port);
> 
> Just on the function name dp_netdev_input_outer_avx512 Above.
> 
> Is it likely to see a separate inner equivalent in the future? And would it be a separate function such as
> dp_netdev_input_inner_avx512 or would it just be on function handling both cases?
> 

We want to add support for recirculation, or handling of the inner headers of packets to provide more performance benefits. Whether it will be separate functions is a good question. We'll need to look closer when it comes to actually adding support for inner packets and find the appropriate solution.

I guess naming this with "outer" hopefully makes sense for now since it only handles a packets first pass through OVS.

> > +
> > +#endif /* netdev-private.h */
> > diff --git a/lib/dpif-netdev-private-thread.h b/lib/dpif-netdev-private-
> > thread.h
> > index 01a28a681..c0c94c566 100644
> > --- a/lib/dpif-netdev-private-thread.h
> > +++ b/lib/dpif-netdev-private-thread.h
> > @@ -45,14 +45,19 @@ struct dp_netdev_pmd_thread_ctx {
> >      struct dp_netdev_rxq *last_rxq;
> >      /* EMC insertion probability context for the current processing cycle. */
> >      uint32_t emc_insert_min;
> > +    /* Enable the SMC cache from ovsdb config. */
> > +    bool smc_enable_db;
> >  };
> >
> >  /* Forward declaration for typedef. */
> >  struct dp_netdev_pmd_thread;
> >
> > -typedef void (*dp_netdev_input_func)(struct dp_netdev_pmd_thread
> > *pmd,
> > -                                     struct dp_packet_batch *packets,
> > -                                     odp_port_t port_no);
> > +/* Typedef for DPIF functions.
> > + * Returns a bitmask of packets to handle, possibly including upcall/misses.
> > + */
> > +typedef int32_t (*dp_netdev_input_func)(struct dp_netdev_pmd_thread
> > *pmd,
> > +                                        struct dp_packet_batch *packets,
> > +                                        odp_port_t port_no);
> >
> >  /* PMD: Poll modes drivers.  PMD accesses devices via polling to eliminate
> >   * the performance overhead of interrupt processing.  Therefore netdev can
> > diff --git a/lib/dpif-netdev-private.h b/lib/dpif-netdev-private.h
> > index d7b6fd7ec..0315b5bf6 100644
> > --- a/lib/dpif-netdev-private.h
> > +++ b/lib/dpif-netdev-private.h
> > @@ -31,4 +31,29 @@
> >  #include "dpif-netdev-private-dfc.h"
> >  #include "dpif-netdev-private-thread.h"
> >
> > +/* Allow other implementations to lookup the DPCLS instances. */
> > +struct dpcls *
> > +dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd,
> > +                           odp_port_t in_port);
> > +
> > +/* Allow other implementations to call dpcls_lookup() for subtable search.
> > */
> > +bool
> > +dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key *keys[],
> > +             struct dpcls_rule **rules, const size_t cnt,
> > +             int *num_lookups_p);
> > +
> > +/* Allow other implementations to execute actions on a batch. */
> > +void
> > +dp_netdev_batch_execute(struct dp_netdev_pmd_thread *pmd,
> > +                        struct dp_packet_batch *packets,
> > +                        struct dpcls_rule *rule,
> > +                        uint32_t bytes,
> > +                        uint16_t tcp_flags);
> > +
> > +/* Available implementations for dpif work. */
> > +int32_t
> > +dp_netdev_input_outer_avx512(struct dp_netdev_pmd_thread *pmd,
> > +                             struct dp_packet_batch *packets,
> > +                             odp_port_t in_port);
> > +
> >  #endif /* netdev-private.h */
> > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> > index bec984643..5ed61d08b 100644
> > --- a/lib/dpif-netdev.c
> > +++ b/lib/dpif-netdev.c
> > @@ -185,10 +185,6 @@ static uint32_t
> > dpcls_subtable_lookup_reprobe(struct dpcls *cls);
> >  static void dpcls_insert(struct dpcls *, struct dpcls_rule *,
> >                           const struct netdev_flow_key *mask);
> >  static void dpcls_remove(struct dpcls *, struct dpcls_rule *);
> > -static bool dpcls_lookup(struct dpcls *cls,
> > -                         const struct netdev_flow_key *keys[],
> > -                         struct dpcls_rule **rules, size_t cnt,
> > -                         int *num_lookups_p);
> >
> >  /* Set of supported meter flags */
> >  #define DP_SUPPORTED_METER_FLAGS_MASK \
> > @@ -485,7 +481,7 @@ static void dp_netdev_execute_actions(struct
> > dp_netdev_pmd_thread *pmd,
> >                                        const struct flow *flow,
> >                                        const struct nlattr *actions,
> >                                        size_t actions_len);
> > -static void dp_netdev_input(struct dp_netdev_pmd_thread *,
> > +static int32_t dp_netdev_input(struct dp_netdev_pmd_thread *,
> >                              struct dp_packet_batch *, odp_port_t port_no);
> >  static void dp_netdev_recirculate(struct dp_netdev_pmd_thread *,
> >                                    struct dp_packet_batch *);
> > @@ -557,7 +553,7 @@ dpif_netdev_xps_revalidate_pmd(const struct
> > dp_netdev_pmd_thread *pmd,
> >                                 bool purge);
> >  static int dpif_netdev_xps_get_tx_qid(const struct dp_netdev_pmd_thread
> > *pmd,
> >                                        struct tx_port *tx);
> > -static inline struct dpcls *
> > +inline struct dpcls *
> >  dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd,
> >                             odp_port_t in_port);
> >
> > @@ -1922,7 +1918,7 @@ void dp_netdev_flow_unref(struct
> > dp_netdev_flow *flow)
> >      }
> >  }
> >
> > -static inline struct dpcls *
> > +inline struct dpcls *
> >  dp_netdev_pmd_lookup_dpcls(struct dp_netdev_pmd_thread *pmd,
> >                             odp_port_t in_port)
> >  {
> > @@ -2722,7 +2718,7 @@ dp_netdev_pmd_lookup_flow(struct
> > dp_netdev_pmd_thread *pmd,
> >                            int *lookup_num_p)
> >  {
> >      struct dpcls *cls;
> > -    struct dpcls_rule *rule;
> > +    struct dpcls_rule *rule = NULL;
> >      odp_port_t in_port = u32_to_odp(MINIFLOW_GET_U32(&key->mf,
> >                                                       in_port.odp_port));
> >      struct dp_netdev_flow *netdev_flow = NULL;
> > @@ -4236,7 +4232,10 @@ dp_netdev_process_rxq_port(struct
> > dp_netdev_pmd_thread *pmd,
> >          }
> >
> >          /* Process packet batch. */
> > -        pmd->netdev_input_func(pmd, &batch, port_no);
> > +        int32_t ret = pmd->netdev_input_func(pmd, &batch, port_no);
> > +        if (ret) {
> > +            dp_netdev_input(pmd, &batch, port_no);
> > +        }
> >
> >          /* Assign processing cycles to rx queue. */
> >          cycles = cycle_timer_stop(&pmd->perf_stats, &timer);
> > @@ -5254,6 +5253,8 @@ dpif_netdev_run(struct dpif *dpif)
> >                      non_pmd->ctx.emc_insert_min = 0;
> >                  }
> >
> > +                non_pmd->ctx.smc_enable_db = dp->smc_enable_db;
> > +
> >                  for (i = 0; i < port->n_rxq; i++) {
> >
> >                      if (!netdev_rxq_enabled(port->rxqs[i].rx)) {
> > @@ -5525,6 +5526,8 @@ reload:
> >                  pmd->ctx.emc_insert_min = 0;
> >              }
> >
> > +            pmd->ctx.smc_enable_db = pmd->dp->smc_enable_db;
> > +
> >              process_packets =
> >                  dp_netdev_process_rxq_port(pmd, poll_list[i].rxq,
> >                                             poll_list[i].port_no);
> > @@ -6419,6 +6422,24 @@ packet_batch_per_flow_execute(struct
> > packet_batch_per_flow *batch,
> >                                actions->actions, actions->size);
> >  }
> >
> > +void
> > +dp_netdev_batch_execute(struct dp_netdev_pmd_thread *pmd,
> > +                        struct dp_packet_batch *packets,
> > +                        struct dpcls_rule *rule,
> > +                        uint32_t bytes,
> > +                        uint16_t tcp_flags)
> > +{
> > +    /* Gets action* from the rule. */
> > +    struct dp_netdev_flow *flow = dp_netdev_flow_cast(rule);
> > +    struct dp_netdev_actions *actions = dp_netdev_flow_get_actions(flow);
> > +
> > +    dp_netdev_flow_used(flow, dp_packet_batch_size(packets), bytes,
> > +                        tcp_flags, pmd->ctx.now / 1000);
> > +    const uint32_t steal = 1;
> > +    dp_netdev_execute_actions(pmd, packets, steal, &flow->flow,
> > +                              actions->actions, actions->size);
> > +}
> > +
> >  static inline void
> >  dp_netdev_queue_batches(struct dp_packet *pkt,
> >                          struct dp_netdev_flow *flow, uint16_t tcp_flags,
> > @@ -6523,6 +6544,30 @@ smc_lookup_batch(struct
> > dp_netdev_pmd_thread *pmd,
> >      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT,
> > n_smc_hit);
> >  }
> >
> > +struct dp_netdev_flow *
> > +smc_lookup_single(struct dp_netdev_pmd_thread *pmd,
> > +                  struct dp_packet *packet,
> > +                  struct netdev_flow_key *key)
> > +{
> > +    const struct cmap_node *flow_node = smc_entry_get(pmd, key->hash);
> > +
> > +    if (OVS_LIKELY(flow_node != NULL)) {
> > +        struct dp_netdev_flow *flow = NULL;
> > +
> > +        CMAP_NODE_FOR_EACH (flow, node, flow_node) {
> > +            /* Since we dont have per-port megaflow to check the port
> > +             * number, we need to verify that the input ports match. */
> > +            if (OVS_LIKELY(dpcls_rule_matches_key(&flow->cr, key) &&
> > +                flow->flow.in_port.odp_port == packet->md.in_port.odp_port)) {
> > +
> > +                return (void *) flow;
> > +            }
> > +        }
> > +    }
> > +
> > +    return NULL;
> > +}
> > +
> >  /* Try to process all ('cnt') the 'packets' using only the datapath flow cache
> >   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
> >   * miniflow is copied into 'keys' and the packet pointer is moved at the
> > @@ -6928,12 +6973,13 @@ dp_netdev_input__(struct
> > dp_netdev_pmd_thread *pmd,
> >      }
> >  }
> >
> > -static void
> > +static int32_t
> >  dp_netdev_input(struct dp_netdev_pmd_thread *pmd,
> >                  struct dp_packet_batch *packets,
> >                  odp_port_t port_no)
> >  {
> >      dp_netdev_input__(pmd, packets, false, port_no);
> > +    return 0;
> 
> Always returning 0, I think this would change at a later stage no in the patch series?
> 

This won't change later in the patch series. The "return 0;" is added so the scalar DPIF API will match with newly introduced API for DPIF functions defined by "dp_netdev_input_func" in " lib/dpif-netdev-private-dpif.h". Other DPIF implementations might return a nonzero value to indicate packets to handle because of misses in that DPIF implementation. The "dp_netdev_input" should handle all cases, that other implementations might not, so will always return 0. 

> BR
> Ian

Thanks again for the review Ian.
> >  }
> >
> >  static void
> > @@ -8374,7 +8420,7 @@ netdev_flow_key_gen_masks(const struct
> > netdev_flow_key *tbl,
> >
> >  /* Returns true if 'target' satisfies 'key' in 'mask', that is, if each 1-bit
> >   * in 'mask' the values in 'key' and 'target' are the same. */
> > -bool
> > +inline bool ALWAYS_INLINE
> >  dpcls_rule_matches_key(const struct dpcls_rule *rule,
> >                         const struct netdev_flow_key *target)
> >  {
> > @@ -8400,7 +8446,7 @@ dpcls_rule_matches_key(const struct dpcls_rule
> > *rule,
> >   * priorities, instead returning any rule which matches the flow.
> >   *
> >   * Returns true if all miniflows found a corresponding rule. */
> > -static bool
> > +bool
> >  dpcls_lookup(struct dpcls *cls, const struct netdev_flow_key *keys[],
> >               struct dpcls_rule **rules, const size_t cnt,
> >               int *num_lookups_p)
> > --
> > 2.31.1
> >
> > _______________________________________________
> > dev mailing list
> > dev at openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-dev



More information about the dev mailing list