[ovs-dev] [PATCH v13 4/5] dpif-netdev: Refactor generic implementation

Ilya Maximets i.maximets at samsung.com
Thu Jul 18 11:34:11 UTC 2019


On 17.07.2019 21:21, Harry van Haaren wrote:
> This commit refactors the generic implementation. The
> goal of this refactor is to simplify the code to enable
> "specialization" of the functions at compile time.
> 
> Given compile-time optimizations, the compiler is able
> to unroll loops, and create optimized code sequences due
> to compile time knowledge of loop-trip counts.
> 
> In order to enable these compiler optimizations, we must
> refactor the code to pass the loop-trip counts to functions
> as compile time constants.
> 
> This patch allows the number of miniflow-bits set per "unit"
> in the miniflow to be passed around as a function argument.
> 
> Note that this patch does NOT yet take advantage of doing so,
> this is only a refactor to enable it in the next patches.
> 
> Signed-off-by: Harry van Haaren <harry.van.haaren at intel.com>
> Tested-by: Malvika Gupta <malvika.gupta at arm.com>
> 
> ---
> 
> v13:
> - Moved blocks scratch array to thread local storage. This encapsulates
>   the blocks array better inside the implementation where it is
>   required, and avoids bleeding the details to the dpcls or PMD level.
>   Thanks Ilya for suggesting the DEFINE_STATIC_PER_THREAD_DATA method.
> - Removed blocks scratch array from struct dpcls
> - Removed blocks_scratch parameter from lookup_func prototype
> 
> v12:
> - Fix Caps and . (Ilya)
> - Fixed typos (Ilya)
> - Added mf_bits and mf_masks in this patch (Ilya)
> - Fixed rebase conflicts
> 
> v11:
> - Rebased to previous changes
> - Fix typo in commit message (Ian)
> - Fix variable declaration spacing (Ian)
> - Remove function names from comments (Ian)
> - Replace magic 8 with sizeof(uint64_t) (Ian)
> - Captialize and end comments with a stop. (Ian/Ilya)
> - Add build time assert to validate FLOWMAP_UNITS (Ilya)
> - Add note on ALWAYS_INLINE operation
> - Add space after ULLONG_FOR_EACH_1 (Ilya)
> - Use hash_add_words64() instead of rolling own loop (Ilya)
>     Note that hash_words64_inline() calls hash_finish() with an
>     fixed value, so it was not the right hash function for this
>     usage. Used hash_add_words64() and manual hash_finish() to
>     re-use as much of hashing code as we can.
> 
> v10:
> - Rebase updates from previous patches
> - Fix whitespace indentation of func params
> - Removed restrict keyword, Windows CI failing when it is used (Ian)
> - Fix integer 0 used to set NULL pointer (Ilya)
> - Postpone free() call on cls->blocks_scratch (Ilya)
> - Fix indentation of a function
> 
> v9:
> - Use count_1bits in favour of __builtin_popcount (Ilya)
> - Use ALWAYS_INLINE instead of __attribute__ synatx (Ilya)
> 
> v8:
> - Rework block_cache and mf_masks to avoid variable-lenght array
>   due to compiler issues. Provisioning for worst case is not a
>   good solution due to magnitue of over-provisioning required.
> - Rework netdev_flatten function removing unused parameter
> ---
>  lib/dpif-netdev-lookup-generic.c | 241 +++++++++++++++++++++++++++----
>  lib/dpif-netdev-private.h        |  12 +-
>  lib/dpif-netdev.c                |  51 ++++++-
>  3 files changed, 269 insertions(+), 35 deletions(-)
> 
> diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
> index 8064911b3..de45099f8 100644
> --- a/lib/dpif-netdev-lookup-generic.c
> +++ b/lib/dpif-netdev-lookup-generic.c
> @@ -27,61 +27,226 @@
>  #include "dpif-netdev-perf.h"
>  #include "dpif-provider.h"
>  #include "flow.h"
> +#include "ovs-thread.h"
>  #include "packets.h"
>  #include "pvector.h"
>  
> -/* Returns a hash value for the bits of 'key' where there are 1-bits in
> - * 'mask'. */
> -static inline uint32_t
> -netdev_flow_key_hash_in_mask(const struct netdev_flow_key *key,
> -                             const struct netdev_flow_key *mask)
> +VLOG_DEFINE_THIS_MODULE(dpif_lookup_generic);
> +
> +/* Lookup functions below depends on the internal structure of flowmap. */
> +BUILD_ASSERT_DECL(FLOWMAP_UNITS == 2);
> +
> +/* Per thread data to store the blocks cache. The 'blocks_cache_count' variable
> + * stores the size of the allocated space in uint64_t blocks (so * 8 to get the
> + * size in bytes).
> + */
> +DEFINE_STATIC_PER_THREAD_DATA(uint64_t *, blocks_scratch_ptr, 0);
> +DEFINE_STATIC_PER_THREAD_DATA(uint32_t, blocks_scratch_count_ptr, 0);


Since you have a malloced data stored here it will be leaked on thread destruction.
You need to use DEFINE_PER_THREAD_MALLOCED_DATA instead that will free the memory
in this case.

Please, consider following incremental:

diff --git a/lib/dpif-netdev-lookup-generic.c b/lib/dpif-netdev-lookup-generic.c
index 50edc483c..0addbad61 100644
--- a/lib/dpif-netdev-lookup-generic.c
+++ b/lib/dpif-netdev-lookup-generic.c
@@ -36,12 +36,34 @@ VLOG_DEFINE_THIS_MODULE(dpif_lookup_generic);
 /* Lookup functions below depends on the internal structure of flowmap. */
 BUILD_ASSERT_DECL(FLOWMAP_UNITS == 2);
 
-/* Per thread data to store the blocks cache. The 'blocks_cache_count' variable
- * stores the size of the allocated space in uint64_t blocks (so * 8 to get the
- * size in bytes).
- */
-DEFINE_STATIC_PER_THREAD_DATA(uint64_t *, blocks_scratch_ptr, 0);
-DEFINE_STATIC_PER_THREAD_DATA(uint32_t, blocks_scratch_count_ptr, 0);
+struct array {
+    uint32_t allocated; /* Number of elements allocated in 'elems'. */
+    uint64_t elems[];
+};
+
+/* Per thread data to store the blocks cache. */
+DEFINE_PER_THREAD_MALLOCED_DATA(struct array *, blocks_scratch_array);
+
+static inline uint64_t *
+get_blocks_scratch(uint32_t count)
+{
+    struct array *darray = blocks_scratch_array_get();
+
+    /* Check if this thread already has a large enough array allocated.
+     * This is a predictable UNLIKLEY branch as it will only occur once at
+     * startup, or if a subtable with higher blocks count is added.
+     */
+    if (OVS_UNLIKELY(!darray || darray->allocated < count)) {
+        /* Allocate new memory for blocks_scratch, and store new size. */
+        darray = xrealloc(darray,
+                          sizeof *darray + count * sizeof darray->elems[0]);
+        darray->allocated = count;
+        blocks_scratch_array_set_unsafe(darray);
+        VLOG_DBG("Block scratch array resized to %"PRIu32, count);
+    }
+
+    return &darray->elems[0];
+}
 
 /* Given a packet, table and mf_masks, this function iterates over each bit
  * set in the subtable, and calculates the appropriate metadata to store in the
@@ -184,29 +206,7 @@ lookup_generic_impl(struct dpcls_subtable *subtable,
      * and hence improves performance. The blocks_scratch array is stored as a
      * thread local variable, as each thread requires its own blocks memory.
      */
-    uint64_t **blocks_scratch_ptr = blocks_scratch_ptr_get();
-    uint32_t *blocks_scratch_count_ptr = blocks_scratch_count_ptr_get();
-    uint32_t blocks_scratch_count = *blocks_scratch_count_ptr;
-
-    /* Check if this thread already has a large enough blocks_scratch array
-     * allocated. This is a predictable UNLIKLEY branch as it will only occur
-     * once at startup, or if a subtable with higher blocks count is added.
-     */
-    if (OVS_UNLIKELY(blocks_scratch_count < block_count_required ||
-                     !*blocks_scratch_ptr)) {
-        /* Free old scratch memory if it was allocated. */
-        if (*blocks_scratch_ptr) {
-                free(*blocks_scratch_ptr);
-        }
-
-        /* Allocate new memory for blocks_scratch, and store new size */
-        uint64_t *new_ptr = xmalloc(sizeof(uint64_t) * block_count_required);
-        *blocks_scratch_ptr = new_ptr;
-        *blocks_scratch_count_ptr = block_count_required;
-        VLOG_DBG("block scratch array resized to %d\n", block_count_required);
-    }
-
-    uint64_t *blocks_scratch = *blocks_scratch_ptr;
+    uint64_t *blocks_scratch = get_blocks_scratch(block_count_required);
 
     /* Flatten the packet metadata into the blocks_scratch[] using subtable. */
     ULLONG_FOR_EACH_1 (i, keys_map) {
---

Best regards, Ilya Maximets.


More information about the dev mailing list