[ovs-dev] [PATCH] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."

Fri Nov 24 16:08:59 UTC 2017

>On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote:
>>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
>>>
>>> Padding and aligning of dp_netdev_pmd_thread structure members is
>>> useless, broken in a several ways and only greatly degrades
>>> maintainability and extensibility of the structure.
>>
>> The idea of my earlier patch was to mark the cache lines and reduce the
>holes while still maintaining the grouping of related members in this structure.
>
>Some of the grouping aspects looks strange. For example, it looks illogical that
>'exit_latch' is grouped with 'flow_table' but not the 'reload_seq' and other
>reload related stuff. It looks strange that statistics and counters spread across
>different groups. So, IMHO, it's not well grouped.

I had to strike a fine balance and some members may be placed in a different group
due to their sizes and importance. Let me think if I can make it better.

>
>> Also cache line marking is a good practice to make some one extra cautious
>when extending or editing important structures .
>> Most importantly I was experimenting with prefetching on this structure
>and needed cache line markers for it.
>>
>> I see that you are on ARM (I don't have HW to test) and want to know if this
>commit has any negative affect and any numbers would be appreciated.
>
>Basic VM-VM testing shows stable 0.5% perfromance improvement with
>revert applied.

I did P2P, PVP and PVVP with IXIA and haven't noticed any drop on X86.  

>Padding adds 560 additional bytes of holes.
As the cache line in ARM is 128 , it created holes, I can find a workaround to handle this.

>
>> More comments inline.
>>
>>>
>>> Issues:
>>>
>>>    1. It's not working because all the instances of struct
>>>       dp_netdev_pmd_thread allocated only by usual malloc. All the
>>>       memory is not aligned to cachelines -> structure almost never
>>>       starts at aligned memory address. This means that any further
>>>       paddings and alignments inside the structure are completely
>>>       useless. Fo example:
>>>
>>>       Breakpoint 1, pmd_thread_main
>>>       (gdb) p pmd
>>>       $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20
>>>       (gdb) p &pmd->cacheline1
>>>       $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60
>>>       (gdb) p &pmd->cacheline0
>>>       $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20
>>>       (gdb) p &pmd->flow_cache
>>>       $53 = (struct emc_cache *) 0x1b1afe0
>>>
>>>       All of the above addresses shifted from cacheline start by 32B.
>>
>> If you see below all the addresses are 64 byte aligned.
>>
>> (gdb) p pmd
>> $1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040
>> (gdb) p &pmd->cacheline0
>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040
>> (gdb) p &pmd->cacheline1
>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080
>> (gdb) p &pmd->flow_cache
>> $4 = (struct emc_cache *) 0x7fc1e9b1a0c0
>> (gdb) p &pmd->flow_table
>> $5 = (struct cmap *) 0x7fc1e9fba100
>> (gdb) p &pmd->stats
>> $6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140
>> (gdb) p &pmd->port_mutex
>> $7 = (struct ovs_mutex *) 0x7fc1e9fba180
>> (gdb) p &pmd->poll_list
>> $8 = (struct hmap *) 0x7fc1e9fba1c0
>> (gdb) p &pmd->tnl_port_cache
>> $9 = (struct hmap *) 0x7fc1e9fba200
>> (gdb) p &pmd->stats_zero
>> $10 = (unsigned long long (*)[5]) 0x7fc1e9fba240
>>
>> I tried using xzalloc_cacheline instead of default xzalloc() here.  I
>> tried tens of times and always found that the address is
>> 64 byte aligned and it should start at the beginning of cache line on X86.
>> Not sure why the comment  " (The memory returned will not be at the start
>of  a cache line, though, so don't assume such alignment.)" says otherwise?
>
>Yes, you will always get aligned addressess on your x86 Linux system that
>supports
>posix_memalign() call. The comment says what it says because it will make
>some memory allocation tricks in case posix_memalign() is not available
>(Windows, some MacOS, maybe some Linux systems (not sure)) and the
>address will not be aligned it this case.

I also verified the other case when posix_memalign isn't available and even in that case
it returns the address aligned on CACHE_LINE_SIZE boundary. I will send out a patch to use
 xzalloc_cacheline for allocating the memory.

>
>>
>>>
>>>       Can we fix it properly? NO.
>>>       OVS currently doesn't have appropriate API to allocate aligned
>>>       memory. The best candidate is 'xmalloc_cacheline()' but it
>>>       clearly states that "The memory returned will not be at the
>>>       start of a cache line, though, so don't assume such alignment".
>>>       And also, this function will never return aligned memory on
>>>       Windows or MacOS.
>>>
>>>    2. CACHE_LINE_SIZE is not constant. Different architectures have
>>>       different cache line sizes, but the code assumes that
>>>       CACHE_LINE_SIZE is always equal to 64 bytes. All the structure
>>>       members are grouped by 64 bytes and padded to CACHE_LINE_SIZE.
>>>       This leads to a huge holes in a structures if CACHE_LINE_SIZE
>>>       differs from 64. This is opposite to portability. If I want
>>>       good performance of cmap I need to have CACHE_LINE_SIZE equal
>>>       to the real cache line size, but I will have huge holes in the
>>>       structures. If you'll take a look to struct rte_mbuf from DPDK
>>>       you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and
>>>       RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure.
>>
>> I understand that ARM and few other processors (like OCTEON) have 128
>bytes cache lines.
>> But  again curious of performance impact in your case with this new
>alignment.
>>
>>>
>>>    3. Sizes of system/libc defined types are not constant for all the
>>>       systems. For example, sizeof(pthread_mutex_t) == 48 on my
>>>       ARMv8 machine, but only 40 on x86. The difference could be
>>>       much bigger on Windows or MacOS systems. But the code assumes
>>>       that sizeof(struct ovs_mutex) is always 48 bytes. This may lead
>>>       to broken alignment/big holes in case of padding/wrong comments
>>>       about amount of free pad bytes.
>>
>> This isn't an issue as you would have already mentioned and more about
>issue with the comment that reads the pad bytes.
>> In case of ARM it would be just 8 pad bytes instead of 16 on X86.
>>
>>         union {
>>                 struct {
>>                         struct ovs_mutex port_mutex;     /* 4849984    48 */
>>                 };    /*          48 */
>>                 uint8_t            pad13[64];            /*          64 */
>>         };                                               /
>>
>
>It's not only about 'port_mutex'. If you'll take a look at 'flow_mutex', you will
>see, that it even not padded. So, increasing the size of 'flow_mutex'
>leads to shifting of all the following padded blocks and no other blocks will be
>properly aligned even if the structure allocated on the aligned memory.
>
>>>
>>>    4. Sizes of the many fileds in structure depends on defines like
>>>       DP_N_STATS, PMD_N_CYCLES, EM_FLOW_HASH_ENTRIES and so on.
>>>       Any change in these defines or any change in any structure
>>>       contained by thread should lead to the not so simple
>>>       refactoring of the whole dp_netdev_pmd_thread structure. This
>>>       greatly reduces maintainability and complicates development of
>>>       a new features.
>>
>> I don't think it complicates development and instead I feel the commit
>> gives a clear indication to the developer that the members are grouped and
>aligned and marked with cacheline markers.
>> This makes the developer extra cautious when adding new members so that
>holes can be avoided.
>
>Starting rebase of the output batching patch-set I figured out that I need to
>remove 'unsigned long long last_cycles' and add 'struct
>dp_netdev_pmd_thread_ctx ctx'
>which is 8 bytes larger. Could you, please, suggest me where should I place
>that new structure member and what to do with a hole from 'last_cycles'?
>
>This is not a trivial question, because already poor grouping will become worse
>almost anyway.

Aah, realized now that the batching series doesn't cleanly apply on master.
Let me check this and will send across the changes that should fix this.

- Bhanuprakash

>>
>> Cacheline marking the structure is a good practice and I am sure this
>> structure is significant and should be carefully extended in the future.
>
>Not so sure about that.
>
>>
>>>
>>>    5. There is no reason to align flow_cache member because it's
>>>       too big and we usually access random entries by single thread
>>>       only.
>>>
>>
>> I see your point. This patch wasn't done for performance and instead
>> more to have some order with this ever growing structure. During
>> testing I found that for some test cases aligning the flow_cache was giving
>me 100k+ performance consistently and so was added.
>
>This was a random performance boost. You achieved it without aligned
>memory allocation.
>Just a luck with you system environment. Using of mzalloc_cacheline will likely
>eliminate this performance difference or even degrade the performance.
>
>>
>>> So, the padding/alignment only creates some visibility of performance
>>> optimization but does nothing useful in reality. It only complicates
>>> maintenance and adds huge holes for non-x86 architectures and
>>> non-Linux systems. Performance improvement stated in a original
>>> commit message should be random and not valuable. I see no
>performance difference.
>>
>> I understand that this is causing issues with architecture having
>> different cache line sizes, but unfortunately majority of them have 64 byte
>cache lines so this change makes sense.
>
>I understand that 64 byte cache lines are spread a lot wider. I also have x86 as
>a target arch, but still, IMHO, OVS is a cross-platform application and it should
>not have platform dependent stuff which makes one architecture/platform
>better and worsen others.
>
>>
>> If you have performance data to prove that this causes sever perf
>degradation I can think of work arounds for ARM.
>>
>> - Bhanuprakash.
>
>
>P.S.: If you'll want to test with CACHE_LINE_SIZE=128 you will have to apply
>following patch to avoid build time assert (I'll send it formally later):
>
>-----------------------------------------------------------------------------
>diff --git a/lib/cmap.c b/lib/cmap.c
>index 35decea..5b15ecd 100644
>--- a/lib/cmap.c
>+++ b/lib/cmap.c
>@@ -123,12 +123,11 @@ COVERAGE_DEFINE(cmap_shrink);
> /* Number of entries per bucket: 7 on 32-bit, 5 on 64-bit. */  #define CMAP_K
>((CACHE_LINE_SIZE - 4) / CMAP_ENTRY_SIZE)
>
>-/* Pad to make a bucket a full cache line in size: 4 on 32-bit, 0 on 64-bit. */ -
>#define CMAP_PADDING ((CACHE_LINE_SIZE - 4) - (CMAP_K *
>CMAP_ENTRY_SIZE))
>-
> /* A cuckoo hash bucket.  Designed to be cache-aligned and exactly one
>cache
>  * line long. */
> struct cmap_bucket {
>+    /* Padding to make cmap_bucket exactly one cache line long. */
>+    PADDED_MEMBERS(CACHE_LINE_SIZE,
>     /* Allows readers to track in-progress changes.  Initially zero, each
>      * writer increments this value just before and just after each change (see
>      * cmap_set_bucket()).  Thus, a reader can ensure that it gets a consistent
>@@ -145,11 +144,7 @@ struct cmap_bucket {
>      * slots. */
>     uint32_t hashes[CMAP_K];
>     struct cmap_node nodes[CMAP_K];
>-
>-    /* Padding to make cmap_bucket exactly one cache line long. */
>-#if CMAP_PADDING > 0
>-    uint8_t pad[CMAP_PADDING];
>-#endif
>+    );
> };
> BUILD_ASSERT_DECL(sizeof(struct cmap_bucket) == CACHE_LINE_SIZE);
>
>diff --git a/lib/util.h b/lib/util.h
>index 3c43c2c..514fdaa 100644
>--- a/lib/util.h
>+++ b/lib/util.h
>@@ -61,7 +61,7 @@ struct Bad_arg_to_ARRAY_SIZE {
>
> /* This system's cache line size, in bytes.
>  * Being wrong hurts performance but not correctness. */ -#define
>CACHE_LINE_SIZE 64
>+#define CACHE_LINE_SIZE 128
> BUILD_ASSERT_DECL(IS_POW2(CACHE_LINE_SIZE));
>
> /* Cacheline marking is typically done using zero-sized array.
>-----------------------------------------------------------------------------
>
>
>Best regards, Ilya Maximets.
>
>>
>>>
>>> Most of the above issues are also true for some other padded/aligned
>>> structures like 'struct netdev_dpdk'. They will be treated separately.
>>>
>>> CC: Bhanuprakash Bodireddy <bhanuprakash.bodireddy at intel.com>
>>> CC: Ben Pfaff <blp at ovn.org>
>>> Signed-off-by: Ilya Maximets <i.maximets at samsung.com>
>>> ---
>>> lib/dpif-netdev.c | 160
>>> +++++++++++++++++++++++-----------------------------
>>> --
>>> 1 file changed, 69 insertions(+), 91 deletions(-)
>>>
>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>> 0a62630..6784269
>>> 100644
>>> --- a/lib/dpif-netdev.c
>>> +++ b/lib/dpif-netdev.c
>>> @@ -547,31 +547,18 @@ struct tx_port {
>>>  * actions in either case.
>>>  * */
>>> struct dp_netdev_pmd_thread {
>>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>cacheline0,
>>> -        struct dp_netdev *dp;
>>> -        struct cmap_node node;          /* In 'dp->poll_threads'. */
>>> -        pthread_cond_t cond;            /* For synchronizing pmd thread
>>> -                                           reload. */
>>> -    );
>>> -
>>> -    PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>cacheline1,
>>> -        struct ovs_mutex cond_mutex;    /* Mutex for condition variable. */
>>> -        pthread_t thread;
>>> -        unsigned core_id;               /* CPU core id of this pmd thread. */
>>> -        int numa_id;                    /* numa node id of this pmd thread. */
>>> -    );
>>> +    struct dp_netdev *dp;
>>> +    struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed.
>*/
>>> +    struct cmap_node node;          /* In 'dp->poll_threads'. */
>>> +
>>> +    pthread_cond_t cond;            /* For synchronizing pmd thread reload. */
>>> +    struct ovs_mutex cond_mutex;    /* Mutex for condition variable. */
>>>
>>>     /* Per thread exact-match cache.  Note, the instance for cpu core
>>>      * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
>>>      * need to be protected by 'non_pmd_mutex'.  Every other instance
>>>      * will only be accessed by its own pmd thread. */
>>> -    OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache;
>>> -    struct ovs_refcount ref_cnt;    /* Every reference must be refcount'ed.
>*/
>>> -
>>> -    /* Queue id used by this pmd thread to send packets on all netdevs if
>>> -     * XPS disabled for this netdev. All static_tx_qid's are unique and less
>>> -     * than 'cmap_count(dp->poll_threads)'. */
>>> -    uint32_t static_tx_qid;
>>> +    struct emc_cache flow_cache;
>>>
>>>     /* Flow-Table and classifiers
>>>      *
>>> @@ -580,77 +567,68 @@ struct dp_netdev_pmd_thread {
>>>      * 'flow_mutex'.
>>>      */
>>>     struct ovs_mutex flow_mutex;
>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>> -        struct cmap flow_table OVS_GUARDED; /* Flow table. */
>>> -
>>> -        /* One classifier per in_port polled by the pmd */
>>> -        struct cmap classifiers;
>>> -        /* Periodically sort subtable vectors according to hit frequencies */
>>> -        long long int next_optimization;
>>> -        /* End of the next time interval for which processing cycles
>>> -           are stored for each polled rxq. */
>>> -        long long int rxq_next_cycle_store;
>>> -
>>> -        /* Cycles counters */
>>> -        struct dp_netdev_pmd_cycles cycles;
>>> -
>>> -        /* Used to count cycles. See 'cycles_counter_end()'. */
>>> -        unsigned long long last_cycles;
>>> -        struct latch exit_latch;        /* For terminating the pmd thread. */
>>> -    );
>>> -
>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>> -        /* Statistics. */
>>> -        struct dp_netdev_pmd_stats stats;
>>> -
>>> -        struct seq *reload_seq;
>>> -        uint64_t last_reload_seq;
>>> -        atomic_bool reload;             /* Do we need to reload ports? */
>>> -        bool isolated;
>>> -
>>> -        /* Set to true if the pmd thread needs to be reloaded. */
>>> -        bool need_reload;
>>> -        /* 5 pad bytes. */
>>> -    );
>>> -
>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>> -        struct ovs_mutex port_mutex;    /* Mutex for 'poll_list'
>>> -                                           and 'tx_ports'. */
>>> -        /* 16 pad bytes. */
>>> -    );
>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>> -        /* List of rx queues to poll. */
>>> -        struct hmap poll_list OVS_GUARDED;
>>> -        /* Map of 'tx_port's used for transmission.  Written by the main
>>> -         * thread, read by the pmd thread. */
>>> -        struct hmap tx_ports OVS_GUARDED;
>>> -    );
>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>> -        /* These are thread-local copies of 'tx_ports'.  One contains only
>>> -         * tunnel ports (that support push_tunnel/pop_tunnel), the other
>>> -         * contains ports with at least one txq (that support send).
>>> -         * A port can be in both.
>>> -         *
>>> -         * There are two separate maps to make sure that we don't try to
>>> -         * execute OUTPUT on a device which has 0 txqs or PUSH/POP on a
>>> -         * non-tunnel device.
>>> -         *
>>> -         * The instances for cpu core NON_PMD_CORE_ID can be accessed by
>>> -         * multiple threads and thusly need to be protected by
>>> 'non_pmd_mutex'.
>>> -         * Every other instance will only be accessed by its own pmd thread.
>*/
>>> -        struct hmap tnl_port_cache;
>>> -        struct hmap send_port_cache;
>>> -    );
>>> -
>>> -    PADDED_MEMBERS(CACHE_LINE_SIZE,
>>> -        /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>>> -         * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>>> -         * values and subtracts them from 'stats' and 'cycles' before
>>> -         * reporting to the user */
>>> -        unsigned long long stats_zero[DP_N_STATS];
>>> -        uint64_t cycles_zero[PMD_N_CYCLES];
>>> -        /* 8 pad bytes. */
>>> -    );
>>> +    struct cmap flow_table OVS_GUARDED; /* Flow table. */
>>> +
>>> +    /* One classifier per in_port polled by the pmd */
>>> +    struct cmap classifiers;
>>> +    /* Periodically sort subtable vectors according to hit frequencies */
>>> +    long long int next_optimization;
>>> +    /* End of the next time interval for which processing cycles
>>> +       are stored for each polled rxq. */
>>> +    long long int rxq_next_cycle_store;
>>> +
>>> +    /* Statistics. */
>>> +    struct dp_netdev_pmd_stats stats;
>>> +
>>> +    /* Cycles counters */
>>> +    struct dp_netdev_pmd_cycles cycles;
>>> +
>>> +    /* Used to count cicles. See 'cycles_counter_end()' */
>>> +    unsigned long long last_cycles;
>>> +
>>> +    struct latch exit_latch;        /* For terminating the pmd thread. */
>>> +    struct seq *reload_seq;
>>> +    uint64_t last_reload_seq;
>>> +    atomic_bool reload;             /* Do we need to reload ports? */
>>> +    pthread_t thread;
>>> +    unsigned core_id;               /* CPU core id of this pmd thread. */
>>> +    int numa_id;                    /* numa node id of this pmd thread. */
>>> +    bool isolated;
>>> +
>>> +    /* Queue id used by this pmd thread to send packets on all netdevs if
>>> +     * XPS disabled for this netdev. All static_tx_qid's are unique and less
>>> +     * than 'cmap_count(dp->poll_threads)'. */
>>> +    uint32_t static_tx_qid;
>>> +
>>> +    struct ovs_mutex port_mutex;    /* Mutex for 'poll_list' and 'tx_ports'.
>*/
>>> +    /* List of rx queues to poll. */
>>> +    struct hmap poll_list OVS_GUARDED;
>>> +    /* Map of 'tx_port's used for transmission.  Written by the main thread,
>>> +     * read by the pmd thread. */
>>> +    struct hmap tx_ports OVS_GUARDED;
>>> +
>>> +    /* These are thread-local copies of 'tx_ports'.  One contains only tunnel
>>> +     * ports (that support push_tunnel/pop_tunnel), the other contains
>ports
>>> +     * with at least one txq (that support send).  A port can be in both.
>>> +     *
>>> +     * There are two separate maps to make sure that we don't try to
>execute
>>> +     * OUTPUT on a device which has 0 txqs or PUSH/POP on a
>>> + non-tunnel
>>> device.
>>> +     *
>>> +     * The instances for cpu core NON_PMD_CORE_ID can be accessed by
>>> multiple
>>> +     * threads, and thusly need to be protected by 'non_pmd_mutex'.
>Every
>>> +     * other instance will only be accessed by its own pmd thread. */
>>> +    struct hmap tnl_port_cache;
>>> +    struct hmap send_port_cache;
>>> +
>>> +    /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>>> +     * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>>> +     * values and subtracts them from 'stats' and 'cycles' before
>>> +     * reporting to the user */
>>> +    unsigned long long stats_zero[DP_N_STATS];
>>> +    uint64_t cycles_zero[PMD_N_CYCLES];
>>> +
>>> +    /* Set to true if the pmd thread needs to be reloaded. */
>>> +    bool need_reload;
>>> };
>>>
>>> /* Interface to netdev-based datapath. */
>>> --
>>> 2.7.4
>>
>>
>>
>>