[ovs-dev] [PATCH] Revert "dpif_netdev: Refactor dp_netdev_pmd_thread structure."
Ilya Maximets
i.maximets at samsung.com
Mon Nov 27 11:59:29 UTC 2017
On 24.11.2017 19:08, Bodireddy, Bhanuprakash wrote:
>> On 22.11.2017 20:14, Bodireddy, Bhanuprakash wrote:
>>>> This reverts commit a807c15796ddc43ba1ffb2a6b0bd2ad4e2b73941.
>>>>
>>>> Padding and aligning of dp_netdev_pmd_thread structure members is
>>>> useless, broken in a several ways and only greatly degrades
>>>> maintainability and extensibility of the structure.
>>>
>>> The idea of my earlier patch was to mark the cache lines and reduce the
>> holes while still maintaining the grouping of related members in this structure.
>>
>> Some of the grouping aspects looks strange. For example, it looks illogical that
>> 'exit_latch' is grouped with 'flow_table' but not the 'reload_seq' and other
>> reload related stuff. It looks strange that statistics and counters spread across
>> different groups. So, IMHO, it's not well grouped.
>
> I had to strike a fine balance and some members may be placed in a different group
> due to their sizes and importance. Let me think if I can make it better.
>
>>
>>> Also cache line marking is a good practice to make some one extra cautious
>> when extending or editing important structures .
>>> Most importantly I was experimenting with prefetching on this structure
>> and needed cache line markers for it.
>>>
>>> I see that you are on ARM (I don't have HW to test) and want to know if this
>> commit has any negative affect and any numbers would be appreciated.
>>
>> Basic VM-VM testing shows stable 0.5% perfromance improvement with
>> revert applied.
>
> I did P2P, PVP and PVVP with IXIA and haven't noticed any drop on X86.
>
>> Padding adds 560 additional bytes of holes.
> As the cache line in ARM is 128 , it created holes, I can find a workaround to handle this.
>
>>
>>> More comments inline.
>>>
>>>>
>>>> Issues:
>>>>
>>>> 1. It's not working because all the instances of struct
>>>> dp_netdev_pmd_thread allocated only by usual malloc. All the
>>>> memory is not aligned to cachelines -> structure almost never
>>>> starts at aligned memory address. This means that any further
>>>> paddings and alignments inside the structure are completely
>>>> useless. Fo example:
>>>>
>>>> Breakpoint 1, pmd_thread_main
>>>> (gdb) p pmd
>>>> $49 = (struct dp_netdev_pmd_thread *) 0x1b1af20
>>>> (gdb) p &pmd->cacheline1
>>>> $51 = (OVS_CACHE_LINE_MARKER *) 0x1b1af60
>>>> (gdb) p &pmd->cacheline0
>>>> $52 = (OVS_CACHE_LINE_MARKER *) 0x1b1af20
>>>> (gdb) p &pmd->flow_cache
>>>> $53 = (struct emc_cache *) 0x1b1afe0
>>>>
>>>> All of the above addresses shifted from cacheline start by 32B.
>>>
>>> If you see below all the addresses are 64 byte aligned.
>>>
>>> (gdb) p pmd
>>> $1 = (struct dp_netdev_pmd_thread *) 0x7fc1e9b1a040
>>> (gdb) p &pmd->cacheline0
>>> $2 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a040
>>> (gdb) p &pmd->cacheline1
>>> $3 = (OVS_CACHE_LINE_MARKER *) 0x7fc1e9b1a080
>>> (gdb) p &pmd->flow_cache
>>> $4 = (struct emc_cache *) 0x7fc1e9b1a0c0
>>> (gdb) p &pmd->flow_table
>>> $5 = (struct cmap *) 0x7fc1e9fba100
>>> (gdb) p &pmd->stats
>>> $6 = (struct dp_netdev_pmd_stats *) 0x7fc1e9fba140
>>> (gdb) p &pmd->port_mutex
>>> $7 = (struct ovs_mutex *) 0x7fc1e9fba180
>>> (gdb) p &pmd->poll_list
>>> $8 = (struct hmap *) 0x7fc1e9fba1c0
>>> (gdb) p &pmd->tnl_port_cache
>>> $9 = (struct hmap *) 0x7fc1e9fba200
>>> (gdb) p &pmd->stats_zero
>>> $10 = (unsigned long long (*)[5]) 0x7fc1e9fba240
>>>
>>> I tried using xzalloc_cacheline instead of default xzalloc() here. I
>>> tried tens of times and always found that the address is
>>> 64 byte aligned and it should start at the beginning of cache line on X86.
>>> Not sure why the comment " (The memory returned will not be at the start
>> of a cache line, though, so don't assume such alignment.)" says otherwise?
>>
>> Yes, you will always get aligned addressess on your x86 Linux system that
>> supports
>> posix_memalign() call. The comment says what it says because it will make
>> some memory allocation tricks in case posix_memalign() is not available
>> (Windows, some MacOS, maybe some Linux systems (not sure)) and the
>> address will not be aligned it this case.
>
> I also verified the other case when posix_memalign isn't available and even in that case
> it returns the address aligned on CACHE_LINE_SIZE boundary. I will send out a patch to use
> xzalloc_cacheline for allocating the memory.
I don't know how you tested this, because it is impossible:
1. OVS allocates some memory:
base = xmalloc(...);
2. Rounds it up to the cache line start:
payload = (void **) ROUND_UP((uintptr_t) base, CACHE_LINE_SIZE);
3. Returns the pointer increased by 8 bytes:
return (char *) payload + MEM_ALIGN;
So, unless you redefined MEM_ALIGN to zero, you will never get aligned memory
address while allocating by xmalloc_cacheline() on system without posix_memalign().
>
>>
>>>
>>>>
>>>> Can we fix it properly? NO.
>>>> OVS currently doesn't have appropriate API to allocate aligned
>>>> memory. The best candidate is 'xmalloc_cacheline()' but it
>>>> clearly states that "The memory returned will not be at the
>>>> start of a cache line, though, so don't assume such alignment".
>>>> And also, this function will never return aligned memory on
>>>> Windows or MacOS.
>>>>
>>>> 2. CACHE_LINE_SIZE is not constant. Different architectures have
>>>> different cache line sizes, but the code assumes that
>>>> CACHE_LINE_SIZE is always equal to 64 bytes. All the structure
>>>> members are grouped by 64 bytes and padded to CACHE_LINE_SIZE.
>>>> This leads to a huge holes in a structures if CACHE_LINE_SIZE
>>>> differs from 64. This is opposite to portability. If I want
>>>> good performance of cmap I need to have CACHE_LINE_SIZE equal
>>>> to the real cache line size, but I will have huge holes in the
>>>> structures. If you'll take a look to struct rte_mbuf from DPDK
>>>> you'll see that it uses 2 defines: RTE_CACHE_LINE_SIZE and
>>>> RTE_CACHE_LINE_MIN_SIZE to avoid holes in mbuf structure.
>>>
>>> I understand that ARM and few other processors (like OCTEON) have 128
>> bytes cache lines.
>>> But again curious of performance impact in your case with this new
>> alignment.
>>>
>>>>
>>>> 3. Sizes of system/libc defined types are not constant for all the
>>>> systems. For example, sizeof(pthread_mutex_t) == 48 on my
>>>> ARMv8 machine, but only 40 on x86. The difference could be
>>>> much bigger on Windows or MacOS systems. But the code assumes
>>>> that sizeof(struct ovs_mutex) is always 48 bytes. This may lead
>>>> to broken alignment/big holes in case of padding/wrong comments
>>>> about amount of free pad bytes.
>>>
>>> This isn't an issue as you would have already mentioned and more about
>> issue with the comment that reads the pad bytes.
>>> In case of ARM it would be just 8 pad bytes instead of 16 on X86.
>>>
>>> union {
>>> struct {
>>> struct ovs_mutex port_mutex; /* 4849984 48 */
>>> }; /* 48 */
>>> uint8_t pad13[64]; /* 64 */
>>> }; /
>>>
>>
>> It's not only about 'port_mutex'. If you'll take a look at 'flow_mutex', you will
>> see, that it even not padded. So, increasing the size of 'flow_mutex'
>> leads to shifting of all the following padded blocks and no other blocks will be
>> properly aligned even if the structure allocated on the aligned memory.
>>
>>>>
>>>> 4. Sizes of the many fileds in structure depends on defines like
>>>> DP_N_STATS, PMD_N_CYCLES, EM_FLOW_HASH_ENTRIES and so on.
>>>> Any change in these defines or any change in any structure
>>>> contained by thread should lead to the not so simple
>>>> refactoring of the whole dp_netdev_pmd_thread structure. This
>>>> greatly reduces maintainability and complicates development of
>>>> a new features.
>>>
>>> I don't think it complicates development and instead I feel the commit
>>> gives a clear indication to the developer that the members are grouped and
>> aligned and marked with cacheline markers.
>>> This makes the developer extra cautious when adding new members so that
>> holes can be avoided.
>>
>> Starting rebase of the output batching patch-set I figured out that I need to
>> remove 'unsigned long long last_cycles' and add 'struct
>> dp_netdev_pmd_thread_ctx ctx'
>> which is 8 bytes larger. Could you, please, suggest me where should I place
>> that new structure member and what to do with a hole from 'last_cycles'?
>>
>> This is not a trivial question, because already poor grouping will become worse
>> almost anyway.
>
> Aah, realized now that the batching series doesn't cleanly apply on master.
> Let me check this and will send across the changes that should fix this.
>
> - Bhanuprakash
>
>>>
>>> Cacheline marking the structure is a good practice and I am sure this
>>> structure is significant and should be carefully extended in the future.
>>
>> Not so sure about that.
>>
>>>
>>>>
>>>> 5. There is no reason to align flow_cache member because it's
>>>> too big and we usually access random entries by single thread
>>>> only.
>>>>
>>>
>>> I see your point. This patch wasn't done for performance and instead
>>> more to have some order with this ever growing structure. During
>>> testing I found that for some test cases aligning the flow_cache was giving
>> me 100k+ performance consistently and so was added.
>>
>> This was a random performance boost. You achieved it without aligned
>> memory allocation.
>> Just a luck with you system environment. Using of mzalloc_cacheline will likely
>> eliminate this performance difference or even degrade the performance.
>>
>>>
>>>> So, the padding/alignment only creates some visibility of performance
>>>> optimization but does nothing useful in reality. It only complicates
>>>> maintenance and adds huge holes for non-x86 architectures and
>>>> non-Linux systems. Performance improvement stated in a original
>>>> commit message should be random and not valuable. I see no
>> performance difference.
>>>
>>> I understand that this is causing issues with architecture having
>>> different cache line sizes, but unfortunately majority of them have 64 byte
>> cache lines so this change makes sense.
>>
>> I understand that 64 byte cache lines are spread a lot wider. I also have x86 as
>> a target arch, but still, IMHO, OVS is a cross-platform application and it should
>> not have platform dependent stuff which makes one architecture/platform
>> better and worsen others.
>>
>>>
>>> If you have performance data to prove that this causes sever perf
>> degradation I can think of work arounds for ARM.
>>>
>>> - Bhanuprakash.
>>
>>
>> P.S.: If you'll want to test with CACHE_LINE_SIZE=128 you will have to apply
>> following patch to avoid build time assert (I'll send it formally later):
>>
>> -----------------------------------------------------------------------------
>> diff --git a/lib/cmap.c b/lib/cmap.c
>> index 35decea..5b15ecd 100644
>> --- a/lib/cmap.c
>> +++ b/lib/cmap.c
>> @@ -123,12 +123,11 @@ COVERAGE_DEFINE(cmap_shrink);
>> /* Number of entries per bucket: 7 on 32-bit, 5 on 64-bit. */ #define CMAP_K
>> ((CACHE_LINE_SIZE - 4) / CMAP_ENTRY_SIZE)
>>
>> -/* Pad to make a bucket a full cache line in size: 4 on 32-bit, 0 on 64-bit. */ -
>> #define CMAP_PADDING ((CACHE_LINE_SIZE - 4) - (CMAP_K *
>> CMAP_ENTRY_SIZE))
>> -
>> /* A cuckoo hash bucket. Designed to be cache-aligned and exactly one
>> cache
>> * line long. */
>> struct cmap_bucket {
>> + /* Padding to make cmap_bucket exactly one cache line long. */
>> + PADDED_MEMBERS(CACHE_LINE_SIZE,
>> /* Allows readers to track in-progress changes. Initially zero, each
>> * writer increments this value just before and just after each change (see
>> * cmap_set_bucket()). Thus, a reader can ensure that it gets a consistent
>> @@ -145,11 +144,7 @@ struct cmap_bucket {
>> * slots. */
>> uint32_t hashes[CMAP_K];
>> struct cmap_node nodes[CMAP_K];
>> -
>> - /* Padding to make cmap_bucket exactly one cache line long. */
>> -#if CMAP_PADDING > 0
>> - uint8_t pad[CMAP_PADDING];
>> -#endif
>> + );
>> };
>> BUILD_ASSERT_DECL(sizeof(struct cmap_bucket) == CACHE_LINE_SIZE);
>>
>> diff --git a/lib/util.h b/lib/util.h
>> index 3c43c2c..514fdaa 100644
>> --- a/lib/util.h
>> +++ b/lib/util.h
>> @@ -61,7 +61,7 @@ struct Bad_arg_to_ARRAY_SIZE {
>>
>> /* This system's cache line size, in bytes.
>> * Being wrong hurts performance but not correctness. */ -#define
>> CACHE_LINE_SIZE 64
>> +#define CACHE_LINE_SIZE 128
>> BUILD_ASSERT_DECL(IS_POW2(CACHE_LINE_SIZE));
>>
>> /* Cacheline marking is typically done using zero-sized array.
>> -----------------------------------------------------------------------------
>>
>>
>> Best regards, Ilya Maximets.
>>
>>>
>>>>
>>>> Most of the above issues are also true for some other padded/aligned
>>>> structures like 'struct netdev_dpdk'. They will be treated separately.
>>>>
>>>> CC: Bhanuprakash Bodireddy <bhanuprakash.bodireddy at intel.com>
>>>> CC: Ben Pfaff <blp at ovn.org>
>>>> Signed-off-by: Ilya Maximets <i.maximets at samsung.com>
>>>> ---
>>>> lib/dpif-netdev.c | 160
>>>> +++++++++++++++++++++++-----------------------------
>>>> --
>>>> 1 file changed, 69 insertions(+), 91 deletions(-)
>>>>
>>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index
>>>> 0a62630..6784269
>>>> 100644
>>>> --- a/lib/dpif-netdev.c
>>>> +++ b/lib/dpif-netdev.c
>>>> @@ -547,31 +547,18 @@ struct tx_port {
>>>> * actions in either case.
>>>> * */
>>>> struct dp_netdev_pmd_thread {
>>>> - PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>> cacheline0,
>>>> - struct dp_netdev *dp;
>>>> - struct cmap_node node; /* In 'dp->poll_threads'. */
>>>> - pthread_cond_t cond; /* For synchronizing pmd thread
>>>> - reload. */
>>>> - );
>>>> -
>>>> - PADDED_MEMBERS_CACHELINE_MARKER(CACHE_LINE_SIZE,
>> cacheline1,
>>>> - struct ovs_mutex cond_mutex; /* Mutex for condition variable. */
>>>> - pthread_t thread;
>>>> - unsigned core_id; /* CPU core id of this pmd thread. */
>>>> - int numa_id; /* numa node id of this pmd thread. */
>>>> - );
>>>> + struct dp_netdev *dp;
>>>> + struct ovs_refcount ref_cnt; /* Every reference must be refcount'ed.
>> */
>>>> + struct cmap_node node; /* In 'dp->poll_threads'. */
>>>> +
>>>> + pthread_cond_t cond; /* For synchronizing pmd thread reload. */
>>>> + struct ovs_mutex cond_mutex; /* Mutex for condition variable. */
>>>>
>>>> /* Per thread exact-match cache. Note, the instance for cpu core
>>>> * NON_PMD_CORE_ID can be accessed by multiple threads, and thusly
>>>> * need to be protected by 'non_pmd_mutex'. Every other instance
>>>> * will only be accessed by its own pmd thread. */
>>>> - OVS_ALIGNED_VAR(CACHE_LINE_SIZE) struct emc_cache flow_cache;
>>>> - struct ovs_refcount ref_cnt; /* Every reference must be refcount'ed.
>> */
>>>> -
>>>> - /* Queue id used by this pmd thread to send packets on all netdevs if
>>>> - * XPS disabled for this netdev. All static_tx_qid's are unique and less
>>>> - * than 'cmap_count(dp->poll_threads)'. */
>>>> - uint32_t static_tx_qid;
>>>> + struct emc_cache flow_cache;
>>>>
>>>> /* Flow-Table and classifiers
>>>> *
>>>> @@ -580,77 +567,68 @@ struct dp_netdev_pmd_thread {
>>>> * 'flow_mutex'.
>>>> */
>>>> struct ovs_mutex flow_mutex;
>>>> - PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>> - struct cmap flow_table OVS_GUARDED; /* Flow table. */
>>>> -
>>>> - /* One classifier per in_port polled by the pmd */
>>>> - struct cmap classifiers;
>>>> - /* Periodically sort subtable vectors according to hit frequencies */
>>>> - long long int next_optimization;
>>>> - /* End of the next time interval for which processing cycles
>>>> - are stored for each polled rxq. */
>>>> - long long int rxq_next_cycle_store;
>>>> -
>>>> - /* Cycles counters */
>>>> - struct dp_netdev_pmd_cycles cycles;
>>>> -
>>>> - /* Used to count cycles. See 'cycles_counter_end()'. */
>>>> - unsigned long long last_cycles;
>>>> - struct latch exit_latch; /* For terminating the pmd thread. */
>>>> - );
>>>> -
>>>> - PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>> - /* Statistics. */
>>>> - struct dp_netdev_pmd_stats stats;
>>>> -
>>>> - struct seq *reload_seq;
>>>> - uint64_t last_reload_seq;
>>>> - atomic_bool reload; /* Do we need to reload ports? */
>>>> - bool isolated;
>>>> -
>>>> - /* Set to true if the pmd thread needs to be reloaded. */
>>>> - bool need_reload;
>>>> - /* 5 pad bytes. */
>>>> - );
>>>> -
>>>> - PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>> - struct ovs_mutex port_mutex; /* Mutex for 'poll_list'
>>>> - and 'tx_ports'. */
>>>> - /* 16 pad bytes. */
>>>> - );
>>>> - PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>> - /* List of rx queues to poll. */
>>>> - struct hmap poll_list OVS_GUARDED;
>>>> - /* Map of 'tx_port's used for transmission. Written by the main
>>>> - * thread, read by the pmd thread. */
>>>> - struct hmap tx_ports OVS_GUARDED;
>>>> - );
>>>> - PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>> - /* These are thread-local copies of 'tx_ports'. One contains only
>>>> - * tunnel ports (that support push_tunnel/pop_tunnel), the other
>>>> - * contains ports with at least one txq (that support send).
>>>> - * A port can be in both.
>>>> - *
>>>> - * There are two separate maps to make sure that we don't try to
>>>> - * execute OUTPUT on a device which has 0 txqs or PUSH/POP on a
>>>> - * non-tunnel device.
>>>> - *
>>>> - * The instances for cpu core NON_PMD_CORE_ID can be accessed by
>>>> - * multiple threads and thusly need to be protected by
>>>> 'non_pmd_mutex'.
>>>> - * Every other instance will only be accessed by its own pmd thread.
>> */
>>>> - struct hmap tnl_port_cache;
>>>> - struct hmap send_port_cache;
>>>> - );
>>>> -
>>>> - PADDED_MEMBERS(CACHE_LINE_SIZE,
>>>> - /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>>>> - * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>>>> - * values and subtracts them from 'stats' and 'cycles' before
>>>> - * reporting to the user */
>>>> - unsigned long long stats_zero[DP_N_STATS];
>>>> - uint64_t cycles_zero[PMD_N_CYCLES];
>>>> - /* 8 pad bytes. */
>>>> - );
>>>> + struct cmap flow_table OVS_GUARDED; /* Flow table. */
>>>> +
>>>> + /* One classifier per in_port polled by the pmd */
>>>> + struct cmap classifiers;
>>>> + /* Periodically sort subtable vectors according to hit frequencies */
>>>> + long long int next_optimization;
>>>> + /* End of the next time interval for which processing cycles
>>>> + are stored for each polled rxq. */
>>>> + long long int rxq_next_cycle_store;
>>>> +
>>>> + /* Statistics. */
>>>> + struct dp_netdev_pmd_stats stats;
>>>> +
>>>> + /* Cycles counters */
>>>> + struct dp_netdev_pmd_cycles cycles;
>>>> +
>>>> + /* Used to count cicles. See 'cycles_counter_end()' */
>>>> + unsigned long long last_cycles;
>>>> +
>>>> + struct latch exit_latch; /* For terminating the pmd thread. */
>>>> + struct seq *reload_seq;
>>>> + uint64_t last_reload_seq;
>>>> + atomic_bool reload; /* Do we need to reload ports? */
>>>> + pthread_t thread;
>>>> + unsigned core_id; /* CPU core id of this pmd thread. */
>>>> + int numa_id; /* numa node id of this pmd thread. */
>>>> + bool isolated;
>>>> +
>>>> + /* Queue id used by this pmd thread to send packets on all netdevs if
>>>> + * XPS disabled for this netdev. All static_tx_qid's are unique and less
>>>> + * than 'cmap_count(dp->poll_threads)'. */
>>>> + uint32_t static_tx_qid;
>>>> +
>>>> + struct ovs_mutex port_mutex; /* Mutex for 'poll_list' and 'tx_ports'.
>> */
>>>> + /* List of rx queues to poll. */
>>>> + struct hmap poll_list OVS_GUARDED;
>>>> + /* Map of 'tx_port's used for transmission. Written by the main thread,
>>>> + * read by the pmd thread. */
>>>> + struct hmap tx_ports OVS_GUARDED;
>>>> +
>>>> + /* These are thread-local copies of 'tx_ports'. One contains only tunnel
>>>> + * ports (that support push_tunnel/pop_tunnel), the other contains
>> ports
>>>> + * with at least one txq (that support send). A port can be in both.
>>>> + *
>>>> + * There are two separate maps to make sure that we don't try to
>> execute
>>>> + * OUTPUT on a device which has 0 txqs or PUSH/POP on a
>>>> + non-tunnel
>>>> device.
>>>> + *
>>>> + * The instances for cpu core NON_PMD_CORE_ID can be accessed by
>>>> multiple
>>>> + * threads, and thusly need to be protected by 'non_pmd_mutex'.
>> Every
>>>> + * other instance will only be accessed by its own pmd thread. */
>>>> + struct hmap tnl_port_cache;
>>>> + struct hmap send_port_cache;
>>>> +
>>>> + /* Only a pmd thread can write on its own 'cycles' and 'stats'.
>>>> + * The main thread keeps 'stats_zero' and 'cycles_zero' as base
>>>> + * values and subtracts them from 'stats' and 'cycles' before
>>>> + * reporting to the user */
>>>> + unsigned long long stats_zero[DP_N_STATS];
>>>> + uint64_t cycles_zero[PMD_N_CYCLES];
>>>> +
>>>> + /* Set to true if the pmd thread needs to be reloaded. */
>>>> + bool need_reload;
>>>> };
>>>>
>>>> /* Interface to netdev-based datapath. */
>>>> --
>>>> 2.7.4
>>>
>>>
>>>
>>>
More information about the dev
mailing list