[ovs-dev] [PATCH 1/4] compiler: Introduce OVS_PREFETCH variants.

Bodireddy, Bhanuprakash bhanuprakash.bodireddy at intel.com
Tue Mar 13 14:52:48 UTC 2018


>
>> -----Original Message-----
>> From: ovs-dev-bounces at openvswitch.org [mailto:ovs-dev-
>> bounces at openvswitch.org] On Behalf Of Bhanuprakash Bodireddy
>> Sent: Friday, January 12, 2018 5:41 PM
>> To: dev at openvswitch.org
>> Subject: [ovs-dev] [PATCH 1/4] compiler: Introduce OVS_PREFETCH variants.
>>
>> This commit introduces prefetch variants by using the GCC built-in
>> prefetch function.
>>
>> The prefetch variants gives the user better control on designing data
>> caching strategy in order to increase cache efficiency and minimize
>> cache pollution. Data reference patterns here can be classified in to
>>
>>  - Non-temporal(NT) - Data that is referenced once and not reused in
>>                       immediate future.
>>  - Temporal         - Data will be used again soon.
>>
>> The Macro variants can be used where there are
>>  - Predictable memory access patterns.
>>  - Execution pipeline can stall if data isn't available.
>>  - Time consuming loops.
>>
>> For example:
>>
>>   OVS_PREFETCH_CACHE(addr, OPCH_LTR)
>>     - OPCH_LTR : OVS PREFETCH CACHE HINT-LOW TEMPORAL READ.
>>     - __builtin_prefetch(addr, 0, 1)
>>     - Prefetch data in to L3 cache for readonly purpose.
>>
>>   OVS_PREFETCH_CACHE(addr, OPCH_HTW)
>>     - OPCH_HTW : OVS PREFETCH CACHE HINT-HIGH TEMPORAL WRITE.
>>     - __builtin_prefetch(addr, 1, 3)
>>     - Prefetch data in to all caches in anticipation of write. In doing
>>       so it invalidates other cached copies so as to gain 'exclusive'
>>       access.
>>
>>   OVS_PREFETCH(addr)
>>     - OPCH_HTR : OVS PREFETCH CACHE HINT-HIGH TEMPORAL READ.
>>     - __builtin_prefetch(addr, 0, 3)
>>     - Prefetch data in to all caches in anticipation of read and that
>>       data will be used again soon (HTR - High Temporal Read).
>>
>> Signed-off-by: Bhanuprakash Bodireddy
>> <bhanuprakash.bodireddy at intel.com>
>> ---
>>  include/openvswitch/compiler.h | 147
>> ++++++++++++++++++++++++++++++++++++++---
>>  1 file changed, 139 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/openvswitch/compiler.h
>> b/include/openvswitch/compiler.h index c7cb930..94bb24d 100644
>> --- a/include/openvswitch/compiler.h
>> +++ b/include/openvswitch/compiler.h
>> @@ -222,18 +222,149 @@
>>      static void f(void)
>>  #endif
>>
>> -/* OVS_PREFETCH() can be used to instruct the CPU to fetch the cache
>> - * line containing the given address to a CPU cache.
>> - * OVS_PREFETCH_WRITE() should be used when the memory is going to
>be
>> - * written to.  Depending on the target CPU, this can generate the
>> same
>> - * instruction as OVS_PREFETCH(), or bring the data into the cache in
>> an
>> - * exclusive state. */
>>  #if __GNUC__
>> -#define OVS_PREFETCH(addr) __builtin_prefetch((addr)) -#define
>> OVS_PREFETCH_WRITE(addr) __builtin_prefetch((addr), 1)
>> +enum cache_locality {
>> +    NON_TEMPORAL_LOCALITY,
>> +    LOW_TEMPORAL_LOCALITY,
>> +    MODERATE_TEMPORAL_LOCALITY,
>> +    HIGH_TEMPORAL_LOCALITY
>> +};
>> +
>> +enum cache_rw {
>> +    PREFETCH_READ,
>> +    PREFETCH_WRITE
>> +};
>> +
>> +/* The prefetch variants gives the user better control on designing
>> +data
>> + * caching strategy in order to increase cache efficiency and
>> +minimize
>> + * cache pollution. Data reference patterns here can be classified in
>> +to
>> + *
>> + *   Non-temporal(NT) - Data that is referenced once and not reused in
>> + *                      immediate future.
>> + *   Temporal         - Data will be used again soon.
>> + *
>> + * The Macro variants can be used where there are
>> + *   o Predictable memory access patterns.
>> + *   o Execution pipeline can stall if data isn't available.
>> + *   o Time consuming loops.
>> + *
>> + * OVS_PREFETCH_CACHE() can be used to instruct the CPU to fetch the
>> +cache
>> + * line containing the given address to a CPU cache. The second
>> +argument
>> + * OPCH_XXR (or) OPCH_XXW is used to hint if the prefetched data is
>> +going
>> + * to be read or written to by core.
>> + *
>> + * Example Usage:
>> + *
>> + *   OVS_PREFETCH_CACHE(addr, OPCH_LTR)
>> + *       - OPCH_LTR : OVS PREFETCH CACHE HINT-LOW TEMPORAL READ.
>> + *       - __builtin_prefetch(addr, 0, 1)
>> + *       - Prefetch data in to L3 cache for readonly purpose.
>> + *
>> + *   OVS_PREFETCH_CACHE(addr, OPCH_HTW)
>> + *       - OPCH_HTW : OVS PREFETCH CACHE HINT-HIGH TEMPORAL WRITE.
>> + *       - __builtin_prefetch(addr, 1, 3)
>> + *       - Prefetch data in to all caches in anticipation of write. In
>> doing
>> + *         so it invalidates other cached copies so as to gain
>> 'exclusive'
>> + *         access.
>> + *
>> + *   OVS_PREFETCH(addr)
>> + *       - OPCH_HTR : OVS PREFETCH CACHE HINT-HIGH TEMPORAL READ.
>> + *       - __builtin_prefetch(addr, 0, 3)
>> + *       - Prefetch data in to all caches in anticipation of read and
>> that
>> + *         data will be used again soon (HTR - High Temporal Read).
>> + *
>> + * Implementation details of prefetch hint instructions may vary
>> + across
>> + * different processors and microarchitectures.
>
>Herein lies a potential problem, have you tested this on systems that have
>different interpretations of the prefetch hints? What about systems that
>don't support it?

[BHANU] 
I have tested it on different intel micro architectures(Haswell, Broadwell, skylake).
I understand that you are concerned about ARM platform, I see that ARM do support prefetch variants and they have the same functionality as x86_64.

For example, the below code snippet when compiled on ARM64 with gcc 5.4

void pref(void *p) {
  
  __builtin_prefetch(p,0,0);
  __builtin_prefetch(p,0,1);
  __builtin_prefetch(p,0,2);
  __builtin_prefetch(p,0,3);

  __builtin_prefetch(p,1,0);
  __builtin_prefetch(p,1,1);  
  __builtin_prefetch(p,1,2);  
  __builtin_prefetch(p,1,3);  
}

ON ARM64 (gcc 5.4) :

pref:
        prfm    PLDL1STRM, [x0]
        prfm    PLDL3KEEP, [x0]
        prfm    PLDL2KEEP, [x0]
        prfm    PLDL1KEEP, [x0]
        prfm    PSTL1STRM, [x0]
        prfm    PSTL3KEEP, [x0]
        prfm    PSTL2KEEP, [x0]
        prfm    PSTL1KEEP, [x0]
        ret

On instruction details: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/PRFM_imm.html

The best way to verify different platforms and complier versions is to use https://gcc.godbolt.org/

>
>In some cases OVS will be compiled on one system but then deployed on
>another, they might not be the same HW platform. What happens in that
>case?

If the target doesn't support the prefetch, it might be a NOP on that platform and doesn't cause any application crashes or performance penalties.

>
>Will it behave as expected i.e. similar fashion to how prefetch currently
>behaves?

Yes.

>
>> + *
>> + * OPCH_NTW, OPCH_LTW, OPCH_MTW uses prefetchwt1 instruction and
>> +OPCH_HTW
>> + * uses prefetchw instruction when available. Refer Documentation on
>> +how
>> + * to enable prefetchwt1 instruction.
>
>Just to clarify, Is it HW documentation for a user's setup they must refer to?

[BHANU] 
Nope, I meant the OvS Documentation in this patch.
https://mail.openvswitch.org/pipermail/ovs-dev/2018-January/343101.html


>Are there any extra setup steps for compilers etc. for these instructions?

[BHANU] True, this has been clearly mentioned in the Documentation in the above specified link.
>
>I would expect something like this to be added to the OVS docs.
>
>> + *
>> + * PREFETCH HINT    Instruction     GCC builtin function
>> + * -------------------------------------------------------
>> + *   OPCH_NTR       prefetchnta  __builtin_prefetch(a, 0, 0)
>> + *   OPCH_LTR       prefetcht2   __builtin_prefetch(a, 0, 1)
>> + *   OPCH_MTR       prefetcht1   __builtin_prefetch(a, 0, 2)
>> + *   OPCH_HTR       prefetcht0   __builtin_prefetch(a, 0, 3)
>> + *
>> + *   OPCH_NTW       prefetchwt1  __builtin_prefetch(a, 1, 0)
>> + *   OPCH_LTW       prefetchwt1  __builtin_prefetch(a, 1, 1)
>> + *   OPCH_MTW       prefetchwt1  __builtin_prefetch(a, 1, 2)
>> + *   OPCH_HTW       prefetchw    __builtin_prefetch(a, 1, 3)
>> + *
>> + * */
>> +#define OVS_PREFETCH_CACHE_HINT
>> \
>> +    OPCH(OPCH_NTR, PREFETCH_READ, NON_TEMPORAL_LOCALITY,
>> \
>> +         "Fetch data to non-temporal cache close to processor"
>> \
>> +         "to minimize cache pollution")
>> \
>> +    OPCH(OPCH_LTR, PREFETCH_READ, LOW_TEMPORAL_LOCALITY,
>> \
>> +         "Fetch data to L2 and L3 cache")
>> \
>> +    OPCH(OPCH_MTR, PREFETCH_READ, MODERATE_TEMPORAL_LOCALITY,
>> \
>> +         "Fetch data to L2 and L3 caches, same as LTR on"
>> \
>> +         "Nehalem, Westmere, Sandy Bridge and newer
>> + microarchitectures")
>> \
>> +    OPCH(OPCH_HTR, PREFETCH_READ, HIGH_TEMPORAL_LOCALITY,
>> \
>> +         "Fetch data in to all cache levels L1, L2 and L3")
>> \
>> +    OPCH(OPCH_NTW, PREFETCH_WRITE, NON_TEMPORAL_LOCALITY,
>> \
>> +         "Fetch data to L2 and L3 cache in exclusive state"
>> \
>> +         "in anticipation of write")
>> \
>> +    OPCH(OPCH_LTW, PREFETCH_WRITE, LOW_TEMPORAL_LOCALITY,
>> \
>> +         "Fetch data to L2 and L3 cache in exclusive state")
>> \
>> +    OPCH(OPCH_MTW, PREFETCH_WRITE,
>MODERATE_TEMPORAL_LOCALITY,
>> \
>> +         "Fetch data in to L2 and L3 caches in exclusive state")
>> \
>> +    OPCH(OPCH_HTW, PREFETCH_WRITE, HIGH_TEMPORAL_LOCALITY,
>> \
>> +         "Fetch data in to all cache levels in exclusive state")
>> +
>> +/* Indexes for cache prefetch types. */ enum { #define OPCH(ENUM, RW,
>> +LOCALITY, EXPLANATION) ENUM##_INDEX,
>> +    OVS_PREFETCH_CACHE_HINT
>> +#undef OPCH
>> +};
>> +
>> +/* Cache prefetch types. */
>> +enum ovs_prefetch_type {
>> +#define OPCH(ENUM, RW, LOCALITY, EXPLANATION) ENUM = 1 <<
>ENUM##_INDEX,
>> +    OVS_PREFETCH_CACHE_HINT
>> +#undef OPCH
>> +};
>> +
>> +#define OVS_PREFETCH_CACHE(addr, TYPE) switch(TYPE)
>
>Checkpatch caught the following:
>
>ERROR: Improper whitespace around control block
>#164 FILE: include/openvswitch/compiler.h:331:
>#define OVS_PREFETCH_CACHE(addr, TYPE) switch(TYPE)                           \
>
>Lines checked: 204, Warnings: 0, Errors: 1> \

[BHANU]
I will fix this.

>> +{
>> \
>> +    case OPCH_NTR:
>> \
>> +        __builtin_prefetch((addr), PREFETCH_READ,
>> + NON_TEMPORAL_LOCALITY);
>> \
>> +        break;
>> \
>> +    case OPCH_LTR:
>> \
>> +        __builtin_prefetch((addr), PREFETCH_READ,
>> + LOW_TEMPORAL_LOCALITY);
>> \
>> +        break;
>> \
>> +    case OPCH_MTR:
>> \
>> +        __builtin_prefetch((addr), PREFETCH_READ,
>> \
>> +                           MODERATE_TEMPORAL_LOCALITY);
>> \
>> +        break;
>> \
>> +    case OPCH_HTR:
>> \
>> +        __builtin_prefetch((addr), PREFETCH_READ,
>> HIGH_TEMPORAL_LOCALITY);    \
>> +        break;
>> \
>> +    case OPCH_NTW:
>> \
>> +        __builtin_prefetch((addr), PREFETCH_WRITE,
>> NON_TEMPORAL_LOCALITY);    \
>> +        break;
>> \
>> +    case OPCH_LTW:
>> \
>> +        __builtin_prefetch((addr), PREFETCH_WRITE,
>> LOW_TEMPORAL_LOCALITY);    \
>> +        break;
>> \
>> +    case OPCH_MTW:
>> \
>> +        __builtin_prefetch((addr), PREFETCH_WRITE,
>> \
>> +                           MODERATE_TEMPORAL_LOCALITY);
>> \
>> +        break;
>> \
>> +    case OPCH_HTW:
>> \
>> +        __builtin_prefetch((addr), PREFETCH_WRITE,
>> HIGH_TEMPORAL_LOCALITY);   \
>> +        break;
>> \
>> +}
>> +
>> +/* Retain this for backward compatibility. */ #define
>> +OVS_PREFETCH(addr) OVS_PREFETCH_CACHE(addr, OPCH_HTR) #define
>> +OVS_PREFETCH_WRITE(addr) OVS_PREFETCH_CACHE(addr, OPCH_HTW)
>>  #else
>>  #define OVS_PREFETCH(addr)
>>  #define OVS_PREFETCH_WRITE(addr)
>> +#define OVS_PREFETCH_CACHE(addr, OP)
>>  #endif
>>
>>  /* Build assertions.
>> --
>> 2.4.11
>>
>> _______________________________________________
>> dev mailing list
>> dev at openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev


More information about the dev mailing list