[ovs-dev] [PATCH RFC 2/5] configure: Include -mprefetchwt1 explicitly.

Ilya Maximets i.maximets at samsung.com
Tue Dec 5 15:00:10 UTC 2017


On 05.12.2017 16:54, Bodireddy, Bhanuprakash wrote:
>>>> On Mon, Dec 04, 2017 at 08:16:47PM +0000, Bhanuprakash Bodireddy
>> wrote:
>>>>> Processors support prefetch instruction in anticipation of write but
>>>>> compilers(gcc) won't use them unless explicitly asked to do so even
>>>>> with '-march=native' specified.
>>>>>
>>>>> [Problem]
>>>>>   Case A:
>>>>>     OVS_PREFETCH_CACHE(addr, OPCH_HTW)
>>>>>        __builtin_prefetch(addr, 1, 3)
>>>>>          leaq    -112(%rbp), %rax        [Assembly]
>>>>>          prefetchw  (%rax)
>>>>>
>>>>>   Case B:
>>>>>     OVS_PREFETCH_CACHE(addr, OPCH_LTW)
>>>>>        __builtin_prefetch(addr, 1, 1)
>>>>>          leaq    -112(%rbp), %rax        [Assembly]
>>>>>          prefetchw  (%rax)             <***problem***>
>>>>>
>>>>>   Inspite of specifying -march=native and using Low Temporal
>>>> Write(OPCH_LTW),
>>>>>   the compiler generates 'prefetchw' instruction instead of 'prefetchwt1'
>>>>>   instruction available on processor.
>>>>>
>>>>> [Solution]
>>>>>   Include -mprefetchwt1
>>>>>
>>>>>   Case B:
>>>>>     OVS_PREFETCH_CACHE(addr, OPCH_LTW)
>>>>>        __builtin_prefetch(addr, 1, 1)
>>>>>          leaq    -112(%rbp), %rax        [Assembly]
>>>>>          prefetchwt1  (%rax)
>>>>>
>>>>> [Testing]
>>>>>   $ ./boot.sh
>>>>>   $ ./configure
>>>>>      checking target hint for cgcc... x86_64
>>>>>      checking whether gcc accepts -mprefetchwt1... yes
>>>>>   $ make -j
>>>>>
>>>>> Signed-off-by: Bhanuprakash Bodireddy <bhanuprakash.bodireddy at
>>>>> intel.com>
>>>>
>>>> Does this have any effect if the architecture or CPU configured for
>>>> use does not support prefetchwt1?
>>>
>>> That's a good question and I spent reasonable time today to figure this out.
>>> I have Haswell, Broadwell and Skylake CPUs and they all support this
>> instruction.
>>
>> Hmm. I have 2 different Broadwell machines (Xeon E5 v4 and i7-6800K) and
>> both of them doesn't have prefetchwt1 instruction according to cpuid:
>>
>> 	PREFETCHWT1                              = false
> 
> Xeon E5-26XX v4 is Broadwell workstation/server but i7-6800k is Skylake Desktop variant where as E3-12XX v5 is equivalent skylake workstation/server variant.
> AFAIK, prefetchwt1 should be available on above processors, not sure why cpuid displays it otherwise.

That is totally weird. I tried to compile following simple program: 

int main()
{
        int c;

        __builtin_prefetch(&c, 1, 1);
        c = 8;

        return c;
}

on my old Ivy Bridge i7-3770 CPU. It does not support even 'prefetchw':

      PREFETCHWT1                              = false
      3DNow! PREFETCH/PREFETCHW instructions = false

Results:

$ gcc 1.c 
$ objdump -S ./a.out | grep prefetch -A2 -B2
  40055b:       31 c0                   xor    %eax,%eax
  40055d:       48 8d 45 f4             lea    -0xc(%rbp),%rax
  400561:       0f 18 18                prefetcht2 (%rax)
  400564:       c7 45 f4 08 00 00 00    movl   $0x8,-0xc(%rbp)
  40056b:       8b 45 f4                mov    -0xc(%rbp),%eax

$ gcc 1.c -march=native
$ objdump -S ./a.out | grep prefetch -A2 -B2
  40055b:       31 c0                   xor    %eax,%eax
  40055d:       48 8d 45 f4             lea    -0xc(%rbp),%rax
  400561:       0f 18 18                prefetcht2 (%rax)
  400564:       c7 45 f4 08 00 00 00    movl   $0x8,-0xc(%rbp)
  40056b:       8b 45 f4                mov    -0xc(%rbp),%eax

$ gcc 1.c -march=native -mprefetchwt1
$ objdump -S ./a.out | grep prefetch -A2 -B2
  40055b:       31 c0                   xor    %eax,%eax
  40055d:       48 8d 45 f4             lea    -0xc(%rbp),%rax
  400561:       0f 0d 10                prefetchwt1 (%rax)
  400564:       c7 45 f4 08 00 00 00    movl   $0x8,-0xc(%rbp)
  40056b:       8b 45 f4                mov    -0xc(%rbp),%eax

So, it inserts this instruction even if I have on such instruction in CPU.
More interesting is that program still works without any issues.
I assume that CPU just skips that instruction or executes something else.

So, it's really strange and it's unclear what CPU really executes in
case where we have 'prefetchwt1' in code but not supported by CPU.

If CPU just skips this instruction we will lost all the prefetching optimizations
because all the calls will be replaced by non-existent 'prefetchwt1'.

How can we be sure that 'prefetchwt1' was really executed?

Best regards, Ilya Maximets.

> 
> pmd_thread_main()
> -------------------------------------------------------------------------------------------
> WITH OPCH_HTW, we see prefetchw instruction. 
> 
> OVS_PREFETCH_CACHE(&pmd->cachelineC, OPCH_HTW);
>     cycles_count_start(pmd);
>     for (;;) {
>         for (i = 0; i < poll_cnt; i++) {
>             process_packets =
>                 dp_netdev_process_rxq_port(pmd, poll_list[i].rxq->rx,
>                                            poll_list[i].port_no);
>             cycles_count_intermediate(pmd, poll_list[i].rxq,
> 
> 
> Address	Source Line	Assembly	
> 0x6e29ef	4,086	movl  0x823ecb(%rip), %edi							
> 0x6e29f5	4,085	movq  0x50(%rsp), %rax							
> 0x6e29fa	4,086	test %edi, %edi							
> 0x6e29fc	4,085	prefetchwz  (%rax)							
> ----------------------------------------------------------------------------------------
> With OPCH_LTW, we can see prefetchwt1b instruction being used(change made to show this).
> 
> OVS_PREFETCH_CACHE(&pmd->cachelineC, OPCH_LTW);
>     cycles_count_start(pmd);
>     for (;;) {
>         for (i = 0; i < poll_cnt; i++) {
>             ..........
> 
> Address	Source Line	Assembly	
> 0x6e29ef	4,086	movl  0x823ecb(%rip), %edi							
> 0x6e29f5	4,085	movq  0x50(%rsp), %rax							
> 0x6e29fa	4,086	test %edi, %edi							
> 0x6e29fc	4,085	prefetchwt1b  (%rax)							
> -----------------------------------------------------------------------------------------
> 
>>
>> This means that introducing of this change will break binary compatibility even
>> between CPUs of the same generation, i.e. I will not be able to run on my
>> system binaries compiled on yours.
>>
>> If it's true I prefer to not have this change.
>>
>> Anyway adding of this change will make compiling a generic binary for a
>> different platforms impossible if your build server supports prefetchwt1.
>> There should be way to disable this arch specific compiler flag even if it
>> supported on my current platform.
> 
> I see your point where a build server can be advanced and supports the prefetchwt1 instruction
> and when I copy and run the precompiled binaries on a server not supporting it, how does this behave?
> 
> Not sure on this. May be Redhat/canonical developers can comment on how they handle this kind of cases.
> 
> I will try to check this on my side.
> 
> - Bhanuprakash.
> 
>>
>> Best regards, Ilya Maximets.
>>
>>> But I found that this instruction isn't enabled by default even with
>> march=native and so need to explicitly enable this.
>>>
>>> Coming to your question, there won't be side effects on using OPCH_LTW.
>>> On Processors that *doesn't* support PREFETCHW and PREFETCHWT1 the
>> compiler generates a 'prefetcht1' instruction.
>>> On processors that support PREFETCHW the compiler generates 'prefetchw'
>> instruction.
>>> On processors that support PREFETCHW & PREFETCHWT1, the compiler
>> generates 'prefetchwt1' instruction with -mprefetchwt1 explicitly enabled.
>>>
>>>> If it could lead to that situation, then this does not seem like the
>>>> right thing to do, and we might want to fall back to recommending use
>>>> of the option when the person building knows that the software will
>>>> run on a machine with prefetchwt1.
>>>
>>> According to above on processors that doesn't have this instruction support,
>> 'prefetchnt1' instruction would be generated and doesn't have side effects.
>>> I verified this using https://gcc.godbolt.org/  and carefully checking the
>> instructions generated for different compiler versions and march flags.
>>>
>>> - Bhanuprakash.


More information about the dev mailing list