[ovs-dev] [PATCH v4 2/5] bitmap: add bitmap_count1 function

Alexander Wu alexander.wu at huawei.com
Mon Dec 9 03:34:26 UTC 2013


On 07/12/2013 00:13, Jarno Rajahalme wrote:
>
> On Dec 6, 2013, at 1:18 AM, Alexander Wu <alexander.wu at huawei.com> wrote:
>
>> Hi Jarno,
>>
>> I've read your patch "better count1_bits", and I test the gcc
>> builtins separately.
>>
>> Call __builtin_popcount|__builtin_popcountl|__builtin_popcountll 10 million times
>> --------------------------------------
>>     suse-kvm-of13:/test # time ./bit4
>>
>>     real    0m0.034s
>>     user    0m0.032s
>>     sys     0m0.000s
>>
>> Call count1_bits 10 million times
>> --------------------------------------
>>     suse-kvm-of13:/test # time ./bit1
>>
>>     real    0m0.080s
>>     user    0m0.076s
>>     sys     0m0.000s
>>
>> Looks good, but I've a problem below.
>> My cpuinfo: 16U * Intel(R) Xeon(R) CPU E5620 @ 2.40GHz. (westmere)
>> I've read gcc source, find M_INTEL_COREI7_WESTMERE, it seems
>> to say westmere is corei7, but the following code doesn't work:
>>
>>     #if defined(__corei7)
>>         int i;
>>         for (i = 0; i < 10000000; i++)
>>             __builtin_popcount(i);
>>     #endif
>>
>
> You need to tell gcc to compile for your processor:
>
> $ echo | gcc -dM -E - | grep core
> $ echo | gcc -march=native -dM -E - | grep core
> #define __corei7 1
> #define __tune_corei7__ 1
> #define __corei7__ 1
> $
>
> Also, you need to be careful to both allow the compiler to optimize as we do with building OVS (-O2), but make sure the test cases are not optimized away.
>
>> I believe there're some particuler cpus which the buildin_popcount
>> is suitable for, any way to represent them?
>>
>
> I think it is trial and error, since the builtin popcount is kind of bad without direct CPU support.
>
>> On 06/12/2013 12:26, Ben Pfaff wrote:
>>>
>>> But I'm inclined to believe that a 65536-byte array wastes too much
>>> memory.
>>>
>
> I’m inclined to agree that it might waste too much (L1 cache) memory.
>
>     Jarno
>

Hi Jarno,

I get my gcc predefined __core2. But its performance seems to be worse when
I add '-O2'. Not sure if it's the reality.

Here are part of my test code, compile command and its result.

Code:

     uint32_t i, last_bits;
     struct timespec start = {0};
     struct timespec end = {0};
     srand(time(NULL));
     int r = rand();
#define N_LOOP 100000
     int random_array[N_LOOP];

     srand(time(NULL));
     for (i = 0; i < N_LOOP; i++) {
         r = rand();
         random_array[i] = r;
     }

//__builtin_popcount
     clock_gettime(CLOCK_THREAD_CPUTIME_ID, &start);
     for (i = 0; i < N_LOOP; i++) {
         last_bits = __builtin_popcount(random_array[i]);
     }
     clock_gettime(CLOCK_THREAD_CPUTIME_ID, &end);
     printf("time-diff:%ld\n", end.tv_nsec - start.tv_nsec);
     printf("last-bits:%d\n", last_bits);

//original ovs count_1bits_32
     clock_gettime(CLOCK_THREAD_CPUTIME_ID, &start);
     for (i = 0; i < N_LOOP; i++) {
         last_bits = count_1bits_32(random_array[i]);
     }
     clock_gettime(CLOCK_THREAD_CPUTIME_ID, &end);
     printf("time-diff:%ld\n", end.tv_nsec - start.tv_nsec);
     printf("last-bits:%d\n", last_bits);

//simple foo function, to count '=' and function time.
     clock_gettime(CLOCK_THREAD_CPUTIME_ID, &start);
     for (i = 0; i < N_LOOP; i++) {
         last_bits = foo();
     }
     clock_gettime(CLOCK_THREAD_CPUTIME_ID, &end);
     printf("time-diff:%ld\n", end.tv_nsec - start.tv_nsec);
     printf("last-bits:%d\n", last_bits);

Compile:
     gcc bit1.c -o bit1 -march=native -mtune=native -lrt -O2  && ./bit1

Result:

     time-diff:1063893 //__builtin_popcount
     last-bits:10
     time-diff:293463  //original ovs count_1bits_32
     last-bits:10
     time-diff:188     //simple foo function, to count '=' and function time.(maybe it has been optimized out)
     last-bits:99999

Result without -O2:

     time-diff:1317450
     last-bits:10
     time-diff:991438
     last-bits:10
     time-diff:416265
     last-bits:99999


Note I use last_bits to restore the return value, and when I use it,
performance of __builtin_popcount seems to decrease, I guess compiler
optimize __builtin_popcount as its wish like -O2.

So do you think it's enough to represent __builtin_popcount is not
suitable for __core2?

Best regards,
Alexander Wu





More information about the dev mailing list