|
| View previous topic :: View next topic |
| Author |
Message |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Fri Dec 21, 2012 2:48 pm Post subject: |
|
|
| Quote: | | I am wondering how pgcc optimize the code so that it can have a very good performance? |
While my gcc time isn't a bad as yours, yes PGI is much faster, even without optimization. With "-Minfo" you can see some of the optimization performed. Also, I believe our log function is faster.
| Code: | % gcc interesting.c -O0 -std=c99 -lm ; a.out
Result is 2185160991.755704, time is 9168.019775
% gcc interesting.c -O3 -ftree-vectorize -ffast-math -std=c99 -lm ; a.out
Result is 2185160991.755704, time is 6343.856934
% pgcc interesting.c -Minfo -O0 -Msafeptr ; a.out
Result is 2185160991.755704, time is 7534.363037
% pgcc interesting.c -Minfo -fast -Msafeptr ; a.out
main:
55, Loop not vectorized: may not be beneficial
Unrolled inner loop 8 times
Residual loop unrolled 4 times (completely unrolled)
67, Memory set idiom, loop replaced by call to __c_mset8
75, Generated vector sse code for the loop
Generated a prefetch instruction for the loop
83, Generated vector sse code for the loop
Generated a prefetch instruction for the loop
Result is 2185160991.755704, time is 4462.258057
|
| Quote: | | Is there anything I can do to minimize that time? | Hoist the data region above the outer loop and put the data init routines on the accelerator:
| Code: | % cat interesting.c
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>
#include <time.h>
#define T 4756
void stopwatch_start(double* stopwatch) {
struct timeval tval;
gettimeofday(&tval, NULL);
*stopwatch = (tval.tv_sec * 1000 + tval.tv_usec/1000.0);
}
/**
* * * Returns the elapsed time since the stopwatch started via stopwatch_start
* * * @param[in] stopwatch the stopwatch handler
* * * @return elapsed time in milliseconds
* * */
double stopwatch_elapsed(double* stopwatch) {
struct timeval tval;
gettimeofday(&tval, NULL);
return ((tval.tv_sec * 1000 + tval.tv_usec/1000.0) - *stopwatch);
}
double fRand(double fMin, double fMax)
{
double f = (double)rand() / RAND_MAX;
return fMin + f * (fMax - fMin);
}
int main(void){
unsigned int len1 = 3182;
int *temp1=(int *)calloc(T,sizeof(int));
double * temp2 = (double *) calloc(T*32, sizeof(double *));
/* for (int i = 0; i < T; i++){
temp2[i] = (double *) calloc(32, sizeof(double));
}
*/
double a = 1.0f;
double b = 2.0f;
double c = 3.0f;
double reduction_1 = 0;
double timer;
stopwatch_start(&timer);
// high level loop
#pragma acc data create(temp1[0:T],temp2[0:T*32])
{
for(int r_est = 0; r_est < len1; r_est ++){
#pragma acc kernels
{
for(int i = 0; i < T;i++)
{
if(i%2 == 0)
temp1[i] = 0;
else
temp1[i] = 1;
// printf("Temp1 is %d\n",temp1[i]);
}
#pragma acc loop independent
for(int i = 0; i < T; i++)
{
#pragma acc loop independent
for(int j = 0; j < 32; j++)
temp2[i*32+j] = 1.5f;
}
// printf("enter %d times\n",r_est);
for(int n = 0; n < T; n++){
double tem = 0;
if (temp1[n]== 1){
for(int f= 0;f<32;f++)
tem += (a+b)*log(c+a*temp2[n*32+f]);
}
reduction_1 += tem;
}
for(int n = 0; n < T; n++){
double tem = 0;
if (temp1[n]== 0){
for(int f= 0;f<32;f++)
tem += (a+b)*log(c+a*temp2[n*32+f]);
}
reduction_1 += tem;
}
}
}
}
printf("Result is %f, time is %f\n",reduction_1,stopwatch_elapsed(&timer));
return 1;
}
% pgcc interesting.c -Minfo=accel -fast -acc -ta=nvidia,4.2 -Msafeptr ; a.out
main:
49, Generating create(temp2[0:152192])
Generating create(temp1[0:4756])
53, Generating present_or_create(temp2[0:152192])
Generating present_or_create(temp1[0:4756])
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
55, Loop is parallelizable
Accelerator kernel generated
55, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
CC 1.3 : 8 registers; 32 shared, 20 constant, 0 local memory bytes
CC 2.0 : 12 registers; 0 shared, 56 constant, 0 local memory bytes
64, Loop is parallelizable
67, Loop is parallelizable
Accelerator kernel generated
64, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
67, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
CC 1.3 : 11 registers; 32 shared, 20 constant, 0 local memory bytes
CC 2.0 : 14 registers; 0 shared, 56 constant, 0 local memory bytes
72, Loop is parallelizable
Accelerator kernel generated
72, #pragma acc loop gang /* blockIdx.x */
CC 1.3 : 31 registers; 96 shared, 60 constant, 0 local memory bytes
CC 2.0 : 28 registers; 32 shared, 112 constant, 0 local memory bytes
75, #pragma acc loop vector(32) /* threadIdx.x */
78, Sum reduction generated for reduction_1
75, Loop is parallelizable
80, Loop is parallelizable
Accelerator kernel generated
80, #pragma acc loop gang /* blockIdx.x */
CC 1.3 : 31 registers; 96 shared, 60 constant, 0 local memory bytes
CC 2.0 : 28 registers; 32 shared, 112 constant, 0 local memory bytes
83, #pragma acc loop vector(32) /* threadIdx.x */
86, Sum reduction generated for reduction_1
83, Loop is parallelizable
Result is 2185160991.286413, time is 1749.938965
Accelerator Kernel Timing data
interesting.c
main
53: region entered 3182 times
time(us): total=1,645,141 init=183 region=1,644,958
kernels=976,429
w/o init: total=1,644,958 max=189,133 min=440 avg=516
55: kernel launched 3182 times
grid: [38] block: [128]
time(us): total=39,806 max=197 min=11 avg=12
67: kernel launched 3182 times
grid: [1x595] block: [32x8]
time(us): total=64,299 max=38 min=18 avg=20
72: kernel launched 3182 times
grid: [4756] block: [32]
time(us): total=373,710 max=444 min=114 avg=117
78: kernel launched 3182 times
grid: [1] block: [256]
time(us): total=63,636 max=42 min=14 avg=19
80: kernel launched 3182 times
grid: [4756] block: [32]
time(us): total=371,370 max=304 min=113 avg=116
86: kernel launched 3182 times
grid: [1] block: [256]
time(us): total=63,608 max=42 min=18 avg=19
interesting.c
main
49: region entered 1 time
time(us): total=1,749,932 init=103,785 region=1,646,147
w/o init: total=1,646,147 max=1,646,147 min=1,646,147 avg=1,646,147
|
|
|
| Back to top |
|
 |
bo_fang
Joined: 21 Dec 2012 Posts: 2
|
Posted: Fri Dec 21, 2012 3:30 pm Post subject: |
|
|
Hi Mat,
I modify the code as what you pointed, and my gcc version is still not good :). Perhaps it has something to do with the CPU that I am using.
Anyway, for GPU+OpenACC, below is my new timing info. My initialzation time is still 3s. I realized that could it be due to that I have three GPUs on my machine and I am only using one of them? I couldn't find a suitable command for me to disable or not initialize the other two GPUs.
| Code: |
bo@node240 ~]$ ./a.out
Result is 2185160991.286413, time is 4671.968018
Accelerator Kernel Timing data
/home/bo/test1.c
main
54: region entered 3182 times
time(us): total=1,425,879 init=248 region=1,425,631
kernels=1,099,710
w/o init: total=1,425,631 max=1,793 min=444 avg=448
57: kernel launched 3182 times
grid: [38] block: [128]
time(us): total=14,595 max=17 min=4 avg=4
68: kernel launched 3182 times
grid: [1x595] block: [32x8]
time(us): total=69,944 max=27 min=21 avg=21
72: kernel launched 3182 times
grid: [4756] block: [32]
time(us): total=458,477 max=155 min=143 avg=144
78: kernel launched 3182 times
grid: [1] block: [256]
time(us): total=50,967 max=27 min=15 avg=16
80: kernel launched 3182 times
grid: [4756] block: [32]
time(us): total=454,763 max=149 min=142 avg=142
86: kernel launched 3182 times
grid: [1] block: [256]
time(us): total=50,964 max=21 min=15 avg=16
/home/bo/test1.c
main
49: region entered 1 time
time(us): total=4,671,945 init=3,244,745 region=1,427,200
w/o init: total=1,427,200 max=1,427,200 min=1,427,200 avg=1,427,200
|
|
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Fri Dec 21, 2012 4:07 pm Post subject: |
|
|
| Quote: | | My initialzation time is still 3s. I realized that could it be due to that I have three GPUs on my machine and I am only using one of them? | Yes, the init time is ~1 second per device, even if you're only using one of the. Though, pgcudainit should hold them all open.
You might want to investigate using NVDIA's smi utility: https://developer.nvidia.com/nvidia-system-management-interface
- Mat |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|