PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

cuModuleGetGlobal error
Goto page Previous  1, 2, 3
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6146
Location: The Portland Group Inc.

PostPosted: Fri Dec 21, 2012 2:48 pm    Post subject: Reply with quote

Quote:
I am wondering how pgcc optimize the code so that it can have a very good performance?


While my gcc time isn't a bad as yours, yes PGI is much faster, even without optimization. With "-Minfo" you can see some of the optimization performed. Also, I believe our log function is faster.
Code:
% gcc interesting.c -O0 -std=c99 -lm ; a.out
Result is 2185160991.755704,  time is 9168.019775
% gcc interesting.c -O3 -ftree-vectorize -ffast-math -std=c99 -lm ; a.out
Result is 2185160991.755704,  time is 6343.856934
% pgcc interesting.c -Minfo -O0 -Msafeptr ; a.out
Result is 2185160991.755704,  time is 7534.363037
% pgcc interesting.c -Minfo -fast -Msafeptr ; a.out
main:
     55, Loop not vectorized: may not be beneficial
         Unrolled inner loop 8 times
         Residual loop unrolled 4 times (completely unrolled)
     67, Memory set idiom, loop replaced by call to __c_mset8
     75, Generated vector sse code for the loop
         Generated a prefetch instruction for the loop
     83, Generated vector sse code for the loop
         Generated a prefetch instruction for the loop
Result is 2185160991.755704,  time is 4462.258057


Quote:
Is there anything I can do to minimize that time?
Hoist the data region above the outer loop and put the data init routines on the accelerator:

Code:
% cat interesting.c
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>
#include <time.h>
#define T 4756

void stopwatch_start(double* stopwatch) {
  struct timeval tval;
  gettimeofday(&tval, NULL);
  *stopwatch = (tval.tv_sec * 1000 + tval.tv_usec/1000.0);
}

/**                                                                                                                                           
 *  *  * Returns the elapsed time since the stopwatch started via stopwatch_start                                           
 *   *   * @param[in] stopwatch the stopwatch handler                                         
 *    *    * @return elapsed time in milliseconds                                     
 *     *     */
double stopwatch_elapsed(double* stopwatch) {
  struct timeval tval;
  gettimeofday(&tval, NULL);
  return ((tval.tv_sec * 1000 + tval.tv_usec/1000.0) - *stopwatch);
}


double fRand(double fMin, double fMax)
{
  double f = (double)rand() / RAND_MAX;
  return fMin + f * (fMax - fMin);
}

int main(void){
  unsigned int len1 = 3182;
  int *temp1=(int *)calloc(T,sizeof(int));
  double * temp2 = (double *) calloc(T*32, sizeof(double *));
  /*  for (int i = 0; i < T; i++){
 
      temp2[i] = (double *) calloc(32, sizeof(double));
      }
  */
  double a = 1.0f;
  double b = 2.0f;
  double c = 3.0f;
  double reduction_1 = 0;

  double timer;
  stopwatch_start(&timer);
  // high level loop
#pragma acc data create(temp1[0:T],temp2[0:T*32])
  {
    for(int r_est = 0; r_est < len1; r_est ++){

#pragma acc kernels
      {
   for(int i = 0; i < T;i++)
     {   
       if(i%2 == 0)
         temp1[i] = 0;
       else
         temp1[i] = 1;
       // printf("Temp1  is %d\n",temp1[i]);
     }
#pragma acc loop independent
   for(int i = 0; i < T; i++)
     {
#pragma acc loop independent
       for(int j = 0; j < 32; j++)
         temp2[i*32+j] = 1.5f;
     }

   // printf("enter %d times\n",r_est);
   for(int n = 0; n < T; n++){
     double tem = 0;
     if (temp1[n]== 1){
            for(int f= 0;f<32;f++)
         tem += (a+b)*log(c+a*temp2[n*32+f]);
     }
     reduction_1 += tem;
   }
   for(int n = 0; n < T; n++){
     double tem = 0;
     if (temp1[n]== 0){
            for(int f= 0;f<32;f++)
         tem += (a+b)*log(c+a*temp2[n*32+f]);
     }
     reduction_1 += tem;
   }
   
      }
    }
  }
  printf("Result is %f,  time is %f\n",reduction_1,stopwatch_elapsed(&timer));
  return 1;
}
% pgcc interesting.c -Minfo=accel -fast -acc -ta=nvidia,4.2 -Msafeptr ; a.out
main:
     49, Generating create(temp2[0:152192])
         Generating create(temp1[0:4756])
     53, Generating present_or_create(temp2[0:152192])
         Generating present_or_create(temp1[0:4756])
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     55, Loop is parallelizable
         Accelerator kernel generated
         55, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             CC 1.3 : 8 registers; 32 shared, 20 constant, 0 local memory bytes
             CC 2.0 : 12 registers; 0 shared, 56 constant, 0 local memory bytes
     64, Loop is parallelizable
     67, Loop is parallelizable
         Accelerator kernel generated
         64, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
         67, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
             CC 1.3 : 11 registers; 32 shared, 20 constant, 0 local memory bytes
             CC 2.0 : 14 registers; 0 shared, 56 constant, 0 local memory bytes
     72, Loop is parallelizable
         Accelerator kernel generated
         72, #pragma acc loop gang /* blockIdx.x */
             CC 1.3 : 31 registers; 96 shared, 60 constant, 0 local memory bytes
             CC 2.0 : 28 registers; 32 shared, 112 constant, 0 local memory bytes
         75, #pragma acc loop vector(32) /* threadIdx.x */
         78, Sum reduction generated for reduction_1
     75, Loop is parallelizable
     80, Loop is parallelizable
         Accelerator kernel generated
         80, #pragma acc loop gang /* blockIdx.x */
             CC 1.3 : 31 registers; 96 shared, 60 constant, 0 local memory bytes
             CC 2.0 : 28 registers; 32 shared, 112 constant, 0 local memory bytes
         83, #pragma acc loop vector(32) /* threadIdx.x */
         86, Sum reduction generated for reduction_1
     83, Loop is parallelizable
Result is 2185160991.286413,  time is 1749.938965

Accelerator Kernel Timing data
interesting.c
  main
    53: region entered 3182 times
        time(us): total=1,645,141 init=183 region=1,644,958
                  kernels=976,429
        w/o init: total=1,644,958 max=189,133 min=440 avg=516
        55: kernel launched 3182 times
            grid: [38]  block: [128]
            time(us): total=39,806 max=197 min=11 avg=12
        67: kernel launched 3182 times
            grid: [1x595]  block: [32x8]
            time(us): total=64,299 max=38 min=18 avg=20
        72: kernel launched 3182 times
            grid: [4756]  block: [32]
            time(us): total=373,710 max=444 min=114 avg=117
        78: kernel launched 3182 times
            grid: [1]  block: [256]
            time(us): total=63,636 max=42 min=14 avg=19
        80: kernel launched 3182 times
            grid: [4756]  block: [32]
            time(us): total=371,370 max=304 min=113 avg=116
        86: kernel launched 3182 times
            grid: [1]  block: [256]
            time(us): total=63,608 max=42 min=18 avg=19
interesting.c
  main
    49: region entered 1 time
        time(us): total=1,749,932 init=103,785 region=1,646,147
        w/o init: total=1,646,147 max=1,646,147 min=1,646,147 avg=1,646,147
Back to top
View user's profile
bo_fang



Joined: 21 Dec 2012
Posts: 2

PostPosted: Fri Dec 21, 2012 3:30 pm    Post subject: Reply with quote

Hi Mat,

I modify the code as what you pointed, and my gcc version is still not good :). Perhaps it has something to do with the CPU that I am using.

Anyway, for GPU+OpenACC, below is my new timing info. My initialzation time is still 3s. I realized that could it be due to that I have three GPUs on my machine and I am only using one of them? I couldn't find a suitable command for me to disable or not initialize the other two GPUs.

Code:


bo@node240 ~]$ ./a.out
Result is 2185160991.286413,  time is 4671.968018

Accelerator Kernel Timing data
/home/bo/test1.c
  main
    54: region entered 3182 times
        time(us): total=1,425,879 init=248 region=1,425,631
                  kernels=1,099,710
        w/o init: total=1,425,631 max=1,793 min=444 avg=448
        57: kernel launched 3182 times
            grid: [38]  block: [128]
            time(us): total=14,595 max=17 min=4 avg=4
        68: kernel launched 3182 times
            grid: [1x595]  block: [32x8]
            time(us): total=69,944 max=27 min=21 avg=21
        72: kernel launched 3182 times
            grid: [4756]  block: [32]
            time(us): total=458,477 max=155 min=143 avg=144
        78: kernel launched 3182 times
            grid: [1]  block: [256]
            time(us): total=50,967 max=27 min=15 avg=16
        80: kernel launched 3182 times
            grid: [4756]  block: [32]
            time(us): total=454,763 max=149 min=142 avg=142
        86: kernel launched 3182 times
            grid: [1]  block: [256]
            time(us): total=50,964 max=21 min=15 avg=16
/home/bo/test1.c
  main
    49: region entered 1 time
        time(us): total=4,671,945 init=3,244,745 region=1,427,200
        w/o init: total=1,427,200 max=1,427,200 min=1,427,200 avg=1,427,200
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6146
Location: The Portland Group Inc.

PostPosted: Fri Dec 21, 2012 4:07 pm    Post subject: Reply with quote

Quote:
My initialzation time is still 3s. I realized that could it be due to that I have three GPUs on my machine and I am only using one of them?
Yes, the init time is ~1 second per device, even if you're only using one of the. Though, pgcudainit should hold them all open.

You might want to investigate using NVDIA's smi utility: https://developer.nvidia.com/nvidia-system-management-interface

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3
Page 3 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group