PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Calculation fails when using long long with OpenACC kernels

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Debugging and Profiling
View previous topic :: View next topic  
Author Message
Feng Chen



Joined: 16 May 2014
Posts: 10

PostPosted: Wed May 28, 2014 4:49 pm    Post subject: Calculation fails when using long long with OpenACC kernels Reply with quote

Hi, All:

I am testing a very simple Pi calculation using OpenACC acceleration, however it is giving erroneous results when I use a very large iteration number with long long:

The code pi_acc.orig.c is listed below:

Code:
  1 #include <stdio.h>
  1 #include <stdio.h>
  2 #include <stdlib.h>
  3 #include <omp.h>
  4
  5 int main(int argc, char** argv) {
  6     long long int i, n=10000000000; //10^10
  7     if (argc>1) n=atoi(argv[1]);
  8     double start_time, end_time;
  9     double x, pi;
 10     double sum = 0.0;
 11     double step = 1.0/(double) n;
 12     printf("step = %17.15f\n",step);
 13
 14 #pragma acc kernels
 15     for (i = 0; i < n; i++) {
 16         x = (i+0.5)*step;
 17         sum +=  4.0/(1.0+x*x);
 18     }
 19     pi = step * sum;
 20     printf("pi = %17.15f\n",pi);
 21     return 0;
 22 }


When using n=10000000000 (10^10), it is no longer able to give correct values, however if using smaller value (e.g. 10^9), the pi value is ok.

If without the #pragma acc kernels, the serial version will give correct results:

[fchen14@shelob006 c]$ pgcc -acc pi_acc.orig.c
[fchen14@shelob006 c]$ ./a.out
step = 0.000000000100000
pi = 0.560331986334500
[fchen14@shelob006 c]$ ./a.out 1000000000
step = 0.000000001000000
pi = 3.141592653589794
[fchen14@shelob006 c]$ pgcc pi_acc.orig.c
[fchen14@shelob006 c]$ ./a.out
step = 0.000000000100000
pi = 3.141592653589451

Hopefully I have made my problem cleat, could anyone tell how to use the long long value with OpenACC?

Thanks a lot!

Feng[/code]
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Thu May 29, 2014 4:07 pm    Post subject: Reply with quote

Hi Feng,

Looks like the compiler isn't handling reductions with very large loop trip counts. I've added a problem report (TPR#20503) and sent it on to engineering.

The work around would be to split this into two reductions:

Code:
% cat testA.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <stdint.h>
#include <math.h>

int main(int argc, char** argv) {
     long i, j, n; //10^10
     if (argc>1) n=atoi(argv[1]);
     double start_time, end_time;
     double x, pi;
     double sum = 0.0;
     double sumA = 0.0;
     double step;
     n=10000000000; //10^10
   //n=1000000000; //10^10
     step = 1.0/(double) n;
     printf("step = %17.15f %ld\n",step,n);
     n = (long) sqrtf((float)n);
#pragma acc kernels loop  gang reduction(+:sum)
     for (i = 0; i < n; i++) {
         sumA = 0.0;
#pragma acc loop vector reduction(+:sumA)
     for (j = 0; j < n; j++) {
         x = ((n*i)+j+0.5)*step;
         sumA +=  4.0/(1.0+x*x);
       }
       sum+=sumA;
     }
     pi = step * sum;
     printf("pi = %17.15f \n",pi);
     return 0;
}
% pgcc testA.c -fast -acc -Minfo=accel  ; a.out
main:
     20, Generating Tesla code
     21, Loop is parallelizable
         Accelerator kernel generated
         21, #pragma acc loop gang /* blockIdx.x */
             Sum reduction generated for sum
         24, #pragma acc loop vector(128) /* threadIdx.x */
             Sum reduction generated for sumA
         Loop is parallelizable
step = 0.000000000100000 10000000000
pi = 3.141592653589793


Thanks!
Mat
Back to top
View user's profile
Feng Chen



Joined: 16 May 2014
Posts: 10

PostPosted: Sat May 31, 2014 2:32 pm    Post subject: Reply with quote

Hi, Mat, thanks for providing the workaround. The reason I use long long is trying to demonstrate speedup of OpenACC with simple directive. However for this example it seems the speedup is not apparent unless the "n" value is large enough to 10^10

Feng

mkcolg wrote:
Hi Feng,

Looks like the compiler isn't handling reductions with very large loop trip counts. I've added a problem report (TPR#20503) and sent it on to engineering.

The work around would be to split this into two reductions:

Code:
% cat testA.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <stdint.h>
#include <math.h>

int main(int argc, char** argv) {
     long i, j, n; //10^10
     if (argc>1) n=atoi(argv[1]);
     double start_time, end_time;
     double x, pi;
     double sum = 0.0;
     double sumA = 0.0;
     double step;
     n=10000000000; //10^10
   //n=1000000000; //10^10
     step = 1.0/(double) n;
     printf("step = %17.15f %ld\n",step,n);
     n = (long) sqrtf((float)n);
#pragma acc kernels loop  gang reduction(+:sum)
     for (i = 0; i < n; i++) {
         sumA = 0.0;
#pragma acc loop vector reduction(+:sumA)
     for (j = 0; j < n; j++) {
         x = ((n*i)+j+0.5)*step;
         sumA +=  4.0/(1.0+x*x);
       }
       sum+=sumA;
     }
     pi = step * sum;
     printf("pi = %17.15f \n",pi);
     return 0;
}
% pgcc testA.c -fast -acc -Minfo=accel  ; a.out
main:
     20, Generating Tesla code
     21, Loop is parallelizable
         Accelerator kernel generated
         21, #pragma acc loop gang /* blockIdx.x */
             Sum reduction generated for sum
         24, #pragma acc loop vector(128) /* threadIdx.x */
             Sum reduction generated for sumA
         Loop is parallelizable
step = 0.000000000100000 10000000000
pi = 3.141592653589793


Thanks!
Mat
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Tue Jun 03, 2014 11:38 am    Post subject: Reply with quote

Hi Feng,

While not as good as the 10^10, I do see a nice speed-up with 10^9 as well. If you're on Linux, you might want to run the "pgcudainit" utility in the background. The OS will power down your device when not in use and costs about 1-2 seconds per device to power back up. "pgcudainit" holds the device open so you don't incur the start-up costs. For longer running applications the start-up penalty doesn't matter, but can have an impact in these small examples.

- Mat


Code:
% pgcudainit &
[1] 21130
 pgcudainit called cuInit, now waiting for input
% time gpu.out
step = 0.000000000100000 10000000000
pi = 3.141592653589793
0.247u 0.347s 0:00.62 93.5%     0+0k 0+0io 0pf+0w
% time cpu.out
step = 0.000000000100000 10000000000
pi = 3.141592653589754
28.907u 0.004s 0:28.99 99.6%    0+0k 0+0io 0pf+0w

% time gpu9.out
step = 0.000000001000000 1000000000
pi = 3.141494419177372
0.023u 0.219s 0:00.26 88.4%     0+0k 0+0io 0pf+0w
% time cpu9.out
step = 0.000000001000000 1000000000
pi = 3.141494419177361
2.900u 0.001s 0:02.91 99.6%     0+0k 0+0io 0pf+0w


Without pgcudinit:
Code:
% time gpu9.out
step = 0.000000001000000 1000000000
pi = 3.141494419177372
0.010u 0.575s 0:02.78 20.8%     0+0k 0+0io 0pf+0w
Back to top
View user's profile
jtull



Joined: 30 Jun 2004
Posts: 445

PostPosted: Fri Jul 25, 2014 4:09 pm    Post subject: TPR 20503 is fixed in 14.7 Reply with quote

TPR 20503 - OpenACC: reduction gives bad answers when summing loop with very large trip count


is fixed in the current 14.7 release.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Debugging and Profiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group