PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

OpenACC cache directive

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
i_alex2004



Joined: 18 Aug 2012
Posts: 8

PostPosted: Sat Oct 13, 2012 4:08 am    Post subject: OpenACC cache directive Reply with quote

Hello, I faced with a problem when tried to use #pragma acc cache directive. It is considrered to cache specified data into shared memory of GPU . But compiler log says, my kernel doesn't use shared memory:
Code:

main:
     12, Generating copyout(B[0:N-1][0:M-1])
         Generating copyin(A[0:N-1][0:M-1])
     15, Generating present_or_copyin(A[0:N-1][0:M-1])
         Generating present_or_copyout(B[0:N-1][0:M-1])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     17, Loop is parallelizable
     20, Loop is parallelizable
         Accelerator kernel generated
         17, #pragma acc loop gang, vector(32) /* blockIdx.y threadIdx.y */
             Cached references to size [(y+2)x(x+2)] block of 'A'
         20, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
             CC 1.0 : 26 registers; 48 shared, 20 constant, 0 local memory bytes
             CC 2.0 : 22 registers; 0 shared, 68 constant, 0 local memory bytes


Actually it should use at least 324 shared memory variables to cache block size of [(16+2)x(16+2)].

Here is the code:
Code:

#include <openacc.h>
#include <stdio.h>
#include <stdlib.h>

void main()
{
   int A[1000][1000];
   int B[1000][1000];
   int N=1000;//count of elements
   int M=1000;//count of elements
   
   #pragma acc data copyin (A[0:N-1][0:M-1]), copyout(B[0:N-1][0:M-1])
   {
   
   #pragma acc kernels loop independent vector(32)
   {
   for (int i=1;i<N-1;i++)
      {
      #pragma acc loop independent vector (32)
      for (int j=1; j<M-1;j++)
      {
         //#pragma acc cache (A[i-1:i+1][j-1:j+1])
         B[i][j]=0;         
         B[i][j]+=A[i-1][j];
         B[i][j]+=A[i-1][j-1];
         B[i][j]+=A[i-1][j+1];
         B[i][j]+=A[i][j];
         B[i][j]+=A[i][j-1];
         B[i][j]+=A[i][j+1];
         B[i][j]+=A[i+1][j];
         B[i][j]+=A[i+1][j-1];
         B[i][j]+=A[i+1][j+1];
         B[i][j]=B[i][j]/9;
      }
      }   
   }   
   }
}


Could you please tell me, how to place data into GPU shared memory?

Yours sincerely,
Alex Ivakhnenko
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6125
Location: The Portland Group Inc.

PostPosted: Mon Oct 15, 2012 9:49 am    Post subject: Reply with quote

Hi Alex,

You have it correct. The ptxas information shown is only for static shared memory. The shared memory you're using is dynamically allocated at launch and is adjusted to match the thread block size.

Quote:
cache block size of [(16+2)x(16+2)].
In this case it's actually [(32+2)x(32+2)]

Hope this helps,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group