PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Managing Shared Memory

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
Karthee



Joined: 08 Jul 2010
Posts: 4

PostPosted: Sun Jul 18, 2010 10:29 am    Post subject: Managing Shared Memory Reply with quote

Is it possible to manage shared memory? There is a clause cache that can be used to hint the compiler. I have the following acc region. the kreal and kimag values of a particular point are stored in separate array (ktempreal,ktempimag) and used in all the nterms iterations. The idea here is to move this array to shared memory.

Code:
!$acc region
!$acc do kernel private (ktempreal, ktempimag) cache(ktempreal,ktempimag)
        DO p = 1,npoints           
            !store k values in ktemp array
            DO r = 1,order+1
              DO mu = 0,NDIR-1
                ktempreal(mu,r) = kreal(mu,r,p)
                ktempimag(mu,r) = kimag(mu,r,p)
              END DO
            END DO
          ! use the ktemp values for all nterms iterations
          DO i = 1,nterms
            phase3real = 0
            phase3imag = 0
            DO r = 1,order+1
              DO mu = 0,NDIR-1
                phase3real =  phase3real - ktempimag(mu,r) * yxv(mu,r,i)
                phase3imag =  phase3imag + ktempreal(mu,r) * yxv(mu,r,i)
              END DO
            END DO
            vtxgpureal(p) = vtxgpureal(p) + phase3real
            vtxgpuimag(p) = vtxgpuimag(p) + phase3imag
          END DO
        END DO
!$acc end region


The compiler generated messages are as follows
Code:
60, Generating copyin(kimag(0:ndir-1,1:order+1,1:npoints))
         Generating copyin(kreal(0:ndir-1,1:order+1,1:npoints))
         Generating copy(vtxgpuimag(0,0,0,1:npoints))
         Generating copy(vtxgpureal(0,0,0,1:npoints))
         Generating copyin(yxv(0:ndir-1,1:order+1,1:nterms))
         Generating compute capability 2.0 binary
     62, Loop is parallelizable
         Accelerator kernel generated
         62, !$acc do parallel, vector(32)
             Non-stride-1 accesses for array 'vtxgpuimag'
             Non-stride-1 accesses for array 'vtxgpureal'
             CC 2.0 : 30 registers; 4 shared, 152 constant, 0 local memory bytes; 16 occupancy
     63, Loop is parallelizable
     64, Loop is parallelizable
     69, Complex loop carried dependence of 'vtxgpureal' prevents parallelization
         Loop carried dependence of 'vtxgpureal' prevents parallelization
         Loop carried backward dependence of 'vtxgpureal' prevents vectorization
         Complex loop carried dependence of 'vtxgpuimag' prevents parallelization
         Loop carried dependence of 'vtxgpuimag' prevents parallelization
         Loop carried backward dependence of 'vtxgpuimag' prevents vectorization
     72, Loop is parallelizable
     73, Loop is parallelizable


Please advice.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Mon Jul 19, 2010 12:19 pm    Post subject: Reply with quote

Hi Karthee,

Quote:
Is it possible to manage shared memory?

For this particular code, the ktempreal and ktempimag variables are local (private) arrays. Hence they would be placed in a register unless they were too large or if there were too many other local variables. In this case, the registers may 'spill' into the global memory. Registry allocation is performed by the back-end NVIDIA compiler that the user has no control over except in limiting the maximum number of registers to use ("-ta=nvidia,maxregcount:<n>").

"shared" memory refers to global memory that has been cached in local memory and can be accessed by all threads within a single thread block. CUDA Fortran requires the programmer to manage their own shared memory. For the PGI Accelerator model, the compiler manages the shared memory for you.

Quote:
The idea here is to move this array to shared memory.

Provided that ktempreal and ktempimag are stored in registers (i.e. not spilled), then essentially you have put them in shared memory since shared memory and registers both are stored in a multi-processor's local memory.

Another possibility is to have the compiler put 'kreal' and 'kimag' into shared memory. In other words, remove the temp arrays and access kreal and kimag directly. Though, you'll need to move the parallel dimension (p) to the first dimension to allow for contigeous memory access.

Hope this helps,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group