PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Privatization of array
Goto page Previous  1, 2
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6125
Location: The Portland Group Inc.

PostPosted: Tue Oct 06, 2009 7:53 am    Post subject: Reply with quote

Hi Viet,

The inner "n" loops should be executed sequentially, but just in case, you can use "#pragma acc do kernel" or "#pragma acc do seq" for force it. Be sure to compile with "-Minfo=accel" and watch the generated output. The compiler will tell you how the loops were scheduled.

You can also use the "#pragma acc do parallel" and "#pragma acc do vector" to manually adjust the parallel schedule.

On a side note, compile with "-Msafeptr" or declare your arrays with the C99 restrict keyword. Otherwise the compiler wont be able to parallelize them.

- Mat
Back to top
View user's profile
Alistair Hart



Joined: 06 Jul 2010
Posts: 21
Location: Cray Exascale Research Initiative, Edinburgh

PostPosted: Wed Jul 07, 2010 5:18 am    Post subject: Reply with quote

TheMatt wrote:

Did you try privatizing the array using the pragma clauses? With Fortran it's an !$acc do clause, so I imagine it's a #pragma acc for one in C (I'm a Fortran programmer so caveat lector).

So you might try:
#pragma acc region
{
#pragma acc for private(arr1)
for(n=1;n<nn-1;++n){...

Essentially, you add that private clause on the line directly before the for-loop that it must apply to.

Can "!$acc do private" be used for nested loops, e.g. for "u" as below?
Code:

PROGRAM test

  IMPLICIT NONE

  INTEGER, PARAMETER :: N = 4
  INTEGER :: b(N,N,N),i,j,k,u(N)

  b(:,:,:) = 0

!$acc region
  DO k = 1,N
     DO j = 1,N
        DO i = 1,N
           u(i) = i+j+k
        ENDDO
        DO i = 2,N-1
           b(i,j,k) = u(i-1) + u(i)
        ENDDO
     ENDDO
  ENDDO
!$acc end region                                                               

  PRINT '(4I6)',b

END PROGRAM test


I tried putting "!$acc do private(u)" above k and/or j loops but nothing worked (wrong answers or cuMemFree error).

I can, of course, add a j-index to u and a "do private" to k, but I would like to accelerate the original source version just using directives.

Thanks for your help,

Alistair.

P.S. I'm using v10.5 of the compiler, if it matters.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6125
Location: The Portland Group Inc.

PostPosted: Wed Jul 07, 2010 8:09 am    Post subject: Reply with quote

Hi Alistair,

It seems to work for me when I add "!$acc do private(u)" above the k loop. Try rebooting your system (or resetting your CUDA driver). The pinned memory that NVIDIA drivers use can enter an inconsistent state leading to this type of behavior.

- Mat

Code:
% cat test1.f90
PROGRAM test

  IMPLICIT NONE

  INTEGER, PARAMETER :: N = 4
  INTEGER :: b(N,N,N),i,j,k,u(N)

  b(:,:,:) = 0

!$acc region
!$acc do private(u)
  DO k = 1,N
     DO j = 1,N
        DO i = 1,N
           u(i) = i+j+k
        ENDDO
        DO i = 2,N-1
           b(i,j,k) = u(i-1) + u(i)
        ENDDO
     ENDDO
  ENDDO
!$acc end region

  PRINT '(4I6)',b

END PROGRAM test

% pgf90 -ta=nvidia -Minfo=accel test1.f90 -V10.5
test:
     10, Generating copyout(b(2:3,1:4,1:4))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     12, Loop is parallelizable
     13, Loop carried reuse of 'u' prevents parallelization
     14, Loop is parallelizable
         Accelerator kernel generated
         12, !$acc do parallel, vector(4)
         13, !$acc do seq
         14, !$acc do parallel, vector(4)
             Using register for 'u'
             CC 1.0 : 7 registers; 24 shared, 40 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 7 registers; 24 shared, 40 constant, 0 local memory bytes; 25 occupancy
     17, Loop is parallelizable
         Accelerator kernel generated
         12, !$acc do parallel, vector(4)
         13, !$acc do seq
         17, !$acc do parallel, vector(2)
             Cached references to size [3] block of 'u'
             CC 1.0 : 8 registers; 40 shared, 44 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 8 registers; 40 shared, 44 constant, 0 local memory bytes; 25 occupancy
% a.out
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
Back to top
View user's profile
Alistair Hart



Joined: 06 Jul 2010
Posts: 21
Location: Cray Exascale Research Initiative, Edinburgh

PostPosted: Wed Jul 14, 2010 4:36 am    Post subject: Reply with quote

Thanks for the reply.

mkcolg wrote:

It seems to work for me when I add "!$acc do private(u)" above the k loop.
<snip>
Code:

% a.out
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0

But the correct result (compiling without -ta flag and running on host) is
Code:

% pgf90 test1.F90
% ./a.out
     0     7     9     0
     0     9    11     0
     0    11    13     0
     0    13    15     0
     0     9    11     0
     0    11    13     0
     0    13    15     0
     0    15    17     0
     0    11    13     0
     0    13    15     0
     0    15    17     0
     0    17    19     0
     0    13    15     0
     0    15    17     0
     0    17    19     0
     0    19    21     0

The only way I could get agreement with the host code is to explicitly privatise u by making it a three-index array in the source code. I was hoping to do the acceleration without changing the Fortran.
mkcolg wrote:

Try rebooting your system (or resetting your CUDA driver). The
pinned memory that NVIDIA drivers use can enter an inconsistent
state leading to this type of behavior.

I don't like the sound of that - how can I diagnose this problem
other than my code not working? And rebooting a remote system you
don't own is non-trivial.

Cheers,

Alistair.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6125
Location: The Portland Group Inc.

PostPosted: Wed Jul 14, 2010 10:16 am    Post subject: Reply with quote

Sorry about that Alistair, let me try again. This time adding in the 'kernel' clause around the J loop.

Code:
% cat test1.f90
PROGRAM test

  IMPLICIT NONE

  INTEGER, PARAMETER :: N = 4
  INTEGER :: b(N,N,N),i,j,k,u(N)

  b(:,:,:) = 0

!$acc region
  DO k = 1,N
!$acc do private(u), kernel
     DO j = 1,N
        DO i = 1,N
           u(i) = i+j+k
        ENDDO
        DO i = 2,N-1
           b(i,j,k) = u(i-1) + u(i)
        ENDDO
     ENDDO
  ENDDO
!$acc end region

  PRINT '(4I6)',b

END PROGRAM test

% pgf90 -ta=nvidia,time test1.f90 -fast -V10.6 -Minfo=accel
test:
     10, Generating copyout(b(2:3,1:4,1:4))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     11, Loop is parallelizable
     13, Loop is vectorizable
         Accelerator kernel generated
         11, !$acc do parallel, vector(4)
         13, !$acc do vector(4)
             CC 1.0 : 9 registers; 20 shared, 28 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 8 registers; 20 shared, 28 constant, 0 local memory bytes; 25 occupancy
     14, Loop is parallelizable
     17, Loop is parallelizable
% a.out
     0     7    15     0
     0     9    17     0
     0    11    19     0
     0    13    21     0
     0     9    15     0
     0    11    17     0
     0    13    19     0
     0    15    21     0
     0    11    15     0
     0    13    17     0
     0    15    19     0
     0    17    21     0
     0    13    15     0
     0    15    17     0
     0    17    19     0
     0    19    21     0

Accelerator Kernel Timing data
/tmp/mec/test1.f90
  test
    10: region entered 1 time
        time(us): total=92946 init=92536 region=410
                  kernels=21 data=24
        w/o init: total=410 max=410 min=410 avg=410
        13: kernel launched 1 times
            grid: [1]  block: [4x4]
            time(us): total=21 max=21 min=21 avg=21


Quote:

I don't like the sound of that - how can I diagnose this problem
other than my code not working? And rebooting a remote system you
don't own is non-trivial.
The NVIDIA driver has an issue where the Pinned host memory it uses to perform data transfers can get corrupted if a CUDA program abnormally aborts (such as if the user kills the program via Ctrl-C). NVIDIA is aware of the problem but I don't know it's status.

Whenever I start getting wrong answers or unexpected behavior from a program that had been working, my first step is to either reset the NVIDIA driver or reboot the system. I agree that it is not ideal but it's the only work around that I have found for this issue.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group