PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Problem accelerating nested arrays
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
Alistair Hart



Joined: 06 Jul 2010
Posts: 21
Location: Cray Exascale Research Initiative, Edinburgh

PostPosted: Tue Jul 06, 2010 7:53 am    Post subject: Problem accelerating nested arrays Reply with quote

Hi,

I'm new to PGI directives and may be doing something stupid.

The Fortran test code below populates a 3d array with the number 5 using nested loops. I accelerate it in three ways: round the outer loop, the middle loop or the inner loop.

The outer and inner cases work correctly. When I put "!$acc region" round the middle loop, however, I only seem to populate b(:,1,:) rather than b(:,:,:). What is wrong?

I am using: "pgf90 10.5-0 64-bit target on x86-64 Linux -tp nehalem-64"

Thanks for any help,

Alistair.

Code:
PROGRAM test

  IMPLICIT NONE

  INTEGER, PARAMETER :: N = 10
  INTEGER :: b(N,N,N),i,j,k

!!$ CASE 1 ****************************************                           

  b(:,:,:) = 0

  DO k = 1,N
     DO j = 1,N
!$acc region                                                                   
        DO i = 1,N
           b(i,j,k) = 5
        ENDDO
!$acc end region                                                               
     ENDDO
  ENDDO

  PRINT '(/,"Case ",I1)',1
  DO i = 1,N
     PRINT *,i,b(i,1,1),b(i,2,1)
  ENDDO

!!$ CASE 2 ****************************************                           

  b(:,:,:) = 0

  DO k = 1,N
!$acc region                                                                   
     DO j = 1,N
        DO i = 1,N
           b(i,j,k) = 5
        ENDDO
     ENDDO
!$acc end region                                                               
  ENDDO

  PRINT '(/,"Case ",I1)',2
  DO i = 1,N
     PRINT *,i,b(i,1,1),b(i,2,1)
  ENDDO

!!$ CASE 3 ****************************************                           

  b(:,:,:) = 0

!$acc region                                                                   
  DO k = 1,N
     DO j = 1,N
        DO i = 1,N
           b(i,j,k) = 5
        ENDDO
     ENDDO
  ENDDO
!$acc end region                                                               

  PRINT '(/,"Case ",I1)',3
  DO i = 1,N
     PRINT *,i,b(i,1,1),b(i,2,1)
  ENDDO

END PROGRAM test


The compiler report is:
Code:
 pgf90 test.F90 -ta=nvidia -Minfo=accel
test:
     17, Generating copyout(b(1:10,j,k))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     18, Loop is parallelizable
         Accelerator kernel generated
         18, !$acc do parallel, vector(10)
             CC 1.0 : 3 registers; 20 shared, 28 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 3 registers; 20 shared, 28 constant, 0 local memory bytes; 25 occupancy
     35, Generating copyout(b(1:10,1:10,k))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     36, Loop is parallelizable
     37, Loop is parallelizable
         Accelerator kernel generated
         36, !$acc do parallel, vector(10)
         37, !$acc do parallel, vector(10)
             CC 1.0 : 6 registers; 24 shared, 28 constant, 0 local memory bytes; 100 occupancy
             CC 1.3 : 6 registers; 24 shared, 28 constant, 0 local memory bytes; 100 occupancy
     53, Generating copyout(b(1:10,1:10,1:10))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     54, Loop is parallelizable
     55, Loop is parallelizable
     56, Loop is parallelizable
         Accelerator kernel generated
         54, !$acc do parallel, vector(4)
         55, !$acc do parallel, vector(4)
         56, !$acc do vector(10)
             CC 1.0 : 7 registers; 24 shared, 20 constant, 0 local memory bytes; 83 occupancy
             CC 1.3 : 7 registers; 24 shared, 20 constant, 0 local memory bytes; 93 occupancy

and the output is
Code:
./a.out

Case 1
            1            5            5
            2            5            5
            3            5            5
            4            5            5
            5            5            5
            6            5            5
            7            5            5
            8            5            5
            9            5            5
           10            5            5

Case 2
            1            5            0
            2            5            0
            3            5            0
            4            5            0
            5            5            0
            6            5            0
            7            5            0
            8            5            0
            9            5            0
           10            5            0

Case 3
            1            5            5
            2            5            5
            3            5            5
            4            5            5
            5            5            5
            6            5            5
            7            5            5
            8            5            5
            9            5            5
           10            5            5
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Wed Jul 07, 2010 9:02 am    Post subject: Reply with quote

Hi Alistair,

For the second case, add "copy(b)" to "!$acc region".

Hope this helps,
Mat
Back to top
View user's profile
Alistair Hart



Joined: 06 Jul 2010
Posts: 21
Location: Cray Exascale Research Initiative, Edinburgh

PostPosted: Wed Jul 14, 2010 4:10 am    Post subject: Reply with quote

mkcolg wrote:

For the second case, add "copy(b)" to "!$acc region".

Thanks. This works.

Will future versions of the compiler recognise the need for this clause automatically?

Cheers,

Alistair.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Wed Jul 14, 2010 9:59 am    Post subject: Reply with quote

Hi Alistair,

Quote:
Will future versions of the compiler recognise the need for this clause automatically?
I submitted a problem report (TPR#17096).

Note that for performance reasons, the first and second cases would be poor methods. For each iteration of the outer loops, the B array would need to be copied to and from the GPU. Copying data is very slow so should be avoided whenever possible.

In scenarios where you do need to put an accelerator region within a loop, try to use the 'data region' directives to move the copies outside the loop. For example:
Code:
% cat b.f90
PROGRAM test

  IMPLICIT NONE
  INTEGER, PARAMETER :: N = 10
  INTEGER :: b(N,N,N),i,j,k

!!$ CASE 2 ****************************************
  b(:,:,:) = 0

!$acc data region copyout(b)
  DO k = 1,N
!$acc region
     DO j = 1,N
        DO i = 1,N
           b(i,j,k) = 5
        ENDDO
     ENDDO
!$acc end region
  ENDDO
!$acc end data region

  PRINT '(/,"Case ",I1)',2
  DO i = 1,N
     PRINT *,i,b(i,1,1),b(i,2,1)
  ENDDO


END PROGRAM test
% pgf90 -ta=nvidia b.f90 -V10.6 -Minfo=accel -fast
test:
     10, Generating copyout(b(:,:,:))
     12, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     13, Loop is parallelizable
     14, Loop is parallelizable
         Accelerator kernel generated
         13, !$acc do parallel, vector(10)
         14, !$acc do parallel, vector(10)
             CC 1.0 : 6 registers; 24 shared, 28 constant, 0 local memory bytes; 100 occupancy
             CC 1.3 : 6 registers; 24 shared, 28 constant, 0 local memory bytes; 100 occupancy
% a.out

Case 2
            1            5            5
            2            5            5
            3            5            5
            4            5            5
            5            5            5
            6            5            5
            7            5            5
            8            5            5
            9            5            5
           10            5            5


- Mat
Back to top
View user's profile
Alistair Hart



Joined: 06 Jul 2010
Posts: 21
Location: Cray Exascale Research Initiative, Edinburgh

PostPosted: Thu Jul 15, 2010 1:27 am    Post subject: Reply with quote

Thanks for the prompt and informative reply.

I did find that it was most efficient to put the acc region around all the loops, but I was interested to see what was possible if I wanted to place it elsewhere.

Cheers,

Alistair.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group