PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Sharing device data with subroutines and Fortran !$acc direc
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
Alistair Hart



Joined: 06 Jul 2010
Posts: 21
Location: Cray Exascale Research Initiative, Edinburgh

PostPosted: Fri Jul 16, 2010 9:19 am    Post subject: Sharing device data with subroutines and Fortran !$acc direc Reply with quote

Hi (again),

Is there any way using !$acc directives to share data already located on the GPU with Fortran subroutines?

When I compile the test code below with v10.6.0 of the pgf90 compiler, the (edited) output reads:
Code:

test_data_spans_subs:
     32, Generating copyin(u(:))
sub1:
     47, Generating copyin(u(1:100))
sub2:
     63, Generating copyin(u(1:100))

Line 32 is the data region in the main program. I would like to eliminate the repeat data copying in the subroutines (lines 47 and 63).

I don't mind promoting u into MODULE my_data (and not passing it) but I couldn't find how to make that work either.

Thanks for any advice you can offer,

Alistair.
Code:

!!$**************************************       
MODULE my_data
   IMPLICIT NONE
   INTEGER, PARAMETER :: N = 100
END MODULE my_data
!!$**************************************                                                 
PROGRAM test_data_spans_subs

   USE my_data
   IMPLICIT NONE

   INTEGER :: u(N)
   INTEGER :: i,r

   DO i = 1,N
    u(i) = i
   ENDDO

!!$ Attempt 1: inlined                                                                     
!$acc data region copyin(u)                                                               
!$acc region                                                                               
   r = SUM(u)
!$acc end region                                                                           
   PRINT *,"Total 1: ",r
!$acc region                                                                               
   r = SUM(u**2)
!$acc end region                                                                           
   PRINT *,"Total 2: ",r
!$acc end data region                                                                     

!!$ Attempt 2: subroutines                                                                 
!$acc data region copyin(u)                                                               
   CALL sub1(u)
   CALL sub2(u)
!$acc end data region                                                                     

END PROGRAM test_data_spans_subs
!!$**************************************                                                 
SUBROUTINE sub1(u)

   USE my_data
   IMPLICIT NONE
   INTEGER, INTENT(in) :: u(N)

   INTEGER :: r

!$acc region                                                                               
   r = SUM(u)
!$acc end region                                                                           

   PRINT *,"Total 1: ",r

END SUBROUTINE sub1
!!$**************************************                                                 
SUBROUTINE sub2(u)

   USE my_data
   IMPLICIT NONE
   INTEGER, INTENT(in) :: u(N)

   INTEGER :: r

!$acc region                                                                               
   r = SUM(u**2)
!$acc end region                                                                           

   PRINT *,"Total 2: ",r

END SUBROUTINE sub2
!!$**************************************                                                 
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Fri Jul 16, 2010 10:01 am    Post subject: Reply with quote

Hi Alistair,

What you need is the 'reflected' clause. (See section 2.6.6. of http://www.pgroup.com/lit/whitepapers/pgi_accel_prog_model_1.2.pdf). Unfortunately, implementation of this feature was more complex than originally thought and has been delayed until 10.8/10.9 time frame.

In the meantime, you can use CUDA Fortran device pointers to pass the device data to routines. The caveat is that mixing CUDA Fortran and the PGI Accelerator is only supported on Linux and should be considered experimental.

Example:

Code:
% cat test.f90

#ifdef _ACCEL
#define CUDA_DEVICE ,device
#else
#define CUDA_DEVICE
#endif

!!$**************************************
MODULE my_data
   IMPLICIT NONE
   INTEGER, PARAMETER :: N = 100
END MODULE my_data
!!$**************************************
PROGRAM test_data_spans_subs

   USE my_data
   IMPLICIT NONE

   INTEGER CUDA_DEVICE :: u(N)
   INTEGER :: i,r

!$acc region
   DO i = 1,N
    u(i) = i
   ENDDO
!$acc end region

!!$ Attempt 1: inlined
!$acc region
   r = SUM(u)
!$acc end region
   PRINT *,"Total 1: ",r
!$acc region
   r = SUM(u**2)
!$acc end region
   PRINT *,"Total 2: ",r

!!$ Attempt 2: subroutines
   CALL sub1(u)
   CALL sub2(u)

END PROGRAM test_data_spans_subs
!!$**************************************
SUBROUTINE sub1(u)

   USE my_data
   IMPLICIT NONE
   INTEGER CUDA_DEVICE, INTENT(in) :: u(N)

   INTEGER :: r

!$acc region
   r = SUM(u)
!$acc end region

   PRINT *,"Total 1: ",r

END SUBROUTINE sub1
!!$**************************************
SUBROUTINE sub2(u)

   USE my_data
   IMPLICIT NONE
   INTEGER CUDA_DEVICE, INTENT(in) :: u(N)

   INTEGER :: r

!$acc region
   r = SUM(u**2)
!$acc end region

   PRINT *,"Total 2: ",r

END SUBROUTINE sub2
!!$**************************************

% pgf90 -ta=nvidia -Mcuda test.f90 -Minfo=accel -Mpreprocess -o test_gpu.out
test_data_spans_subs:
     22, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     23, Loop is parallelizable
         Accelerator kernel generated
         23, !$acc do parallel, vector(96)
             CC 1.0 : 4 registers; 20 shared, 20 constant, 0 local memory bytes; 100% occupancy
             CC 1.3 : 4 registers; 20 shared, 20 constant, 0 local memory bytes; 75% occupancy
     29, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     30, Loop is parallelizable
         Accelerator kernel generated
         30, !$acc do parallel, vector(96)
             Sum reduction generated for u$r
             CC 1.0 : 5 registers; 404 shared, 40 constant, 0 local memory bytes; 100% occupancy
             CC 1.3 : 5 registers; 404 shared, 40 constant, 0 local memory bytes; 75% occupancy
     33, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     34, Loop is parallelizable
         Accelerator kernel generated
         34, !$acc do parallel, vector(96)
             Sum reduction generated for u$r1
             CC 1.0 : 5 registers; 404 shared, 40 constant, 0 local memory bytes; 100% occupancy
             CC 1.3 : 5 registers; 404 shared, 40 constant, 0 local memory bytes; 75% occupancy
sub1:
     52, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     53, Loop is parallelizable
         Accelerator kernel generated
         53, !$acc do parallel, vector(96)
             Sum reduction generated for u$r
             CC 1.0 : 5 registers; 404 shared, 40 constant, 0 local memory bytes; 100% occupancy
             CC 1.3 : 5 registers; 404 shared, 40 constant, 0 local memory bytes; 75% occupancy
sub2:
     68, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     69, Loop is parallelizable
         Accelerator kernel generated
         69, !$acc do parallel, vector(96)
             Sum reduction generated for u$r
             CC 1.0 : 5 registers; 404 shared, 40 constant, 0 local memory bytes; 100% occupancy
             CC 1.3 : 5 registers; 404 shared, 40 constant, 0 local memory bytes; 75% occupancy
% pgf90 test.f90 -Mpreprocess -o test_cpu.out
% test_gpu.out
 Total 1:          5050
 Total 2:        338350
 Total 1:          5050
 Total 2:        338350
% test_cpu.out
 Total 1:          5050
 Total 2:        338350
 Total 1:          5050
 Total 2:        338350   


Hope this helps,
Mat
Back to top
View user's profile
Alistair Hart



Joined: 06 Jul 2010
Posts: 21
Location: Cray Exascale Research Initiative, Edinburgh

PostPosted: Mon Jul 19, 2010 4:11 am    Post subject: Reply with quote

Hi Mat,

Thanks for the reply: as usual, quick and informative.

Yes, this code works. As you say, this mixing of CUDA Fortran and directives is experimental. Even so, I'll report the following bug.

The code below sums an array slice.

-DCase1: Explicit copyin/out of u for each region. Works.
-DCase2: Data region with local u spanning all regions. Works.
-DCase3: DEVICE array u. Does not work.

If I set ns=1, the code appears to work for all three cases.

I get the same result for r1 and r2 as the compiler recognises the reduction for r1.

The compiler was v10.6.0 and the other compilation options were:

pgf90 -ta=nvidia:cc20,time -Mcuda -Minfo=accel

I'm not doing anything wrong, am I?

Cheers,


Alistair
Code:

#if defined(Case3) && defined(_ACCEL)
#define CUDA_DEVICE ,device
#else
#define CUDA_DEVICE
#endif

PROGRAM test_sum_reduction

   IMPLICIT NONE

   INTEGER, PARAMETER :: N = 100

   INTEGER CUDA_DEVICE :: u(N)
   INTEGER :: i,r1,r2

   INTEGER :: ns,no,nf

   ns = 25
   no = 50
   nf = ns + no - 1

#ifdef Case2
!$acc data region local(u)                                                     
#endif

!$acc region                                                                   
   DO i = 1,N
    u(i) = i
   ENDDO
!$acc end region                                                               

!$acc region                                                                   
   r1 = 0
   DO i = ns,nf
    r1 = r1 + u(i)
   ENDDO
!$acc end region                                                               

!$acc region                                                                   
   r2 = SUM(u(ns:nf))
!$acc end region                                                               

#ifdef Case2
!$acc end data region                                                           
#endif

   PRINT *,"Explicit sum:   ",r1
   PRINT *,"SUM intrinsic:  ",r2
   PRINT *,"Correct answer: ",nf*(nf+1)/2 - (ns-1)*(ns-1+1)/2

END PROGRAM test_sum_reduction

Back to top
View user's profile
Alistair Hart



Joined: 06 Jul 2010
Posts: 21
Location: Cray Exascale Research Initiative, Edinburgh

PostPosted: Mon Jul 19, 2010 6:43 am    Post subject: Reply with quote

Update to message above:

Removing the "!$acc region" statements from around the reductions for r1 and r2 seemed to make it work for Case3.

Is it tautology to make an array "DEVICE" and then address it using a "!$acc region"?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Mon Jul 19, 2010 3:43 pm    Post subject: Reply with quote

Hi Alistair,

It appears to me that the compiler is starting the summation of CUDA Fortran device "u" at index 1 instead of "ns". This is because the device pointer is being passed instead of the section. For the accelerator version, only a section of the array (starting at ns) is being passed to the reduction.

I'll note it,but I don't think it will be something we'll fix anytime soon. Hence, please consider this a limitation that when using CUDA Fortran device arrays in reductions, you can only sections that begin at index 1.

Quote:
Is it tautology to make an array "DEVICE" and then address it using a "!$acc region"?
No. This works because there is an implicit copy from the device into a temporary host array and the summation is done on the host.

It's the same as:

u_host = u
r2 = SUM(u_host(ns:nf))

Hope this helps,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group