PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Keeping data on GPU while looping and calling subroutines
Goto page 1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
ncharvin



Joined: 21 Feb 2012
Posts: 5

PostPosted: Fri Mar 02, 2012 5:29 am    Post subject: Keeping data on GPU while looping and calling subroutines Reply with quote

Hi,

I would like to put data on the GPU-memory once at the beginning of the program, and use this data (which is never modified) within a loop calling a subroutine.
Here is a trivial example:

Code:

subroutine accumulateTrigo(a, size, sum)
    integer :: ii,jj, size
    real, dimension(size) :: a
    real :: sum

!$acc data region copyin(a)
    do jj=1,500
    sum=0.0
!$acc region
      do ii=1,size
            sum = sum + sin(a(ii)) ** 2 + cos(a(ii)) ** 2
      enddo
!$acc end region     
    enddo
!$acc end data region
    print *, "sum = ", sum
    return
end subroutine



                       
program main
    real, dimension(100000) ::  X
    integer :: Xsize,m,i,k,c1,c2   
    real :: lastSum
   
    Xsize = 100000   
    m = 5           ! m calls to subroutine accumulateTrigo
 
! GPU initialization
#ifdef _ACCEL
    call acc_init( acc_device_nvidia )
#endif   

! initialization of array X
    do i = 1,Xsize
        X(i) = (i*2.0)
    enddo

! computations on GPU   
    call system_clock( count=c1 )
    do k= 1, m     
        call accumulateTrigo(X, Xsize, lastSum)
    enddo
   
    print *, "LAST = ", lastSum
    call system_clock( count=c2 )
    print *, (c2-c1)/1000.0, ' milliseconds'
end program


This works perfectly, nevertheless, array X is copied 5 times to the GPU (at the data region entry, each time the subroutine is called). But X is never modified, so I'd like to copy it only once.

I try to use the "!$acc mirror" and "!$acc reflected" directives, but I cannot compile nor it gives me segfault. I probably miss something obvious, since I am not a FORTRAN expert and get confused with that "dummy array" thing. Thus, I'd really appreciate any help

Please note that I read this post to inline the subroutine, but when I try to put a region around the k loop in the main program, I get the same ACON unsupported operation described here

Thanks a lot
Nicolas
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Fri Mar 02, 2012 9:24 am    Post subject: Reply with quote

Hi Nicolas,

Quote:
I try to use the "!$acc mirror" and "!$acc reflected" directives, but I cannot compile nor it gives me segfault. I probably miss something obvious, since I am not a FORTRAN expert and get confused with that "dummy array" thing. Thus, I'd really appreciate any help
Using mirror or reflected is the way to do this. The only thing you're missing is an F90 interface. The easiest thing to do is put your routine in a module where an implicit interface is created. Though, you can also add an explicit interface. Examples for both are bellow. Note that I changed the module version a bit to show how mirror works, but you could use reflected instead.

- Mat

Using reflected:
Code:
% cat reflect.f90

subroutine accumulateTrigo(a, size, sum)
    integer :: ii,jj, size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)

    do jj=1,500
    sum=0.0
!$acc region
      do ii=1,size
            sum = sum + sin(a(ii)) ** 2 + cos(a(ii)) ** 2
      enddo
!$acc end region     
    enddo
    print *, "sum = ", sum
    return
end subroutine

                       
program main
    real, dimension(100000) ::  X
    integer :: Xsize,m,i,k,c1,c2   
    real :: lastSum
 
interface
  subroutine accumulateTrigo(a, size, sum)
    integer :: size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)
  end subroutine accumulateTrigo
end interface
 
    Xsize = 100000   
    m = 5           ! m calls to subroutine accumulateTrigo
 
! GPU initialization
#ifdef _ACCEL
    call acc_init( acc_device_nvidia )
#endif   

! initialization of array X
    do i = 1,Xsize
        X(i) = (i*2.0)
    enddo

!$acc data region copyin(X)
! computations on GPU   
    call system_clock( count=c1 )
    do k= 1, m     
        call accumulateTrigo(X, Xsize, lastSum)
    enddo
!$acc end data region
   
    print *, "LAST = ", lastSum
    call system_clock( count=c2 )
    print *, (c2-c1)/1000.0, ' milliseconds'
end program
% pgfortran reflect.f90 -Mpreprocess -ta=nvidia -Minfo=accel -fast
accumulatetrigo:
      6, Generating reflected(a(:))
     10, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     11, Loop is parallelizable
         Accelerator kernel generated
         11, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.0 : 13 registers; 1080 shared, 132 constant, 28 local memory bytes; 66% occupancy
             CC 2.0 : 15 registers; 1032 shared, 140 constant, 4 local memory bytes; 100% occupancy
         12, Sum reduction generated for sum
main:
     48, Generating copyin(x(:))
% a.out
 sum =     100000.0   
 sum =     100000.0   
 sum =     100000.0   
 sum =     100000.0   
 sum =     100000.0   
 LAST =     100000.0   
    2011.558      milliseconds


Using Mirror:
Code:

% cat mirror.f90
module mymod

!MEC Change this to be an allocatable array so Mirror can be used.
    real, allocatable, dimension(:) ::  X
!$acc mirror (X)

contains

! Since X is a module array, no need to pass it in
subroutine accumulateTrigo(size, sum)
    integer :: ii,jj, size
    real :: sum

    do jj=1,500
    sum=0.0
!$acc region
      do ii=1,size
            sum = sum + sin(X(ii)) ** 2 + cos(X(ii)) ** 2
      enddo
!$acc end region     
    enddo
    print *, "sum = ", sum
    return
end subroutine

end module mymod
                       
program main
    use mymod
    integer :: Xsize,m,i,k,c1,c2   
    real :: lastSum

! GPU initialization
#ifdef _ACCEL
    call acc_init( acc_device_nvidia )
#endif   
   
    Xsize = 100000 
!MEC X needs to be allocated. One copy is done on the device
!MEC and one on the host.  No data movement is done!
    allocate(X(Xsize))
    m = 5           ! m calls to subroutine accumulateTrigo
 
! initialization of array X
!MEC Either put this into  a compute region
!$acc region do
    do i = 1,Xsize
        X(i) = (i*2.0)
    enddo
!
!MEC or use the update clause to syncronize the host and device copies.
!acc update device(X)

! computations on GPU   
    call system_clock( count=c1 )
    do k= 1, m     
        call accumulateTrigo(Xsize, lastSum)
    enddo
   
    print *, "LAST = ", lastSum
    call system_clock( count=c2 )
    print *, (c2-c1)/1000.0, ' milliseconds'
end program
% pgf90 mirror.f90 -Mpreprocess -ta=nvidia -Minfo=accel -fast
accumulatetrigo:
     16, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     17, Loop is parallelizable
         Accelerator kernel generated
         17, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.0 : 13 registers; 1088 shared, 132 constant, 28 local memory bytes; 66% occupancy
             CC 2.0 : 15 registers; 1032 shared, 148 constant, 4 local memory bytes; 100% occupancy
         18, Sum reduction generated for sum
main:
     46, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     47, Loop is parallelizable
         Accelerator kernel generated
         47, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.0 : 3 registers; 48 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 6 registers; 8 shared, 56 constant, 0 local memory bytes; 100% occupancy
% a.out
 sum =     100000.0   
 sum =     100000.0   
 sum =     100000.0   
 sum =     100000.0   
 sum =     100000.0   
 LAST =     100000.0   
    1960.875      milliseconds
Back to top
View user's profile
sslgamess



Joined: 23 Nov 2009
Posts: 35

PostPosted: Fri Mar 02, 2012 11:13 am    Post subject: Reply with quote

This is very interesting!

What is the difference between mirror and reflected? When would you use one over the other?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Fri Mar 02, 2012 12:19 pm    Post subject: Reply with quote

Quote:
What is the difference between mirror and reflected?
"Mirror" mirrors the allocation status between the device and host of allocatable arrays. It also creates an implicit data region having the same scope as the array.

reflected is an attribute applied to a dummy argument indicating that it already has a device copy. reflected must contained within a higher level explicit or implicit data region.

Quote:
When would you use one over the other?
Mirror gives a more 'global' view of your data but requires direct knowledge of variable name (this is why I changed the subroutine in the mirror.f90 example from using "a" to "X"). Reflected works on smaller scopes but allows for different device arrays to used.

The two ideas can work together. For example, let's modify the reflect.f90 example to have different "mirror" arrays passed to the subroutine.

Code:
% cat reflect2.f90
module mymod

    real, allocatable, dimension(:) ::  X,Y
!$acc mirror(X,Y)
!MEC This creates an implicit data region for X, Y within the same scope.

contains

subroutine accumulateTrigo(a, size, sum)
    integer :: ii,jj, size
    real, dimension(size) :: a
    real :: sum
!$acc reflected (a)

    do jj=1,500
    sum=0.0
!$acc region
      do ii=1,size
            sum = sum + sin(a(ii)) ** 2 + cos(a(ii)) ** 2
      enddo
!$acc end region     
    enddo
    print *, "sum = ", sum
    return
end subroutine

end module mymod
                       
program main
    use mymod
    integer :: Xsize,Ysize,m,i,k,c1,c2   
    real :: lastXSum, lastYSum
    Xsize = 100000   
    Ysize = 160000   
    m = 5           ! m calls to subroutine accumulateTrigo
 
! GPU initialization
#ifdef _ACCEL
    call acc_init( acc_device_nvidia )
#endif   

    allocate(X(Xsize),Y(Ysize))

! initialization of array X
    do i = 1,Xsize
        X(i) = (i*2.0)
    enddo
    do i = 1,Ysize
        Y(i) = (i*2.0)
    enddo

!$acc update device(X,Y)

! computations on GPU   
    call system_clock( count=c1 )
    do k= 1, m     
        call accumulateTrigo(X, Xsize, lastXSum)
        call accumulateTrigo(Y, Ysize, lastYSum)
    enddo
   
    print *, "LAST = ", lastXSum, lastYSum
    call system_clock( count=c2 )
    print *, (c2-c1)/1000.0, ' milliseconds'
end program
% pgf90 -ta=nvidia -Mpreprocess reflect2.f90 -Minfo=accel
accumulatetrigo:
     13, Generating reflected(a(:))
     17, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     18, Loop is parallelizable
         Accelerator kernel generated
         18, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.0 : 13 registers; 1080 shared, 132 constant, 28 local memory bytes; 66% occupancy
             CC 2.0 : 15 registers; 1032 shared, 140 constant, 4 local memory bytes; 100% occupancy
         19, Sum reduction generated for sum
main:
     52, Generating update device(y(:))
         Generating update device(x(:))
% a.out
 sum =     100000.0   
 sum =     160000.0   
 sum =     100000.0   
 sum =     160000.0   
 sum =     100000.0   
 sum =     160000.0   
 sum =     100000.0   
 sum =     160000.0   
 sum =     100000.0   
 sum =     160000.0   
 LAST =     100000.0        160000.0   
    4220.299      milliseconds
Back to top
View user's profile
sslgamess



Joined: 23 Nov 2009
Posts: 35

PostPosted: Sat Mar 03, 2012 2:56 pm    Post subject: Reply with quote

Can the interface be between two subprograms? Or is it limited to a main program and a subprogram?
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group