PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Automatically Combine Allocation and Memory Copies

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
odlantern



Joined: 30 Aug 2010
Posts: 17

PostPosted: Fri Nov 09, 2012 9:06 am    Post subject: Automatically Combine Allocation and Memory Copies Reply with quote

I am attempting to create a Fortran library which will automatically combine small allocations and data transfers into a single large allocation and data transfer. The current code I am porting to the GPU has a lot of small arrays. The transfer of these arrays onto the GPU is taking up a considerable amount of time because it performs many small transfers instead of a small amount of large transfers.

Something like

Code:
...
real, allocatable :: a(:), b(:)
real, device, allocatable :: ad(:), bd(:)

allocate(a(N), b(N))
allocate(ad(N), bd(N))
...
ad = a
bd = b


becomes

Code:
...
real, pointer :: a(:), b(:)
real, device, pointer :: ad(:), bd(:)

call dbAllocCopy( a, ad, N )
call dbAllocCopy( b, bd, N )
call dbFlushAlloc()
call dbFlushCopy()
...


The routine dbAllocCopy in my library would take the requests and store it in a list. Once dbFlushAlloc is called, all of the items in the list would be allocated as one large memory allocation on the CPU and one large memory allocation on the GPU. The addresses of a, b, ad, bd, would be set to the proper sub arrays of this larger arrays. dbFlushCopy would then perform the copy of the data as one large data transfer. Code could then proceed as normal.

I could easily do something like this in C with pointers and pointer arithmetic, but Fortran is a bit limited with these features (intentionally so from what I understand). If device pointers worked, I could implemented something like this easily, but the documentation indicates that device pointers are not supported yet.

The following simple program compiles, but the kernel fails to run.

Code:
module kernelModule

contains
      attributes(global) subroutine gg(ad, bd, cd,N)
       
          real, pointer, intent(in) :: ad(:), bd(:)
          integer, value, intent(in) :: N
          real, pointer, intent(out) :: cd(:)
         
          integer :: i
          i = threadidx%x + (blockidx%x - 1) * blockdim%x   
          if (i<=N) then     
            cd(i) = ad(i) * bd(i)
          endif
      end subroutine
end module

  program PointerTest
 
     use cudafor
     use kernelModule

      implicit none

      integer, parameter :: N = 1000
      real, allocatable, target :: aa(:)
      real, device, allocatable, target :: aad(:)
      real, pointer :: a(:), b(:), c(:)
      real, device, pointer :: ad(:), bd(:), cd(:)
      real :: m
      integer :: i
      integer :: err

      type(dim3) :: grid, block
      allocate( aa(N*3) )
      allocate( aad(N*3 ))
     
      a=>aa(1:N)
      b=>aa(N+1:2*N)
      c=>aa(2*N+1:3*N)

      ad=>aad(1:N)
      bd=>aad(N+1:2*N)
      cd=>aad(2*N+1:3*N)
     
      a = 3.0
      b = 0.8
      c = 0.0

      aad = aa
     

      grid = dim3( (N+255)/256, 1, 1)
      block = dim3( 256, 1, 1)

      print *, "calling kernel"
      call gg<<<grid, block>>>(ad,bd,cd,N)
      print *, "error is ", err
      print *, "done calling kernel"
      print err
      c = cd

      ! check the results on the CPU
      print *, "checking results on CPU"
      do i=1, N
        m = a(i) * b(i)
        if ( abs(c(i) - m) > 0.0001 ) then
            print *, "error at index ", i
            stop
        end if
      end do
      print *, "test ran succesfully"
      deallocate(aa)
      deallocate(aad)

      end program PointerTest


I am assuming this is because I am attempting to use device pointers. Is there a different way to do this that is supported by PGI Fortran? I have been looking into c_ptr and c_devptr and possible cross language solutions but this seems messy and I would like to stay in Fortran if possible.

Thanks for your time,
David
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Fri Nov 09, 2012 4:53 pm    Post subject: Reply with quote

Hi David,

Unfortunately, F90 pointers are still underdevelopment. We added basic support for them in order to support CUDA Texture memory, but full support is still forth coming.

The good news is that I was able get your program to run correctly if I remove the "pointer" attribute in the CUDA device kernel and instead of copying back "c", copy back "aa=aad".

Code:
% cat pointer.cuf
module kernelModule

contains
      attributes(global) subroutine gg(ad, bd, cd,N)
       
          real :: ad(:), bd(:)
          integer, value, intent(in) :: N
          real :: cd(:)
         
          integer :: i
          i = threadidx%x + (blockidx%x - 1) * blockdim%x   
          if (i<=N) then     
            cd(i) = ad(i) * bd(i)
          endif
      end subroutine
end module

  program PointerTest
 
     use cudafor
     use kernelModule

      implicit none

      integer, parameter :: N = 1000
      real, allocatable, target :: aa(:)
      real, device, allocatable, target :: aad(:)
      real, pointer :: a(:), b(:), c(:)
      real, device, pointer :: ad(:), bd(:), cd(:)
      real :: m
      integer :: i
      integer :: err

      type(dim3) :: grid, block
      allocate( aa(N*3) )
      allocate( aad(N*3 ))
     
      a=>aa(1:N)
      b=>aa(N+1:2*N)
      c=>aa(2*N+1:3*N)

      ad=>aad(1:N)
      bd=>aad(N+1:2*N)
      cd=>aad(2*N+1:3*N)
     
      a = 3.0
      b = 0.8
      c = 0.0

      aad = aa
     
      grid = dim3( (N+255)/256, 1, 1)
      block = dim3( 256, 1, 1)

      print *, "calling kernel"
      call gg<<<grid, block>>>(ad,bd,cd,N)
      err = cudaGetLastError()
      print *, "error is ", err, cudaGetErrorString(err)
      print *, "done calling kernel"
      !c = cd
      aa(2*N+1:3*N)=aad(2*N+1:3*N)

      ! check the results on the CPU
      print *, "checking results on CPU"
      do i=1, N
        m = a(i) * b(i)
        if ( abs(c(i) - m) > 0.0001 ) then
            print *, "error at index ", i
            stop
        end if
      end do
      print *, "test ran succesfully"
      deallocate(aa)
      deallocate(aad)

      end program PointerTest   

% pgf90 pointer.cuf ; a.out
 calling kernel
 error is             0
 no error                                                                                                                       
 done calling kernel
 checking results on CPU
 test ran succesfully
Back to top
View user's profile
odlantern



Joined: 30 Aug 2010
Posts: 17

PostPosted: Mon Nov 12, 2012 1:06 pm    Post subject: Reply with quote

I didn't have too much luck with using device pointers directly, but I did get this to work using c_ptr and c_devptr along with c_loc, c_devloc, and c_f_pointer. The c_f_pointer extension that PGI supports allows me to take a c_devptr and create an allocatable fortran array (it doesn't have to be a pointer).
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group