PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

cuda fortran module data
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
SarahA



Joined: 29 Aug 2006
Posts: 16

PostPosted: Fri Apr 16, 2010 1:53 pm    Post subject: cuda fortran module data Reply with quote

I'm still having problems with device resident module data.
OK, it's late Friday afternoon. Am I doing something wrong here...

With 10.4.0 this fails:

pgfortran -DFAILS -g -r8 -Mextend -Mcuda=cc13,keepptx -c isolate.F
/tmp/pgcudaforLYthVwCYC_fd.gpu(27): error: identifier "_anymodule_17" is undefined

1 error detected in the compilation of "/tmp/pgnvd2buhIG25C5l2.nv0".
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code 0 (isolate.F: 66)
PGF90/x86-64 Linux 10.4-0: compilation aborted

Thanks in advance, Sarah

Code:
      module anymodule

      use cudafor

      integer, parameter :: Ncol=3, NS=512, Nx2=2, Nyzt=128, ND=4

#ifdef FAILS
      integer, device, allocatable, dimension(:) :: Leo
#endif

      end module anymodule

! -------------------------------------------------------------------------------
#ifdef FAILS
      attributes(global) SUBROUTINE mult(V2x,U_eo1)
#else
      attributes(global) SUBROUTINE mult(V2x,U_eo1,Leo )
#endif

      use anymodule
      complex*16, device, dimension(Ncol,Ncol,NS,2) :: U_eo1
      complex*16 Vt2_1, Vt2_2

#ifndef FAILS
      integer, device, dimension(Nyzt) :: Leo
#endif

C------------Variables--------------------------------------
      complex*16, device, dimension(Ncol,ND,Nx2,Nyzt)  :: V2x

      integer tidy
!
      ic = threadidx%x          ! 1..3
      tidy = threadidx%y
      iyzt = threadidx%y + (blockidx%x-1) * blockdim%y  ! 1..Nyzt

      ieo = 1

       if ( iyzt <= Nyzt ) then
!      DO 1000 iyzt = 1, Nyzt

       nn = ( 1-Leo(iyzt) )*(Nx2-1) + 1

       iv = Nx2 + (iyzt-1)*Nx2

!        DO 1200 ic = 1, Ncol
         Vt2_1 =
     &                U_eo1(1,ic,iv,ieo)*2.0
     &              + U_eo1(2,ic,iv,ieo)*3.0
     &              + U_eo1(3,ic,iv,ieo)*4.0
         Vt2_2 =
     &                U_eo1(1,ic,iv,ieo)*0.5
     &              + U_eo1(2,ic,iv,ieo)*1.5
     &              + U_eo1(3,ic,iv,ieo)*3.5

         V2x(ic,1,Nx2,iyzt)= + Vt2_1
         V2x(ic,2,Nx2,iyzt)= + Vt2_2
         V2x(ic,3,Nx2,iyzt)= DCMPLX(DIMAG(Vt2_2),-DBLE(Vt2_2))
         V2x(ic,4,Nx2,iyzt)= DCMPLX(DIMAG(Vt2_1),-DBLE(Vt2_1))
! 1200   CONTINUE
C
! 1000 CONTINUE
      endif

      return
      end subroutine mult
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Fri Apr 16, 2010 3:30 pm    Post subject: Reply with quote

Hi Sarah,

Device code can only access a module's device data if it's contained in the same module. So in this case, you just need to make mult a contained subroutine within "anymodule".

Code:
      module anymodule

      use cudafor

      integer, parameter :: Ncol=3, NS=512, Nx2=2, Nyzt=128, ND=4
      integer, device, allocatable, dimension(:) :: Leo

      contains

! -------------------------------------------------------------------------------
      attributes(global) SUBROUTINE mult(V2x,U_eo1)

      complex*16, device, dimension(Ncol,Ncol,NS,2) :: U_eo1
      complex*16 Vt2_1, Vt2_2

C------------Variables--------------------------------------
      complex*16, device, dimension(Ncol,ND,Nx2,Nyzt)  :: V2x

      integer tidy
!
      ic = threadidx%x          ! 1..3
      tidy = threadidx%y
      iyzt = threadidx%y + (blockidx%x-1) * blockdim%y  ! 1..Nyzt

      ieo = 1

       if ( iyzt <= Nyzt ) then
!      DO 1000 iyzt = 1, Nyzt

       nn = ( 1-Leo(iyzt) )*(Nx2-1) + 1

       iv = Nx2 + (iyzt-1)*Nx2

!        DO 1200 ic = 1, Ncol
         Vt2_1 =
     &                U_eo1(1,ic,iv,ieo)*2.0
     &              + U_eo1(2,ic,iv,ieo)*3.0
     &              + U_eo1(3,ic,iv,ieo)*4.0
         Vt2_2 =
     &                U_eo1(1,ic,iv,ieo)*0.5
     &              + U_eo1(2,ic,iv,ieo)*1.5
     &              + U_eo1(3,ic,iv,ieo)*3.5

         V2x(ic,1,Nx2,iyzt)= + Vt2_1
         V2x(ic,2,Nx2,iyzt)= + Vt2_2
         V2x(ic,3,Nx2,iyzt)= DCMPLX(DIMAG(Vt2_2),-DBLE(Vt2_2))
         V2x(ic,4,Nx2,iyzt)= DCMPLX(DIMAG(Vt2_1),-DBLE(Vt2_1))
! 1200   CONTINUE
C
! 1000 CONTINUE
      endif

      return
      end subroutine mult

      end module anymodule

Hope this helps,
Mat
Back to top
View user's profile
SarahA



Joined: 29 Aug 2006
Posts: 16

PostPosted: Mon Apr 19, 2010 10:00 am    Post subject: Reply with quote

Quote:

Device code can only access a module's device data if it's contained in the same module. So in this case, you just need to make mult a contained subroutine within "anymodule"


It seems to be more than that. Even with NO device data, it seems that a global ( kernel ) routine must be in a module.

Is this the case? This is not documented as far as I can see. CUDA/fortran sec. 3.1.4 does state a device&host routine must be in a module.

Putting the global routines in a module will unfortunately enforce parameter type checking, and I am having problems porting ( for performance ) a fortran-77 style code which aliases arrays as formal parameters.

For example, deleting the module definition module/end-module and use statements from the example code gives the same sort of unsat. external message.

/tmp/pgfortran_tOg63owqMhb.o: In function `mmul_':
/home/users/saraha/./matmul.CUF:121: undefined reference to `mmul_kernel_'

Code:
! start the module containing the matrix multiply kernel
!module mmul_mod
!    use cudafor
!    contains

! mmul_kernel computes A*B into C where A is NxM, B is MxL, C is then NxL

    attributes(global) subroutine mmul_kernel( A, B, C, N, M, L )
   use cudafor
       real,device :: A(N,M), B(M,L), C(N,L)
       integer, value :: N, M, L
       integer :: i, j, kb, k, tx, ty

! submatrices are declared to be in CUDA shared memory

       real, shared :: Asub(16,16), Bsub(16,16)

! the value of C(i,j) being computed, a temporary scalar

       real :: Cij

! Start execution, first get my thread indices

       tx = threadidx%x
       ty = threadidx%y

! This thread computes C(i,j) = sum(A(i,:) * B(:,j))

       i = (blockidx%x-1) * 16 + tx
       j = (blockidx%y-1) * 16 + ty

       Cij = 0.0

! Do the k loop in chunks of 16, the block size

       do kb = 1, M, 16

! Fill the submatrices; each of 16x16 threads in the thread block
! loads one element of Asub and Bsub

          Asub(tx,ty) = A(i,kb+ty-1)
          Bsub(tx,ty) = B(kb+tx-1,j)

! Wait until all elements are filled

          call syncthreads()

! Multiply the two submatrices; ! Each of the 16x16 threads accumulates the
! dot product for its element of C(i,j)

          do k = 1,16
             Cij = Cij + Asub(tx,k) * Bsub(k,ty)
          enddo

! Synchronize to make sure all threads are done reading the submatrices before
! overwriting them in the next iteration of the kb loop

          call syncthreads()

       enddo

! Each of the 16x16 threads stores its element to the global C array

       C(i,j) = Cij

    end subroutine mmul_kernel


! The host routine to drive the matrix multiplication

    subroutine mmul( A, B, C )
   use cudafor

! assumed shape input arrays

       real, dimension(:,:) :: A, B, C

! Array dimensions

       integer :: N, M, L

! allocatable device arrays

       real, device, allocatable, dimension(:,:) :: Adev,Bdev,Cdev

! dim3 variables to define the grid and block shapes

       type(dim3) :: dimGrid, dimBlock
       integer :: r

! Get the array sizes

       real ctimeall, ctimekernel, flops, mflopskernel, mflopsall
       integer c1, c2, c3, c4

! Begin execution, first determine the sizes of the input arrays

       N = size( A, 1 )
       M = size( A, 2 )
       L = size( B, 2 )

! Start data xfer-inclusive timer and allocate the device arrays using
! F90 ALLOCATE

       call system_clock( count=c1 )
       allocate( Adev(N,M), Bdev(M,L), Cdev(N,L) )

! Copy A and B to the device using F90 array assignments

       Adev = A(1:N,1:M)
       Bdev = B(1:M,1:L)

! Create the grid and block dimensions

       dimGrid = dim3( N/16, L/16, 1 )
       dimBlock = dim3( 16, 16, 1 )

! Start data xfer-exclusive timer, launch the GPU kernel, wait for completion

       call system_clock( count=c2 )
       call mmul_kernel<<<dimGrid,dimBlock>>>( Adev, Bdev, Cdev, N, M, L )
       r = cudathreadsynchronize()

! Stop data xfer-exlusive timer, copy the results back, stop data xfer-
! inclusive timer

       call system_clock( count=c3 )
       C(1:N,1:L) = Cdev
       call system_clock( count=c4 )

! Calculate inclusive/exclusive execution times, and report MFLOPS

       flops = float(N) * float(M) * float(L)
       ctimekernel = c3 - c2
       mflopskernel = flops / ctimekernel
       ctimeall = c4 - c1
       mflopsall = flops / ctimeall

!  Print out results

       print *, 'Kernel time excluding data xfer:', ctimekernel, ' microseconds'
       print *, 'Megaflops excluding data xfer:  ', mflopskernel
       print *, 'Total time including data xfer: ', ctimeall, ' microseconds'
       print *, 'Megaflops including data xfer:  ', mflopsall

! Deallocate device arrays and exit

       deallocate( Adev, Bdev, Cdev )

    end subroutine mmul
!end module mmul_mod

! Main program to initialize arrays, invoke mmul, check results

program matmul
!   use mmul_mod
   use cudafor
   real,dimension(:,:),allocatable :: A,B,C,CC
   integer N, M, L
   integer idevice, istat

! Begin execution

   N = 512
   M = 1024
   L = 512
   idevice = 0
   print *,' arrays sized ', N, ' by ', M, ' by ', L
   allocate(A(N,M),B(M,L),C(N,L),CC(N,L))

! Initialize the A and B arrays;  zero out the C array to be computed
! on the GPU, and the CC array to be computed on the host

   do j = 1,M
      do i = 1,N
         A(i,j) = i*10 + j*1000
      enddo
   enddo
   do j = 1,L
      do i = 1,M
         B(i,j) = i-j
      enddo
   enddo
   do j = 1,L
      do i = 1,N
         CC(i,j) = 0.0
         C(i,j) = 0.0
      enddo
   enddo

! Initialize CPU device

  istat = cudaSetDevice(idevice) 

! Call matrix multiply subroutine to execute on the GPU to compute C

   print *,'calling mmul'
   call mmul( A, B, C )
   print *,' C(1,1) = ', C(1,1)
   print *,' C(2,2) = ', C(2,2)

! Perform matrix multiply on host to compute CC

   do i = 1,N
      do j = 1,L
         do k = 1,M
            CC(i,j) = CC(i,j) + A(i,k)*B(k,j)
         enddo
      enddo
   enddo

! Check for errors

   ierr = 0
   do j = 1,L
      do i = 1,N
         diff = abs(C(i,j) - CC(i,j))
         denom = CC(i,j)
         if ( denom == 0.0 ) denom = 1.0
         error = diff / denom
         if ( error > 2.0e-5 ) then
            ierr = ierr + 1
            if ( ierr <= 10 ) then
               print *, 'C(',i,',',j,') = ',C(i,j), ' should be ', CC(i,j), ' error=', error
            endif
         endif
      enddo
   enddo

   if( ierr == 0 )then
      print *, ' No errors found'
   else
      print *, ierr, ' ERRORS FOUND!!!'
   endif

end program
Back to top
View user's profile
SarahA



Joined: 29 Aug 2006
Posts: 16

PostPosted: Mon Apr 19, 2010 3:00 pm    Post subject: kernel code wrapper routine Reply with quote

To answer my own question..

Assuming it's true that kernel ("global") subroutines must be in a module, the only way to do the array aliasing is to use the approach suggested earlier.

For instance,

Code:
complex, device A(2,100)
call subr(A)
...
module cudastuff
attributes(global) subr_kernel(A)
complex, device A(200)
...
end subr_kernel
end module cudastuff

subroutine subr(A)
use cudastuff
complex, device A(200)
call subr_kernel<<<nblock>>>(A)
return
end subroutine subr
Back to top
View user's profile
sseyler



Joined: 09 Feb 2010
Posts: 1

PostPosted: Thu Sep 02, 2010 12:47 pm    Post subject: Reply with quote

I know I'm bumping an old thread, but in case people don't read the PGI CUDA Fortran user guide closely enough, subprograms with the "global" or "device" attribute are all considered "device" subprograms. Thus, a "global" subroutine must be contained in a module, since it is a device subprogram.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group