PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Declaring local arrays in device code
Goto page Previous  1, 2, 3, 4  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6142
Location: The Portland Group Inc.

PostPosted: Thu Jun 07, 2012 11:42 am    Post subject: Reply with quote

cudadevicesetcacheconfig was added along with the other new CUDA 4.0 features in 11.7. Though we did not switch over to use CUDA 4.0 by default till 12.0. Hence, if you are using 11.7 through 11.10, compile with -Mcuda=4.0.

- Mat
Back to top
View user's profile
TheMatt



Joined: 06 Jul 2009
Posts: 317
Location: Greenbelt, MD

PostPosted: Thu Jun 07, 2012 12:19 pm    Post subject: Reply with quote

mkcolg wrote:
cudadevicesetcacheconfig was added along with the other new CUDA 4.0 features in 11.7. Though we did not switch over to use CUDA 4.0 by default till 12.0. Hence, if you are using 11.7 through 11.10, compile with -Mcuda=4.0.

That's it! I was racking my brain to figure out how I was running this with 11.8: it was because I'd moved to 4.0 by default (and hopefully, soon, to 4.1!).

I'd be interested to know if you see any big benchmark differences with the FuncCache usage.
Back to top
View user's profile
crip_crop



Joined: 28 Jul 2010
Posts: 68

PostPosted: Thu Jun 07, 2012 12:24 pm    Post subject: Reply with quote

That's worked a treat, cheers guys.

But, as programming tends to go, one problem solved another problem formed....

I really don't have a clue what it doesn't like about this code:

Code:
      Module Acceler_formd
      USE cudafor
      implicit none

      parameter(maxatm=DEFMAXATM,maxelmnt=DEFMAXELMNT)

!     GPU specific declarations
      integer,allocatable,device,dimension(:)::ian_d
      integer,allocatable,device,dimension(:)::natorb_d
      integer,allocatable,device,dimension(:)::lowlim_d
      double precision,allocatable,device,dimension(:)::globdens,ftot_d
      integer,allocatable,device,dimension(:,:)::totsubsys_d,
     &     coresubsys_d
      integer,allocatable,device,dimension(:)::subbasis_d
      double precision,allocatable,device,dimension(:,:)::B,
     &     subeval,subnelec
      double precision,allocatable,device,dimension(:,:,:)::coeff,
     &     subevec,subscr1,subdens
     
      integer::maxatm,maxelmnt

      CONTAINS

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

      subroutine formd_cuda(ftot_h,dtot,lowt,natoms,ian_h,natorb_h,
     &     lowlim_h,itr,nbasis,ifact_h,nelecs,maxbasfun,
     &     subsystems,subnelec_h,subbasis_h,totsubsys_h,coresubsys_h)
     
      implicit none

      integer,dimension(maxatm)::ian_h
      integer,dimension(maxelmnt)::natorb_h
      integer,dimension(natoms)::lowlim_h
      integer,dimension(nbasis)::ifact_h
      double precision, dimension(lowt)::ftot_h,dtot
      integer::subsystems,cresidues,bresidues,x,y,I,xend,
     &     xj,xk,maxbasfun
      double precision::temp
      integer,dimension(subsystems,2)::totsubsys_h,coresubsys_h
      integer,dimension(subsystems)::subbasis_h
      double precision,dimension(subsystems,maxbasfun)::B_h,subeval_h,
     &     subnelec_h
      integer::lowt,natoms,itr,counter,nbasis,nelecs,
     &     j,llk,ij,callno
      double precision::ef

!     GPU specific declarations
      integer:: nthreads,blocksize,threadblocks,istat,cuError
      type(dim3)::dimGrid,dimBlock
      character*120 errmsg

!     Set device to prefer L1 cache to shared memory
      istat=cudaDeviceSetCacheConfig(cudaFuncCachePreferL1)


      write(*,*)maxatm,maxelmnt,natoms,lowt,subsystems,maxbasfun

      write(*,*)"1"
      allocate(ian_d(maxatm))
      write(*,*)"2"
      allocate(natorb_d(maxelmnt))
      write(*,*)"3"
      allocate(lowlim_d(natoms))
       write(*,*)"4"
      allocate(globdens(lowt))
      write(*,*)"5"
      allocate(ftot_d(lowt))
      write(*,*)"6"
      allocate(totsubsys_d(subsystems,2))
      write(*,*)"7"
      allocate(coresubsys_d(subsystems,2))
      write(*,*)"8"
      allocate(subbasis_d(subsystems))
      write(*,*)"9"
      allocate(B(subsystems,maxbasfun))
      write(*,*)"10"
      allocate(subeval(subsystems,maxbasfun))
      write(*,*)"11"
      allocate(subnelec(subsystems,maxbasfun))
      write(*,*)"12"
      allocate(subevec(subsystems,maxbasfun,maxbasfun))
      write(*,*)"13"
      allocate(coeff(subsystems,maxbasfun,maxbasfun))
      write(*,*)"14"
      allocate(subscr1(subsystems,maxbasfun,maxbasfun))
      write(*,*)"15"
      allocate(subdens(subsystems,maxbasfun,maxbasfun))
      write(*,*)"16"
           .
           .          .
           .          .
           .          .
           .          .
           .          .
           .
 
         


I'm getting the runtime error:

Quote:
0: ALLOCATE: copyin Symbol Memcpy FAILED:11(invalid argument)


As I can see it I've declared the correct arrays as device, allocatable and I've checked that all the extent scalars have values. When I run it with the "write" statements in it doesn't get further that "1"....

any ideas?

Cheers for your help,
Crip_crop
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6142
Location: The Portland Group Inc.

PostPosted: Thu Jun 07, 2012 3:21 pm    Post subject: Reply with quote

Does the code run without the call to cudaDeviceSetCacheConfig? You need a Fermi card to use this feature (compute capability 2.0), though it should just be a noop on other devices. Doubt it would cause this error, but maybe.

Other then that, I don't see anything obvious. The error seems to suggest the problem occurs at an memCopy call, not an allocate, so it's unclear what's happening. Try running in emulation mode (-Mcuda=emu) and see if the error still occurs. If so , then you can run the code through the debugger to find the error. Else, start commenting out code til you can narrow down the problem.

- Mat
Back to top
View user's profile
crip_crop



Joined: 28 Jul 2010
Posts: 68

PostPosted: Fri Jun 08, 2012 12:12 am    Post subject: Reply with quote

Quote:
Does the code run without the call to cudaDeviceSetCacheConfig? You need a Fermi card to use this feature (compute capability 2.0), though it should just be a noop on other devices. Doubt it would cause this error, but maybe.


I'm running it on a Fermi so this shouldn't be an issue.

When I run it in emulation mode it gives me this compiler error:

Quote:
/home/mbdx6pn2/work/DivNCon/gpu_div/div/./acceler_formd.f:58: undefined reference to `pgf90_dev_mod_alloc03_i8'
/home/mbdx6pn2/work/DivNCon/gpu_div/div/./acceler_formd.f:58: undefined reference to `pgf90_dev_mod_alloc03_i8'
/home/mbdx6pn2/work/DivNCon/gpu_div/div/./acceler_formd.f:58: undefined reference to `pgf90_dev_mod_alloc03_i8'
/home/mbdx6pn2/work/DivNCon/gpu_div/div/./acceler_formd.f:58: undefined reference to `pgf90_dev_mod_alloc03_i8'
/home/mbdx6pn2/work/DivNCon/gpu_div/div/./acceler_formd.f:58: undefined reference to `pgf90_dev_mod_alloc03_i8'
acceler_formd.o:/home/mbdx6pn2/work/DivNCon/gpu_div/div/./acceler_formd.f:58: more undefined references to `pgf90_dev_mod_alloc03_i8' follow
/opt/pgi/linux86-64/11.8/libso/libcudaforemu.so: undefined reference to `cublasAlloc'
/opt/pgi/linux86-64/11.8/libso/libcudaforemu.so: undefined reference to `cublasFree'


which is referring to the first line in the following block of code (line 58):

Code:
      write(*,*)"2"
      write(*,*)"3"
       write(*,*)"4"
      allocate(globdens(lowt))
      write(*,*)"5"
      allocate(ftot_d(lowt))
      write(*,*)"6"
      allocate(totsubsys_d(subsystems,2))
      write(*,*)"7"
      allocate(coresubsys_d(subsystems,2))
      write(*,*)"8"
      allocate(subbasis_d(subsystems))
      write(*,*)"9"
      allocate(B(subsystems,maxbasfun))
      write(*,*)"10"
      allocate(subeval(subsystems,maxbasfun))
      write(*,*)"11"
      allocate(subnelec(subsystems,maxbasfun))
      write(*,*)"12"
      allocate(subevec(subsystems,maxbasfun,maxbasfun))
      write(*,*)"13"
      allocate(coeff(subsystems,maxbasfun,maxbasfun))
      write(*,*)"14"
      allocate(subscr1(subsystems,maxbasfun,maxbasfun))
      write(*,*)"15"
      allocate(subdens(subsystems,maxbasfun,maxbasfun))
      write(*,*)"16"
      allocate(ian_d(maxatm))
      allocate(natorb_d(maxelmnt))
      allocate(lowlim_d(natoms))

           
               istat=cudathreadsynchronize()
         cuError = cudaGetLastError()
         if (cuError .ne. 0) then
           errMsg = cudaGetErrorString(cuError)
           print *, "allocate"
           print *, trim(errMsg)
        end if



   

! Copy host arrays to device memory
     
      ftot_d=ftot_h
      globdens=dtot
      B=B_h
      subeval=subeval_h
     

      nthreads=subsystems
      blocksize=1

      if (mod(nthreads, blocksize)==0) then
         threadblocks=nthreads/blocksize
      else
         threadblocks=(nthreads/blocksize)+1
      end if
     
! Create the grid and block dimensions
      dimGrid= dim3(threadblocks, 1, 1)
      dimBlock= dim3 (blocksize, 1, 1)
.... although it refers to all the errors being at line 58 which can't be right as this is a simple write statement.

I'm really stuck here... I'm struggling to isolate the problem because I think the complier may be merging all the allocates together.

When will these compiler error messages become more specific? I've been programming with CUDA Fortran now for over a year and this is always what slows me down.

Crip_crop
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page Previous  1, 2, 3, 4  Next
Page 3 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group