PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

array copy from device to device

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
bguler



Joined: 28 May 2010
Posts: 17

PostPosted: Tue Nov 02, 2010 10:13 am    Post subject: array copy from device to device Reply with quote

Hi,

I have a problem with a simple fortran cuda application. Below is the code. It works fine if I set the dimensions of the array small (i.e. Nb=2, Ns=3 in the below code). However, if I increase the dimensions to the below numbers, then it gives the following error:
0: copyover Memcpy <...> FAILED: 30<unknown error>

It looks like there is a problem during the copy of the device array. But I don't understand it. Why do I have the problem when the array dimension is high only?

Can anyone suggest anything?


Thanks

Bulent

!==============================================!

module gpu
use cudafor

implicit none

real, device, dimension(:,:,:,:), allocatable :: vd,vdp
integer, device :: Ns_dev,Nb_dev


contains

attributes(global) subroutine v_kernel()
implicit none

integer :: i,j,k,s,b,n,q
real :: v1,v2
real, shared :: vtemp(768),vtemp2(768)

i=blockidx%y
j=blockidx%x
k=ceiling(real(j)/(Ns_dev*Nb_dev))
s=ceiling(real(j-(k-1)*Ns_dev*Nb_dev)/Nb_dev)
b=j-(k-1)*(Ns_dev*Nb_dev)-(s-1)*Nb_dev
j=threadidx%x

vtemp(j)=vdp(b,s,k,j)
call syncthreads()
vtemp2(j)=log(real(b+s+k+i-j))+vtemp(j)
call syncthreads()
vd(b,s,k,i)=maxval(vtemp2)
end subroutine v_kernel
end module

!==============================================================================!

program main
use cudafor
use gpu

implicit none

integer :: istat, idevice
integer :: j
type(dim3) :: dimGrid, dimBlock
real :: begin, finish, error
integer, parameter :: Nb=20,Ns=30,Nk=7,Ni=768

idevice=0

istat=cudaSetDevice(idevice)

dimGrid = dim3(Nb*Ns*Nk,Ni,1)
dimBlock = dim3(Ni,1,1)

Ns_dev=Ns
Nb_dev=Nb

allocate(vd(Nb,Ns,Nk,Ni),vdp(Nb,Ns,Nk,Ni))

vdp=0
call cpu_time(begin)
do j=1,5
call v_kernel<<<dimGrid,dimBlock>>>()
istat = cudaThreadSynchronize()
vdp=vd
end do
istat = cudaThreadSynchronize()
call cpu_time(finish)
print*, 'gpu time=', finish-begin, 'seconds'

deallocate(vd,vdp)

end program
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6146
Location: The Portland Group Inc.

PostPosted: Tue Nov 02, 2010 5:30 pm    Post subject: Reply with quote

Hi Bulent,

I'm not sure what's going on here. When I run the program, it seems to succeed some times and fails other times with:
Quote:
0: copyout Memcpy (host=0x2abfc0ea9020, dev=0x5100000, size=12835200) FAILED: 4(unspecified launch failure)


"unspecified launch failure" typically means that a memory access violation occurred so we'll need to look at your kernel to make sure your not getting an out-of-bound error. Though, I'm out of time for today. I'll look again tommorrow.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group