PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

call to cuMemcpy2D error 700

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
roonimathew



Joined: 17 Sep 2008
Posts: 4

PostPosted: Fri Feb 19, 2010 2:21 pm    Post subject: call to cuMemcpy2D error 700 Reply with quote

Hello,
I'm exploring the possibility of porting my fortran codes to the CUDA-world. One of the things I tested with the example code (f3.f90) shipped with PGI 10.1 is to test the GPU/host processing speeds with a 3D matrix. My modified version of f3.f90 is below. Compilation works fine (compiler output is below.) However, I cant get this to run - runtime error message is "call to cuMemcpy2D returned error 700: Launch failed"

Any insight into this will be appreciated.

Thanks,
Rooni.




module sm
contains
subroutine smooth( a, b, w0, w1, w2, n, m, niters )
real, dimension(:,:,:) :: a,b
real :: w0, w1, w2
integer :: n, m, niters
integer :: i, j, iter, k
!$acc region
do iter = 1,niters
do i = 2,n-1
do j = 2,m-1
a(i,j,iter) = w0 * b(i,j,iter) + &
w1 * (b(i-1,j,iter) + b(i,j-1,iter) + b(i+1,j,iter) + b(i,j+1,iter)) + &
w2 * (b(i-1,j-1,iter) + b(i-1,j+1,iter) + b(i+1,j-1,iter) + b(i+1,j+1,iter))
enddo
enddo
do i = 2,n-1
do j = 2,m-1
b(i,j,iter) = a(i,j,iter)
enddo
enddo
enddo
!$acc end region
end subroutine
subroutine smoothhost( a, b, w0, w1, w2, n, m, niters )
real, dimension(:,:,:) :: a,b
real :: w0, w1, w2
integer :: n, m, niters
integer :: i, j, iter, k
do iter = 1,niters
do i = 2,n-1
do j = 2,m-1
a(i,j,iter) = w0 * b(i,j,iter) + &
w1 * (b(i-1,j,iter) + b(i,j-1,iter) + b(i+1,j,iter) + b(i,j+1,iter)) + &
w2 * (b(i-1,j-1,iter) + b(i-1,j+1,iter) + b(i+1,j-1,iter) + b(i+1,j+1,iter))
enddo
enddo
do i = 2,n-1
do j = 2,m-1
b(i,j,iter) = a(i,j,iter)
enddo
enddo
enddo
end subroutine
end module

program main
use sm
use accel_lib
real,dimension(:,:,:),allocatable :: aa, bb
real,dimension(:,:,:),allocatable :: aahost, bbhost
real :: w0, w1, w2
integer :: i,j,n,m,k
integer :: c0, c1, c2, c3, cgpu, chost
integer :: errs, args
character(10) :: arg
real :: dif, tol

n = 0
m = 0
args = command_argument_count()
if( args .gt. 0 )then
call getarg( 1, arg )
read(arg,'(i10)') n
if( args .gt. 1 )then
call getarg( 2, arg )
read(arg,'(i10)') m
if( args .gt. 2 )then
call getarg( 3, arg )
if( arg .eq. 'host' .or. arg .eq. 'HOST' )then
call acc_set_device( acc_device_host )
print *, 'set host'
else if( arg .eq. 'nvidia' .or. arg .eq. 'NVIDIA' )then
call acc_set_device( acc_device_nvidia )
call acc_init( acc_device_nvidia )
print *, 'initialize nvidia'
else
print *, 'unknown device:', arg
print *, 'using default'
endif
endif
endif
endif
if( n .le. 0 ) n = 1000
if( m .le. 0 ) m = n
k = 11

allocate( aa(n,m,k) )
allocate( bb(n,m,k) )
allocate( aahost(n,m,k) )
allocate( bbhost(n,m,k) )
do k = 1,11
do i = 1,n
do j = 1,m
aa(i,j,k) = 0
bb(i,j,k) = i*1000 + j
aahost(i,j,k) = 0
bbhost(i,j,k) = i*1000 + j
enddo
enddo
enddo
w0 = 0.5
w1 = 0.3
w2 = 0.2
call system_clock( count=c1 )
call smooth( aa, bb, w0, w1, w2, n, m, 11 )
call system_clock( count=c2 )
cgpu = c2 - c1
call smoothhost( aahost, bbhost, w0, w1, w2, n, m, 11 )
call system_clock( count=c3)
chost = c3 - c2
! check the results
errs = 0
tol = 0.000005
do k = 1,11
do i = 1,n
do j = 1,m
dif = abs(aa(i,j,k) - aahost(i,j,k))
if( aahost(i,j,k) .ne. 0 ) dif = abs(dif/aahost(i,j,k))
if( dif .gt. tol )then
errs = errs + 1
if( errs .le. 10 )then
print *, i, j, aa(i,j,k), aahost(i,j,k)
endif
endif
enddo
enddo
enddo
print *, errs, ' errors found'
print *, cgpu, ' microseconds on GPU'
print *, chost, ' microseconds on host'
end program






[localhost@localhost TEST]$ make f3.exe
pgfortran -o f3.exe f3.f90 -ta=nvidia -Minfo=accel -O2
NOTE: your trial license will expire in 14 days, 7.7 hours.
NOTE: your trial license will expire in 14 days, 7.7 hours.
smooth:
24, Generating copyout(a(2:n-1,2:m-1,1:niters))
Generating copyin(b(1:n,1:m,1:niters))
Generating copyout(b(2:n-1,2:m-1,1:niters))
25, Loop is parallelizable
26, Loop is parallelizable
27, Loop is parallelizable
Accelerator kernel generated
25, !$acc do parallel
Cached references to size [18x18] block of 'b'
26, !$acc do parallel, vector(16)
27, !$acc do vector(16)
33, Loop is parallelizable
34, Loop is parallelizable
Accelerator kernel generated
25, !$acc do parallel, vector(4)
33, !$acc do parallel, vector(16)
34, !$acc do vector(4)
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Fri Feb 19, 2010 4:21 pm    Post subject: Reply with quote

Hi Rooni,

Thanks for the report. Looks like this was a error that was fixed in the 10.2 release. However, it appears that while the code will run with 10.2, it gets wrong answers. I have sent a report to our engineers (TPR#16640) and hopefully we can get this fixed for 10.3.

In the meantime, you can work around the problem by adding the flag "-ta=nvidia,oldcg". For example:
Code:
% pgf90 f3.f90 -ta=nvidia,oldcg -V10.1 -o f3.out
% f3.out
            0  errors found
       163691  microseconds on GPU
       389848  microseconds on host


Hope this helps,
Mat
Back to top
View user's profile
roonimathew



Joined: 17 Sep 2008
Posts: 4

PostPosted: Mon Feb 22, 2010 8:26 am    Post subject: Reply with quote

Hi Mat,
That solution worked for array size n up to 5000. Beyond that, I get different error messages depending on the value of n. Error messages are:

1. call to cuMemcpy2D returned error 700: Launch failed
2. call to cuMemAlloc returned error 2: Out of memory

I'm trying to understand the problem - is this a hardware problem specific to the GPU and can I get around the memory limitation by using a newer/more powerful GPU? I'm currently using a nVidia Quadro FX3800, with PGI 10.2. Or is this a software problem related to the compiler or the code I'm trying to run?
Rooni.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Mon Feb 22, 2010 2:12 pm    Post subject: Reply with quote

Hi Roomi,

Eeach 5000x5000x11 single precision array needs over a Gb of memory. However, your Quadro FX3800 only has a Gb total, so not enough space for the two arrays. You either need to reduce the size to below ~3400 or strip mine your code (i.e. compute only section of your array at a time).

Hope this helps,
Mat
Back to top
View user's profile
jtull



Joined: 30 Jun 2004
Posts: 373

PostPosted: Wed Mar 24, 2010 1:54 pm    Post subject: Reply with quote

Roomi,


TPR 16640 has been corrected in the current 10.3 release.

regards,
dave
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group