PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

cudaLaunch returned status 4: unspecified launch failure

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
roberto_bobenrieth_miserd



Joined: 10 Dec 2009
Posts: 3

PostPosted: Wed Sep 05, 2012 7:15 pm    Post subject: cudaLaunch returned status 4: unspecified launch failure Reply with quote

Hi,

I am porting an aeroacoustics code that is using MPI/OpenMP/Accelerator to use MPI/OpenMP/CUDA Fortran. To simplify the process, I am using the CUF kernel loop directive in the following subroutine:

Code:
subroutine cuda_surfaces (surface_accel,surf_x_ip,surf_x_im,surf_y_jp,surf_y_jm)

use accel
use cudafor

double precision, dimension (0:i_max_accel+1,0:j_max_accel+1,4), intent (in), device :: surface_accel
double precision, dimension (0:i_max_accel+1,0:j_max_accel+1), intent (out), device :: surf_x_ip,surf_x_im,surf_y_jp,surf_y_jm

integer :: i,j

!$cuf kernel do(2) <<< (*,*), (n_cores_smp,nu_smps_gpu) >>>
do j=0,j_max_accel+1
   do i=0,i_max_accel+1  !line 2161:cudaLaunch returned status 4
      surf_x_ip(i,j)=surface_accel(i,j,1)
      surf_x_im(i,j)=surface_accel(i,j,2)
      surf_y_jp(i,j)=surface_accel(i,j,3)
      surf_y_jm(i,j)=surface_accel(i,j,4)
   end do
end do

end subroutine cuda_surfaces


where the accel module is given by:
Code:
module accel

integer, parameter :: n_cores_smp=32
integer, parameter :: nu_smps_gpu=14
integer, parameter :: i_max_accel=2336
integer, parameter :: j_max_accel=1624

end module accel


Subroutine cuda_surfaces is called by subroutine euler_solver, where the related parts are:

Code:
subroutine euler_solver (....)
....
use accel
use cudafor
....
double precision, dimension (0:i_max_accel+1,0:j_max_accel+1,4) :: surface_host
double precision, dimension (0:i_max_accel+1,0:j_max_accel+1,4), device :: surface_accel
double precision, dimension (0:i_max_accel+1,0:j_max_accel+1), device :: surf_x_ip,surf_x_im,surf_y_jp,surf_y_jm,vol
....
call mpi_barrier (mpi_comm_world,ierr)
....
!setting the number of threads for the following openmp parallel region equal to the number of gpus
call omp_set_num_threads(n_threads_gpus)
....
!$omp parallel, private (surface_accel,surf_x_ip,surf_x_im,surf_y_jp,surf_y_jm)
....
devnum=omp_get_thread_num()
istat=cudaSetDevice(devnum)
....
!$omp do
do l_block=1,l_max_block !block size is i_max_accel*j_max_accel
   .....
   surface_accel=surface_host
   ....
   call cuda_surfaces  (surface_accel,surf_x_ip,surf_x_im,surf_y_jp,surf_y_jm)
   ....
end do
!$omp end do
!$omp end parallel
....
call mpi_barrier (mpi_comm_world,ierr)

return

end subroutine euler_solver


When I try to run the code, I get the following error message:

line 2161: cudaLaunch returned status 4: unspecified launch failure

where line 2161 is shown commented in subroutine cuda_surfaces and corresponds to the innermost loop. The compilation message related to this subroutine is:

Code:

cuda_surfaces:
   2161, CUDA kernel generated
       2161, !$cuf kernel do <<< (*,*), (32,14) >>>


Any help will be appreciated.

Regards,

Roberto
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Thu Sep 06, 2012 9:10 am    Post subject: Reply with quote

Hi Roberto,

A "cudaLaunch returned status 4: unspecified launch failure" typically means that the kernel abnormally aborted during execution. I tried to reproduce your error with a simple example, but was unable. Hence, my guess is that the error has to do with how you are using OpenMP. While OpenMP and CUDA Fortran work together, it can be a bit tricky getting the data allocation on the device correct. I'd like you try two things:

1) Compile without OpenMP. If it works, then we know it's an OpenMP issue.
2) Change your shared device arrays to be allocatable and then allocate them after you have entered the OpenMP region and after you have called cudaSetDevice.

If neither of these suggestions help, can you please put together a reproducing example and send it to PGI Customer Support (trs@pgroup.com)? Ask them to forward it to me.

Thanks,
Mat
Back to top
View user's profile
roberto_bobenrieth_miserd



Joined: 10 Dec 2009
Posts: 3

PostPosted: Sun Sep 23, 2012 6:55 am    Post subject: Reply with quote

Hi Mat,

Sorry for the delayed response. Regarding suggestion (1), when the code is compiled without OpenMP, it runs without errors showing that is an OpenMP problem. So I implemented suggestion (2), allocating the private device arrays only after entering the parallel OpenMP region and calling cudaSetDevice. The result was also an execution without errors.

Thanks!

Roberto
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group