cudaLaunchKernel returned status 1: invalid argument

OpenACC and CUDA Fortran
Post Reply
mr0202
Posts: 3
Joined: Mar 20 2020

cudaLaunchKernel returned status 1: invalid argument

Post by mr0202 » Mon Mar 23, 2020 10:36 pm

hi, if i have the following simple acc region

Code: Select all

!$acc parallel num_gangs(5)
 print*,"abc"
!$acc end parallel
it compiles and runs as expected. but, if i do the following:

Code: Select all

!$acc parallel num_gangs(5) num_workers(3)
 print*,"abc"
!$acc end parallel
it compiles without any warnings or errors. but if i run i get

line X: cudaLaunchKernel returned status 1: invalid argument

where X is the line number of

Code: Select all

!$acc parallel num_gangs(5) num_workers(3)
can you help me find why it i get this cudaLaunchKernel error?

mkcolg
Posts: 8319
Joined: Jun 30 2004

Re: cudaLaunchKernel returned status 1: invalid argument

Post by mkcolg » Tue Mar 24, 2020 7:29 am

Not sure where this is coming from, but can work around it if vector_length is also set (See below). Though workers and vectors wont be used here since you have no parallel loops. Instead "gang redundant mode" applies (i.e. each gang will execute the print redundantly), so using "num_workers" here isn't really valid.

Code: Select all

% cat test.f90
program foo
!$acc parallel num_gangs(4) num_workers(3) vector_length(1)
 print*,"abc"
!$acc end parallel
end program foo
% pgfortran -ta=tesla test.f90 -Minfo=accel; a.out
foo:
      2, Generating Tesla code
 abc
 abc
 abc
 abc
-Mat

mr0202
Posts: 3
Joined: Mar 20 2020

Re: cudaLaunchKernel returned status 1: invalid argument

Post by mr0202 » Tue Mar 24, 2020 10:30 am

Thanks. Actually, I think "num_gangs" together with "num_workers" should be valid, of course, if I am not missing anything. I made up this example based on a similar one (Figure 15.5) in "Programming Massively Parallel Processors: A Hands-on Approach" by D.B.Kirk and W.W.Hwu, which is as follows:

Code: Select all

#pragma acc parallel copyout(a) num_gangs(1024) num_workers(32)
{
 a=23;
}
Am I missing anything in here?

So, just to clarify for myself, "parallel" without loop, only has "num_gangs" copies executed, independent of "num_workers" and "vector_length" sizes, whereas "parallel loop num_gangs(a) num_workers(b) vector_length(c)" would divide the total number of loop executions into "a*b*c" independent chunks. Am I right?

mkcolg
Posts: 8319
Joined: Jun 30 2004

Re: cudaLaunchKernel returned status 1: invalid argument

Post by mkcolg » Wed Mar 25, 2020 7:12 am

So, just to clarify for myself, "parallel" without loop, only has "num_gangs" copies executed, independent of "num_workers" and "vector_length" sizes, whereas "parallel loop num_gangs(a) num_workers(b) vector_length(c)" would divide the total number of loop executions into "a*b*c" independent chunks.
Correct. A parallel region without a work-shared loop (i.e. without a loop directive) is run in "gang-redundant mode", each gang executes the same exectutable statements. So it's not that using "num_workers" here is invalid, it's just not applicable in this case.

If you did have a loop(s) with a gang, worker, and/or vector clause, then, yes, it would divide the loop across the total number of gangs, workers, and vectors. Though best practice is to not specifically set the number of gangs, workers, or vectors except for specific tuning of algorithms and instead let the compiler and runtime define these values based on the target device. Different target devices may need different values so setting these size will reduce performance portability.

-Mat

mr0202
Posts: 3
Joined: Mar 20 2020

Re: cudaLaunchKernel returned status 1: invalid argument

Post by mr0202 » Wed Mar 25, 2020 10:26 am

thank you!

Post Reply