|
| View previous topic :: View next topic |
| Author |
Message |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Wed Feb 24, 2010 4:35 pm Post subject: |
|
|
Hi Matt,
Customer support sent me your code and it turns out that the "unexpected runtime function call" was a call to "pgf90_auto_alloc". This routines handles the allocation of automatic array. So the the compiler should be giving a semantic error since automatics aren't allowed in device routines. The reason being that a thread can't call malloc which is required for automatics. I've sent a report to engineering (TPR#16653) to have them catch this semantic error. To fix, you'll need to use fixed sized local arrays.
Also, I was wrong about data statements. Engineering is working on allowing data statements for module variables. But wont be allowed for local device variables. The reason is that CUDA C can only initialize data that has file scope, not local scope. Hence, there's not yet a way to map local data statements to CUDA C.
Thanks,
Mat |
|
| Back to top |
|
 |
TheMatt
Joined: 06 Jul 2009 Posts: 263 Location: Greenbelt, MD
|
Posted: Thu Feb 25, 2010 1:26 pm Post subject: |
|
|
| mkcolg wrote: | Hi Matt,
Customer support sent me your code and it turns out that the "unexpected runtime function call" was a call to "pgf90_auto_alloc". This routines handles the allocation of automatic array. So the the compiler should be giving a semantic error since automatics aren't allowed in device routines. The reason being that a thread can't call malloc which is required for automatics. I've sent a report to engineering (TPR#16653) to have them catch this semantic error. To fix, you'll need to use fixed sized local arrays. |
Not sure I understand this one. Does this mean I need to hardwire some array sizes that currently aren't? Do you have an example of what was wrong and what is correct?
| Quote: | | Also, I was wrong about data statements. Engineering is working on allowing data statements for module variables. But wont be allowed for local device variables. The reason is that CUDA C can only initialize data that has file scope, not local scope. Hence, there's not yet a way to map local data statements to CUDA C. |
So, in this case, I should instantiate within the program and figure out a massive RESHAPE? Or, READ/DATA before the CUDA call and pass into the program as an extra input (which, essentially, it is)? |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Thu Feb 25, 2010 2:51 pm Post subject: |
|
|
| Quote: | | Does this mean I need to hardwire some array sizes that currently aren't? |
You'll need to use fixed size for your local arrays. The code currently passes in the size.
| Quote: | | Do you have an example of what was wrong and what is correct? |
Here's a module that uses an automatic array called "local_data".
| Code: | module test_cuda
contains
attributes(global) subroutine kernel_1(data,N,NP)
use cudafor
implicit none
real, device, dimension(N) :: data
integer, value :: N, NP
integer :: i, j, idx, nthrd
real :: local_data(NP)
idx = (blockidx%x-1)*blockdim%x + threadidx%x
nthrd = blockDim%x * gridDim%x
do i=idx,N,nthrd
do j=1,NP
local_data(j) = data(i) * j
end do
end do
end subroutine
end module
% pgf90 -c test.cuf
PGF90-S-0155-device arrays may not be automatic - local_data (test.cuf)
0 inform, 0 warnings, 1 severes, 0 fatal for kernel_1
|
The problem is that the size of local_data is not known until run time so needs to be allocated when entering the subroutine. Unfortunately, threads can't allocate memory so the size of the local arrays needs to be know at compile time. To fix, local_data's size must be fixed.
| Code: | module test_cuda
integer :: maxNP
parameter (maxNP = 10)
contains
attributes(global) subroutine kernel_1(data,N,NP)
use cudafor
implicit none
real, device, dimension(N) :: data
integer, value :: N, NP
integer :: i, j, idx, nthrd
real :: local_data(maxNP)
idx = (blockidx%x-1)*blockdim%x + threadidx%x
nthrd = blockDim%x * gridDim%x
do i=idx,N,nthrd
do j=1,NP
local_data(j) = data(i) * j
end do
end do
end subroutine
end module
|
| Quote: | | So, in this case, I should instantiate within the program and figure out a massive RESHAPE? Or, READ/DATA before the CUDA call and pass into the program as an extra input (which, essentially, it is)? |
Spot checking, it appears that all the variables where you use a data statement are constants values. In this case, I would make them module variables and add the 'constant' attribute to have them placed in constant memory. While your limited in the amount of data that can be stored in constant memory (on my Tesla it's 64K), constant memory is much faster. You would then set the values in your host code:
| Code: | module test_cuda
real, dimension(3), constant :: aig
contains
attributes(global) subroutine kernel_1(data,N)
use cudafor
implicit none
real, device, dimension(N) :: data
integer, value :: N
integer :: i, j, idx, nthrd
idx = (blockidx%x-1)*blockdim%x + threadidx%x
nthrd = blockDim%x * gridDim%x
do i=idx,N,nthrd
data(i) = i * aig(1) * aig(2) * aig(3)
end do
end subroutine
end module
program test
use cudafor
use test_cuda
real, device, dimension(:),allocatable :: dData
real, dimension(:),allocatable :: data
integer :: N
N = 256
! update the device's constant memory
aig(1) = 1.1
aig(2) = 2.1
aig(3) = 3.1
allocate(data(N))
allocate(dData(N))
call kernel_1<<<128,2>>>(dData,N)
data=dData
print *, data(1), data(2)
deallocate(data)
deallocate(dData)
end program test
% pgf90 test.cuf -V10.2
% a.out
7.161000 14.32200
|
Hope this helps,
Mat |
|
| Back to top |
|
 |
TheMatt
Joined: 06 Jul 2009 Posts: 263 Location: Greenbelt, MD
|
Posted: Fri Feb 26, 2010 11:03 am Post subject: |
|
|
| mkcolg wrote: |
The problem is that the size of local_data is not known until run time so needs to be allocated when entering the subroutine. Unfortunately, threads can't allocate memory so the size of the local arrays needs to be know at compile time. To fix, local_data's size must be fixed. |
Ahh. I get this now, I get it. Makes sense thanks to your example and further explanation. I was sort of asking the GPU to do some magical memory allocations.
| Quote: | | Spot checking, it appears that all the variables where you use a data statement are constants values. In this case, I would make them module variables and add the 'constant' attribute to have them placed in constant memory. While your limited in the amount of data that can be stored in constant memory (on my Tesla it's 64K), constant memory is much faster. You would then set the values in your host code: |
I was indeed planning on using constant memory for this data which was one reason I wanted to try out a CUDA Fortan version since I'm not sure the directive-based generator can use constant memory yet, can it? (At least, I'm pretty sure I can't direct it to put data in constant.)
However, I seem to have found another ICE, but this one has an error message with it:
| Code: | ptxas error : Entry function 'soradcuf' uses too much local data (0x60e0 bytes, 0x4000 max)
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code 0 (src/sorad.cudafor.constant.cuf: 3622)
PGF90/x86-64 Linux 10.2-0: compilation aborted |
So it's saying I'm trying to use 24800 bytes of local data and I only have 8192 bytes max. I'm back at "I have no idea" when it comes to what this means.
FYI, I am using a Tesla S1070, so I have CC 1.3 if you are trying to figure out what hardware resource I'm hitting. The file that throws this has all my "constant" DATA arrays now as in your example (36768 bytes worth) and my count of local data is 25848 bytes using maxnp=72 (which I guess I thought would go into global memory with no attribute).
Thanks for all the help as I struggle with this,
Matt |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Fri Feb 26, 2010 3:01 pm Post subject: |
|
|
Hi Matt,
We'll this one was new for me as well. I knew that there was limits on constant and shared memory, but it appears there's one on local memory as well. Granted, I haven't tried to porting over 3500 line subroutine so wouldn't have hit this limit.
One thing I see, is that you have code which stores the value of a global array into a local array (like "tai(k) = ta(idx,k)"). Can you get rid of the local arrays and use the global arrays? Most likely it will be slower, but the first step is getting it working.
Also, could you break-up the subroutine into multiple kernels?
- Mat |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|