PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Error running simple CUDA Fortran program
Goto page Previous  1, 2
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Wed Feb 24, 2010 4:35 pm    Post subject: Reply with quote

Hi Matt,

Customer support sent me your code and it turns out that the "unexpected runtime function call" was a call to "pgf90_auto_alloc". This routines handles the allocation of automatic array. So the the compiler should be giving a semantic error since automatics aren't allowed in device routines. The reason being that a thread can't call malloc which is required for automatics. I've sent a report to engineering (TPR#16653) to have them catch this semantic error. To fix, you'll need to use fixed sized local arrays.

Also, I was wrong about data statements. Engineering is working on allowing data statements for module variables. But wont be allowed for local device variables. The reason is that CUDA C can only initialize data that has file scope, not local scope. Hence, there's not yet a way to map local data statements to CUDA C.

Thanks,
Mat
Back to top
View user's profile
TheMatt



Joined: 06 Jul 2009
Posts: 306
Location: Greenbelt, MD

PostPosted: Thu Feb 25, 2010 1:26 pm    Post subject: Reply with quote

mkcolg wrote:
Hi Matt,

Customer support sent me your code and it turns out that the "unexpected runtime function call" was a call to "pgf90_auto_alloc". This routines handles the allocation of automatic array. So the the compiler should be giving a semantic error since automatics aren't allowed in device routines. The reason being that a thread can't call malloc which is required for automatics. I've sent a report to engineering (TPR#16653) to have them catch this semantic error. To fix, you'll need to use fixed sized local arrays.

Not sure I understand this one. Does this mean I need to hardwire some array sizes that currently aren't? Do you have an example of what was wrong and what is correct?

Quote:
Also, I was wrong about data statements. Engineering is working on allowing data statements for module variables. But wont be allowed for local device variables. The reason is that CUDA C can only initialize data that has file scope, not local scope. Hence, there's not yet a way to map local data statements to CUDA C.

So, in this case, I should instantiate within the program and figure out a massive RESHAPE? Or, READ/DATA before the CUDA call and pass into the program as an extra input (which, essentially, it is)?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Thu Feb 25, 2010 2:51 pm    Post subject: Reply with quote

Quote:
Does this mean I need to hardwire some array sizes that currently aren't?


You'll need to use fixed size for your local arrays. The code currently passes in the size.

Quote:
Do you have an example of what was wrong and what is correct?

Here's a module that uses an automatic array called "local_data".
Code:
module test_cuda

  contains

  attributes(global) subroutine kernel_1(data,N,NP)

    use cudafor

    implicit none

    real, device, dimension(N) :: data
    integer, value :: N, NP

    integer :: i, j, idx, nthrd
    real :: local_data(NP)

    idx = (blockidx%x-1)*blockdim%x + threadidx%x
    nthrd = blockDim%x * gridDim%x
    do i=idx,N,nthrd
         do j=1,NP
              local_data(j) = data(i) * j
         end do
    end do

  end subroutine
end module
% pgf90 -c test.cuf
PGF90-S-0155-device arrays may not be automatic - local_data (test.cuf)
  0 inform,   0 warnings,   1 severes, 0 fatal for kernel_1


The problem is that the size of local_data is not known until run time so needs to be allocated when entering the subroutine. Unfortunately, threads can't allocate memory so the size of the local arrays needs to be know at compile time. To fix, local_data's size must be fixed.

Code:
module test_cuda

  integer :: maxNP
  parameter (maxNP = 10)

  contains

  attributes(global) subroutine kernel_1(data,N,NP)

    use cudafor

    implicit none

    real, device, dimension(N) :: data
    integer, value :: N, NP

    integer :: i, j, idx, nthrd
    real :: local_data(maxNP)

    idx = (blockidx%x-1)*blockdim%x + threadidx%x
    nthrd = blockDim%x * gridDim%x
    do i=idx,N,nthrd
         do j=1,NP
              local_data(j) = data(i) * j
         end do
    end do

  end subroutine
end module


Quote:
So, in this case, I should instantiate within the program and figure out a massive RESHAPE? Or, READ/DATA before the CUDA call and pass into the program as an extra input (which, essentially, it is)?

Spot checking, it appears that all the variables where you use a data statement are constants values. In this case, I would make them module variables and add the 'constant' attribute to have them placed in constant memory. While your limited in the amount of data that can be stored in constant memory (on my Tesla it's 64K), constant memory is much faster. You would then set the values in your host code:

Code:
module test_cuda

  real, dimension(3), constant :: aig

  contains

  attributes(global) subroutine kernel_1(data,N)

    use cudafor

    implicit none

    real, device, dimension(N) :: data
    integer, value :: N

    integer :: i, j, idx, nthrd

    idx = (blockidx%x-1)*blockdim%x + threadidx%x
    nthrd = blockDim%x * gridDim%x
    do i=idx,N,nthrd
      data(i) = i * aig(1) * aig(2) * aig(3)
    end do

  end subroutine
end module

program test
 
   use cudafor
   use test_cuda

   real, device, dimension(:),allocatable :: dData
   real, dimension(:),allocatable :: data
   integer :: N
   N = 256

   ! update the device's constant memory
   aig(1) = 1.1
   aig(2) = 2.1
   aig(3) = 3.1

   allocate(data(N))
   allocate(dData(N))
   call kernel_1<<<128,2>>>(dData,N)
   data=dData
   print *, data(1), data(2)

   deallocate(data)
   deallocate(dData)
end program test
% pgf90 test.cuf -V10.2
% a.out
    7.161000        14.32200


Hope this helps,
Mat
Back to top
View user's profile
TheMatt



Joined: 06 Jul 2009
Posts: 306
Location: Greenbelt, MD

PostPosted: Fri Feb 26, 2010 11:03 am    Post subject: Reply with quote

mkcolg wrote:

The problem is that the size of local_data is not known until run time so needs to be allocated when entering the subroutine. Unfortunately, threads can't allocate memory so the size of the local arrays needs to be know at compile time. To fix, local_data's size must be fixed.

Ahh. I get this now, I get it. Makes sense thanks to your example and further explanation. I was sort of asking the GPU to do some magical memory allocations.
Quote:
Spot checking, it appears that all the variables where you use a data statement are constants values. In this case, I would make them module variables and add the 'constant' attribute to have them placed in constant memory. While your limited in the amount of data that can be stored in constant memory (on my Tesla it's 64K), constant memory is much faster. You would then set the values in your host code:

I was indeed planning on using constant memory for this data which was one reason I wanted to try out a CUDA Fortan version since I'm not sure the directive-based generator can use constant memory yet, can it? (At least, I'm pretty sure I can't direct it to put data in constant.)

However, I seem to have found another ICE, but this one has an error message with it:
Code:
ptxas error   : Entry function 'soradcuf' uses too much local data (0x60e0 bytes, 0x4000 max)
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code       0 (src/sorad.cudafor.constant.cuf: 3622)
PGF90/x86-64 Linux 10.2-0: compilation aborted

So it's saying I'm trying to use 24800 bytes of local data and I only have 8192 bytes max. I'm back at "I have no idea" when it comes to what this means.

FYI, I am using a Tesla S1070, so I have CC 1.3 if you are trying to figure out what hardware resource I'm hitting. The file that throws this has all my "constant" DATA arrays now as in your example (36768 bytes worth) and my count of local data is 25848 bytes using maxnp=72 (which I guess I thought would go into global memory with no attribute).

Thanks for all the help as I struggle with this,
Matt
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Fri Feb 26, 2010 3:01 pm    Post subject: Reply with quote

Hi Matt,

We'll this one was new for me as well. I knew that there was limits on constant and shared memory, but it appears there's one on local memory as well. Granted, I haven't tried to porting over 3500 line subroutine so wouldn't have hit this limit.

One thing I see, is that you have code which stores the value of a global array into a local array (like "tai(k) = ta(idx,k)"). Can you get rid of the local arrays and use the global arrays? Most likely it will be slower, but the first step is getting it working.

Also, could you break-up the subroutine into multiple kernels?

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group