PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Device arrays may not be automatic

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
SPHriction-3D



Joined: 12 Jan 2014
Posts: 48

PostPosted: Fri Mar 14, 2014 12:47 pm    Post subject: Device arrays may not be automatic Reply with quote

Hi,

I am wondering how to have an array that is used in a device kernel that has array bounds that are not known prior to run time.

I have made a little example where A and B are vectors of nTotal elements. nTotal is not known until runtime. A is a variable that gets allocated and assigned in the main program. B is intended to be a device array that stays on the device only. B needs to be an array because the elements of B are unique:

Code:
Module Kernel

Use cudafor

Contains

   Attributes(Global) Subroutine NotPassedAutomatic(A,nTotal)

   Implicit None

   Integer:: i
   Integer, Value:: nTotal
   Integer, Device:: A(nTotal), B(nTotal)

   B = 2

   i = (blockIdx%x-1)*blockDim%x + threadIdx%x

   If (i >= 1 .and. i <= nTotal) Then
      A(i) = A(i) + B(i)
   End If

   End Subroutine NotPassedAutomatic

End Module Kernel

Program Main

   Use cudafor
   Use Kernel
   
   Integer:: nTotal
   Integer, Allocatable:: A(:)
   
   Integer, Device, Allocatable:: A_d(:)

   nTotal = 1024

   Allocate(A(nTotal),A_d(nTotal))
   A = 1

   A_d = A

   Call NotPassedAutomatic<<<ceiling(real(nTotal)/128),128>>>(A_d,nTotal)

   A = A_d

   If (any(A /= 3)) Then
      Write(*,*) "Failed A not equal to 3"
   Else
      Write(*,*) "Passed A equal to 3"
   End If

   Deallocate(A,A_d)

End Program Main



The program will compile and run if you change B to a scalar value in the device kernel.

Can someone help me understand how to do this properly.

Thank you,

Kirk
Back to top
View user's profile
SPHriction-3D



Joined: 12 Jan 2014
Posts: 48

PostPosted: Sat Mar 15, 2014 3:39 pm    Post subject: Reply with quote

To give a bit more context, B would typically be a array to hold a state variable that is unique for each point in a continuum. The state variable is needed to advance the solution of the simulation, but is not needed to be displayed in the results.

Kirk
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6120
Location: The Portland Group Inc.

PostPosted: Mon Mar 17, 2014 10:35 am    Post subject: Reply with quote

Hi Kirk,

It looks to me that you meant for "B" to be global with only "B(i)" to be accessed by an individual thread. In this case, it's best to move "B" to the module data section, make it an allocatable device array, then assign it's value from the host.

Code:
% cat test.cuf
Module Kernel

Use cudafor

   integer, allocatable, device, dimension(:) :: B
Contains

   Attributes(Global) Subroutine NotPassedAutomatic(A,nTotal)

   Implicit None

   Integer:: i
   Integer, Value:: nTotal
   Integer, Device:: A(nTotal)

   i = (blockIdx%x-1)*blockDim%x + threadIdx%x

   If (i >= 1 .and. i <= nTotal) Then
      A(i) = A(i) + B(i)
   End If

   End Subroutine NotPassedAutomatic

End Module Kernel

Program Main

   Use cudafor
   Use Kernel

   Integer:: nTotal
   Integer, Allocatable:: A(:)

   Integer, Device, Allocatable:: A_d(:)

   nTotal = 1024

   Allocate(A(nTotal),A_d(nTotal),B(nTotal))
   A = 1
   B = 2
   A_d = A

   Call NotPassedAutomatic<<<ceiling(real(nTotal)/128),128>>>(A_d,nTotal)

   A = A_d

   If (any(A /= 3)) Then
      Write(*,*) "Failed A not equal to 3"
   Else
      Write(*,*) "Passed A equal to 3"
   End If

   Deallocate(A,A_d,B)

End Program Main
% pgfortran test.cuf ; a.out
 Passed A equal to 3


Note the failure in your original code is because you were trying to use automatic arrays. Automatics are only supported when used as shared arrays who's size in bytes are passed as the fourth argument in the kernel launch configuration.

For later version of the PGI compiler and NVIDIA devices with support for relocatable device code (RDC), you could also have each thread allocate a temp array. However, I wouldn't recommend do this since having all threads allocate memory from the device can hurt performance.

- Mat
Back to top
View user's profile
SPHriction-3D



Joined: 12 Jan 2014
Posts: 48

PostPosted: Mon Mar 17, 2014 5:58 pm    Post subject: Reply with quote

That makes sense, thank you!

Kirk
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group