PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Declaring local arrays in device code
Goto page 1, 2, 3, 4  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
crip_crop



Joined: 28 Jul 2010
Posts: 68

PostPosted: Thu Jun 07, 2012 2:28 am    Post subject: Declaring local arrays in device code Reply with quote

Hi folks,

I'm trying to get round a problem in CUDA Fortran but I'm not having much luck so I'm putting it to the masses. If anyone can suggest anything it would be much appreciated.

Okay, here's the problem....

I have a device array declared in MODULE scope as A(N,M).

The number of threads in my system is equal to N.

Within device code I want to call a device subroutine and pass only one dimension of that array for each thread i.e.

Code:
call subroutine(A(thread,:))


but I get the error:

Quote:
Array reshaping is not supported for device subprogram calls


So to get around this problem I tried to get each thread to copy it's vector into a separate 1D array i.e.

Code:
B(:)=A(thread,:)


but in declaring B(M) in the device subroutine I get another compiler error:

Quote:
device arrays may not be automatic


the only solution I can see is to declare B with a fixed large number but I don't really want to do this as it may waste memory or restrict system size.

Does anyone know a way around my conundrum?

Cheers,
Crip_crop
Back to top
View user's profile
TheMatt



Joined: 06 Jul 2009
Posts: 317
Location: Greenbelt, MD

PostPosted: Thu Jun 07, 2012 8:27 am    Post subject: Reply with quote

Crip_crop,

I think both issues are due to the same fact: the device (GPU) cannot allocate its own memory.

When you try and call a subroutine with an array slice, the compiler (most likely?) wants to make a temporary copy and pass a reference to that temp copy to the subroutine. Since the device can't do so, error.[1]

The second case makes the same issue. Automatic arrays are allocated upon entry and the GPU can't do that. In my code, I often do just what you are tentative to do which is allocate local, per-thread arrays at compile-time with some maximum fixed size that I can know a priori (number of levels in the system, say, which we know roughly) that I do with the preprocessor in my Makefile. I'm lucky that my "M" is fairly small (O(100)) so I can make those. If your "M" is big...might not work. But, if you can, and you aren't using much shared memory, you can tell the compiler to prefer L1 cache (make it 48k) which will increase your chances of getting a hit on L1-cached local memory.

But, if you don't want to, or can't due to the size of M, the only other thought I have is to pass in the reference to all of A along with the thread number:
Code:
call subroutine(A,thread)
and then inside that subroutine, just do all your work on A(thread,:) inside. It's not ideal, but try it out. You might find there isn't much of a performance hit at all.

Matt

[1] Note: I think this is also why you can't do math in subroutine calls:
Code:
call subroutine(2*A,...)
since the compiler would try and make a temp array B=2*A and use that.
Back to top
View user's profile
crip_crop



Joined: 28 Jul 2010
Posts: 68

PostPosted: Thu Jun 07, 2012 8:48 am    Post subject: Reply with quote

Cheers Mat that's really useful.

How would i go about increasing the size of the L1 cache?

Crip_crop
Back to top
View user's profile
TheMatt



Joined: 06 Jul 2009
Posts: 317
Location: Greenbelt, MD

PostPosted: Thu Jun 07, 2012 9:01 am    Post subject: Reply with quote

You can do that on a per-device or per-function basis. You can use:
Code:
status = cudaDeviceSetCacheConfig(cacheconfig)

status = cudaFuncSetCacheConfig(func, cacheconfig)
where "cacheconfig" is cudaFuncCachePreferNone, cudaFuncCachePreferShared, or cudaFuncCachePreferL1 (pretty self-explanatory). For the FuncSet, func is usually decorated by the module name, so subroutine dothis in module mymodule, would (I think) be mymodule_dothis.

Obviously, it's probably not best to do the device-wide set if you use near 48k of shared memory in most of your code, but the function version might help.

But, benchmark as always, some codes might respond, some might not. Per the spec, this is only the "preferred" configuration that the user wants. I think the CUDA compiler can always choose its own configuration when it determines its is better. Or when it's Tuesday. I'm not sure I've ever seen how it determines this (probably looks at how many registers spilled into local, &c.).

Oh, and if PGI chimes in here saying differently, believe them more than me!

Matt
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6138
Location: The Portland Group Inc.

PostPosted: Thu Jun 07, 2012 9:26 am    Post subject: Reply with quote

Hi Crip_crop,

Matt is correct about the problem being lack of dynamic allocation from device code. The may change with CUDA 5 and the Kepler K20 GPUs, but for now we're stuck.

Though, another thing to try is using automatic arrays of shared memory. The third argument to the kernel chevron is the size in bytes to dynamically allocate in shared memory. The compiler can then map this dynamic shared memory to device automatic arrays. The glitch being that the automatic arrays are shared by all the threads in the block and the amount of shared memory is relatively small.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page 1, 2, 3, 4  Next
Page 1 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group