PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Entry function uses too much local data

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Sun Apr 06, 2014 11:48 pm    Post subject: Entry function uses too much local data Reply with quote

Hi, I get the compiler error:

ptxas error : Entry function 'ppush_kernel' uses too much local data (0x30d40 bytes, 0x4000 max)

'ppush_kernel' is a global subroutine that is called by the main host program.

It takes 11 largish (50,000 type real elements) device arrays as parameters. These arrays are stored in a .h file which is included in a module file that the main program uses. But this module file is not used in the specific ppush_kernel subroutine.

Within the ppush_kernel subroutine I declare another 50,000 element (local) array. I am assuming it is automatically shared amongst threads, and so it is only stored in memory once, though I never explicitly declared it as a shared array.

I'm wondering which arrays would be causing the problem, and what are the solutions?

Also another question: how does the compiler know what the memory limits are of the GPU I am using? And if I compile on a machine different then what I actually run on could this cause a problem? (e.g. some cases of submitting a job to a cluster).
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5943
Location: The Portland Group Inc.

PostPosted: Mon Apr 07, 2014 12:11 pm    Post subject: Reply with quote

Quote:
Within the ppush_kernel subroutine I declare another 50,000 element (local) array. I am assuming it is automatically shared amongst threads, and so it is only stored in memory once, though I never explicitly declared it as a shared array.
Is this an automatic array who's size is set via the kernels execution configuration? If not, then you do need to explicitly add the "shared" attribute. Otherwise, each thread will get their own local copy of this array. Given the size 0x30d40 (i.e. 200,000 bytes or 50,000 elements times 4 bytes per element), I'm guessing this is the problem.

Quote:
Also another question: how does the compiler know what the memory limits are of the GPU I am using? And if I compile on a machine different then what I actually run on could this cause a problem? (e.g. some cases of submitting a job to a cluster).
If I remember correctly, for the older Tesla (cc13) cards, ptxas enforced the local memory size. Though, I think these limits were lifted or changed to the runtime when targeting newer devices.

By default, we target multiple devices including CC 1.3. Try targeting a newer device such as CC 3.5,"-Mcuda=cc35,cuda5.5", to see if it works around this limit.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group