PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

cuda memory issues
Goto page Previous  1, 2, 3
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Thu Jan 19, 2012 3:11 pm    Post subject: Reply with quote

Hi Morgan,

You might be interested in a few of our PGinsider articles. In particular, Michael Wolfe's article Understanding the CUDA Data Parallel Threading Model and mine on Multi-GPU programming with CUDA Fortran.

What you're currently trying to optimize is the "occupancy" of your program. As you know, each streaming multi-processor has a finite amount of shared memory and registers. These memories are divided up amongst the active threads from a block or blocks. The more memory each thread consumes, the few number of threads that can be running at any given time. The percentage of active threads versus the maximum potential number threads is the occupancy.

To calculate the occupancy, use the information provided by the "-Mcuda=ptxinfo" flag as inputs to NVIDIA's Occupancy Calculator Spreadsheet (http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls). This will be able to help answer the question "given I use N registers, how many threads should I use in my thread block?".

Note that higher occupancy does not guarantee higher performance. So while it's a worth while effort to try and reduce the number of registers used, it may or may not be beneficial. You can also try using the the flag "-Mcuda=maxregcount:n" to limit the number of registers allowed per thread.


It sounds like you have both OpenMP threads trying to use the same GPU. While a single host thread/process can have multiple context (i.e. use multiple GPU), a single GPU can not have multiple context (i.e. multiple host threads can't share a single GPU). While it may "work" sometimes, it is not supported by NVIDIA and may arbitrarily fail.

Newer cards are capable of running multiple kernels, but this is for asynchronous kernels on separate streams. What happens is that the second kernel doesn't start running till the first starts finishing. As the first stops using a multi-processor, then the next one can begin using it. The two kernels never share a multi-processor and hence you don't need to worry about local variables contending for registers.

Hope this helps,
Mat
Back to top
View user's profile
vibrantcascade



Joined: 04 Aug 2011
Posts: 28

PostPosted: Sun Feb 05, 2012 2:41 pm    Post subject: Reply with quote

Thanks Mat!

That actually answered a lot of my questions and got me going again for the time being!

Morgan
Back to top
View user's profile
vibrantcascade



Joined: 04 Aug 2011
Posts: 28

PostPosted: Mon Mar 19, 2012 2:50 pm    Post subject: Reply with quote

Ok, a question on the occupancy calculator.

After compiling with -Mcuda=ptxinfo I received:

123 registers
48896 + 0 bytes lmem
24 + 16 bytes smem
5456 bytes cmem[0]
32 bytes cmem[1]

(I'm running a tesla C2050 which is compute 2.0)

The occupancy calculator only seems to care about registers, block size, and shared memory size.

For shared memory I added up 48896 + 24 + 16 = 48936 bytes
Then added that into the calculator with the 123 registers and 32 threads per block.

According to the calculator I should just barely be able to run 1 warp per multi-processor due to shared memory constraints. Yet I keep getting the error:

ptxas error : Entry function 'case8' uses too much local data (0xbf00 bytes, 0x4000 max)
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code 0 (ibe-25CudaC.f: 643)

(line 643 is the last line of the module containing the cuda functions)

Do I need to add in the cmem or something? Technically my code should be able to run with up to 256 threads per block if the calculator is correct. Or is there something else I'm overlooking?


Thanks,
Morgan
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Mon Mar 19, 2012 4:29 pm    Post subject: Reply with quote

Hi Morgan,

My guess is that the issue occurs when creating the Compute Capability 1.3 version of your code which only has 16k per thread of local memory. By default, the compiler targets both CC1.3 and CC2.0. Instead, try targeting just CC2.0. If using a 11.x compiler, I'd also use CUDA 4.0. i.e. "-Mcuda=cc20,4.0"

Hope this helps,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3
Page 3 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group