PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Are arrays in global routines really in global memory?

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
alfvenwave



Joined: 08 Apr 2010
Posts: 78

PostPosted: Tue Oct 05, 2010 2:19 pm    Post subject: Are arrays in global routines really in global memory? Reply with quote

Hi. After a few months or so of using the PGF compiler, I'm still finding that my use of shared, global and constant memory involves some trial and error. OK, a lot of trial and error. And mainly errors. Basically, I don't think I really understand what's going on.

So here is my question - if I have a device subroutine or function, declared with attributes "global", then I declare a single precision real array of modest size, will the PGF compiler place this array in global memory, or will it reside inside "register" memory? The reason I ask is that sometimes I find that using shared memory actually slows my algorithm down, or simply doesn't make any difference.

I was also surprised a few days ago when I decided to make a block of regularly used data (in a 100x100 array) sit in constant memory instead of global memory - as my code ended up being slower! Is there a limit to how much constant memory can be cached?

Lastly, I haven't got a clue how to use "pinned" memory, so if anyone has an example of how this might be useful, I'd be very grateful.

Cheers,

Rob.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Wed Oct 06, 2010 9:51 am    Post subject: Reply with quote

Hi Rob,

Use of the local memory can be a balancing act. All threads in a block share the same local memory (16384 bytes on a Tesla) on a SM (streaming multi-processor). All registers and shared memory will be placed in local memory unless you use too much, in which case the memory gets spilled to global memory.

So where do those local arrays end up? It depends on how many threads you have in a block and how much local memory each SM has. The utility 'pgaccelinfo' will show how much local memory is available. The flag "-Mcuda=ptxinfo" will list the number of registers and shared memory used per thread. This combined with the number of threads in a block should help you determine if you're spilling to global memory.

Quote:
The reason I ask is that sometimes I find that using shared memory actually slows my algorithm down, or simply doesn't make any difference
If you're using a Fermi card, then using shared memory doesn't matter as much. Fermi has added an L2 cache as well as hardware caching of the local memory. Software caching still can help, just not as much.

Quote:
I was also surprised a few days ago when I decided to make a block of regularly used data (in a 100x100 array) sit in constant memory instead of global memory - as my code ended up being slower! Is there a limit to how much constant memory can be cached?
Constant memory is also finite (64K) but your program would have crashed if you exceeded this limit. Unfortunately I don't why it's slower for you.

Quote:
Lastly, I haven't got a clue how to use "pinned" memory, so if anyone has an example of how this might be useful, I'd be very grateful.
In order for host data to be copied to and from the device, the data must be placed 'pinned' memory. This is a physical memory address that can not be swapped to virtual memory. The OS can start the DMA transfer and then start on another task without worry of the data being swapped out of memory mid-transfer.

By default, host memory is allocated in normal virtual memory. To transfer the data to the device, this virtual memory must be first copied to pinned memory. When you use the 'pinned' attribute, you can save this extra copy since the host memory will be allocated in 'pinned' memory. The caveat being that pinned memory is finite and that the OS does not need to honor the request. Also, it's worth noting that it's the CUDA driver that manages this memory. Hence, if you destroy you context, all device and pinned memory will be destroyed as well. Normal host allocated memory is not. destroyed.

Hope this helps,
Mat
Back to top
View user's profile
TheMatt



Joined: 06 Jul 2009
Posts: 304
Location: Greenbelt, MD

PostPosted: Wed Oct 06, 2010 10:44 am    Post subject: Reply with quote

mkcolg wrote:
Constant memory is also finite (64K) but your program would have crashed if you exceeded this limit. Unfortunately I don't why it's slower for you.


I can say from experience that there are times when it's not advantageous to have a constant array in constant memory, though I'm not sure I've seen a slowdown of any great magnitude.

In my case, I found that the advantages of constant memory were wiped out because the array wasn't being accessed in a broadcast way. That is, each thread in the warp wasn't asking for the same values at the same time, but rather each thread asked for different parts of the array depending on some previously calculated index in the array. Could this be happening with your code?
Back to top
View user's profile
alfvenwave



Joined: 08 Apr 2010
Posts: 78

PostPosted: Wed Oct 06, 2010 11:14 am    Post subject: Reply with quote

Yep - that could be happening. Lots to think about and lots to learn

Thanks guys.

Rob.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group