PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

CUDA FORTRAN/OpenACC "Overflow" Register with maxr
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
frnkyl004



Joined: 06 Dec 2011
Posts: 50

PostPosted: Fri Jan 31, 2014 2:41 am    Post subject: CUDA FORTRAN/OpenACC "Overflow" Register with maxr Reply with quote

Hi All,

Compute Capability 3.5 cards have a maximum of 65536 registers per block and 255 registers per thread, where (AFAIK) the 256th register is used to store the location in global memory to where registers are spilled (the "overflow" register). If I use 512 threads per block, I can use a maximum of 65536(registers/block)/512(threads/block) = 128 registers per thread, which means I need to use
Code:
maxregcount:n
when compiling. A value of 129 or more for n results in a launch error due to unavailable resources (as it should) and a value of 128 or less works, but I'm not sure why 128 is ok. Should the value of n be 128 or 127? If it should/can be 128, where is the "overflow" register?

Cheers,
Kyle
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Fri Jan 31, 2014 12:01 pm    Post subject: Reply with quote

Hi Kyle,

The reason why this works is that the "overflow" register isn't the 256th register rather it's a special register instruction encoding (all 1's) identified as "RZ".

- Mat
Back to top
View user's profile
frnkyl004



Joined: 06 Dec 2011
Posts: 50

PostPosted: Fri Jan 31, 2014 12:27 pm    Post subject: Reply with quote

Hi Mat,

Thanks for the info. I've seen RZ in PTX before but couldn't figure out what it was. I thought it might be something to do with round-to-zero, but it didn't make sense for it to be that in the way it was used in PTX.

Knowing this, why is it possible to use 512(threads/block)*128(registers/thread)=65536(registers/block), but only 256(threads/block)*255(registers/thread)=65280?

Is there anywhere I can find the answers to questions like this?

Cheers,
Kyle
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Fri Jan 31, 2014 5:20 pm    Post subject: Reply with quote

Hi Kyle,

You're welcome to ask these questions here and I can then ask around if I don't know. Though, stackoverflow.com is a good place to ask these types of low level CUDA questions.

While I don't know for a fact, I would think the reasoning for the thread limit is that the register numbers are represented by a 2-byte value (00-FF). Given RZ is represented as FF (or all 1's) only registers R0 through R254 can be identified. Hence, you're hitting the limit of 255 registers per thread. In the other case, you're hitting the limit of 65536 registers per block. Two separated but related limits.

If you want a more definitive answer, I can investigate.

- Mat
Back to top
View user's profile
frnkyl004



Joined: 06 Dec 2011
Posts: 50

PostPosted: Sat Feb 01, 2014 12:13 am    Post subject: Reply with quote

Hi Mat,

Quote:
If you want a more definitive answer, I can investigate.


No need. Couldn't have asked for a better answer.

Cheers,
Kyle
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group