PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Can't I launch kernel with the maxThreadsPerBlock of GTX260?
Goto page Previous  1, 2
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Wed Aug 29, 2012 8:00 am    Post subject: Reply with quote

Hi Feng,

Quote:
Does that mean that each thread uses 51 registers, larger than 32, when I launched 512 threads per block ?
This means that each thread will use 51 registers and thus limit you to 321 threads per block (i.e. 16,384/51), though I'd round this to a multiple of 32 (the warp size) or 320 threads.

The other thing you can do is use the flag "-Mcuda=maxregcount:32" to limit the number of registers used by each thread so you can run more threads. The caveat is that each thread will spill the extra registers to global memory making each thread run a bit slower.

I would try multiple configurations to see which one works best.

Quote:
Would you tell me how to calculate the amount of registers used in device subprogram?
You can't know exactly except from the ptxas information. Though, things like local scalars and temp variables used to store address calculations are often stored in registers. (The back-end CUDA tools do the actual register allocation). Basically, to use fewer registers, write smaller kernels.

Quote:
Dose a single precision real variable use a register?
Possibly and double precision would use two.

Quote:
And would you please give me some advice about modifying the code?
In your case, most of the register usage is coming from your local variables. If you can use single precision instead of double, you will probably get below the 32 registers per thread. You might be save a few if you manually inline "gpu_get_intersection_of_two_lines" since it's local variables will also be put into registers and inlining will mean that you no longer need them. Finally, if you can reuse some of the local variables and eliminate others, then you can save a few more.

Though, do not make any of these changes if they compromise your algorithm. While performance is important, it is secondary to producing correct and reliable results.

- Mat
Back to top
View user's profile
cyfengMIT



Joined: 07 Mar 2012
Posts: 22

PostPosted: Thu Aug 30, 2012 8:11 am    Post subject: Reply with quote

Hi Mat,

I do really appreciate your advice.
I'll try and keep them all in mind.
Thank you very much. :)

Feng.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group