Joined: 30 Jun 2004
Location: The Portland Group Inc.
|Posted: Wed Aug 29, 2012 8:00 am Post subject:
This means that each thread will use 51 registers and thus limit you to 321 threads per block (i.e. 16,384/51), though I'd round this to a multiple of 32 (the warp size) or 320 threads.
|Does that mean that each thread uses 51 registers, larger than 32, when I launched 512 threads per block ? |
The other thing you can do is use the flag "-Mcuda=maxregcount:32" to limit the number of registers used by each thread so you can run more threads. The caveat is that each thread will spill the extra registers to global memory making each thread run a bit slower.
I would try multiple configurations to see which one works best.
You can't know exactly except from the ptxas information. Though, things like local scalars and temp variables used to store address calculations are often stored in registers. (The back-end CUDA tools do the actual register allocation). Basically, to use fewer registers, write smaller kernels.
|Would you tell me how to calculate the amount of registers used in device subprogram? |
Possibly and double precision would use two.
|Dose a single precision real variable use a register? |
In your case, most of the register usage is coming from your local variables. If you can use single precision instead of double, you will probably get below the 32 registers per thread. You might be save a few if you manually inline "gpu_get_intersection_of_two_lines" since it's local variables will also be put into registers and inlining will mean that you no longer need them. Finally, if you can reuse some of the local variables and eliminate others, then you can save a few more.
|And would you please give me some advice about modifying the code? |
Though, do not make any of these changes if they compromise your algorithm. While performance is important, it is secondary to producing correct and reliable results.