| View previous topic :: View next topic |
| Author |
Message |
S. Soll
Joined: 17 Sep 2010 Posts: 3
|
Posted: Tue Oct 19, 2010 5:56 am Post subject: CUDA Fortran: unroll directive for kernel code? |
|
|
Hello,
I'm trying to compare the performance of CUDA C and CUDA Fortran because we have a large Fortran code base and want to avoid porting it to C. So far I've been able to optimize most aspects of CUDA Fortran but as soon as I add a well placed #pragma unroll to the C code the Fortran codes falls behind.
E.g. this kernel code in CUDA C is unrolled: | Code: | uint end = min(gpu_shared_mem_block_size, n - block_offset);
#pragma unroll 16
for (uint j = 0; j < end; j++)
sum += block_a[threadIdx.y][j] * block_b[j][threadIdx.x]; |
And I'm wondering if there is a way to do the same in this CUDA Fortran code: | Code: | block_end = min(16, n - shared_block_offset)
do k = 1, block_end
sum = sum + A_shared(threadidx%x, k) * B_shared(k, threadidx%y)
end do |
I've searched the documentation and have found "!$pgi unroll" and "!$acc unroll". It seems that both only apply to host code and do not change the way kernel code is generated. Did I miss something?
Thanks in advance. |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Tue Oct 19, 2010 4:28 pm Post subject: |
|
|
Hi S. Soll,
| Quote: | | It seems that both only apply to host code and do not change the way kernel code is generated. Did I miss something? | No, you didn't miss anything. The unroll directive only applies to host code. However, the compiler may automatically unroll loops if it finds it advantageous to do so.
- Mat |
|
| Back to top |
|
 |
S. Soll
Joined: 17 Sep 2010 Posts: 3
|
Posted: Wed Oct 20, 2010 7:35 am Post subject: |
|
|
Thanks for the reply.
I noticed that with -O3 the loop gets unrolled into groups of 4. That increases performance quite a bit. Is there any way to let the compiler unroll the entire loop? |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Wed Oct 20, 2010 10:53 am Post subject: |
|
|
You can try using the "-Munroll" flag.
Setting "-Munroll=c:16" will completely unroll loops having a loop count of 16 or less. However, this only works on loops who's counts are known at compile time and in your case, the count isn't known until runtime.
Instead, you can try using "-Munroll:n:16" which will unroll single block loops by a 16. However, this may not be optimal when block_end is less than 16.
"-Munroll=m:<n>" is the same as "-Munroll=n:<n>" except sets the unroll factor for multi-block loops.
- Mat |
|
| Back to top |
|
 |
S. Soll
Joined: 17 Sep 2010 Posts: 3
|
Posted: Thu Oct 21, 2010 9:56 am Post subject: |
|
|
Thanks for the hint. I already tried -Munroll (sorry, I should have written that in my first post) but it had no effect on the performance.
I tried it again with n:16, m:16 and c:16,:n:16:m:16 and looked at the code generated by CUDA Fortran. In the generated C code the code is always unrolled with a block size of 4. The resulting PTX code also looks like that (4 groups of ld.shared instructions).
However the -Munroll option triggered the unroll with a block size of 4. If I omit -Munroll no unrolling is performed. This is pretty much what the man page says and the default value of 4 also matches (the default value is mentioned for the c option). However the options n and m do not seem to change this default value for loops with unknown counts.
Without -Munroll (and no -O option) the loop was not unrolled. But apart from it's presence (and the default value of 4) the -Munroll option does not seem to affect the generated GPU code. Is this intended for GPU code? |
|
| Back to top |
|
 |
|