PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

CUDA Fortran: unroll directive for kernel code?
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Performance and Benchmarking
View previous topic :: View next topic  
Author Message
S. Soll



Joined: 17 Sep 2010
Posts: 3

PostPosted: Tue Oct 19, 2010 5:56 am    Post subject: CUDA Fortran: unroll directive for kernel code? Reply with quote

Hello,

I'm trying to compare the performance of CUDA C and CUDA Fortran because we have a large Fortran code base and want to avoid porting it to C. So far I've been able to optimize most aspects of CUDA Fortran but as soon as I add a well placed #pragma unroll to the C code the Fortran codes falls behind.

E.g. this kernel code in CUDA C is unrolled:
Code:
uint end = min(gpu_shared_mem_block_size, n - block_offset);
#pragma unroll 16
for (uint j = 0; j < end; j++)
   sum += block_a[threadIdx.y][j] * block_b[j][threadIdx.x];

And I'm wondering if there is a way to do the same in this CUDA Fortran code:
Code:
block_end = min(16, n - shared_block_offset)
do k = 1, block_end
   sum = sum + A_shared(threadidx%x, k) * B_shared(k, threadidx%y)
end do


I've searched the documentation and have found "!$pgi unroll" and "!$acc unroll". It seems that both only apply to host code and do not change the way kernel code is generated. Did I miss something?

Thanks in advance.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Tue Oct 19, 2010 4:28 pm    Post subject: Reply with quote

Hi S. Soll,

Quote:
It seems that both only apply to host code and do not change the way kernel code is generated. Did I miss something?
No, you didn't miss anything. The unroll directive only applies to host code. However, the compiler may automatically unroll loops if it finds it advantageous to do so.

- Mat
Back to top
View user's profile
S. Soll



Joined: 17 Sep 2010
Posts: 3

PostPosted: Wed Oct 20, 2010 7:35 am    Post subject: Reply with quote

Thanks for the reply.

I noticed that with -O3 the loop gets unrolled into groups of 4. That increases performance quite a bit. Is there any way to let the compiler unroll the entire loop?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Wed Oct 20, 2010 10:53 am    Post subject: Reply with quote

You can try using the "-Munroll" flag.

Setting "-Munroll=c:16" will completely unroll loops having a loop count of 16 or less. However, this only works on loops who's counts are known at compile time and in your case, the count isn't known until runtime.

Instead, you can try using "-Munroll:n:16" which will unroll single block loops by a 16. However, this may not be optimal when block_end is less than 16.

"-Munroll=m:<n>" is the same as "-Munroll=n:<n>" except sets the unroll factor for multi-block loops.

- Mat
Back to top
View user's profile
S. Soll



Joined: 17 Sep 2010
Posts: 3

PostPosted: Thu Oct 21, 2010 9:56 am    Post subject: Reply with quote

Thanks for the hint. I already tried -Munroll (sorry, I should have written that in my first post) but it had no effect on the performance.

I tried it again with n:16, m:16 and c:16,:n:16:m:16 and looked at the code generated by CUDA Fortran. In the generated C code the code is always unrolled with a block size of 4. The resulting PTX code also looks like that (4 groups of ld.shared instructions).

However the -Munroll option triggered the unroll with a block size of 4. If I omit -Munroll no unrolling is performed. This is pretty much what the man page says and the default value of 4 also matches (the default value is mentioned for the c option). However the options n and m do not seem to change this default value for loops with unknown counts.

Without -Munroll (and no -O option) the loop was not unrolled. But apart from it's presence (and the default value of 4) the -Munroll option does not seem to affect the generated GPU code. Is this intended for GPU code?
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Performance and Benchmarking All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group