PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

How to choose the correct Loop Scheduling

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
Fedele.Stabile



Joined: 08 Feb 2012
Posts: 9

PostPosted: Mon Mar 26, 2012 3:50 am    Post subject: How to choose the correct Loop Scheduling Reply with quote

Hi,
I have problems to understand the loop scheduling policy adopted by the
compiler.
I run my tests using a Tesla M2050 GPU
First example, single loop:
If loop makes a sum two vector of length 512

do i=1, 512
c(i) = a(i) + (b(i)
enddo

I think is possible to use the directive
!$acc do vector(512)
to instruct the compiler on how to parallelize the loop
Is it correct?
And what happens if I have vectors of length 10.000, for example?

Second example, nested loops:
Supposing I have to sum two 2-D array of 512x512
do i=1, 512
do j=1, 512
c(i,j) = a(i,j) + (b(i,j)
enddo
enddo
the compiler chooses these two directives to parallelize
!$acc do parallel, vector(16) (for i-loop)
!$acc do parallel, vector(16) (for j-loop)

Why it choose the value of 16 ?
I suppose the GPU is not fully used in this way,
is it correct?
But I noticed that if I force the compiler to use different value, inserting
explicit directives in the code,
for example
!$acc do parallel, vector(64) (for i-loop)
!$acc do parallel, vector(64) (for j-loop)
I don't obtain better performance.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6125
Location: The Portland Group Inc.

PostPosted: Mon Mar 26, 2012 11:38 am    Post subject: Reply with quote

Hi Fedele.Stabile,

The compiler typically chooses a good schedule but is not guaranteed. Unfortunately, there isn't an optimal way to find the best schedule except for trying all possible schedules. I typically will spend an hour or two varying the schedule to see how it effects performance. Though, most of the time, I can't beat the default.

Quote:
I think is possible to use the directive
!$acc do vector(512)
to instruct the compiler on how to parallelize the loop
Is it correct?
You are just setting the block size (i.e. the number of CUDA threads per block). If you are not familiar with the CUDA threading model, Michael Wolfe has a great introductory article Understanding the CUDA Data Parallel Threading Model: A Primer.

Quote:
And what happens if I have vectors of length 10.000, for example?
There has to be at least one "parallel" clause. In this case since it was not specified, one will be added. Since you are not limiting the number of blocks (i.e. parallel), more blocks are created.

Quote:
Why it choose the value of 16 ?
It's the largest square dimension that can fit on a Tesla card with a compute capability of 1.3. Newer cards could use 32x32 but then other factors such as shared memory and register usage may still warrant the use of a 16x16 block.
Quote:

!$acc do parallel, vector(64) (for i-loop)
!$acc do parallel, vector(64) (for j-loop)
I don't obtain better performance.
Check the -Minfo=accel output. A 64x64 thread block will be too large for your device so the compiler is most likely ignoring your values and using the default. To see the maximum number of threads per block for your device, please run the utility "pgaccelinfo". On a Fermi, this max is 1024 and on Tesla the max is 512.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group