PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

how gang and vector parallelization of a loop map to the GPU
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Sun Oct 13, 2013 6:04 pm    Post subject: how gang and vector parallelization of a loop map to the GPU Reply with quote

Hi,

Consider a single parallel loop which contains no nested loops and where n is sufficiently big to fill the entire GPU with work

Code:

!$acc kernels do
do i = 1,n
  calculations
end do


so that the compiler output yields something like
!$acc loop gang, vector(128)

So what does this mean exactly? Will this ensure that each core on the GPU has a loop iteration to work on? Is it possible that only one multiprocessor is being used, or that each multiprocessor isn't being entirely filled with work? Why doesn't it tell me how many gangs are being used?

My guess is that the n iterations are divided amongst the different gangs, which each correspond to a thread block. The number of thread blocks, or gangs we get, depends on the number of multi-processors our GPU has. The vector(128) specifies that 128 threads are in each thread block.

So, upon execution, each multiprocessor executes 128 threads in parallel, where each thread corresponds to an iteration of the loop. How accurate is this?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6141
Location: The Portland Group Inc.

PostPosted: Mon Oct 14, 2013 7:25 am    Post subject: Reply with quote

Quote:
So what does this mean exactly?
The best way to think of this is that the compiler is strip-mining this loop (i.e. adding an inner loop which computes 128 elements). The inner stip-mined loop then mapped to the thread block where each thread computes one iteration. The outer loop would then be mapped to the grid block with one inner loop (i.e. groups of 128).

Quote:
Will this ensure that each core on the GPU has a loop iteration to work on?
Depends upon the value of N and how many cores your GPU has. While not always the case, in this example each iteration of the loop gets mapped to a thread. If the number of iterations is less then the number of the core, then you will have idle cores. If you have more iterations than cores, some gangs will need to wait for others to complete before computing.

Quote:
Is it possible that only one multiprocessor is being used, or that each multiprocessor isn't being entirely filled with work?
Sure, if N is less than the number of cores.

Quote:
Why doesn't it tell me how many gangs are being used?
It's variable at runtime given the size of N. Set the environment variable "PGI_ACC_TIME=1" to see the profile, including the grid dimensions.

This article is a bit dated but still gives a great explanation of the CUDA threading model. http://www.pgroup.com/lit/articles/insider/v2n1a5.htm

- Mat
Back to top
View user's profile
xray



Joined: 21 Jan 2010
Posts: 85

PostPosted: Tue Feb 25, 2014 2:28 am    Post subject: Reply with quote

Dear Mat,
one further question concerning the schedule that is taken for a given number of gangs. Lets say we have:
Code:
int n = 1024;
int blocks = 8;
int threadsperblock = 32;

#pragma acc parallel num_gangs(block) vector_length(threadsperblock)
#pragma acc loop gang vector
for (int i=0; i<n; i++) // do something


I assume that internally happens the code below. Given a fixed gang size, is it correct that we have some kind of "static" schedule among the gangs. What I mean by this is the distribution methodology from OpenMP's static for schedule: Put the same size of iterations into one block and put close iterations together in one block. If so, I believe the schedule would look like the one below (with two nested loops for the gangs, and one strip-mined loop for the threads within the gang).
Code:
int n = 1024;
int blocks = 8;
int threadsperblock = 32;

#pragma acc parallel num_gangs(block) vector_length(threadsperblock)
#pragma acc loop gang
for (int i=0; i<blocks; i++) { // 0..8 - number of blocks
 for (int j=i*(n/blocks); j<(i+1)*(n/blocks); j+=32) { // 0..127, 128..255,.. - number of iterations within one block that is strip-mined by threadsperblock
#pragma acc loop vector
  for (int k=j; k<j+threadsperblock; k++) { //0..32 - threadsperblock
    // do something
}}}

Thanks, Sandra
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6141
Location: The Portland Group Inc.

PostPosted: Tue Feb 25, 2014 4:39 pm    Post subject: Reply with quote

Hi Sandra,

While the exact kernel may vary depending on the host code, it would be conceptionally similar to what you have. Though, the kernel would stride by the gang*vector length (256). Hence block 0 would compute elements 0-31, 256-287, 512-543, 768-799

Something like:
Code:

#pragma acc parallel num_gangs(8) vector_length(32)
{
#pragma acc loop gang vector
   for (int i=0; i < N; ++i) {
      val[i] = i;
   }
}


Would get turned into something like:
Code:
// outer gang loop
      for (int i = 0; i < 8; ++i) {
// stripmine - stride by num gangs * vector length
       for (int j = i*32; j < 1024; j+=(32*8)) {
// vector loop
        for (int k=j; k<j+32;++k) {
            val2[k] = k;
        }
   }
}


Last edited by mkcolg on Wed Feb 26, 2014 4:53 pm; edited 1 time in total
Back to top
View user's profile
xray



Joined: 21 Jan 2010
Posts: 85

PostPosted: Wed Feb 26, 2014 12:39 am    Post subject: Reply with quote

Hi Mat,
Thanks for your answer. Just to make sure: The outer 32 is taken from n/threadsperblock = 1024 / 32 = 32, correct?
Sandra
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group