PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

OpenACC reductions

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
nickaj



Joined: 06 Sep 2011
Posts: 21

PostPosted: Mon Mar 26, 2012 9:29 am    Post subject: OpenACC reductions Reply with quote

Hi,

I was trying to implement a 1D fft using the OpenACC directives support in 12.3

Here's the core of my code:
Code:

#pragma acc data copyin(x_in,N), copy(x_out), create(k,n,tmp)
  {
#pragma acc kernels
#pragma acc loop gang(32), vector(16)
    for(k=0;k<N;k++){
#pragma acc loop
      for(n=0;n<N;n++){
        tmp = x_in[n] * cos((-2*M_PI/N)*k*n);
      }
      x_out[k] = tmp;
    }
  }


I expected that the compiler would spot the reduction on the inner loop. Is there a better way to let it be detected or should I wait for a later compiler version where the reduction clause is supported?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Mon Mar 26, 2012 2:25 pm    Post subject: Reply with quote

Hi nickaj,

The "kernels" model works on tightly nested loops. Here, the inner loop is not tightly nested so would not be accelerated and using the "reduction" clause would not help. (note that in your code tmp isn't a reduction in this case anyway)

I think what you really want is a way to express that the outer loop is being performed by a "gang" while the inner loop performed by a "vector". Something like:
Code:

#pragma acc data copyin(x_in,N), copy(x_out), create(k,n,tmp)
  {
#pragma acc parallel
{
#pragma acc loop gang
    for(k=0;k<N;k++){

! this section of code performed by one thread in the gang
! if (thread == 1) then
   tmp = 0   
! end if
! call syncthreads()  to syncronize the threads

#pragma acc loop vector(32)
      for(n=0;n<N;n++){
! perform this loop in parallel across all threads in a gang
! creating a partial sum
        tmp = tmp + (x_in[n] * cos((-2*M_PI/N)*k*n));
      }
! call syncthreads()
! Back into sequential code
!  if (thread == 1) then
! perform the final sum reduction of tmp and store the results back to memory
      x_out[k] = tmp;
! end if
! call syncthreads()
    }
}
  }


We just started to get requests like this in the last few months and are investigating how we can express this in OpenACC. We're not sure if we can do this within the current "parallel" model specs, or if the OpenACC API needs to be extended. It's very early.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group