PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

paralle + independent and kernels + vector_length()
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
PaulPa



Joined: 02 Aug 2012
Posts: 35

PostPosted: Thu Aug 02, 2012 5:19 am    Post subject: paralle + independent and kernels + vector_length() Reply with quote

Hello,

I am currently working with the new pgi compiler 12.6 (linux) and I am still running into problems regarding some OpenACC directives.

The compiler does not seem to be cabable of handling the indepenet clauses within a parallel region. Is that suppossed to be that way or is this feature still missing?

I encounter a similar problem if I try to compile my code with a vector_length clause within a kernels region (the vector_length clause is part of a acc loop directive within the kernels region.). What am I doing wrong here?

Furthermore, I realized that the 12.6 compiler decides to shedule the workload among gangs and vectors eventhough I explicitly tell the compiler to schedule the work among gangs only. Is this behaviour expected or is it a bug?

If you guys want, I can provide you with the source code and/or file a bug-report.

Best,
Paul
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Thu Aug 02, 2012 9:47 am    Post subject: Reply with quote

Hi Paul,

Quote:
The compiler does not seem to be cabable of handling the indepenet clauses within a parallel region.
Per the OpenACC spec section 2.8.6, the independent clause only applies to loop directives within kernel regions.

Quote:
I encounter a similar problem if I try to compile my code with a vector_length clause within a kernels region (the vector_length clause is part of a acc loop directive within the kernels region.). What am I doing wrong here?
Section 2.4.7, the vector_length is only allowed on the parallel construct.

Quote:
Furthermore, I realized that the 12.6 compiler decides to shedule the workload among gangs and vectors eventhough I explicitly tell the compiler to schedule the work among gangs only. Is this behaviour expected or is it a bug?
Can you post an basic example of the code (not necessarily specific code but more the structure of the code)? My guess is that you have non-tightly nested loops or parallel with tightly nested loop but no collapse clause.

For example, in the non-tightly nested loop, the inner loop gang schedule is illegal due to the code between the two loops. The inner loop would need to be executed by the vectors in order to create a valid kernel.
Quote:

!$acc parallel
!$acc loop gang
do i=1,N
... do something
!$acc loog gang ! << this will be ignored
do j=1,M
-- do more


In the tightly nest loop case, the default is to schedule the outer loop as the gang. In order to schedule multiple loops in a gang, the collapse clause is needed:
Quote:

!$acc parallel
!$acc loop gang collapse(2)
do i=1,N
do j=1,M
-- do more


Hope this helps,
Mat
Back to top
View user's profile
PaulPa



Joined: 02 Aug 2012
Posts: 35

PostPosted: Wed Aug 15, 2012 2:03 am    Post subject: Reply with quote

Hi Mat,

thank you for your help and once again sorry for the late response.

Here is a code snippet:
Code:

#pragma acc kernels present(Ahat[0:n*k],x[0:k],tmpArray[0:n*numBlocksK])   
{   
#pragma acc loop independent gang collapse(2)
   for (int i=0; i<numBlocksN; i++) {
       for(int j=0; j<numBlocksK; j++) {
#pragma acc loop independent vector
         for(int l = 0 ; l < BLOCK_SIZE ; ++l){
            precision tmp;
            tmp = 0.0;
#pragma unroll(UNROLL_SIZE)
            for(int m = 0 ; m < BLOCK_SIZE ; ++m){
               tmp += Ahat[(i*BLOCK_SIZE +l)* k + j*BLOCK_SIZE + m] * x[j*BLOCK_SIZE + m];
            }
            tmpArray[(i*BLOCK_SIZE + l ) * numBlocksK + j] += tmp;
         }
       } // for j
   } // for i
}


The compiler output looks like this:
Code:

 95, Loop is parallelizable
         Accelerator kernel generated
         92, #pragma acc loop gang /* blockIdx.y */
         93, #pragma acc loop gang, vector(64) /* blockIdx.x threadIdx.x */
         95, #pragma acc loop vector(4) /* threadIdx.y */
             CC 2.0 : 28 registers; 0 shared, 84 constant, 0 local memory bytes

Line 92 corresponds to the i-loop.

Quote:

Section 2.4.7, the vector_length is only allowed on the parallel construct.

Is there any other way such that I can influence the vector_length within the kernels region?
The vector_length is a performance-critical parameter, hence I don't understand why the programmer would not be able to change this value.

Thank you.

Best,
Paul
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Wed Aug 15, 2012 11:31 am    Post subject: Reply with quote

Hi Paul,

For this code, it might be better to use the "parallel" construct rather than "kernels". It will give you the finer grain control you're looking for. Something along the lines of:

Code:
#pragma acc parallel present(Ahat[0:n*k],x[0:k],tmpArray[0:n*numBlocksK]) vector_length(256)   
{   
#pragma acc loop independent gang collapse(2)
   for (int i=0; i<numBlocksN; i++) {
       for(int j=0; j<numBlocksK; j++) {
#pragma acc loop independent vector
         for(int l = 0 ; l < BLOCK_SIZE ; ++l){
            precision tmp;
            tmp = 0.0;
#pragma unroll(UNROLL_SIZE)
            for(int m = 0 ; m < BLOCK_SIZE ; ++m){
               tmp += Ahat[(i*BLOCK_SIZE +l)* k + j*BLOCK_SIZE + m] * x[j*BLOCK_SIZE + m];
            }
            tmpArray[(i*BLOCK_SIZE + l ) * numBlocksK + j] += tmp;
         }
       } // for j
   } // for i
}


Hope this helps,
Mat
Back to top
View user's profile
PaulPa



Joined: 02 Aug 2012
Posts: 35

PostPosted: Sun Aug 19, 2012 2:43 am    Post subject: Reply with quote

Hi Mat,

thanks for your reply.

I actually had such an implementation before - it worked fine - but I would like to achieve such an implementation with the kernels region as well.

Hence, will there be the possibility to change the vector_length within a kernels region?

Furthermore, do you know why the compiler schedules the workload among gangs and vectors?

Best,
Paul
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group