PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

MatMul with openACC
Goto page Previous  1, 2
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
PaulPa



Joined: 02 Aug 2012
Posts: 35

PostPosted: Mon Dec 17, 2012 12:44 pm    Post subject: Reply with quote

Hi Mat,

mkcolg wrote:
Here's the OpenACC versions:
Code:

void
MatrixMultiplication4(float * restrict a,float * restrict b, float * restrict c, int m, int n, int p)
{
    int i, j, k ;

#pragma acc data copyout(a[0:(m*n)]), copyin(b[0:(m*p)],c[0:(p*n)])
{
#pragma acc kernels loop gang, vector(8)
    for (i=0; i<m; i++){
#pragma acc loop gang, vector (8)
        for (j=0; j<n; j++)
        {
            float sum = 0.0 ;
#pragma acc loop seq
            for (k=0; k<p; k++)
                sum += b[i*p+k]*c[k*n+j] ;
            a[i*n+j] = sum ;
        }
    }
}
}


I compiled your matrix-matrix multiplication once with pgcc 12.9 and once with pgcc 12.5. The result was that both compilers produced different schedules which result in differernt runtimes.

Compiler feedback of pgcc 12.9:
Code:

     24, Generating present_or_copyin(B[0:n*n])
         Generating present_or_copyin(A[0:n*n])
         Generating present_or_copy(C[0:n*n])
         Generating compute capability 2.0 binary
     27, Loop carried dependence of '*(C)' prevents parallelization
         Loop carried backward dependence of '*(C)' prevents vectorization
     29, Loop is parallelizable
         Accelerator kernel generated
         29, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
             CC 2.0 : 25 registers; 0 shared, 80 constant, 0 local memory bytes
     32, Loop is parallelizable


Compiler feedback of pgcc 12.5:
Code:

     24, Generating copyin(B[0:n*n])
         Generating copyin(A[0:n*n])
         Generating copy(C[0:n*n])
         Generating compute capability 2.0 binary
     27, Loop carried dependence of '*(C)' prevents parallelization
         Loop carried backward dependence of '*(C)' prevents vectorization
     29, Loop is parallelizable
         Accelerator kernel generated
         27, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
         29, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */
             CC 2.0 : 22 registers; 8 shared, 80 constant, 0 local memory bytes
     32, Loop is parallelizable


To my surprise pgcc 12.5 yields sightly better performance than pgcc 12.9 (i.e. 2.4 sec vs. 2.8 for matrices of size 1024 x 1024 @ Quadro 6000 [with acc_init()]).

Why does the more recent compiler ignore my schedule, while pgcc 12.5 respects it?

Thank you.
Best,
Paul
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Mon Dec 17, 2012 3:26 pm    Post subject: Reply with quote

Hi Paul,

12.9 is being a little more picky about the loop dependency and only parallelizlng the "j" loop. I should have done this with the first post, but the solution is to add the independent clause.

Code:
% cat mm4.c
void
MatrixMultiplication4(float * restrict a,float * restrict b, float * restrict c, int m, int n, int p)
{
    int i, j, k ;

#pragma acc data copyout(a[0:(m*n)]), copyin(b[0:(m*p)],c[0:(p*n)])
{
#pragma acc kernels loop gang, vector(8) independent
    for (i=0; i<m; i++){
#pragma acc loop gang, vector(8) independent
        for (j=0; j<n; j++)
        {
            float sum = 0.0 ;
            for (k=0; k<p; k++)
                sum += b[i*p+k]*c[k*n+j] ;
            a[i*n+j] = sum ;
        }
    }
}
}
% pgcc -acc -Minfo mm4.c -V12.9 -c
MatrixMultiplication4:
      6, Generating copyout(a[0:m*n])
         Generating copyin(c[0:p*n])
         Generating copyin(b[0:p*m])
      8, Generating present_or_copyout(a[0:m*n])
         Generating present_or_copyin(b[0:p*m])
         Generating present_or_copyin(c[0:p*n])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
      9, Loop is parallelizable
     11, Loop is parallelizable
         Accelerator kernel generated
          9, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
         11, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */
             CC 1.0 : 20 registers; 76 shared, 8 constant, 0 local memory bytes
             CC 2.0 : 22 registers; 0 shared, 92 constant, 0 local memory bytes
     14, Loop is parallelizable


Though, I doubt an 8x8 vector is still the idea schedule. Personally I've started to avoid using the loop schedule clauses altogether and let the compiler choose. I'm concerned that if I set the schedule for one device, it may not be the best for another.

Code:
% cat mm4.c
void
MatrixMultiplication4(float * restrict a,float * restrict b, float * restrict c, int m, int n, int p)
{
    int i, j, k ;

#pragma acc data copyout(a[0:(m*n)]), copyin(b[0:(m*p)],c[0:(p*n)])
{
#pragma acc kernels loop collapse(2) independent
    for (i=0; i<m; i++){
        for (j=0; j<n; j++)
        {
            float sum = 0.0 ;
            for (k=0; k<p; k++)
                sum += b[i*p+k]*c[k*n+j] ;
            a[i*n+j] = sum ;
        }
    }
}
}
% pgcc -acc -Minfo mm4.c -V12.9 -c
MatrixMultiplication4:
      6, Generating copyout(a[0:m*n])
         Generating copyin(c[0:p*n])
         Generating copyin(b[0:p*m])
      8, Generating present_or_copyout(a[0:m*n])
         Generating present_or_copyin(b[0:p*m])
         Generating present_or_copyin(c[0:p*n])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
      9, Loop is parallelizable
     10, Loop is parallelizable
         Accelerator kernel generated
          9, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
         10, #pragma acc loop gang, vector(64) /* blockIdx.x threadIdx.x */
             CC 1.0 : 19 registers; 76 shared, 8 constant, 0 local memory bytes
             CC 2.0 : 22 registers; 0 shared, 92 constant, 0 local memory bytes
     13, Loop is parallelizable


- Mat
Back to top
View user's profile
PaulPa



Joined: 02 Aug 2012
Posts: 35

PostPosted: Mon Dec 17, 2012 4:53 pm    Post subject: Reply with quote

mkcolg wrote:

Though, I doubt an 8x8 vector is still the idea schedule. Personally I've started to avoid using the loop schedule clauses altogether and let the compiler choose. I'm concerned that if I set the schedule for one device, it may not be the best for another.


True, true, that would be ideal in term of portability for future devices.

Thank you for your help.

Best,
Paul
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group