PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

MatMul with openACC
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
Tapasya Patki



Joined: 08 Mar 2012
Posts: 1

PostPosted: Fri Mar 16, 2012 6:39 am    Post subject: MatMul with openACC Reply with quote

Hello,

Does someone have a simple matrix multiplication example that runs with pgcc on the gpu? I've tried a lot of things but it seems like I always get stuck with the parallelization of the innermost loop (either invalid loops (if I try to create a private array to resolve dependence), or cannot be parallelized/vectorized error). The code does run though if I compile it normally (not for an accelerator).


Any help would be greatly appreciated. I'm a newbie trying to learn how to use openACC for accelerating my code on the GPU.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Fri Mar 16, 2012 3:35 pm    Post subject: Reply with quote

Hi Tapasya Patki,

Here's some examples using the PGI Accelerator Model. I'll post the OpenACC versions next. The major difference is that OpenACC only allows for 1-D arrays so "float **" need to be translated to "float *" and the "copy" clauses in OpenACC use starting bounds plus the number of elements to copy.

Code:
$ cat MatrixMultiplication.c
#define M 1024
#define N 2048
#define P 4096

/* muliply a M lines, P columns matrix by a P lines, N columns matrix
   to produce a M lines, N columns matrix */
void
MatrixMultiplication1(restrict float a[M][N], restrict float b[M][P], restrict float c[P][N])
{
    int i, j, k ;

#pragma acc region for parallel, vector(8)
    for (i=0; i<M; i++) {
#pragma acc for parallel, vector(8)
        for (j=0; j<N; j++) {
#pragma acc for seq
            for (k=0; k<P; k++)
                a[i][j] += b[i][k]*c[k][j] ;
        }
    }
}

void
MatrixMultiplication2(restrict float a[M][N], restrict float b[M][P], restrict float c[P][N])
{
    int i, j, k ;

#pragma acc region for parallel, vector(8)
    for (i=0; i<M; i++){
#pragma acc for parallel, vector(8)
        for (j=0; j<N; j++) {
            float sum = 0.0 ;
#pragma acc for seq
            for (k=0; k<P; k++)
                sum += b[i][k]*c[k][j] ;
            a[i][j] = sum ;
        }
    }
}

void
MatrixMultiplication3(float * restrict a, float * restrict b, float * restrict c, int m, int n, int p)
{
    int i, j, k ;

#pragma acc data region copyout(a[0:(m*n)-1]), copyin(b[0:(m*p)-1],c[0:(p*n)-1])
{
#pragma acc region for parallel, vector(8)
    for (i=0; i<m; i++){
#pragma acc for parallel, vector (8)
        for (j=0; j<n; j++) {
#pragma acc for seq
            for (k=0; k<p; k++)
                a[i*n+j] += b[i*p+k]*c[k*n+j] ;
   }
    }
}
}

void
MatrixMultiplication4(float * restrict a,float * restrict b, float * restrict c, int m, int n, int p)
{
    int i, j, k ;

#pragma acc data region copyout(a[0:(m*n)-1]), copyin(b[0:(m*p)-1],c[0:(p*n)-1])
{
#pragma acc region for parallel, vector(8)
    for (i=0; i<m; i++){
#pragma acc for parallel, vector (8)
        for (j=0; j<n; j++)
        {
            float sum = 0.0 ;
#pragma acc for seq
            for (k=0; k<p; k++)
                sum += b[i*p+k]*c[k*n+j] ;
            a[i*n+j] = sum ;
        }
    }
}
}
$ pgcc -ta=nvidia -c -Minfo=accel MatrixMultiplication.c  -Msafeptr -V12.3
MatrixMultiplication1:
     12, Generating copy(a[:1023][:])
         Generating copyin(b[:1023][:])
         Generating copyin(c[:4095][:])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     13, Loop is parallelizable
     15, Loop is parallelizable
     17, Complex loop carried dependence of '*(a)' prevents parallelization
         Loop carried dependence of '*(a)' prevents parallelization
         Loop carried backward dependence of '*(a)' prevents vectorization
         Accelerator kernel generated
         13, #pragma acc for parallel, vector(8) /* blockIdx.y threadIdx.y */
         15, #pragma acc for parallel, vector(8) /* blockIdx.x threadIdx.x */
         17, #pragma acc for seq(8)
             Cached references to size [8x8] block of 'b'
             Cached references to size [8x8] block of 'c'
             CC 1.0 : 9 registers; 560 shared, 16 constant, 0 local memory bytes; 66% occupancy
             CC 2.0 : 17 registers; 520 shared, 56 constant, 0 local memory bytes; 33% occupancy
MatrixMultiplication2:
     28, Generating copyout(a[:1023][:])
         Generating copyin(b[:1023][:])
         Generating copyin(c[:4095][:])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     29, Loop is parallelizable
     31, Loop is parallelizable
         Accelerator kernel generated
         29, #pragma acc for parallel, vector(8) /* blockIdx.y threadIdx.y */
         31, #pragma acc for parallel, vector(8) /* blockIdx.x threadIdx.x */
             CC 1.0 : 12 registers; 48 shared, 8 constant, 0 local memory bytes; 66% occupancy
             CC 2.0 : 20 registers; 8 shared, 56 constant, 0 local memory bytes; 33% occupancy
     34, Loop is parallelizable
MatrixMultiplication3:
     46, Generating copyout(a[:m*n-1])
         Generating copyin(c[:p*n-1])
         Generating copyin(b[:p*m-1])
     48, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     49, Loop carried dependence of '*(a)' prevents parallelization
         Loop carried backward dependence of '*(a)' prevents vectorization
     51, Loop is parallelizable
     53, Complex loop carried dependence of '*(a)' prevents parallelization
         Loop carried dependence of '*(a)' prevents parallelization
         Loop carried backward dependence of '*(a)' prevents vectorization
         Accelerator kernel generated
         49, #pragma acc for parallel, vector(8) /* blockIdx.y threadIdx.y */
         51, #pragma acc for parallel, vector(8) /* blockIdx.x threadIdx.x */
         53, #pragma acc for seq
             CC 1.0 : 19 registers; 84 shared, 16 constant, 0 local memory bytes; 50% occupancy
             CC 2.0 : 25 registers; 8 shared, 92 constant, 0 local memory bytes; 33% occupancy
MatrixMultiplication4:
     65, Generating copyout(a[:m*n-1])
         Generating copyin(c[:p*n-1])
         Generating copyin(b[:p*m-1])
     67, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     68, Loop carried dependence of '*(a)' prevents parallelization
         Loop carried backward dependence of '*(a)' prevents vectorization
     70, Loop is parallelizable
         Accelerator kernel generated
         68, #pragma acc for parallel, vector(8) /* blockIdx.y threadIdx.y */
         70, #pragma acc for parallel, vector(8) /* blockIdx.x threadIdx.x */
             CC 1.0 : 17 registers; 84 shared, 12 constant, 0 local memory bytes; 50% occupancy
             CC 2.0 : 24 registers; 8 shared, 92 constant, 0 local memory bytes; 33% occupancy
     74, Loop is parallelizable


- Mat
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Fri Mar 16, 2012 3:38 pm    Post subject: Reply with quote

Here's the OpenACC versions:


Code:
$ cat MatrixMultiplication_openacc.c

void
MatrixMultiplication3(float * restrict a, float * restrict b, float * restrict c, int m, int n, int p)
{
    int i, j, k ;

#pragma acc data copyout(a[0:(m*n)]), copyin(b[0:(m*p)],c[0:(p*n)])
{
#pragma acc kernels loop gang, vector(8)
    for (i=0; i<m; i++){
#pragma acc loop gang, vector (8)
        for (j=0; j<n; j++) {
#pragma acc loop seq
            for (k=0; k<p; k++)
                a[i*n+j] += b[i*p+k]*c[k*n+j] ;
   }
    }
}
}

void
MatrixMultiplication4(float * restrict a,float * restrict b, float * restrict c, int m, int n, int p)
{
    int i, j, k ;

#pragma acc data copyout(a[0:(m*n)]), copyin(b[0:(m*p)],c[0:(p*n)])
{
#pragma acc kernels loop gang, vector(8)
    for (i=0; i<m; i++){
#pragma acc loop gang, vector (8)
        for (j=0; j<n; j++)
        {
            float sum = 0.0 ;
#pragma acc loop seq
            for (k=0; k<p; k++)
                sum += b[i*p+k]*c[k*n+j] ;
            a[i*n+j] = sum ;
        }
    }
}
}

$ pgcc -acc -c -Minfo=accel MatrixMultiplication_openacc.c  -Msafeptr -V12.3
MatrixMultiplication3:
      7, Generating copyout(a[:m*n])
         Generating copyin(c[:p*n])
         Generating copyin(b[:p*m])
      9, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     10, Loop carried dependence of '*(a)' prevents parallelization
         Loop carried backward dependence of '*(a)' prevents vectorization
     12, Loop is parallelizable
     14, Complex loop carried dependence of '*(a)' prevents parallelization
         Loop carried dependence of '*(a)' prevents parallelization
         Loop carried backward dependence of '*(a)' prevents vectorization
         Accelerator kernel generated
         10, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
         12, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */
         14, #pragma acc loop seq
             Non-stride-1 accesses for array 'b'
             CC 1.0 : 16 registers; 84 shared, 12 constant, 0 local memory bytes; 66% occupancy
             CC 2.0 : 26 registers; 8 shared, 92 constant, 0 local memory bytes; 33% occupancy
MatrixMultiplication4:
     26, Generating copyout(a[:m*n])
         Generating copyin(c[:p*n])
         Generating copyin(b[:p*m])
     28, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     29, Loop carried dependence of '*(a)' prevents parallelization
         Loop carried backward dependence of '*(a)' prevents vectorization
     31, Loop is parallelizable
         Accelerator kernel generated
         29, #pragma acc loop gang, vector(8) /* blockIdx.y threadIdx.y */
         31, #pragma acc loop gang, vector(8) /* blockIdx.x threadIdx.x */
             CC 1.0 : 17 registers; 84 shared, 12 constant, 0 local memory bytes; 50% occupancy
             CC 2.0 : 26 registers; 8 shared, 92 constant, 0 local memory bytes; 33% occupancy
     35, Loop is parallelizable
Back to top
View user's profile
sobe118



Joined: 20 Jun 2012
Posts: 1

PostPosted: Thu Jun 21, 2012 7:31 am    Post subject: Having trouble running code Reply with quote

When trying to create an exe with: pgcc matmul4.c -Msafeptr (I renamed the pgi accel file to matmul4.c) I receive this error:

libcmt.lib(crt0.oobj) : error LNK2019: unresolved external symbol main referenced in function __tmainCRTStartup
matmul4.exe : fatal error LNK1120: 1 unresolved externals

I tried adding " int main(); int main() {} " and it was able to create the exe but I have a feeling thats not a real fix, it doesn't seem to be running the matrix multiplication. Any ideas on a solution?

Thanks
Back to top
View user's profile
toepfer



Joined: 04 Dec 2007
Posts: 47

PostPosted: Thu Jun 21, 2012 9:55 am    Post subject: Reply with quote

In the main program that you created, you will need to allocate and initialize the array's, a, b, and c which are used in the matrix multiply functions. Then you will need to call one of the MatrixMultiply functions shown below. For example, here is what the code may look like if calling the PGI Accelerator Model MatrixMultiplication1 function:

Code:
#include <stdlib.h>

void main()
{
    float *a, *b, *c;
    int i;

    a = (float *) malloc(M*N*4);
    b = (float *) malloc(M*P*4);
    c = (float *) malloc(P*N*4);

    for (i = 0; i <  M*N; i++) {
       a[i] = (float) 0.0;
    }
    for (i = 0; i <  M*P; i++) {
       b[i] = (float) i;
    }
    for (i = 0; i <  P*N; i++) {
       c[i] = (float) 1.0;
    }

    MatrixMultiplication1((float **)a,(float **)b,(float **)c);

    printf("%f %f\n",a[0],a[M*N-1]);

    free(a);
    free(b);
    free(c);
}
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group