PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Matrix multiplication parallelizing

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
Mikee



Joined: 19 Apr 2010
Posts: 3

PostPosted: Sun May 23, 2010 10:23 am    Post subject: Matrix multiplication parallelizing Reply with quote

Hello. I'm sorry for silly question, but i'm new in parallel programming, so i've got difficulties. I wrote simple triple loop that multiplies two matrixes (matrix B is transposed):
Code:
start = clock();
#pragma acc region copyin(a[0:n-1][0:n-1],b[0:n-1][0:n-1]) copyout(c[0:n-1][0:n-1])
{
   for(i=0;i<n;i++)
   {
      for(j=0;j<n;j++)
      {
         t=0;
         for(k=0;k<n;k++)
            t += (a[i][k]*b[j][k]);
         c[i][j]=t;
      }
   }
}
printf("on GPU = %f", (double) (clock() - start) / CLOCKS_PER_SEC);


Compiler writes:
Code:
C:\PGI\win32\10.4\bin>pgcc matrix.c -fast -ta=nvidia,time -Minfo
main:
     43, Generating copyout(c[:n-1][:n-1])
         Generating copyin(b[:n-1][:n-1])
         Generating copyin(a[:n-1][:n-1])
         Generating compute capability 1.0 kernel
         Generating compute capability 1.3 kernel
     45, Loop is parallelizable
         Accelerator kernel generated
         45, #pragma acc for parallel, vector(256)
     46, Loop is parallelizable
     49, Loop is parallelizable


So all loops are parallelizable and everything seems to be good, but execution times are strange, for matrixes 900x900 they are:
on GPU = 4.016000 sec
on CPU = 1.453000 sec
Please, give me advice how to optimize this code for GPU, or what is wrong with it.
Also there is strange time statistics with the key -ta=nvidia,time:
Code:
  main
    43: region entered 1 time
        time(us): total=8000000
                  kernels=3951644 data=4048356

I always got integer even "total" time like 2,4,6,8 seconds and so on, and this time is much greater than execution time. Also there is no statistics about init time, but if "total" time becomes less than 2 sec, it writes:
Code:
  main
    43: region entered 1 time
        time(us): init=0

My GPU is GeForce 9500 GT.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Thu May 27, 2010 11:52 am    Post subject: Reply with quote

Hi Mikee,

I tried unsuccessfully to recreate your problem here. Can you please post your full source or send it to PGI Customer Support (trs@pgroup.com)?

Thanks,
Mat
Back to top
View user's profile
Mikee



Joined: 19 Apr 2010
Posts: 3

PostPosted: Thu May 27, 2010 1:06 pm    Post subject: Reply with quote

Thank you Mat for the reply. I deleted all useless stuff, and here is the code with same problem:

Code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main( int argc, char* argv[] )
{
   int n; //size of the matrix
   float **restrict a;
   float **restrict b;
   float **restrict c; //acc result matrix
   float **restrict d; //host result matrix
   int i,j,k;
   float t; //temp sum
   clock_t start;
   n = atoi( argv[1] );

   a = (float**) malloc( sizeof(float*) * n );
   a[0] = (float*) malloc( sizeof(float) * n * n );
   b = (float**) malloc( sizeof(float*) * n );
   b[0] = (float*) malloc( sizeof(float) * n * n );
   c = (float**) malloc( sizeof(float*) * n );
   c[0] = (float*) malloc( sizeof(float) * n * n );
   d = (float**) malloc( sizeof(float*) * n );
   d[0] = (float*) malloc( sizeof(float) * n * n );
   for( i = 1; i < n; i++ )
   {
      a[i] = a[i-1] + n;
      b[i] = b[i-1] + n;
      c[i] = c[i-1] + n;
      d[i] = d[i-1] + n;
   }
   
    for( i = 0; i < n; ++i )
      for(j=0;j<n;j++)
      {
         a[i][j] = (float)(0.4*i+2.3*j+1);
         b[j][i] = (float)(3*i+0.7*j+1); //transpose matrix
         c[i][j] = 0;
         d[i][j] = 0;
      }         

   start = clock();
   #pragma acc region
   {
      for(i=0;i<n;i++)
         for(j=0;j<n;j++)
         {
            t=0;
            for(k=0;k<n;k++)
               t += (a[i][k]*b[j][k]);
            c[i][j]=t;
         }
   }
   printf("time GPU = %f", (double) (clock() - start) / CLOCKS_PER_SEC);

   start = clock(); //compare on the host
   {
      for(i=0;i<n;i++)
         for(j=0;j<n;j++)
         {
            t=0;
            for(k=0;k<n;k++)
               t += (a[i][k]*b[j][k]);
            d[i][j]=t;
         }
    }
   printf("\ntime CPU = %f", (double) (clock() - start) / CLOCKS_PER_SEC);
   free(a);
   free(b);
   free(c);
   free(d);
   return 0;
}
Back to top
View user's profile
Mikee



Joined: 19 Apr 2010
Posts: 3

PostPosted: Mon May 31, 2010 1:16 am    Post subject: Reply with quote

Ok, i've read PGIUG carefully and found the "kernel" loop scheduling clause. It is described: "The kernel clause tells the compiler that the body of this loop is to be the body of the computational kernel. Any loops contained within the kernel loop are executed sequentially on the accelerator". So i rewrote the code to store matrixes in one-dimensional arrays. My plan was to make double loop instead of triple, and with using the "kernel" clause, force outer loop execute parallel with iterations independence and inner loop - sequentially.
Definitions of arrays are like this:
Code:
double *restrict a;
a = malloc( sizeof(double) *n*n );

Now my main loop is:
Code:
   #pragma acc data region copyin(a[0:n*n-1],b[0:n*n-1]) copyout(c[0:n*n-1]) local(t,p,r)
   #pragma acc region for kernel
   for(i=0;i<n*n;i++)
   {
      t=0;
      p=i/n;
      r=i-p*n;
      for(k=0;k<n;k++)
         t += (a[p*n+k]*b[n*r+k]);
      c[i]=t;
   }

And compiler still says:
Code:
   39, Loop is parallelizable

Where line No.39 is loop "for(k=0;k<n;k++)"
How can it parallelize this loop without sum reduction (it writes nothing about applying sum reduction)? Or phrase "Loop is parallelizable" means only possibility of parallelizing, but not that loop IS PARALLELIZED?


Last edited by Mikee on Wed Jun 02, 2010 8:34 pm; edited 1 time in total
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Tue Jun 01, 2010 12:38 pm    Post subject: Reply with quote

Hi Mikee,

Sorry about only getting back to this now.

Quote:
How can it parallelize this loop without sum reduction (it writes nothing about applying sum reduction)? Or phrase "Loop is parallelizable" means only possibility of parallelizing, but not that loop IS PARALLELIZED?
Correct. The message means only that it is possible to parallelize the code, not that it was parallelized. The -Minfo message "Accelerator kernel generated" tells you what was actually accelerated.

The problem with the original loop is that since you're using pointers, the compiler is only able to accelerate the outer loop. Hence, you're only getting 900 threads and slower performance. Our compiler team is aware of the issue and hopefully can enhance our analysis so both loops are accelerated. In the meantime, moving to a single dimension is a good strategy.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group