PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Slow matmul despite optimization flags used

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
jamie9



Joined: 28 Dec 2004
Posts: 6

PostPosted: Thu Jan 05, 2006 2:28 pm    Post subject: Slow matmul despite optimization flags used Reply with quote

Hi,

I don't understand why PGI's matmul implementation is so slow. The following program
compares three methods of matrix multiplication. Result:

Explicit: 0.013
Dot_prod. 0.604
Matmul: 56.454

Compilation line:

pgf95 -fast -Mipa=fast -Mvect=nosse -O3 -tp=p6 matmul1.f90

Code:

   program Test_intr
   implicit NONE

      integer :: I
      real    :: T0, T1
      integer, parameter :: S = 2, N = 10000000
      double precision :: A(s,s), B(s,s)
      double precision :: X(s), Y(s), Z(s)

     A = reshape((/1,2,3,4/), (/s,s/))
     B = reshape((/6,7,8,9/), (/s,s/))
     X = (/ 1.1d0, 2.2d0 /)
     Y = (/ -7d0, 12d0 /)

     call cpu_time(T0)
     do i = -N, N
! i have changed the following code
       Z(1)=A(1,1)*X(1)+A(1,s)*X(s) -B(1,1)*Y(1)-B(1,s)*Y(s)
       a(s,s) = i  !  Against opt.
       Z(s)=A(s,1)*X(1)+A(s,s)*X(s) -B(s,1)*Y(1)-B(s,s)*Y(s)
     end do
     call cpu_time(T1)
     print "(' Explicit:', F8.3)" , T1 - T0

     call cpu_time(T0)
     do i = -N, N
! also equivalent [JvO]  :
       Z(1) = dot_product(A(1,:), X) - dot_product(B(1,:), Y)
       a(s,s) = i  !  Against opt.
       Z(s) = dot_product(A(s,:), X) - dot_product(B(s,:), Y)
     end do
     call cpu_time(T1)
     print "(' Dot_prod.', F8.3)" , T1 - T0

     call cpu_time(T0)
     do i = -N, N
! to the equivalent code
       Z = MATMUL(A, X) - MATMUL(B, Y)
       a(s,s) = i  !  Against opt.
     end do
     call cpu_time(T1)
     print "(' Matmul:  ', F8.3)" , T1 - T0
   end program Test_intr


Regards,
Jamie
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6146
Location: The Portland Group Inc.

PostPosted: Mon Jan 09, 2006 12:07 pm    Post subject: Reply with quote

Hi Jamie,

For a small simple 2x2 matrix, then your first two loops can give you a performance boost. Since MATMUL needs to accommodate a variety of shapes, it has a fixed overhead. As the complexity increases, you find that the performance difference becomes less. Also, using MATMUL is much easier and more flexible. Do you really want to re-write your code anytime the shape of a matrix changes?

- Mat
Back to top
View user's profile
jamie9



Joined: 28 Dec 2004
Posts: 6

PostPosted: Tue Jan 10, 2006 1:44 pm    Post subject: Reply with quote

Hi Mat,

You are absolutely right. I checked matmul speed for larger arrays and it works
fast. Version 6.1 is a big step forward in my opinion. It is the first version which
compiles my code without any modifications (f95 + OpenMP 2.5 + few common extensions).

Thanks,
Jamie
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6146
Location: The Portland Group Inc.

PostPosted: Tue Jan 10, 2006 3:33 pm    Post subject: Reply with quote

Thanks Jamie. A lot of effort went into getting our OpenMP implementation to be 2.5 compliant so I'm glad to hear that it can be put to good use. You'll also be glad to know that in the past year the Portland Group has joined the OpenMP ARB (Architecture Review Board) and are helping shape the future of OpenMP.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group