PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Why library based DGEMM slower than DGEMM code itself?
Goto page 1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Performance and Benchmarking
View previous topic :: View next topic  
Author Message
Sharp



Joined: 28 Aug 2008
Posts: 17

PostPosted: Thu Nov 20, 2008 2:04 pm    Post subject: Why library based DGEMM slower than DGEMM code itself? Reply with quote

Hi, I have developed the following code to calculate matrix multiplication.

Code:
      do i=1,n
        do j=1,n
          XA(j,i)=Zero
          XCa(j,i)=Zero
          XMA(j,i)=Zero
          XMB(j,i)=Zero
        end do
      end do
C
      itmp2=0
      do i=1,n
         XA(i,i)=real(((i*(i-1))/2)+i)
        if(i.eq.(itmp2*68+1))then
          itmp2=itmp2+1
          do j=i+1,n
            l=((j*(j-1))/2)+i
            XA(j,i)=real(l)
            XA(i,j)=XA(j,i)
          end do
        endif
      end do
C
      do i=1,n
        do j=1,n
          XCa(j,i)=One/(Ten**4)
        end do
      end do
C
         call FDate(DayTim)
        write(*,"5X,A") DayTim
C
      Do i=1,25
        call XGEMM(1,'N','N',n,n,n,one,XCa,n,XA,n,one,XMA,n)
        call XGEMM(1,'N','T',n,n,n,one,XCa,n,XA,n,one,XMB,n)
      End do
C
         call FDate(DayTim)
        write(*,"5X,A") DayTim


where the main part of XGEMM is:

Code:
        XM = M
        XK = K
        XN = N
        XLDA = LDA
        XLDB = LDB
        XLDC = LDC
        MinCoW = 3
        ....
C$OMP Parallel
C$OMP Single
C$      Np=OMP_GET_NUM_THREADS()
C$OMP End Single
C$OMP End Parallel
        ColPW = Max((N+NP-1)/NP,MinCoW)
        NWork = (N+ColPW-1)/ColPW
        IncB = LDB*ColPW
        IncC = ColPW*LDC
C$OMP Parallel Do Default(Shared) Schedule(Static,1) Private(IP,XN)
          Do 100 IP = 0, (NWork-1)
            XN = Min(N-IP*ColPW,ColPW)
            Call DGEMM(XStr1,XStr2,XM,XN,XK,Alpha,A,XLDA,B(1+IP*IncB),
     $          XLDB,Beta,C(1+IP*IncC),XLDC)
  100       Continue


As we can see from the first part of the code, matrix XA is a very sparse matrix (>99%, the dimension of the matrices is 3000*3000). Since in DGEMM code there is a sparse test of matrix XA if the multiplication form is XCa*XA or XCa*XA(T), I make both matrix multiplication in such form.

Then the problem arises: when I use the PGI 7.1.6 library to compile this code, to finish 25 times matrix multiplication, the time is about 635 seconds. But the speedup for parallelization is very good: for 8 processors I can get the speedup around 7.

However, when I compile this code without any library (with DGEMM code in my source code), the time used for 25 multiplications is about 55 seconds. But the speedup for parallelization is quite poor: no matter how many processors I have used, the speedup is never over 2.

I am very confused. I thought the library based DGEMM should be much faster than the code itself no matter whether the matrix XA is sparse or not. But now why is it slower? And why the non-library based DGEMM has such a poor speedup? Some one can help me on this? Thank you so much.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Fri Nov 21, 2008 2:55 pm    Post subject: Reply with quote

Hi Sharp,

Which DGEMM source and BLAS library are you using? Also, what optimizations are you using to compile and link?

PGI ships two BLAS libraries, "-lblas" and "-lacml". ACML is AMD's optimized math library while our generic BLAS library uses the NetLIB source (See: http://www.netlib.org/blas/dgemm.f) and is compiled with "-Kieee". "-Kieee" disables some high level optimizations in order to adhere to the IEEE 754 standard.

- Mat
Back to top
View user's profile
Sharp



Joined: 28 Aug 2008
Posts: 17

PostPosted: Mon Nov 24, 2008 5:59 am    Post subject: Reply with quote

Hi Mat,

I have used the following script to compile:

Code:
pgf77 -i8 -mp -O2 -tp p7-64 -Kieee -time -fast -o xgemm.exe xgemm.F -lacml


Where I have added the -Kieee you mentioned. However, compiling in this way, nothing has changed. To finish 25 times matrix multiplication it still takes around 630 seconds.

I use the same DGEMM code you showed me in my source code for the non-library based version. And still this version uses 55 seconds to finish 25 times matrix multiplication. But I assume since it is non-library based, the speedup for parallelization is quite poor.

I am still confused about this. Using library should be much better than using code only, right?

Sharp
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Mon Nov 24, 2008 10:48 am    Post subject: Reply with quote

Hi Sharp,

Quote:
I am still confused about this. Using library should be much better than using code only, right?
Since ACML is AMD's math kernel library and tuned for AMD chips, I would expect some performance penalty when running on an Intel based system. Though 25 times slower seems a bit much. You can try downloading a later version (See: HERE) to see if they have done more tuning for Intel chips. Otherwise, I would contact AMD.

I would be interested to see how the performance changes with Intel's MKL or even with GOTOBlas.

- Mat
Back to top
View user's profile
Sharp



Joined: 28 Aug 2008
Posts: 17

PostPosted: Tue Nov 25, 2008 9:00 am    Post subject: Reply with quote

[quote="mkcolg"]Hi Sharp,

Quote:
You can try downloading a later version (See: HERE) to see if they have done more tuning for Intel chips.


Hi Mat,

Since the program I am dealing with is not compiled by MKL or GOTOBlas, in order to simulate the program, I have to compile my stand-alone code by using the same kind of library--PGI libraries. I haven't tried the MKL library yet. I have downloaded the latest acml library (version 4.2.0) and compiled my own code by using pgf77 (version 7.1.6). To finish 2 times of matrix multiplication the new library based code takes 63 seconds while the non-library based code takes less than 5 seconds. So the performance doesn't change much.

Because one of the matrices used for multiplication is quite sparse, and I found in the DGEMM code there is a zero test on one matrix, i.e.:

Code:
C
C           Form  C := alpha*A*B + beta*C.
C
            DO 90, J = 1, N
               ........
               DO 80, L = 1, K
                  IF( B( L, J ).NE.ZERO )THEN
                     TEMP = ALPHA*B( L, J )
                     DO 70, I = 1, M
                        C( I, J ) = C( I, J ) + TEMP*A( I, L )
   70                CONTINUE
                  END IF
   80          CONTINUE
   90       CONTINUE


To use the DGEMM code directly, this zero test seems work quite well, but when the library is applied (no matter what version is used), it seems this zero test doesn't work at all. However, it seems this zero test breaks the speedup for the OpenMP parallelization of the DGEMM code.

I really hope this problem can be solved. Thank you very much.

Sharp
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Performance and Benchmarking All times are GMT - 7 Hours
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group