View previous topic :: View next topic 
Author 
Message 
Sharp
Joined: 28 Aug 2008 Posts: 17

Posted: Thu Nov 20, 2008 2:04 pm Post subject: Why library based DGEMM slower than DGEMM code itself? 


Hi, I have developed the following code to calculate matrix multiplication.
Code:  do i=1,n
do j=1,n
XA(j,i)=Zero
XCa(j,i)=Zero
XMA(j,i)=Zero
XMB(j,i)=Zero
end do
end do
C
itmp2=0
do i=1,n
XA(i,i)=real(((i*(i1))/2)+i)
if(i.eq.(itmp2*68+1))then
itmp2=itmp2+1
do j=i+1,n
l=((j*(j1))/2)+i
XA(j,i)=real(l)
XA(i,j)=XA(j,i)
end do
endif
end do
C
do i=1,n
do j=1,n
XCa(j,i)=One/(Ten**4)
end do
end do
C
call FDate(DayTim)
write(*,"5X,A") DayTim
C
Do i=1,25
call XGEMM(1,'N','N',n,n,n,one,XCa,n,XA,n,one,XMA,n)
call XGEMM(1,'N','T',n,n,n,one,XCa,n,XA,n,one,XMB,n)
End do
C
call FDate(DayTim)
write(*,"5X,A") DayTim 
where the main part of XGEMM is:
Code:  XM = M
XK = K
XN = N
XLDA = LDA
XLDB = LDB
XLDC = LDC
MinCoW = 3
....
C$OMP Parallel
C$OMP Single
C$ Np=OMP_GET_NUM_THREADS()
C$OMP End Single
C$OMP End Parallel
ColPW = Max((N+NP1)/NP,MinCoW)
NWork = (N+ColPW1)/ColPW
IncB = LDB*ColPW
IncC = ColPW*LDC
C$OMP Parallel Do Default(Shared) Schedule(Static,1) Private(IP,XN)
Do 100 IP = 0, (NWork1)
XN = Min(NIP*ColPW,ColPW)
Call DGEMM(XStr1,XStr2,XM,XN,XK,Alpha,A,XLDA,B(1+IP*IncB),
$ XLDB,Beta,C(1+IP*IncC),XLDC)
100 Continue 
As we can see from the first part of the code, matrix XA is a very sparse matrix (>99%, the dimension of the matrices is 3000*3000). Since in DGEMM code there is a sparse test of matrix XA if the multiplication form is XCa*XA or XCa*XA(T), I make both matrix multiplication in such form.
Then the problem arises: when I use the PGI 7.1.6 library to compile this code, to finish 25 times matrix multiplication, the time is about 635 seconds. But the speedup for parallelization is very good: for 8 processors I can get the speedup around 7.
However, when I compile this code without any library (with DGEMM code in my source code), the time used for 25 multiplications is about 55 seconds. But the speedup for parallelization is quite poor: no matter how many processors I have used, the speedup is never over 2.
I am very confused. I thought the library based DGEMM should be much faster than the code itself no matter whether the matrix XA is sparse or not. But now why is it slower? And why the nonlibrary based DGEMM has such a poor speedup? Some one can help me on this? Thank you so much. 

Back to top 


mkcolg
Joined: 30 Jun 2004 Posts: 6070 Location: The Portland Group Inc.

Posted: Fri Nov 21, 2008 2:55 pm Post subject: 


Hi Sharp,
Which DGEMM source and BLAS library are you using? Also, what optimizations are you using to compile and link?
PGI ships two BLAS libraries, "lblas" and "lacml". ACML is AMD's optimized math library while our generic BLAS library uses the NetLIB source (See: http://www.netlib.org/blas/dgemm.f) and is compiled with "Kieee". "Kieee" disables some high level optimizations in order to adhere to the IEEE 754 standard.
 Mat 

Back to top 


Sharp
Joined: 28 Aug 2008 Posts: 17

Posted: Mon Nov 24, 2008 5:59 am Post subject: 


Hi Mat,
I have used the following script to compile:
Code:  pgf77 i8 mp O2 tp p764 Kieee time fast o xgemm.exe xgemm.F lacml 
Where I have added the Kieee you mentioned. However, compiling in this way, nothing has changed. To finish 25 times matrix multiplication it still takes around 630 seconds.
I use the same DGEMM code you showed me in my source code for the nonlibrary based version. And still this version uses 55 seconds to finish 25 times matrix multiplication. But I assume since it is nonlibrary based, the speedup for parallelization is quite poor.
I am still confused about this. Using library should be much better than using code only, right?
Sharp 

Back to top 


mkcolg
Joined: 30 Jun 2004 Posts: 6070 Location: The Portland Group Inc.

Posted: Mon Nov 24, 2008 10:48 am Post subject: 


Hi Sharp,
Quote:  I am still confused about this. Using library should be much better than using code only, right?  Since ACML is AMD's math kernel library and tuned for AMD chips, I would expect some performance penalty when running on an Intel based system. Though 25 times slower seems a bit much. You can try downloading a later version (See: HERE) to see if they have done more tuning for Intel chips. Otherwise, I would contact AMD.
I would be interested to see how the performance changes with Intel's MKL or even with GOTOBlas.
 Mat 

Back to top 


Sharp
Joined: 28 Aug 2008 Posts: 17

Posted: Tue Nov 25, 2008 9:00 am Post subject: 


[quote="mkcolg"]Hi Sharp,
Quote:  You can try downloading a later version (See: HERE) to see if they have done more tuning for Intel chips. 
Hi Mat,
Since the program I am dealing with is not compiled by MKL or GOTOBlas, in order to simulate the program, I have to compile my standalone code by using the same kind of libraryPGI libraries. I haven't tried the MKL library yet. I have downloaded the latest acml library (version 4.2.0) and compiled my own code by using pgf77 (version 7.1.6). To finish 2 times of matrix multiplication the new library based code takes 63 seconds while the nonlibrary based code takes less than 5 seconds. So the performance doesn't change much.
Because one of the matrices used for multiplication is quite sparse, and I found in the DGEMM code there is a zero test on one matrix, i.e.:
Code:  C
C Form C := alpha*A*B + beta*C.
C
DO 90, J = 1, N
........
DO 80, L = 1, K
IF( B( L, J ).NE.ZERO )THEN
TEMP = ALPHA*B( L, J )
DO 70, I = 1, M
C( I, J ) = C( I, J ) + TEMP*A( I, L )
70 CONTINUE
END IF
80 CONTINUE
90 CONTINUE

To use the DGEMM code directly, this zero test seems work quite well, but when the library is applied (no matter what version is used), it seems this zero test doesn't work at all. However, it seems this zero test breaks the speedup for the OpenMP parallelization of the DGEMM code.
I really hope this problem can be solved. Thank you very much.
Sharp 

Back to top 




You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum

Powered by phpBB © phpBB Group
