|
| View previous topic :: View next topic |
| Author |
Message |
Sharp
Joined: 28 Aug 2008 Posts: 17
|
Posted: Thu Nov 20, 2008 2:04 pm Post subject: Why library based DGEMM slower than DGEMM code itself? |
|
|
Hi, I have developed the following code to calculate matrix multiplication.
| Code: | do i=1,n
do j=1,n
XA(j,i)=Zero
XCa(j,i)=Zero
XMA(j,i)=Zero
XMB(j,i)=Zero
end do
end do
C
itmp2=0
do i=1,n
XA(i,i)=real(((i*(i-1))/2)+i)
if(i.eq.(itmp2*68+1))then
itmp2=itmp2+1
do j=i+1,n
l=((j*(j-1))/2)+i
XA(j,i)=real(l)
XA(i,j)=XA(j,i)
end do
endif
end do
C
do i=1,n
do j=1,n
XCa(j,i)=One/(Ten**4)
end do
end do
C
call FDate(DayTim)
write(*,"5X,A") DayTim
C
Do i=1,25
call XGEMM(1,'N','N',n,n,n,one,XCa,n,XA,n,one,XMA,n)
call XGEMM(1,'N','T',n,n,n,one,XCa,n,XA,n,one,XMB,n)
End do
C
call FDate(DayTim)
write(*,"5X,A") DayTim |
where the main part of XGEMM is:
| Code: | XM = M
XK = K
XN = N
XLDA = LDA
XLDB = LDB
XLDC = LDC
MinCoW = 3
....
C$OMP Parallel
C$OMP Single
C$ Np=OMP_GET_NUM_THREADS()
C$OMP End Single
C$OMP End Parallel
ColPW = Max((N+NP-1)/NP,MinCoW)
NWork = (N+ColPW-1)/ColPW
IncB = LDB*ColPW
IncC = ColPW*LDC
C$OMP Parallel Do Default(Shared) Schedule(Static,1) Private(IP,XN)
Do 100 IP = 0, (NWork-1)
XN = Min(N-IP*ColPW,ColPW)
Call DGEMM(XStr1,XStr2,XM,XN,XK,Alpha,A,XLDA,B(1+IP*IncB),
$ XLDB,Beta,C(1+IP*IncC),XLDC)
100 Continue |
As we can see from the first part of the code, matrix XA is a very sparse matrix (>99%, the dimension of the matrices is 3000*3000). Since in DGEMM code there is a sparse test of matrix XA if the multiplication form is XCa*XA or XCa*XA(T), I make both matrix multiplication in such form.
Then the problem arises: when I use the PGI 7.1.6 library to compile this code, to finish 25 times matrix multiplication, the time is about 635 seconds. But the speedup for parallelization is very good: for 8 processors I can get the speedup around 7.
However, when I compile this code without any library (with DGEMM code in my source code), the time used for 25 multiplications is about 55 seconds. But the speedup for parallelization is quite poor: no matter how many processors I have used, the speedup is never over 2.
I am very confused. I thought the library based DGEMM should be much faster than the code itself no matter whether the matrix XA is sparse or not. But now why is it slower? And why the non-library based DGEMM has such a poor speedup? Some one can help me on this? Thank you so much. |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Fri Nov 21, 2008 2:55 pm Post subject: |
|
|
Hi Sharp,
Which DGEMM source and BLAS library are you using? Also, what optimizations are you using to compile and link?
PGI ships two BLAS libraries, "-lblas" and "-lacml". ACML is AMD's optimized math library while our generic BLAS library uses the NetLIB source (See: http://www.netlib.org/blas/dgemm.f) and is compiled with "-Kieee". "-Kieee" disables some high level optimizations in order to adhere to the IEEE 754 standard.
- Mat |
|
| Back to top |
|
 |
Sharp
Joined: 28 Aug 2008 Posts: 17
|
Posted: Mon Nov 24, 2008 5:59 am Post subject: |
|
|
Hi Mat,
I have used the following script to compile:
| Code: | | pgf77 -i8 -mp -O2 -tp p7-64 -Kieee -time -fast -o xgemm.exe xgemm.F -lacml |
Where I have added the -Kieee you mentioned. However, compiling in this way, nothing has changed. To finish 25 times matrix multiplication it still takes around 630 seconds.
I use the same DGEMM code you showed me in my source code for the non-library based version. And still this version uses 55 seconds to finish 25 times matrix multiplication. But I assume since it is non-library based, the speedup for parallelization is quite poor.
I am still confused about this. Using library should be much better than using code only, right?
Sharp |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Mon Nov 24, 2008 10:48 am Post subject: |
|
|
Hi Sharp,
| Quote: | | I am still confused about this. Using library should be much better than using code only, right? | Since ACML is AMD's math kernel library and tuned for AMD chips, I would expect some performance penalty when running on an Intel based system. Though 25 times slower seems a bit much. You can try downloading a later version (See: HERE) to see if they have done more tuning for Intel chips. Otherwise, I would contact AMD.
I would be interested to see how the performance changes with Intel's MKL or even with GOTOBlas.
- Mat |
|
| Back to top |
|
 |
Sharp
Joined: 28 Aug 2008 Posts: 17
|
Posted: Tue Nov 25, 2008 9:00 am Post subject: |
|
|
[quote="mkcolg"]Hi Sharp,
| Quote: | | You can try downloading a later version (See: HERE) to see if they have done more tuning for Intel chips. |
Hi Mat,
Since the program I am dealing with is not compiled by MKL or GOTOBlas, in order to simulate the program, I have to compile my stand-alone code by using the same kind of library--PGI libraries. I haven't tried the MKL library yet. I have downloaded the latest acml library (version 4.2.0) and compiled my own code by using pgf77 (version 7.1.6). To finish 2 times of matrix multiplication the new library based code takes 63 seconds while the non-library based code takes less than 5 seconds. So the performance doesn't change much.
Because one of the matrices used for multiplication is quite sparse, and I found in the DGEMM code there is a zero test on one matrix, i.e.:
| Code: | C
C Form C := alpha*A*B + beta*C.
C
DO 90, J = 1, N
........
DO 80, L = 1, K
IF( B( L, J ).NE.ZERO )THEN
TEMP = ALPHA*B( L, J )
DO 70, I = 1, M
C( I, J ) = C( I, J ) + TEMP*A( I, L )
70 CONTINUE
END IF
80 CONTINUE
90 CONTINUE
|
To use the DGEMM code directly, this zero test seems work quite well, but when the library is applied (no matter what version is used), it seems this zero test doesn't work at all. However, it seems this zero test breaks the speedup for the OpenMP parallelization of the DGEMM code.
I really hope this problem can be solved. Thank you very much.
Sharp |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|