PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Why large scale DGEMM parallelization appears strange?

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Performance and Benchmarking
View previous topic :: View next topic  
Author Message
Sharp



Joined: 28 Aug 2008
Posts: 17

PostPosted: Thu Aug 28, 2008 10:38 am    Post subject: Why large scale DGEMM parallelization appears strange? Reply with quote

Hi, I am working on a program that using DGEMM for matrix multiplication. The compiler I am using is pgi 707/pgf77. In this program the subroutine DGEMM has been parallelized already:

Code:
C$OMP Parallel
C$OMP Single
C$      NP=omp_get_num_threads()
C$      MinCoW=16
C$OMP End Single
C$OMP End Parallel
          ColPW = Max((N+NP-1)/NP,MinCoW)
          NWork = (N+ColPW-1)/ColPW        [i]!...N is the number of column of C(M,N).[/i]
          If(XStr2.eq.'T'.or.XStr2.eq.'C') then
            IncB = 1
           else
              IncB = LDB
            endIf
           IncB = IncB*ColPW
           IncC = ColPW*LDC
C$OMP Parallel Do Default(Shared) Schedule(Static,1) Private(IP,XN)
          Do 100 IP = 0, (NWork-1)
              XN = Min(N-IP*ColPW,ColPW)
              Call DGEMM(XStr1,XStr2,XM,XN,XK,Alpha,A,XLDA,B(1+IP*IncB),
     $          XLDB,Beta,C(1+IP*IncC),XLDC)
100      Continue


The BLAS library I use for compiling this code is:

pgf77 -i8 '-mcmodel=medium' -mp -O2 -tp p7-64 -Mreentrant -Mrecursive -Mnosave -Minfo -Mneginfo -time -fast -Munroll -Mvect=assoc,recog,cachesize:2097152 -o xgemm.exe xgemm.o $gdvroot/bsd/libf77blas-em64t.a $gdvroot/bsd/libatlas-em64t.a -lpthread -lm -lc

Now the problem is:when I run the matrix multiplication jobs (the size of the matrices is 3432X3432) parallelized, upto 7 processors the speedup is perfect, but once the jobs are parallelized by 8 processors, the speedup becomes really poor (less than 3 times). However, when I change the size of the matrices, e.g. 924X924, the speedup for 8 processors becomes normal. I tried to assemble more memory for the 3432X3432 matrix multiplication of 8 processors, but it seems the speedup for a 10GB memory (the limit of our hardware) is still the same. Any one here can help me? Thank you very much!!!
Back to top
View user's profile
hongyon



Joined: 19 Jul 2004
Posts: 551

PostPosted: Thu Aug 28, 2008 10:44 am    Post subject: Reply with quote

Hi,

Did you try with our latest release? Can you please try and let us know if there is still a problem. There might be performance bug in our Openmp runtime that gets fixed in latest release.

Hongyon
Back to top
View user's profile
Sharp



Joined: 28 Aug 2008
Posts: 17

PostPosted: Fri Sep 05, 2008 4:06 am    Post subject: Reply with quote

Hi, thank you for your advice. Since our group doesn't have license of using the latest 7.2.x version, I tried the library of 7.1.6. It works alright now. Thank you.

Sharp
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Performance and Benchmarking All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group