PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Why library based DGEMM slower than DGEMM code itself?
Goto page Previous  1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6125
Location: The Portland Group Inc.

PostPosted: Tue Nov 25, 2008 11:19 am    Post subject: Reply with quote

Since we don't control the ACML source, you'd need to contact AMD about the poor performance.

If you want to send your example program to trs@pgroup.com and ask PGI customer support to forward it to me, I can take a look at the lack of parallel speed-up. I can also run the code on an AMD system to see if the problem with ACML is general or specific to Core2.

- Mat
Back to top
View user's profile
Sharp



Joined: 28 Aug 2008
Posts: 17

PostPosted: Mon Dec 08, 2008 9:31 am    Post subject: Reply with quote

Hi Mat, I wonder is there any updating about the library based DGEMM? Also, I have read through my non-library based DGEMM code for many times, I still don't know why the speed up performance is so poor. Do you have any idea about this? Thank you very much.

Sharp
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6125
Location: The Portland Group Inc.

PostPosted: Mon Dec 08, 2008 2:34 pm    Post subject: Reply with quote

Hi Sharp,

I've looked at but nothing stands out. I had to put it aside for a bit but will try and look at it again later this week.

- Mat
Back to top
View user's profile
Sharp



Joined: 28 Aug 2008
Posts: 17

PostPosted: Tue Dec 16, 2008 10:36 am    Post subject: Reply with quote

Hi Mat,

The speedup of the non-library based code is quite poor. I thought it is because of the compiler. So I tried 3 different compilers: pgf77, pgf90, and intel compiler ifort. to compile the same non-library based DGEMM code. The results are as follows:

ifort:
Code:
No. of CPU: 1        2        3        4        5        6        7        8
Speedup:    -     1.39      1.72      2.00     2.30     2.45     2.45     2.53


pgf77:
Code:
No. of CPU: 1        2        3        4        5        6        7        8
Speedup:    -     1.55      1.74      2.14    2.08     2.28     2.38    2.53


pgf90:
Code:
No. of CPU: 1        2        3        4        5        6        7        8
Speedup:    -     1.55      1.74     2.03      2.08     2.48     2.16      2.28


I still don't understand what is the reason causes this poor speedup.

However, when I use library based DGEMM, although it takes 10 times more time to finish a matrix multiplication, the speedup is perfect (a linear relationship). It seems we have a trade-off here. Then I wonder is there a existing library can achieve both fast serial calculation and good speedup? Thank you very much.

Hope you have a very good Christmas and Happy New Year!

Sharp
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6125
Location: The Portland Group Inc.

PostPosted: Fri Dec 19, 2008 4:21 pm    Post subject: Reply with quote

Hi Sharp,

I profiled the code using oprofile based hardware counters (using the PGI utility pgcollect) and found that the "mp_barrier" routines accounted for 52% of the runtime. I then did an instrumented profile (by compiling with -Mprof=lines) and see that the threads spend a lot of time waiting for thread 0 to finish.

To help with this, instead of dividing the array up into NP chunks where NP is the total number of threads, I divided it up into smaller NP*16 chunks. This way the threads aren't waiting and can do useful work if one thread hits a bottleneck (like memory). Profiling the modified code shows that "mp_barrier" is now less than 18% of the overall time.

On my 2 socket penryn system, the original code took 56 seconds with 1 thread and 26 seconds with. With the modified code the time dropped to 19 seconds. Not great, but better.

As for other libraries, you can try ATLAS or GOTOBlas. I've run out of time for today, otherwise I'd try them myself.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page Previous  1, 2, 3  Next
Page 2 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group