| View previous topic :: View next topic |
| Author |
Message |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Tue Nov 25, 2008 11:19 am Post subject: |
|
|
Since we don't control the ACML source, you'd need to contact AMD about the poor performance.
If you want to send your example program to trs@pgroup.com and ask PGI customer support to forward it to me, I can take a look at the lack of parallel speed-up. I can also run the code on an AMD system to see if the problem with ACML is general or specific to Core2.
- Mat |
|
| Back to top |
|
 |
Sharp
Joined: 28 Aug 2008 Posts: 17
|
Posted: Mon Dec 08, 2008 9:31 am Post subject: |
|
|
Hi Mat, I wonder is there any updating about the library based DGEMM? Also, I have read through my non-library based DGEMM code for many times, I still don't know why the speed up performance is so poor. Do you have any idea about this? Thank you very much.
Sharp |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Mon Dec 08, 2008 2:34 pm Post subject: |
|
|
Hi Sharp,
I've looked at but nothing stands out. I had to put it aside for a bit but will try and look at it again later this week.
- Mat |
|
| Back to top |
|
 |
Sharp
Joined: 28 Aug 2008 Posts: 17
|
Posted: Tue Dec 16, 2008 10:36 am Post subject: |
|
|
Hi Mat,
The speedup of the non-library based code is quite poor. I thought it is because of the compiler. So I tried 3 different compilers: pgf77, pgf90, and intel compiler ifort. to compile the same non-library based DGEMM code. The results are as follows:
ifort:
| Code: | No. of CPU: 1 2 3 4 5 6 7 8
Speedup: - 1.39 1.72 2.00 2.30 2.45 2.45 2.53 |
pgf77:
| Code: | No. of CPU: 1 2 3 4 5 6 7 8
Speedup: - 1.55 1.74 2.14 2.08 2.28 2.38 2.53 |
pgf90:
| Code: | No. of CPU: 1 2 3 4 5 6 7 8
Speedup: - 1.55 1.74 2.03 2.08 2.48 2.16 2.28 |
I still don't understand what is the reason causes this poor speedup.
However, when I use library based DGEMM, although it takes 10 times more time to finish a matrix multiplication, the speedup is perfect (a linear relationship). It seems we have a trade-off here. Then I wonder is there a existing library can achieve both fast serial calculation and good speedup? Thank you very much.
Hope you have a very good Christmas and Happy New Year!
Sharp |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Fri Dec 19, 2008 4:21 pm Post subject: |
|
|
Hi Sharp,
I profiled the code using oprofile based hardware counters (using the PGI utility pgcollect) and found that the "mp_barrier" routines accounted for 52% of the runtime. I then did an instrumented profile (by compiling with -Mprof=lines) and see that the threads spend a lot of time waiting for thread 0 to finish.
To help with this, instead of dividing the array up into NP chunks where NP is the total number of threads, I divided it up into smaller NP*16 chunks. This way the threads aren't waiting and can do useful work if one thread hits a bottleneck (like memory). Profiling the modified code shows that "mp_barrier" is now less than 18% of the overall time.
On my 2 socket penryn system, the original code took 56 seconds with 1 thread and 26 seconds with. With the modified code the time dropped to 19 seconds. Not great, but better.
As for other libraries, you can try ATLAS or GOTOBlas. I've run out of time for today, otherwise I'd try them myself.
- Mat |
|
| Back to top |
|
 |
|