PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Why library based DGEMM slower than DGEMM code itself?
Goto page Previous  1, 2, 3
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Performance and Benchmarking
View previous topic :: View next topic  
Author Message
Sharp



Joined: 28 Aug 2008
Posts: 17

PostPosted: Mon Dec 22, 2008 12:52 pm    Post subject: Reply with quote

mkcolg wrote:

I profiled the code using oprofile based hardware counters (using the PGI utility pgcollect) and found that the "mp_barrier" routines accounted for 52% of the runtime. I then did an instrumented profile (by compiling with -Mprof=lines) and see that the threads spend a lot of time waiting for thread 0 to finish.

To help with this, instead of dividing the array up into NP chunks where NP is the total number of threads, I divided it up into smaller NP*16 chunks. This way the threads aren't waiting and can do useful work if one thread hits a bottleneck (like memory). Profiling the modified code shows that "mp_barrier" is now less than 18% of the overall time.


Hi Mat, I have already tried to reduce the size of the chunks. For example, I changed the line of setting the size of chunks from:

Code:
        ColPW = Max((N+NP-1)/NP,MinCoW)

to
Code:
          ColPW = Min((N+NP-1)/NP,(NP*MinCoW))


also, since the chunk size is smaller now, I changed the Schedule(Static,1) to Schedule(Dynamic,1).

However, the speedup for 2 processors is much better (close to 2, about 1.8 ish), but once the number of processors goes over 2, the speedup is still very poor, e.g. for 8 processors, the speedup is about 3.73. This is still not as good as the library based DGEMM. Since the reason for this seems not come from the compiler, it might be because my code is not efficient enough. Do you think I simply try out a better chunk size will work?

I also tried to parallel DGEMM, by using the same size of the chunks, directly without using the stub XGEMM. It shows similar results.

Since you said
Quote:
the threads spend a lot of time waiting for thread 0 to finish

do you think if I put a Nowait at the end of the parallel region will work?

mkcolg wrote:
As for other libraries, you can try ATLAS or GOTOBlas. I've run out of time for today, otherwise I'd try them myself.


For other libraries, I will let you know the results after I have tired them. Thank you very much.

Hope you have a very good Christmas and New Year!

Sharp
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Mon Dec 22, 2008 5:14 pm    Post subject: Reply with quote

Hi Sharp,

Quote:
Do you think I simply try out a better chunk size will work?

It helps but not by much. Unfortunately, after two hours of tinkering with things (like you I tried dynamic scheduling as well as guided, and nowait), reducing the chuck size was the only thing I saw that has some improvement.

One possibility is that the overhead of the thread creation and management is higher that the gain of the parallelization. Given that the total time of DGEMM is less than a second, it would make sense that the overhead would reduce the parallel speed-up. However when I increase the problem size, the speed-up didn't change (assuming a fix overhead cost). So, I haven't been able to prove this is the problem.
Quote:

Hope you have a very good Christmas and New Year!

Thank you and I hope you also have a very good Christmas and New Year. Portland is currently under a foot of snow and ice so it definately a very white Christmas!

- Mat
Back to top
View user's profile
Sharp



Joined: 28 Aug 2008
Posts: 17

PostPosted: Mon Jan 12, 2009 11:36 am    Post subject: Reply with quote

Hi Mat,

Happy New Year! I hope you had a very good holiday.

Now back to our problem we had before. I have tried to enlarge the sizes of the matrices too to see whether the performance changes (with NP*16 chunk and dynamic). But the result of speedup is about the same.

Also, I just found out that we don't have ATLAS or GOTOBlas installed. So I can't tell you the result of the performance. But I think it will be similar to ACML.

After trying many ways to solve the performance problem, it seems the best way is to use library. However, because the library based DGEMM performs slower than the non-library version (one of the matrices is sparse), I ran a search online to see whether there is a solution for this. Then I found there are some sparse BLAS routines. But it seems these sparse BLAS routines are supported by MKL libraries. I wonder is there any sparse BLAS routines that is supported by PGI libraries? Thank you very much.

Once again, Happy New Year!

Sharp
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Mon Jan 12, 2009 12:08 pm    Post subject: Reply with quote

Hi Sharp,

I don't believe that the NetLib BLAS or AMD ACML libraries that we ship with our product contain any Sparse routines. Though, you should be able to link PGI compiled code with the MKL libraries. There are also several free libraries available including one from the Trilinos product (http://trilinos.sandia.gov/index.html) which I've been told works well with the PGI compilers.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Performance and Benchmarking All times are GMT - 7 Hours
Goto page Previous  1, 2, 3
Page 3 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group