PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

poor pgi openmp performance??
Goto page Previous  1, 2, 3, 4  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
steve.xu



Joined: 20 Feb 2012
Posts: 25

PostPosted: Sat Jun 30, 2012 8:08 am    Post subject: Reply with quote

Thanks toepfer.
But i still cannot see a performance enhancement with -fast. Here i give you some details about our parallel computer and program.
Our parallel computer contains many computing nodes, and each node contains 2 Intel xeon 5670 processors. These two xeon 567 processors are pluged in 2 different sockets and each contains 6 cores, thus each node gets 12 cores with a 48GB memory shared between the 12 cores.
Our program is a real-world numerical application program for computational fluid dynamics (CFD), not a benchmark. We are writing some MPI, OpenMP and CUDA codes to parallelize the program. Since our CFD program is written in Fortran and currently PGI is the only compiler that supports CUDA FORTRAN, we choose PGI, specifically PGI 11.8, for our work.
Below is some results of performance comparison between Intel compiler and PGI 11.8 as regard to our program:
--------------------------- 1 thread------6 threads-----12 threads----
Intel-omp-O3 |-- 23 s --|-- 5 s --| -- 3 s ---
Pgi-omp-fast |-- 28 s --|-- 10 s --| -- 8.4 s ---
Pgi-omp-fast-ipa |-- 26.5 s --|-- 8.5 s --| -- 6.7 s ---
pgi-omp-fast-ipa-bound |-- NA --|-- 6.8 s --| -- NA ---

compiler flag:
Intel-omp-O3: -O3 -openmp
Pgi-omp-fast :-fast -mp
pgi-omp-fast-ipa: -fast -mp -Mipa=fast,inline
I just running the program on one computing node with one MPI process and 1, 6 and 12 threads.pgi-omp-fast-ipa-bound means we set MP_BOUND and MP_BLIST in pgi. We do find that binding thread to processor core can improve openMp performance in PGI, but even in this case pgi is still nearly 40% slower than intel. What surprise us is that the performance can be worse if we set MP_BOUND to "yes" when the number of openMP threads is 12.

The performance gap between pgi and intel for sequential program is not too large, roughly 10% to 20%. Not too bad for our program. But i have no idea that why the parallel openMP performance gap of the two compilers is so huge: Intel is even 2 times faster than pgi, and you can also see a good parallel scalability for Intel.

Ok, toepfer. I just compiled the same source code with the two compilers and different compiler flags, and got the result above. I just cannot explain the abnormal performance gap of the two OpenMP implementations. Would you and any guys can help me??
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5871
Location: The Portland Group Inc.

PostPosted: Mon Jul 02, 2012 5:41 pm    Post subject: Reply with quote

Hi Steve,

I think Craig has taken it as far as we can online. If you can send us the code (trs@pgroup.com) then I'd be happy to spend an hour or two investigating the performance. Though, you may also want to try profiling your code to to get a better idea of where the performance differences occur. You can then use the compiler feedback messages (-Minfo) to see what optimizations are being applied to this section of code. Pay particular attention to any messages which show places where the compiler attempted but failed to optimize a section of code.

- Mat
Back to top
View user's profile
steve.xu



Joined: 20 Feb 2012
Posts: 25

PostPosted: Tue Jul 03, 2012 8:12 pm    Post subject: Reply with quote

Thanks Mat.
I cannot send you our source code. But i can send you the pgi compiler output by -Minfo, Please check your mail. I hope you can find some usefule information from this output. Are there any other profiler tools from pgi can obtain additional performance information? I am not sure if "-Minfo" option can provide you enough message.
BTW, I can improve pgi OpenMP performance in a single CPU socket. I set MB_BIND=Y and set MB_LIST=5,4,3,2,1,0 since our Intel Xeon 5670 have six cores. It do have an effect to OpenMP performance. But How can I binding thread to processor cores when there are 2 cpu in 2 different socket????
I donot know how to set MB_LIST.
Back to top
View user's profile
steve.xu



Joined: 20 Feb 2012
Posts: 25

PostPosted: Wed Jul 18, 2012 8:14 am    Post subject: Reply with quote

hi Mat.
I have just send you a code that can reproduce this openmp performance problem, please check the mail.
wish you could give me some advice, thank you
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5871
Location: The Portland Group Inc.

PostPosted: Wed Jul 18, 2012 10:36 am    Post subject: Reply with quote

Thanks Steve. I see that Customer Service forwarded your code to Craig for further investigation. Craig is very good at diagnosing OpenMP performance issues (much better than me) so you'll be in good hands.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page Previous  1, 2, 3, 4  Next
Page 3 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group