PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

poor pgi openmp performance??
Goto page 1, 2, 3, 4  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
steve.xu



Joined: 20 Feb 2012
Posts: 25

PostPosted: Thu May 31, 2012 12:03 am    Post subject: poor pgi openmp performance?? Reply with quote

I have a fortran CFD program parallized by openMP. When it is compiled by Intel fortran, i can achieve a speedup of almost 10 on 2 Intel Xeon X5670 CPUs which containing 12 cores. But when i compile it by pgi (version 11.8), i can only achieve a speedup of less than 5. I use the two compilers with -O3 option. For the sequential program, i observe that pgi fortran is about 20% slower than inter fortran. More surprisingly, if i use -fast option of pgi compiler, i cannot get the right result with 12 openMP threads, but it is still normal when the number of threads is less than 12.
So what is the difference of implementation between intel openMP and pgi openMP??Anybody can give me some advice about how to improve pgi openMP performance ???
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6129
Location: The Portland Group Inc.

PostPosted: Thu May 31, 2012 9:59 am    Post subject: Reply with quote

Hi Steve,
Quote:
I use the two compilers with -O3 option
Our -O3 is not the same as Intel's. Currently, -O3 is really the same as -O2. We will sometimes use -O3 to put new optimization that might impact numerical accuracy but these typically get moved into -O2 once they have been vetted. The more equivalent PGI flag to Intel's -O3 is -fast and this difference in optimization could account for the 20% difference if not more.

Quote:
More surprisingly, if i use -fast option of pgi compiler, i cannot get the right result with 12 openMP threads, but it is still normal when the number of threads is less than 12.

That is odd. While the optimization that the compiler uses can impact numerical accuracy, even with -fast we stay within 1 ulp of accuracy. Though while parallel execution can change accuracy due to the order of operations, it's not clear why this would only occur over 12 threads and with -fast. You'll need to do some digging.

-fast is an aggregate flag made up of other optimizations. So what you can do is eliminate them one by one to see which optimization is giving you the verification error. The specific flags included in -fast can change from target to target, but here's a general list:
Quote:
-O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline -Mvect=sse -Mscalarsse -Mcache_align -Mflushz -Mpre


Note that you'll most likely want to characterize the verification error. If you determine which optimization causes the issue, the next step is to use a binary search method to compile half of your code with the optimization and half without. Continue until you narrow down the particular file that's causing the error. Next, you can use the "opt" directive/pragma to disable optimization for a particular routine and again use a binary search method to find the particular routine where the verification error occurs.

While this type of issue could be the compilers fault, often I see cases where the user's program has an error that doesn't get exposed until a particular optimization is used. Things such as out-of-bounds errors and uninitialized memory are the typical cause.


Quote:
So what is the difference of implementation between intel openMP and pgi openMP??Anybody can give me some advice about how to improve pgi openMP performance ???
Without a detailed performance analysis it's impossible to know exactly what's going on. My first thought is that it's the difference in optimization so the simplest thing to try different optimizations and see how they effect your code. Though a more thorough analysis would be to use a profiler to determine where most time is spent. Compare the PGI and Intel profiles to determine where the differences occur. Next, use the compiler feedback messages (-Minfo) to determine which optimizations are being applied and more important, which are not. In particular, look for messages about code not vectorizing (-Mvect).


Hope this helps,
Mat
Back to top
View user's profile
Michal_Kvasnicka



Joined: 22 Jul 2004
Posts: 7

PostPosted: Mon Jun 11, 2012 12:00 am    Post subject: Re: poor pgi openmp performance?? Reply with quote

steve.xu wrote:
I have a fortran CFD program parallized by openMP. When it is compiled by Intel fortran, i can achieve a speedup of almost 10 on 2 Intel Xeon X5670 CPUs which containing 12 cores. But when i compile it by pgi (version 11.8), i can only achieve a speedup of less than 5. I use the two compilers with -O3 option. For the sequential program, i observe that pgi fortran is about 20% slower than inter fortran. More surprisingly, if i use -fast option of pgi compiler, i cannot get the right result with 12 openMP threads, but it is still normal when the number of threads is less than 12.
So what is the difference of implementation between intel openMP and pgi openMP??Anybody can give me some advice about how to improve pgi openMP performance ???


It looks like that INTEL compiler is significantly better than PGI compiler ... see for example: http://www.pgroup.com/userforum/viewtopic.php?t=2318
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6129
Location: The Portland Group Inc.

PostPosted: Mon Jun 11, 2012 9:12 am    Post subject: Reply with quote

Hi Michal,

Yes, Intel is currently faster then PGI on the Polyhedron benchmark. However, this may or may not have any barring on Steve's code and it's best to do a performance analysis to understand what's happening.

- Mat
Back to top
View user's profile
steve.xu



Joined: 20 Feb 2012
Posts: 25

PostPosted: Tue Jun 19, 2012 8:39 pm    Post subject: Reply with quote

Thanks Mat.
Thanks Michal.
Our CFD program contans nearly 20000 lines of fortran codes. We implemented MPI, OpenMP and CUDA parallel computing in the program. Since PGI is the only fortran compiler supporting CUDA fortran, currently we have to use PGI compiler. Otherwise we would have to write mixed language code contains CUDA C and Fortran. Unfortunatly i do find sometimes PGI is slower than intel when comparing the original MPI,OPENMP parallel implementation of our CFD code.
As for OpenMP, this performance gap is even bigger regards to our code, i am not sure if we have used appropriate complier optimization flags, or Could you give me some general advices about performance optimization while compiling or writing openMP codes in PGI fortran??

thanks all
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page 1, 2, 3, 4  Next
Page 1 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group