PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Recommended flags for 64-bit Xeon?

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
mikep213



Joined: 03 Sep 2004
Posts: 2

PostPosted: Fri Sep 03, 2004 7:09 am    Post subject: Recommended flags for 64-bit Xeon? Reply with quote

Can anyone recommend a good set of optimisation / machine type flags for pgcc under linux on a 64-bit Xeon? Looking through the compiler flag list, I'm getting a reasonable flops benchmark from "-tp p7-64 -O4", which is the specific arch I'm using. Any advancement on that is appreciated - there's a lot of flags to try, and I suspect flag ordering may affect the results, too!
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Fri Sep 03, 2004 10:08 am    Post subject: "-fastsse -Mipa=fast,inline" Reply with quote

Hi,

In general, the flag set that gives the best performance for C and Fortran with release 5.2 is "-fastsse -Mipa=fast,inline". C++ tends to do better with "-fastsse --no_exceptions -Minline=level:10". Note that "-tp p7-64" is on by default when you compile on an 64-bit Xeon system. "-fastsse" combines all of the most common optimizations under a single flag.

"-fastsse" == "-O2 -Munroll=c:1 -Mnoframe -Mlre -Mscalarsse -Mvect=sse -Mcache_align -Mflushz"

You can also then add "-O3" (-O4 is really -O3), which may or may not help. Also try prefetching with "-Mprefetch" and "-Mvect=prefetch".

Note, if your using a "stream" benchmark, aggressive optimization doesn't help since the benchmark is memory bound. The best flags for stream are "-O2 -Mvect=sse -Mnontemporal -Munsafe_par_align". Note that "-Mnontemporal" can hurt general applications and "-Munsafe_par_align" is called unsafe for a reason.

Other things to try when tuning for performance is to use the profilier, pgprof. Compile and link your code with "-Mprof=lines", run the program, and view the results with pgprof. This will give you a better understanding of where your code takes the most time and where you should focus your tuning efforts.

The bottom line is to start with "-fastsee -Mipa=fast,inline". Does this perform well enough? If so, great your done. Otherwise, find the parts of your code that's not performing well, determine why, then determine if other compiler options will work better or maybe some code rewritting will help.

Flag order does matter in some cases. In general, the last flag will override previous conflicting flags. So "-fastsse -O3" is different than "-O3 -fastsse" since -fastsse implys -O2 and -O2 will override -O3. The exception to this is "-Mvect" which adds suboptions together. So "-Mvect=sse -Mvect=prefetch" == "-Mvect=sse,prefetch". If you wanted fastsse and only the prefetch option you would need to add nosse, "-fastsse -Mvect=nosse,prefetch" since -Mvect=sse is part of fastsse.

Hope this wasn't too long winded. Let me know if you need anything clarified.

-Mat
Back to top
View user's profile
mwolfe



Joined: 13 Jul 2004
Posts: 20

PostPosted: Fri Sep 03, 2004 4:07 pm    Post subject: Reply with quote

To be perfectly clear, '-tp p7-64' is the default on a 64-bit Xeon system if the 64-bit compiler is on your path. When you install the PGI compiler suite, you can install both 32-bit and 64-bit compilers; the 64-bit compilers will go in
/usr/pgi/linux86-64/5.2/{bin,include,lib,...}
and the 32-bit compilers go in
/usr/pgi/linux86/5.2/{bin,include,lib,...}
Assuming you install in /usr/pgi (your prefix may be different).
If you put /usr/pgi/linux86-64/5.2/bin on your path, you get the 64-bit compilers by default; if you put /usr/pgi/linux86/5.2/bin on your path, you get the 32-bit compilers by default. The -tp switch will override the default, of course.
Back to top
View user's profile
mikep213



Joined: 03 Sep 2004
Posts: 2

PostPosted: Fri Sep 10, 2004 8:13 am    Post subject: Reply with quote

Thanks for the prompt reply folks!

I've been trying out various sugested options, and I'm not seeing a performance improvement (and in some cases I'm understandably seeing performance degradation). The best results I'm generally seeing are when I don't use any compiler flags at all!

FTR, I'm using Al Aburto's flops benchmark. A fairly simple benchmark, but I've analysed it under other architectures, so I know it's possible to optimise it to take advantage of superscalar architectures.

It seems odd that I'm seeing little performance difference with any optimisation flag. I ran a vanilla compile with the -# option, and it doesn't seem to be auto-selecting optimisation flags. Adding -Minfo shows that I'm getting at least loop-unrolling, but without any apprarent benefit!

I'll try it out on a 'real' piece of code before my trial licence expires, but I wondered if anyone had any suggestions for troubleshooting?

Thaks,
Mike.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Fri Sep 10, 2004 9:49 am    Post subject: Might be the architecture Reply with quote

I'd need to study FLOPS further to understand what's going on. I tried both gcc and icc 8.1, and got the same results where optimization didn't help. However, when I compiled it with cc on a Sun I saw improvement with optimization. Maybe its an architectural issue?

-Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group