PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Remarks / Diagnostic Output?

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Performance and Benchmarking
View previous topic :: View next topic  
Author Message
gordan



Joined: 06 Aug 2007
Posts: 5

PostPosted: Tue Aug 07, 2007 4:58 pm    Post subject: Remarks / Diagnostic Output? Reply with quote

Hi,

Is there a way to get diagnostic output out of the compiler that reports on things like vectorizing loops? Also can you please point me at the documentation about what -O4 does and how the -O? compares to GCC, ICC and other compilers?

Is there a compatibility mode / wrapper that enables pgcc to understand more of the common switches that GCC and ICC understand?

Finally, my home brewed number crunching app doesn't seem to be running very fast with PGCC. It has a number of tight loops with floating point operations that vectorize cleanly under ICC (and to some extent the latest versions of GCC).

The results I'm getting are:
GCC 3.2.2 -march=pentium3 -mcpu=pentium3 -mmmx -msse -mfpmath=sse -malign-double -fpic -O3 -fno-strict-aliasing -ffast-math -foptimize-register-move -frerun-loop-opt -fexpensive-optimizations -fprefetch-loop-arrays -fomit-frame-pointer -funroll-loops -Wall
3m:17s

PGCC 7.0.7 -tp=piii -Mvect=sse -fpic -O4 -Mfprelaxed -Msingle -Mfcon -Mcache_align -Mflushz -Munroll=c:1 -Mnoframe -Mlre -Mipa=align,arg,const,f90ptr,shape,libc,globals,localarg,ptr,pure
2m:57s

ICC 9.1.051 -march=pentium3 -mcpu=pentium3 -mtune=pentium3 -msse -xK -cxxlib-icc -fpic -O3 -ansi-alias -fp-model fast=2 -rcd -align -Zp16 -ipo -fomit-frame-pointer -funroll-loops -w1 -vec-report3
0m:46s

The machine in question is a Pentium 3, as the optimization flags indicate.

That makes ICC over 4x faster. 40% would be a huge difference. 400% makes me think there might be a problem with some of the compiler switches I am using (I know it's still faster than GCC, but GCC in question is 1) 5 years out of date and 2) GCC is known to be quite bad at producing fast code). Is there a problem with any of the compiler switches I listed above for PGCC? Is there any other option that's worth trying?

Finally, I am finding that -Mscalarsse makes the numbers that fall out of my program wildly out. The differences are as big as the 3rd significant figure, and when multiplying things out this leads to massive errors. The problem is reminiscent of a similar issue with GCC (although the numbers are not as far out on GCC) when -mfpmath=sse,387 is used. Is -Mscalarsse known to cause problems?

Many thanks.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Tue Aug 07, 2007 9:51 pm    Post subject: Reply with quote

Hi Gordan,
Quote:

Is there a way to get diagnostic output out of the compiler that reports on things like vectorizing loops?
Yes, "-Minfo=all" will give you information about what optimizations are performed and "-Mneginfo=all" lists what optimizations were not performed. See page 98 of the PGI User's guide for more detailed information.

Quote:
Also can you please point me at the documentation about what -O4 does and how the -O? compares to GCC, ICC and other compilers?
Chapter 2, page 19 of the PGI USer's Guide details "-O" (aka "-O2") and "-O4".

Quote:
Is there a compatibility mode / wrapper that enables pgcc to understand more of the common switches that GCC and ICC understand?
No. While the three compilers do share some flag names, most optimization names are unique and may not be equivlent.

Quote:
Finally, I am finding that -Mscalarsse makes the numbers that fall out of my program wildly out.
-Mscalarsse is not a valid option since it requires a system with SSE2 support. However, what your seeing in most likely the result of precision differences between using an 80-bit x87 FPU and a 64-bit SSE FPU. See the PGI FAQ pages for a more detailed explaination.

Quote:
PGCC 7.0.7 -tp=piii -Mvect=sse -fpic -O4 -Mfprelaxed -Msingle -Mfcon -Mcache_align -Mflushz -Munroll=c:1 -Mnoframe -Mlre -Mipa=align,arg,const,f90ptr,shape,libc,globals,localarg,ptr,pure
Start with "-fast", then individually add "-O3", "-Mipa=fast", "-Mvect", "-Msingle", and "-Mfcon" to see how each one effects your code. Another flag to try is "-Msafeptr".

"-O3" enables more aggressive global optimization but not does not always result in faster code. IPA mostly helps codes with many function calls, so may not help you. "-Mvect" enables vectorization however vectorization support is limited on a PIII.

Hope this helps,
Mat
Back to top
View user's profile
gordan



Joined: 06 Aug 2007
Posts: 5

PostPosted: Wed Aug 08, 2007 12:58 am    Post subject: Reply with quote

Thanks for the info.

Quote:
Yes, "-Minfo=all" will give you information about what optimizations are performed and "-Mneginfo=all" lists what optimizations were not performed.


OK, now that I can see what the compiler is (not) doing on my benchmark sine curve fitting library:

Why do these not vectorize?

#define SEARCHSPACE = float(0.1)
#define MIN_SEARCH = float(0.001)

unsigned int i;
static float Grid[i];
//(all other variables apart from iterators are floats, most of them static)

//Loop not vectorized: contains call
for (i = 0; i < 4; i++)
Grid[i] = float(1) + fabsf(Curve[i] * SEARCHSPACE);
...
//Loop not vectorized: loop count too small
for (i = 0; i < 4; i++)
if (Grid[i] > MIN_SEARCH)
Grid[i] *= float(0.5);
...
//Loop not vectorized: loop count too small
for (i = 0; i < 4; i++)
Curve[i] += BestFit[i];
...
//Loop not vectorized: contains call
for (x = 0, xx = 0; x < LocalDataC; x++)
LocalDataV[x] += Amplitude * sinf (Frequency * xx++ + OffsetX) + OffsetY;
...
//Loop not vectorized: contains call
for (x = 0; x < LocalDataC; x++)
CacheSin[x] = sinf (CacheXF[x] + OffsetX);

All of these vectorize using ICC. I would have expected 4-pass loops to vectorize because SEE1 can handles 4 packed floats in a vector.

Regarding the loops containing calls, is this to say that non-trivial maths functions don't vectorize (fabsf, sinf)?

I also see no reports of _anything_ getting vectorized in my code. I see reports that loops are unrolled, but not that the same loops are vectorized (not being told they're not vectorized, either).

Quote:
-Mscalarsse is not a valid option since it requires a system with SSE2 support. However, what your seeing in most likely the result of precision differences between using an 80-bit x87 FPU and a 64-bit SSE FPU.


I accept this is possible, but I think it's unlikely - I don't need that much FP precision in my code. I only use floats, not doubles, so the magnitude of the error surprises me. And shouldn't the compiler silently ignore or at least throw up a warning -Mscalarsse when -tp-piii, if the two are incompatible? Just generating broken code seems like a poor default...

Quote:
Start with "-fast", then individually add "-O3", "-Mipa=fast", "-Mvect", "-Msingle", and "-Mfcon" to see how each one effects your code. Another flag to try is "-Msafeptr".
...
"-Mvect" enables vectorization however vectorization support is limited on a PIII.


I effectively started with -fast -Mipa=fast, but I broke it down into the specific optimizations it performs so I could switch them in and out one by one if something breaks (such as -Mscalarsse, for example).
-Msafeptr had no effect on performance.
I am aware of P3's limitations regarding SSE but since all my operations are on single precision floats, I don't see why it wouldn't work.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Wed Aug 08, 2007 10:41 am    Post subject: Reply with quote

Hi Gordon,

For small loops it's better to unroll then to vectorize. For loops which contain single precision functions such as expf, logf, sinf, cosf, powf, etc., we do not currently have versions of these functions that will vectorize (it takes specially coded assembly versions instead of the libm versions). However, we do vectorize double precision versions on 64-bit systems. This is a known deficiency for which we have an open RFE.

As for your precision issue, all intermediary calculations on an x87 will be performed in 80-bit precision, even on single precision variables. You might try compiling with "-pc 32", which sets the x87 FPU to use only 32-bit precision, and see how it effects your answers.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Performance and Benchmarking All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group