PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

longer execution time in PGCC 6.0.5 than PGCC 5.2.4

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
shingo



Joined: 31 Jul 2004
Posts: 2

PostPosted: Sun Jul 24, 2005 1:59 am    Post subject: longer execution time in PGCC 6.0.5 than PGCC 5.2.4 Reply with quote

Hi,

I compiled a benchmark program by pgcc 6.0.5 and pgcc 5.2.4
[/url]http://w3cic.riken.go.jp/HPC/HimenoBMT/Load_module/cc_himenoBMTxp_l.lzh
on a 2 AMD opteron 250 CUP machine and run it.

% pgcc -fastsse -Mconcur -DLARGE himenombmtxps.c

The measure by the benchmark program shows that it runs at 1364 Mflops for pgcc 6.0.5
while at 1653 Mfops for pgcc 5.2.4, about 20% faster. If only use single CPU

% pgcc -fastsse -DLARGE himenombmtxps.c

both run at about 1160Mflops with several percent difference.

I tested several compiler options described in Users' Guide but I could not
run the benchmark test compiled by pgcc 6.0.5 as fast as by 5.2.4.

Do you know why this benchmark program compiled -Mconcur by pgcc 6.0.6
is signigicanly slowe than compiled by pgcc 5.2.4?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Mon Jul 25, 2005 12:29 pm    Post subject: Reply with quote

Hi Shingo,

Thank you for the report. I was able to recreate the issue here and was able to isolate the problem. With the 6.0 compilers we added an optimization which better recognizes idioms. Although this optimization helps most codes, in your case it causes the loop at line 223 to no longer parallelize since it now contains a call to the "memcopy" idiom. (The compiler wont parallelize loops with funcion calls).

As part of our current work on auto-parallelization, we have addressed this problem and will have a fix in the 6.1 release. For now however, you can add the xflag "-Mx,8,0x8000000" to the compilation to remove the idiom. With the xflag, I show the MFlops increases from 1413 to 2235. Xflags can change from release to release so you should only use this work around with the 6.0 compilers and this particular benchmark.

FYI, to determine which loops are and are not parallelized, add the flags "-Minfo -Mneginfo=concur" when using "-Mconcur".

Thanks,
Mat
Back to top
View user's profile
shingo



Joined: 31 Jul 2004
Posts: 2

PostPosted: Tue Jul 26, 2005 6:01 pm    Post subject: Reply with quote

Thank you for the quiick fix.
Another observation for the current PGCC 6.0.5 with -Mconcur option is that
without your instruction -Mx,8,0x8000000,

% pgcc -Mconcur -DLARGE

runs faster by 10 % than

% pgcc -fastsse -Mconcur -DLARGE

for the same benchmark program. The -fastsse option does not always help, but seems sometimes slow down the execution.

Shingo
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Wed Jul 27, 2005 11:28 am    Post subject: Reply with quote

Hi Shingo,

It appears that cache alignment (-Mcache_align) is causing the problem. Try compiling with "-fast -Mvect=sse" which is -fastsse without -Mcache_align.

"-fastsse" is an aggregate flag composed of the optimizations that help most codes. In some cases however, certain optimization can hurt performance. If you notice such a case, try breaking up an aggregate flag into its components to determine which optimizations help and which hurt. To get the component list use "-help" flag along with the flag. Note that specific component flags can change.

Example:

Code:
pgcc -help -fastsse
Reading rcfile /usr/pgi/linux86-64/6.0/bin/.pgccrc
-fastsse            == -fast -Mvect=sse -Mscalarsse -Mcache_align -Mflushz
-fast               Common optimizations: -O2 -Munroll=c:1 -Mnoframe -Mlre
-M[no]vect[=[no]altcode|[no]assoc|cachesize:<c>|[no]idiom|levels:<n>|nosizelimit|prefetch|[no]recog|smallvect:<n>|[no]sse|[no]transform]
                    Control automatic vector pipelining
    [no]assoc       Allow [disallow] reassociation
    cachesize:<c>   Optimize for cache size c
    [no]idiom       Enable [disable] idiom recognition
    prefetch        Generate prefetch instructions
    [no]sse         Generate [don't generate] SSE instructions
-M[no]scalarsse     Generate scalar sse code with xmm registers; implies -Mflushz
-Mcache_align       Align long objects on cache-line boundaries
-M[no]flushz        Set SSE to flush-to-zero mode



- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group