PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Severe Problem with PGF90

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
SkyWombat



Joined: 02 Aug 2005
Posts: 3

PostPosted: Tue Aug 02, 2005 8:58 am    Post subject: Severe Problem with PGF90 Reply with quote

Hi All,

during migration of a stellar atmosphere code from 32 bit Linux to an Opteron cluster, we encountered severe problems. Beforehand we used the Intel ifort compiler without any problems, but for the Opterons only a pgf compiler is available. The code package we have to compile consists of a mixture of F77 and F90/95 files.

Ok, the problem is, that the code compiles fine, but at runtime changes values in variables which aren't even touched!!!!!!
Eg: natoms, an integer variable in an F77 file is set to some value and is print.
NATOM 4
After several lines in which this variable is not used, another print gives:
NATOM 538976307
which then, of course, causes the code to crash, as natom is after this print used to get a value from an array. Ok, this is solved by using an natomsav and copying back and forth, which is not desireable, but works.

Now the real problem:
I have a f77 subroutine:
Code:

      SUBROUTINE LEVSEQ_WIND (NATOM,ATW,NION,NL,LTE,IZ,XNUION,XNULTE,NLALL, 
     1                   LTEALL,NIONALL,LTESQ,LSQ,ISQ,IASQ,G,GLTE,     
     1                   LPARIN,IGRUND,EPS)                             

using implicit variables. In this subroutine the integer variable nlall is set to a value. Verifyable with another print:
LEVSEQ_WIND: nlall 217
Now this subroutine is called from a f90 subroutine with the standard call
Code:

 call LEVSEQ_WIND(NATOM,ATW,NION,NL,LTE,IZ,XNUION,XNULTE,NLALL,LTEALL, &
         NIONALL,LTESQ,LSQ,ISQ,IASQ,G,GLTE,LPARIN,IGRUND,EPS)
    print *, "INIT_ATOM: before ATOMWRK"
    print *, "INIT_ATOM: nlall,lteall,nionall",nlall,lteall,nionall

And the print you see here, yields:
INIT_ATOM: nlall,lteall,nionall 0 16 858867016
Where all three values are NOT what the are initialised in LEVSEQ_WIND!!

Now my question is, where is the problem in here??? Is this a compiler bug???
The whole code works without any problems when compiled with the Intel compiler.
Or is there any other workaround??
Thanks in advance

Daniel
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Tue Aug 02, 2005 12:45 pm    Post subject: Reply with quote

Hi Daniel,

Most likely what's happening is that some other part of your code is stomping on natom's memory and changing the value but the error doesn't become evident until natom is accessed. The question is why does this occur when you compile with pgf90 on a 64-bit system but not with the version created by ifort on a 32-bit system? Some possible reasons are:

1) A compiler bug with pgf90.
2) The program contains a bug which is only exposed on this platform.
Unintialized memory reads (UMR), Array bounds, argument mismatches,
etc. can go undetected depending upon how memory is laid out.
3) It's a porting issue between 32 and 64-bits.

The first thing to try is to compile the program in 32-bits by adding the "-tp k8-32" flag. If this works, then it's a porting issue.

Next try compiling without any optimiziation and some debugging flags, "-O0 -Mbounds -Mchkfpstk -Mchkptr". If it still fails, then it's most likely a problem with the code. If it works then it's more likely a compiler bug, continue adding optimization until it fails, "-O1", "-O2", "-fast", "-fastsse". Also, please send a report to trs@pgroup.com including the code or an example of the failure.

If you can not send us the code and it does work at "-O0", try doing a 'binary search' to deterimine where the actual error is occuring. First compile half the code with optimization and the other half without. If it runs correctly, add optimization to half of the remaining non-optimized files, recompile are rerun. (Or start removing the optimization if it still fails) Continue this pattern of adding or removing the optimization until you are able isolate the file which causes the error. The work around would be to compile this file without optimization and the others with optimization. Of course, we would still like to have you send in a report so we can get the bug fixed.

Hope this helps,
Mat
Back to top
View user's profile
SkyWombat



Joined: 02 Aug 2005
Posts: 3

PostPosted: Tue Aug 02, 2005 2:36 pm    Post subject: Reply with quote

Hi Mat,

First of all, thanks for your fast reply. Unfortunately none of the solutions you suggested work.

The "-tp k8-32" flag gives a straight segmentation fault, and compilation without any optimisation does not work either. I used the same flags as you suggested, but added a -r8 flag, which is needed for compatibility between some of the routines used.

Another remark, it is unlikely a porting issue between 32 and 64 bit, rather a porting issue between ifort and pgf, because pgf on a 32-bit linux machine gives the same error. Anyhow, I donīt understand this problem, neither the problem why a variable has some value declared before, even after a subroutine call should have changed this value, which it did in the subroutine.

Do you have any suggestions on this one?

Thanks

Daniel
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Tue Aug 02, 2005 3:21 pm    Post subject: Reply with quote

Hi Daniel,

While it still could be a compiler bug, it's sounding more and more like a problem with the program. Different compilers lay out memory differently, so if for example your code was writing off the end of an array, with intel this memory area might be empty, but with pgf90 this area contains the value of natom. Of course, this is just speculation, however the fact that it seg faults with the 32-bit version might help.

Compile the program with "-O0 -g -tp k8-32 -r8" (or "-tp piv" on a P4) and run the program under the PGI debugger pgdbg or gdb. If you can determine why the seg fault occurs, there's a good chance that the same problem is causing the 64-bit error.

If this doesn't seem to help, is this code that you could send us? We're happy to take a look and see what we can determine.

- Mat
Back to top
View user's profile
SkyWombat



Joined: 02 Aug 2005
Posts: 3

PostPosted: Fri Aug 05, 2005 1:33 am    Post subject: Reply with quote

Hi Mat,

thanks for your help. While the "-O0 -g -tp k8-32 -r8" flags still produced a segfault on the opteron, I was able to debug the code with the help of IFC and NAG compilers. Now the problem seems to be fixed.

Ok, thanks again

Daniel
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group