PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Results depend on compiled binary location
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
Ignacio Fdez. Galvan



Joined: 22 Aug 2013
Posts: 10

PostPosted: Wed Mar 12, 2014 7:40 am    Post subject: Results depend on compiled binary location Reply with quote

I'm having a problem with a program where the results I get are dependent on the location of the program. The binary files are exactly the same, just copied from one path to the other. There is in principle nothing that should cause this, but the program is a messy monster, quite hard to debug (and I haven't been able to create a simple test case so far).

The differences in the results are usually small, if any, but sometimes enough to cause the internal stability tests to fail. So far, I have only detected these differences with optimization level 2 or higher, and only with with the PGI compilers (pgfortran in particular), so I'm suspecting this may be a compiler bug or misconfiguration.

Apparently, when I have the program in a path with a short length, it works fine (meaning the results agree with what I get with other compilers), but if I copy the program to a longer path, sometimes the results are different.

Obviously, this not enough information to solve the problem. But I was wondering if there is any known issue that could be causing this, or if anyone has encountered similar problems. I'm using pgi 13.7-0.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6124
Location: The Portland Group Inc.

PostPosted: Wed Mar 12, 2014 11:02 am    Post subject: Reply with quote

Hi Ignacio Fdez. Galvan,

My best guess is that there's some type of memory issue with the program (like a UMR) and where the binary is located perturbates this just enough to cause the errors to appear. Granted I have no way of knowing if this is indeed the case, but whenever I encounter odd behavior that doesn't make sense, it's the first thing I check.

Can you please try running your program under Valgrind (www.valgrind.org) and see it finds any memory problems? I'd try it a couple different way, with and without optimization, and in both locations.

It's possible that there's a compiler optimization issue or there could be a subtle program issue that only gets exposed with a particular optimization. Let's see if Valgrid finds anything, then go from there.

- Mat
Back to top
View user's profile
Ignacio Fdez. Galvan



Joined: 22 Aug 2013
Posts: 10

PostPosted: Thu Aug 21, 2014 7:55 am    Post subject: Reply with quote

Thank you for your suggestion. I have been trying to track down this problem and it is quite elusive. I tried using valgrind, but it complains about an unrecognized instruction. When I compile with -tp=x64 valgrind doesn't complain, but then the bug does not appear. That's a hint.

Another hint. In the full program, the first place I could find where the problem appears was just a LAPACK call:

Code:
call dsygv(1,'V','L',n,Tr,n,Bk,n,Work(iW),Work(itmp),lwork,info)


I checked that all the arguments are exactly the same on input, but the output is different depending on where the executable sits (I checked by writing to unformatted files and comparing those). This is with a stock LAPACK/BLAS suite, freshly downloaded from http://www.netlib.org/ and compiled with pgfortran. However, I have been so far unable to create a stand-alone minimal test.

Final hint. The bug disappears if I compile just the BLAS routines with -O0 (everything else with -O2).

Any suggestion and help for further debugging this would be appreciated.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6124
Location: The Portland Group Inc.

PostPosted: Thu Aug 21, 2014 3:03 pm    Post subject: Reply with quote

Hi Ignacio Fdez. Galvan,

From what you describe, I'd say AVX or FMA instructions could be the issue, however, it makes no sense why moving the binaries location would cause this.

A couple of things to try:

1) Use the BLAS libraries that ship with the compilers, i.e. "-lblas" or "-lacml".

2) Add "-Mnofma" and/or "-Mvect=nosimd" to your compile flags.

Note that the valgrind error is because it doesn't understand AVX instructions. Newer versions of valgrind have been updated to use AVX.

- Mat
Back to top
View user's profile
frnkyl004



Joined: 06 Dec 2011
Posts: 50

PostPosted: Thu Aug 21, 2014 11:37 pm    Post subject: Reply with quote

Hi Mat/Ignacio Fdez. Galvan,

In my experience, even the latest version of valgrind (3.9.0) isn't happy with all sandybridge instructions and I've found that compiling with -tp=nehalem-64 and running on a sandybridge machine allows valgrind to work while still using AVX. @Ignacio Fdez. Galvan: hopefully -tp=nehalem-64 will keep the bug and allow valgrind to work.

Hope this helps,
Kyle
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group