PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

invalid fortran results - NaN - openmpi 1.6.4 & pgi13.6
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
QuesarVII



Joined: 13 Sep 2004
Posts: 6

PostPosted: Thu Jun 13, 2013 2:55 pm    Post subject: invalid fortran results - NaN - openmpi 1.6.4 & pgi13.6 Reply with quote

Hi all,

I'm having a problem on a Debian cluster with openmpi and PGI compilers. The cluster has Debian 6 installed on it.

Fortran based code seems to end up giving NaN results (reported by end user). We used the LU test from the NAS tests (NPB3.2-MPI) to confirm this behavior and got this output in the verification step:

Verification being performed for class A
Accuracy setting for epsilon = 0.1000000000000E-07
Comparison of RMS-norms of residual
1 NaN 0.7790210760669E+03 NaN
2 NaN 0.6340276525969E+02 NaN
3 NaN 0.1949924972729E+03 NaN
4 NaN 0.1784530116042E+03 NaN
5 NaN 0.1838476034946E+04 NaN
Comparison of RMS-norms of solution error
FAILURE: 1 NaN 0.2996408568547E+02 NaN
FAILURE: 2 NaN 0.2819457636500E+01 NaN
FAILURE: 3 NaN 0.7347341269877E+01 NaN
FAILURE: 4 NaN 0.6713922568778E+01 NaN
FAILURE: 5 NaN 0.7071531568839E+02 NaN
Comparison of surface integral
FAILURE: NaN 0.2603092560489E+02 NaN
Verification failed


If I build with gcc instead of pgi it works and validates.

Openmpi 1.6.4 was built with CC=pgcc, CXX=pgCC, F77=pgf77, F90=pgf90, CFLAGS="-tp=piledriver-64 -O3", and FFLAGS, CXXFLAGS, and FCFLAGS set the same as CFLAGS. I also tried without specifying -tp=piledriver and using O2 instead of O3. It did not help.

What is going on here? What additional info should I provide to help diagnose this?

Thanks,
Rick
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Thu Jun 13, 2013 5:26 pm    Post subject: Reply with quote

Hi Rick,

Hmm, I just ran LU with CLASS A using NPB3.3-MPI last week without issue. Granted, I was using a different set-up than you. While, I'm out of time for today, I'll run it again tomorrow on a Piledriver system using OpenMPI and see if I can recreate your error.

If you could try compiling at "-O0", I'd apprecitate it. Also, what compiler version are using, how many MPI processes are you using, and did you set number of process at build?

Thanks,
Mat
Back to top
View user's profile
QuesarVII



Joined: 13 Sep 2004
Posts: 6

PostPosted: Fri Jun 14, 2013 9:42 am    Post subject: O0 didn't help, and I have a simpler way to reproduce now Reply with quote

Hi,

I'm using the latest 13.6 version of the compilers. The system originally had 13.2 but I updated to 13.6 to try resolving the problem before posting this.

Recompiling MPI and LU using -O0 didn't help anything.

I then tried using the open MP version of the NAS tests instead of the MPI version. I had the same NaN results:

microway@master:~/NPB3.2/NPB3.2-OMP$ make LU CLASS=A
============================================
= NAS PARALLEL BENCHMARKS 3.2 =
= OpenMP Versions =
= F77/C =
============================================

cd LU; make CLASS=A
make[1]: Entering directory `/home/microway/NPB3.2/NPB3.2-OMP/LU'
make[2]: Entering directory `/home/microway/NPB3.2/NPB3.2-OMP/sys'
cc -o setparams setparams.c
make[2]: Leaving directory `/home/microway/NPB3.2/NPB3.2-OMP/sys'
../sys/setparams lu A
pgf77 -c -O0 lu.f
pgf77 -c -O0 read_input.f
pgf77 -c -O0 domain.f
pgf77 -c -O0 setcoeff.f
pgf77 -c -O0 setbv.f
pgf77 -c -O0 exact.f
pgf77 -c -O0 setiv.f
pgf77 -c -O0 erhs.f
pgf77 -c -O0 ssor.f
pgf77 -c -O0 rhs.f
pgf77 -c -O0 l2norm.f
pgf77 -c -O0 jacld.f
pgf77 -c -O0 blts.f
pgf77 -c -O0 jacu.f
pgf77 -c -O0 buts.f
pgf77 -c -O0 error.f
pgf77 -c -O0 pintgr.f
pgf77 -c -O0 verify.f
cd ../common; pgf77 -c -O0 print_results.f
cd ../common; pgf77 -c -O0 timers.f
cd ../common; pgcc -c -O -o wtime.o ../common/wtime.c
pgf77 -O -o ../bin/lu.A lu.o read_input.o domain.o setcoeff.o setbv.o exact.o setiv.o erhs.o ssor.o rhs.o l2norm.o jacld.o blts.o jacu.o buts.o error.o pintgr.o verify.o ../common/print_results.o ../common/timers.o ../common/wtime.o
make[1]: Leaving directory `/home/microway/NPB3.2/NPB3.2-OMP/LU'
microway@master:~/NPB3.2/NPB3.2-OMP$ cd bin/
microway@master:~/NPB3.2/NPB3.2-OMP/bin$ ./lu.A


NAS Parallel Benchmarks (NPB3.2-OMP) - LU Benchmark

Size: 64x 64x 64
Iterations: 250

Time step 1
Time step 20
Time step 40
Time step 60
Time step 80
Time step 100
Time step 120
Time step 140
Time step 160
Time step 180
Time step 200
Time step 220
Time step 240
Time step 250

Verification being performed for class A
Accuracy setting for epsilon = 0.1000000000000E-07
Comparison of RMS-norms of residual
FAILURE: 1 NaN 0.7790210760669E+03 NaN
FAILURE: 2 NaN 0.6340276525969E+02 NaN
FAILURE: 3 NaN 0.1949924972729E+03 NaN
FAILURE: 4 NaN 0.1784530116042E+03 NaN
FAILURE: 5 NaN 0.1838476034946E+04 NaN
Comparison of RMS-norms of solution error
FAILURE: 1 NaN 0.2996408568547E+02 NaN
FAILURE: 2 NaN 0.2819457636500E+01 NaN
FAILURE: 3 NaN 0.7347341269877E+01 NaN
FAILURE: 4 NaN 0.6713922568778E+01 NaN
FAILURE: 5 NaN 0.7071531568839E+02 NaN
Comparison of surface integral
FAILURE: NaN 0.2603092560489E+02 NaN
Verification failed

This is a simpler case to reproduce I think.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Fri Jun 14, 2013 10:56 am    Post subject: Reply with quote

Hi Rick,

This is going to be a tough one. I've tried my best but I can't get it to fail. I went back to NPB 3.2 and used PGI pgf77 v13.6 on a Piledriver system, but it works fine. The fact that it fails with no optimization, leads me to believe that something else is going on rather than a compiler issue. What? I'm not sure.

Have you made any changes to the source? Can you try running the same experiment on a different system?

- Mat

Code:
piledriver:/tmp/qa/NPB3.2/NPB3.2-OMP% make CLASS=A LU
   ============================================
   =      NAS PARALLEL BENCHMARKS 3.2         =
   =      OpenMP Versions                     =
   =      F77/C                               =
   ============================================

cd LU; make CLASS=A
make[1]: Entering directory `/opt/tmp/qa/NPB3.2/NPB3.2-OMP/LU'
make[2]: Entering directory `/opt/tmp/qa/NPB3.2/NPB3.2-OMP/sys'
cc  -o setparams setparams.c
make[2]: Leaving directory `/opt/tmp/qa/NPB3.2/NPB3.2-OMP/sys'
../sys/setparams lu A
pgf77 -c  -O0 lu.f
pgf77 -c  -O0 read_input.f
pgf77 -c  -O0 domain.f
pgf77 -c  -O0 setcoeff.f
pgf77 -c  -O0 setbv.f
pgf77 -c  -O0 exact.f
pgf77 -c  -O0 setiv.f
pgf77 -c  -O0 erhs.f
pgf77 -c  -O0 ssor.f
pgf77 -c  -O0 rhs.f
pgf77 -c  -O0 l2norm.f
pgf77 -c  -O0 jacld.f
pgf77 -c  -O0 blts.f
pgf77 -c  -O0 jacu.f
pgf77 -c  -O0 buts.f
pgf77 -c  -O0 error.f
pgf77 -c  -O0 pintgr.f
pgf77 -c  -O0 verify.f
cd ../common; pgf77 -c  -O0 print_results.f
cd ../common; pgf77 -c  -O0 timers.f
cd ../common; pgcc  -c  -O  -o wtime.o ../common/wtime.c
pgf77 -O -o ../bin/lu.A lu.o read_input.o domain.o setcoeff.o setbv.o exact.o setiv.o erhs.o ssor.o rhs.o l2norm.o jacld.o blts.o jacu.o buts.o error.o pintgr.o verify.o ../common/print_results.o ../common/timers.o ../common/wtime.o
make[1]: Leaving directory `/opt/tmp/qa/NPB3.2/NPB3.2-OMP/LU'
piledriver:/tmp/qa/NPB3.2/NPB3.2-OMP% bin/lu.A


 NAS Parallel Benchmarks (NPB3.2-OMP) - LU Benchmark

 Size:  64x 64x 64
 Iterations:                    250

 Time step    1
 Time step   20
 Time step   40
 Time step   60
 Time step   80
 Time step  100
 Time step  120
 Time step  140
 Time step  160
 Time step  180
 Time step  200
 Time step  220
 Time step  240
 Time step  250

 Verification being performed for class A
 Accuracy setting for epsilon =  0.1000000000000E-07
 Comparison of RMS-norms of residual
           1   0.7790210760669E+03 0.7790210760669E+03 0.5837420383828E-15
           2   0.6340276525969E+02 0.6340276525969E+02 0.2801702468535E-14
           3   0.1949924972729E+03 0.1949924972729E+03 0.1166063713339E-14
           4   0.1784530116042E+03 0.1784530116042E+03 0.1274137507679E-14
           5   0.1838476034946E+04 0.1838476034946E+04 0.4947003303197E-15
 Comparison of RMS-norms of solution error
           1   0.2996408568547E+02 0.2996408568547E+02 0.0000000000000E+00
           2   0.2819457636500E+01 0.2819457636500E+01 0.1575087364679E-15
           3   0.7347341269877E+01 0.7347341269877E+01 0.3626529871458E-15
           4   0.6713922568778E+01 0.6713922568778E+01 0.1322890472152E-15
           5   0.7071531568839E+02 0.7071531568839E+02 0.2009586548100E-15
 Comparison of surface integral
               0.2603092560489E+02 0.2603092560489E+02 0.1364804975715E-15
 Verification Successful


 LU Benchmark Completed.
 Class           =                        A
 Size            =             64x  64x  64
 Iterations      =                      250
 Time in seconds =                   136.03
 Total threads   =                        1
 Avail threads   =                        1
 Mop/s total     =                   876.96
 Mop/s/thread    =                   876.96
 Operation type  =           floating point
 Verification    =               SUCCESSFUL
 Version         =                      3.2
 Compile date    =              14 Jun 2013

 Compile options:
    F77          = pgf77
    FLINK        = $(F77)
    F_LIB        = (none)
    F_INC        = (none)
    FFLAGS       = -O0
    FLINKFLAGS   = -O
    RAND         = (none)


 Please send all errors/feedbacks to:

 NPB Development Team
 npb@nas.nasa.gov
Back to top
View user's profile
QuesarVII



Joined: 13 Sep 2004
Posts: 6

PostPosted: Fri Jun 14, 2013 2:25 pm    Post subject: another system has same problem Reply with quote

Hi Mat,

I just finished setting up another Opteron system with Debian 6 (squeeze) and PGI 13.6. It has the same NaN results on the NAS LU OMP test.

Would you like me to provide remote login access to this system for you to check?

Thanks,
Rick
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group