PGI Execution Questions


What can I do about precision problems?

The x86 floating point processor performs all of its computations in extended (80-bit) precision. This may cause problems when porting code to the x86 which successfully executes on other (non-x86) systems. The increased precision of the x86 may result in 'different' answers. Also, the increased precision of the x86 may result in infinite loops if equality tests of floating point data are used to control while loops. Examples of problem cases:


  1.    a = <expression>   ! 'copy propagate' a's right-hand side to its use
       b = a + c          ! 'propagate' b
       if (b .eq. y ) ... ! 'exact equality' check
    

  2.    while ( C.EQ.ONE )
           LT = LT + 1
           A = A*LBETA
           C = DLAMC3( A, ONE )
           C = DLAMC3( C, -A )
       END WHILE
    

To reduce the precision, the compilers options ‑pc 64 (round floating point operations to double precision) or ‑pc 32 (round floating point operations to single precision) may be used.

The ‑Kieee switch may be used to disable propagating floating point values and to round the argument values passed to intrinsics (sin, cos, etc.).


How come we get different answers on one platform versus a Linux x86 platform?

The x86 architecture implements a floating-point stack by using 8 80-bit registers. Each register uses bits 0-63 as the significant, bits 64-78 for the exponent, and bit 79 is the sign bit. This extended 80-bit real format used by floating instructions is the default. When values are loaded into the floating point stack they are automatically converted into the extended real format. The precision of the floating point stack can be controlled, however, by setting the precision control bits (bits 8 and 9) of the floating control word appropriately. In this way, the programmer can explicitly set the precision to standard IEEE double or single precision (the Intel documentation, however, claims that this only affects the operations of add, subtract, multiply, divide, and square root.)

We have also noticed that, although extended precision is supposedly the default which is set for the control word, it is set at double precision in the x86 Linux systems. Thus, we now also have a ‑pc <val> option which can be used on the command line. The values of <val> are:

       32 => single precision
       64 => double precision
       80 => extended precision

At first glance, an extra 16 bits of precision appears to only be a positive asset. However, operations that are performed exclusively on the floating point stack, without storing into (or loading from) memory, can cause problems with accumulated values within those 16 bits. This can lead to answers, when rounded, that do not match expected results.

We briefly look at several examples which have been encountered. First, we have recently implemented the evaluation of most transcendental functions inline, such as sin, cos, tan, and log, since there are x86 instructions for their direct computation. However, as an example, if the argument to sin is the result of previous calculations performed on the floating point stack, then an 80-bit value vs. a 64-bit value can result in slight discrepancies in the answer. With our sin example, we have seen results even change sign due to the sin curve being so close to an x-intercept value when evaluated. Consistency in this case can be maintained by calling a function which, due to the ABI, must push its arguments on the stack (in this way memory is guaranteed to be accessed, even if the argument is an actual constant.) Thus, even if the called function simply performs the inline expansion, using the function call as a wrapper to sin has the effect of trimming the argument precision down to the expected size. Using the ‑Mnobuiltin option on the command line for C accomplishes this task by resolving all math routines in the library libm, thus performing a function call of necessity. The other method of generating a function call for math routines, but one which may still produce the inline instructions, is by using the ‑Kieee switch, described below.

A second example which illustrates the precision control problem can be seen by examining this code fragment adapted from the benchmark "paranoia", used to validate IEEE compliance. This section of code is used to determine machine precision:

        program find_precision  
        w = 1.0
100     w=w+w
        y=w+1
        z=y-w
        if (z .gt. 0) goto 100
C       ... now w is just big enough that |((w+1)-w)-1| >= 1 ...
        print*,w
        end

In this case, where the variables are implicitly real*4, operations are performed on the floating point stack where optimization removed unneeded loads and stores from memory. The general case of copy propagation being performed follows this pattern:

         a = x
         y = 2.0 + a

Instead of storing x into a, then loading a to perform the addition, the value of x can be left on the floating point stack and have 2.0 added to it. Thus, memory accesses in some cases can be avoided, leaving answers in the extended real format. If copy propagation is disabled, stores of all left-hand sides will automatically be performed, and reloaded when needed. This will have the effect of rounding any results to their declared sizes.

For the above program, w has a value of 1.8446744E+19 when executed as is (extended precision.) However, if ‑Kieee is set, the value becomes 1.6777216E+07 (single precision.) This difference is due to the fact that ‑Kieee disables copy propagation, so all intermediate results are stored into memory, then reloaded when needed. (Actually, copy propagation is only disabled for floating point operations, not integer,when the ‑Kieee switch is set.) Of course, with this particular example, setting the ‑pc switch will also adjust the result.

The switch ‑Kieee also has the effect of making function calls to all transcendental routines. Although the routine still produces the machine instruction for computation (unless in C the ‑Mnobuiltin switch is set), arguments are passed on the stack, which results in a memory store and load.

The final effect of the ‑Kieee which we discuss is to disable reciprocal division for constant divisors. That is, for a/b with unknown a and constant b, the expression is converted at compile time to a* 1/b, thus turning an expensive divide into a relatively cheap multiplication. However, small discrepancies can again occur, resulting in differences from expected answers.

Thus, understanding and correctly using the ‑pc, ‑Mnobuiltin, and ‑Kieee switches should enable the user to produce the desired and expected precision for calculations which utilize floating point operations.

Note: Current x86/x86-64 CPUs have SSE1,SSE2, SSE3, etc instruction sets which perform 32-bit and 64-bit floating point operations in a vectorized manner. This greatly reduces any precision discrepancies between x86 and other CPU types.


Why when I execute do I get the error message 'libpgc.so: cannot open shared object file'?

The current Linux releases feature a shared libpgc.so along with libpgthread.so. Before, the libraries libpgc.a and libpgthread.a were used, and they are still available.

The default link will use a shared libpgc and libpghtread, so that users can build one executable for several versions of Linux. Building your application with, for example, pgcc on a RHEL 5.2 system, it can also run on a SuSE 11-3 system. Using gcc, however, the executable created should run on any Linux system with libc installed.

We have done the same thing with libpgc.so and libpgthread.so. If you wish to execute code built on your (for example) Ubuntu 10.04 system, on another RHEL 6.1 system, simply copy libpgc.so and libpgthread.so from $PGI/linux86-64/2012/REDIST to the target system, and add the directory you placed it in to the LD_LIBRARY_PATH path environment variable.

As an example, if we build 'hello world' on a RHEL 6.0 system

% more hello.c
main(){printf("hello\n");}
% pgcc -o hi hello.c
% hi
hello

To run this program on platform B, which is Ubuntu 11.04

% rcp hi  B:/your_B_dir/hi   ! copy executable to B
% rcp $PGI/linux86-64/2012/REDIST/*.so B:/tmp/.
% rsh B 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/tmp
(note if "rsh 'echo $LD_LIBRARY_PATH'" indicates it is not defined, use
% rsh B 'export LD_LIBRARY_PATH=/tmp' )    ! update dynamic lib path on B
% rsh B '/your_B_dir/hi' ! run executable on B
hello

If libpgc.so does not exist in $PGI/linux86-64/2012/libso, the linkage will be performed on the $LD_LIBRARY_PATH contents, if the shared library is present.


Do you have any relative performance numbers?

While performance is a very important reason for using the PGI compilers, typically we do not publish any relative performance numbers. Performance depends upon too many factors to make a credible claim that we are N% faster than a competitor. The only true measure is how your application performs on your system. Please download an evaluation copy of the PGI compilers and try it out.

Two organizations that do publish performance results are Standard Performance Evaluation Corporation (SPEC) and Polyhedron. Polyhedron also allows you to download the benchmark source code so you can do your own performance comparison. Again, when looking at these results remember that the only true benchmark is your application.


Do you have an example of using -byteswapio?

Here is an example of using the -byteswapio switch.

Example

% more rtest.f
      program test
      real*4 ssmi
      OPEN(UNIT=10,FILE='ice.89',FORM='UNFORMATTED')
      read(10) ssmi
      print *,'OK: ',ssmi
      end

% more wtest.f
      program test
      real*4 ssmi
      ssmi = -999
      OPEN(UNIT=10,FILE='ice.89',FORM='UNFORMATTED')
      write(10) ssmi
      print *,'OK: ',ssmi
      end

On your Sun workstation (or other big-endian device)

f77 -o w_sparc wtest.f
f77 -o r_sparc rtest.f

On your PGI workstation.

pgfortran -o w86 wtest.f
pgfortran -o w86_swap -byteswapio wtest.f
pgfortran -o r86 rtest.f
pgfortran -o r86_swap -byteswapio rtest.f

------------------------------------------
If you write the file  |  Then read the file
  ice.89 with          |  ice.89 with

  w_sparc or w86_swap  |  r_sparc or r86_swap

    w86                |  r86


Does the License Manager allow me to execute on other platforms?

Executables created by PGI compilers are not licensed. There are no license requirements for executables created on any platform. If you compiled with a trial license, the executable will stop functioning after the trial period expires. To prevent this, you will need to recompile codes after permanent license keys are installed.


Why does read(9,rec=recnr,end=100,err=101,iostat=ios) act differently on other compilers?

For statements like

    read(9,rec=recnr,end=100,err=101,iostat=ios) buf

some machines will jump to 100 and return IOSTAT of -1 upon getting to the end of the file, while other machines go to 101 or err exits.

The f77 & f90 standards distinguish between 'error conditions' and 'end-of-file'. The 'correct' (standard conforming, portable) way of writing the read statement to capture 'errors' and 'end-of=file' like the following:

    read(card(iarg:),*,err=701,end=701) iskip, irec1

It is true that the SGI treats an end-of-file condition as an error condition if the ERR= specifier is present and the END= specifier is not present. However, this behavior is inconsistent across systems (for example, HP & g77 both abort execution and report an end-of-file).

Another test case shows another inconsistency in various implementations. Consider this test:

        open(unit=10,file='foo',form='unformatted')
        read(10, err=99, iostat=ios) yy
        print *, 'fail1', ios
        stop
99      continue
        print *, 'fail2', ios
        end

According to the standards, if the 'err' branch is taken, the iostat variable will be defined with a positive value. Given that the SGI takes the ERR= branch in the original example, this test should take the ERR= branch as well. But on the SGI, this test executes as:

 fail1          -1

[NOTE that -1 => end-of-file]

The DEC alpha is another system where the ERR= branch is taken for the original example. But, the test above executes as:

 fail2          -1

But in this case, ‑iostat shouldn't be negative since the ERR= branch was taken.

The point of all this is that there are inconsitencies in the way ERR= is handled given an 'end-of-file' condition. Adding the END= specifier to your example guarantees consistent behavior across 'all' systems.


(32-bit Linux compilers) Why can't my compiled code handle even half of the 2GB of memory in my system?

Users now are capable of buying machines with > 4GB of memory in them, so they expect to be able to declare very large arrays. Most understand that the accessible limit ought to really be 2GB for a 32-bit addressable system, when you assume that signed ints may be involved in libraries that work with addresses.

Here are some things we have learned, from users who were more familiar with Linux.

  1. The Linux kernel places shared libraries at 0x40000000 by default,so on x86 you have only about 1GB total for your program code and other elements you provide. It has nothing to do with gcc.

    Possible solutions: (a) link statically, such that there aren't any shared libraries. Or (b) use malloc() to allocate the arrays. That should give you about 3GB total (but note that malloc() can't allocate a single chunk larger than 2GB).

         -Wl,-Bstatic
    

    will force a static link.

  2. If you wish to modify the kernel, in the kernel source, in file mmap.c, there is a line that reads:

         addr = TASK_UNMAPPED_BASE;
    

    This is what sets the default address of the shared libs in the memory mapping, and it's at 0x40000000 (1G) by default. So change it to, for example:

         addr = 0x80000000;
    

    And you should, in theory, have up to ~2GB to use for the codes.

  3. For more info on this, check out the comp.os.linux.development.system newsgroup.

Bottom line is

  1. it is an OS problem, not a compiler problem.
  2. Sometimes, you may have to do a lot of work on Linux to use all of your memory.

When executing, why do I get a stack overflow on Windows?

The error exhibits itself as either stack overflow or sometimes the program just hangs.

To enlarge the stack space, edit the driver file C:\Program Files\PGI\win32\[RELEASE#]/bin/win32rc, and change the line

  LDARGS=""

to something like

  LDARGS="-stack 10000000,50000"

which will enlarge the stack area (maximum size, commit size) of the executable. Relink your application and execute.


When I compile with -Ktrap=inexact, the program gets many exceptions

The PGI compilers do not support exception free execution for ‑Ktrap=inexact. The purpose of the hardware support is for those who have specific uses for its execution, along with the appropriate signal handlers for handling exceptions it produces. It is not meant for normal floating point operation code support.


Why do I get 'illegal instruction' exceptions when I run the program on another platform?

This usually indicates that you either have a CPU that does not support the assembly instruction the compiler generated, probably because the code was generated for a newer CPU than is currently executing. You can force the compiler to choose to generate code for an older CPU, or you can generate a PGI Unified Binary executable that combines a new CPU type with an older CPU type for everywhere else.

The instruction

  pgpfortran -V

will output the type of CPU that the PGI compilers determine your machine has.


Why to I get different answers with -r8 set when I declared all my variables as REAL*8?

The Fortran standard does not define the default size of constants. Our compiler treats constants as REAL*4 unless you compile with ‑r8. So a program like

       real*8  x,y,z
       x=50453.61
       y=29581.28
       z=x*y
       write(*,10)x,y,z, 50453.61*29581.28
 10    format(4f20.5)
       end

will produce different answers with ‑r8 set. Code that will treat constants as REAL*8 everywhere should be written as

       real*8  x,y,z
       x=50453.61D0
       y=29581.28D0
       z=x*y
       write(*,10)x,y,z, 50453.61D0*29581.28D0
 10    format(4f20.5)
       end


How do I detect infs and NaNs in pgfortran?

REAL*4 and REAL*8 numbers have a specific format for their exponent and mantissa. Exponents larger than the format limits are printed as inf. 32-bit and 64-bit numbers with mantissas falling outside the boundaries of the ieee formats are printed as NaN. Overflows or division by zero can result in inf values, while illegal operations such as sqrt(-1.) and log(-10.0) cause NaN.

The x86-64 architecture has status bits set by floating point operations. pgfortran traps some of those status bits ensuring that they are not set as a side effect of compiler generated code. Specifically, the status bits ovf for overflow, divz for divide by zero and inv for invalid. Other status bits, including inexact, align, unf, and denorm are not guaranteed to not be set as a side effect of normal compiler operation.

Some examples follow. To create infs and NaNs at execution time, you may need to code in such a way as to prevent the compiler from catching your improper operation.

program infnan
  real(8) v,w,x,y,z
  v = -1.0d0
  w = 123456789.0d0
  x = v * 10.d0 * log10(v)
  y= exp(w)
  z=1.0d0 /(v + 1.0d0)
  print *,'x (inv ?) =',x
  print *,'y (ovf ?) =',y
  print *,'x (divz?) =',z
end program infnan

% pgfortran -o infnan infnan.f90
% ./infnan
 x (inv ?) =                       NaN
 y (ovf? ) =                       Inf
 x (divz?) =                       Inf

The F2003 standard states the status bits set will be printed after a STOP statement.

program infnan2
  real(8) v,w,x,y,z
  v = -1.0d0
  w = 123456789.0d0
  x = v * 10.d0 * log10(v)
  y= exp(w)
  z=1.0d0 /(v + 1.0d0)
  print *,'x (inv ?) =',x
  print *,'y (ovf? ) =',y
  print *,'x (divz?) =',z
  stop
end program infnan2

% pgfortran -o infnan2 infnan2.f90
% ./infnan2
 x (inv ?) =                       NaN
 y (ovf? ) =                       Inf
 x (divz?) =                       Inf
Warning: ieee_invalid is signaling
Warning: ieee_divide_by_zero is signaling
Warning: ieee_inexact is signaling
FORTRAN STOP

If you wish the program to terminate when a operation causes a status bit change, compile with the following switches: ‑Ktrap=ovf,unv,divz or its shorthand ‑Ktrap=fp. You can also set the environment variable PGI_TERM to determine what happens when the status bit are set. You can even invoke the PGDBG debugger upon the event.

% pgfortran -o infnan infnan.f90 -Ktrap=divz -g
% ./infnan
Floating exception

% setenv PGI_TERM 'signal'
% ./infnan
Error: floating point exception, integer divide by zero

% setenv PGI_TERM 'signal,debug'
% ./infnan
Error: floating point exception, integer divide by zero
PGDBG 14.1-0 x86-64 (Cluster, 256 Process)
The Portland Group - PGI Compilers and Tools
Copyright (c) 2014, NVIDIA CORPORATION.  All rights reserved.
Loading symbols from /my/dir/infnan ...
Loaded: /my/dir/infnan
Stopped at 0x7F229025219A, function __waitpid
0x7F229025219A:  48 3D 0 F0 FF FF       cmpq   $0xFFFFFFFFFFFFF000,%rax

pgdbg> file /my/dir/infnan.f90
"infnan.f90"
pgdbg> list
 #1:     program infnan
 #2:       real(8) v,w,x,y,z
 #3:       v = -1.0d0
 #4:       w = 123456789.0d0
 #5:       x = v * 10.d0 * log10(v)
 #6:       y= exp(w)
 #7:       z=1.0d0 /(v + 1.0d0)
 #8:       print *,'x (inv ?) =',x
 #9:       print *,'y (ovf? ) =',y
 #10:      print *,'x (divz?) =',z

pgdbg>

If you want to determine if any element of a REAL array is a NaN, use the IEEE_ARITHMETIC routine ieee_is_nan(x) which takes real arguments, and because it is elemental, we can feed the entire array to it.

program infnan3
  use,intrinsic::IEEE_ARITHMETIC
  real(8) x(1000),y
  integer i
  do i=1,1000
     x(i)=i
  end do

  y = -1.0d0
  x(500) = y * 10.d0 * log10(y)

  if(any(ieee_is_nan(x))) then 
     print *, "we found a NaN!"
  else 
    print *, "we found NO NaNs!"
  end if
end program infnan3


% pgf90 -o infnan3 infnan3.f90
% ./infnan3
 we found a NaN!

With this technique, we can find whether an array has one or more NaNs, but not the index of the failing element. To do that, we need to use a loop.

program infnan4
  use,intrinsic::IEEE_ARITHMETIC
  real(8) x(1000),y
  integer i
  do i=1,1000
     x(i)=i
  end do

  y = -1.0d0
  x(500) = y * 10.d0 * log10(y)
  x(600) = y /(y + 1.0d0)
  do i=1,1000
   if(ieee_is_nan(x(i))) then
      print *,"X(",i,") is a NaN!"
   endif
   if(.not.(ieee_is_finite(x(i)))) then
      print *,"X(",i,") is an inf!"
   endif
   end do
end program infnan4

% pgfortran -o infnan4 infnan4.f90
% ./infnan4
 X(          500 ) is a NaN!
 X(          500 ) is an inf!
 X(          600 ) is an inf!


Why does date_and_time(DATE) return 20 for the century?

Century is just defined to be a group of 100 years. For 2012, date_and_time(DATE) sets the century and year values in DATE to 20 and 12, respectively. So, for example, 2012 = 20*100 + 12

Here is a Fortran program that uses date_and_time().

program testread
  implicit none
  integer::cen,year,mon,day
  character(len=8) :: date
  call date_and_time(date)
  read(date,'(4i2)')cen,year,mon,day
  print *,cen,year,mon,day
end program

Click me