PGI Execution Questions

Precision problems:

The x86 floating processor performs all of its computations in extended (80-bit) precision. This may cause problems when porting code to the x86 which successfully executes on other (non-x86) systems. The increased precision of the x86 may result in 'different' answers. Also, the increased precision of the x86 may result in infinite loops if equality tests of floating point data are used to control while loops. Examples of problem cases:


1.

   a =    ! 'copy propagate' a's right-hand side to its use

   b = a + c          ! 'propagate' b

   if (b .eq. y ) ... ! 'exact equality' check



2.

   while ( C.EQ.ONE )

       LT = LT + 1

       A = A*LBETA

       C = DLAMC3( A, ONE )

       C = DLAMC3( C, -A )

   END WHILE

To reduce the precision, the option '-pc 64' (round floating point operations to double precision) or '-pc 32' (round floating point operations to single precision) may be used.

The -Kieee switch may be used to disable propagating floating point values and to round the argument values passed to intrinsics (sin, cos, etc.).


How come we get different answers on one platform versus a linux x86 platform?

The x86 architecture implements a floating-point stack by using 8 80-bit registers. Each register uses bits 0-63 as the significand, bits 64-78 for the exponent, and bit 79 is the sign bit. This extended 80-bit real format used by floating instructions is the default. When values are loaded into the floating point stack they are automatically converted into the extended real format. The precision of the floating point stack can be controlled, however, by setting the precision control bits (bits 8 and 9) of the floating control word appropriately. In this way, the programmer can explicitly set the precision to standard IEEE double or single precision (the Intel documentation, however, claims that this only affects the operations of add, subtract, multiply, divide, and square root.)

We have also noticed that, although extended precision is supposedly the default which is set for the control word, it is set at double precision in the x86 linux systems. Thus, we now also have a -pc <val> option which can be used on the command line. The values of <val> are:




       32 => single precision

       64 => double precision

       80 => extended precision



At first glance, an extra 16 bits of precision appears to only be a positive asset. However, operations that are performed exclusively on the floating point stack, without storing into (or loading from) memory, can cause problems with accumulated values within those 16 bits. This can lead to answers, when rounded, that do not match expected results.

We briefly look at several examples which have been encountered. First, we have recently implemented the evaluation of most transcendental functions inline, such as sin, cos, tan, and log, since there are x86 instructions for their direct computation. However, as an example, if the argument to sin is the result of previous calculations performed on the floating point stack, then an 80-bit value vs. a 64-bit value can result in slight discrepancies in the answer. With our sin example, we have seen results even change sign due to the sin curve being so close to an x-intercept value when evaluated. Consistency in this case can be maintained by calling a function which, due to the ABI, must push its arguments on the stack (in this way memory is guaranteed to be accessed, even if the argument is an actual constant.) Thus, even if the called function simply performs the inline expansion, using the function call as a wrapper to sin has the effect of trimming the argument precision down to the expected size. Using the -Mnobuiltin option on the command line for C accomplishes this task by resolving all math routines in the library libm, thus performing a function call of necessity. The other method of generating a function call for math routines, but one which may still produce the inline instructions, is by using the -Kieee switch, described below.

A second example which illustrates the precision control problem can be seen by examining this code fragment adapted from the benchmark "paranoia", used to validate IEEE compliance. This section of code is used to determine machine precision:




        program find_precision  

 

 

        w = 1.0

100     w=w+w

        y=w+1

        z=y-w

        if (z .gt. 0) goto 100

C       ... now w is just big enough that |((w+1)-w)-1| >= 1 ...

        print*,w

        end



In this case, where the variables are implicitly real*4, operations are performed on the floating point stack where optimization removed unneeded loads and stores from memory. The general case of copy propagation being performed follows this pattern:




         a = x

         y = 2.0 + a



Instead of storing x into a, then loading a to perform the addition, the value of x can be left on the floating point stack and have 2.0 added to it. Thus, memory accesses in some cases can be avoided, leaving answers in the extended real format. If copy propagation is disabled, stores of all left-hand sides will automatically be performed, and reloaded when needed. This will have the effect of rounding any results to their declared sizes.

For the above program, w has a value of 1.8446744E+19 when executed as is (extended precision.) If, however, -Kieee is set, the value becomes 1.6777216E+07 (single precision.) This difference is due to the fact that -Kieee disables copy propagation, so all intermediate results are stored into memory, then reloaded when needed. (Actually, copy propagation is only disabled for floating point operations, not integer,when the -Kieee switch is set.) Of course, with this particular example, setting the -pc switch will also adjust the result.

The switch -Kieee also has the effect of making function calls to all transcendental routines. Although the routine still produces the machine instruction for computation (unless in C the -Mnobuiltin switch is set), arguments are passed on the stack, which results in a memory store and load.

The final effect of the -Kieee which we discuss is to disable reciprocal division for constant divisors. That is, for a/b with unknown a and constant b, the expression is converted at compile time to a* 1/b, thus turning an expensive divide into a relatively cheap multiplication. However, small discrepancies can again occur, resulting in differences from expected answers.

Thus, understanding and correctly using the -pc, -Mnobuiltin, and -Kieee switches should enable the user to produce the desired and expected precision for calculations which utilize floating point operations.


When I execute, I get 'libpgc.so: cannot open shared object file'?

The 4.0-x linux releases feature a shared libpgc.so along with libpgthread.so. Before, the libraries libpgc.a and libpgthread.a were used, and they are still available.

The default link will use a shared libpgc and libpghtread, so that users can build one executable for several versions of linux. If a user builds his application using, for example, pgcc, on a Red Hat 6.2 system, normally this will only run on a Red Hat 6.2 system. Using gcc, however, the executable created should run on any linux system Red Hat 7.1 as well, because the libc.so is shared and the version running on Red Hat 7.1 will be the libc.so that works on that release.

We have done the same thing with libpgc.so and libpgthread.so. If you wish to execute code built on your (for example) Red Hat 7.2 system, on another Red Hat 7.2 system, simply copy libpgc.so and libpgthread.so from $PGI/linux86/lib or $PGI/linux86/liblf to the target system, and add the directory you placed it in to the LD_LIBRARY_PATH path environment variable.

Here is a list of the versions of libpgc.so that come with the install package (the tarfile you unpack), and the versions of Linux that apply (remember, this is for Releases 4.0-2 or later).


                                                                                                      

                   Version of libpgc.so, libpgthread.so

                                 standard             large file support

                              ----------------        ------------------

 Red Hat 6.0 (SuSE 6.1)       lib-linux86-g211              N/A

 Red Hat 6.1 (SuSE 6.2)       lib-linux86-g212              N/A

 Red Hat 6.2 (SuSE 6.3,6.4)   lib-linux86-g212              N/A

 Red Hat 7.0 (SuSE 7.1)       lib-linux86-g22          lib-linux86-g22-lf

 Red Hat 7.1 (SuSE 7.2)       lib-linux86-g22          lib-linux86-g22-lf

 Red Hat 7.2 (SuSE 7.3)       lib-linux86-g224         lib-linux86-g224-lf

 Red Hat 7.3 (SuSE 8.0)       lib-linux86-g225         lib-linux86-g225-lf

                                                                                                       

As an example, if I build hi on a Red Hat 6.2 system


% more hello.c

main(){printf("hello\n");}

% pgcc -o hi hello.c

% hi

hello



To run this program on platform B, which is Red Hat 7.2.




% rcp hi  B:/your_B_dir/hi   ! copy executable to B

% rcp your_install_package/linux86/lib-linux86-g224/libpgc.so  B:/tmp/.

% rsh B 'export LD_LIBRAR_PATH=$LD_LIBRARY_PATH:/tmp

(note if "rsh 'echo $LD_LIBRARY_PATH'" indicates it is not defined, use

% rsh B 'export LD_LIBRARY_PATH=/tmp' )    ! update dynamic lib path on B

% rsh B '/your_B_dir/hi' ! run executable on B

hello



If libpgc.so does not exist in $PGI/linux86/lib or $PGI/linux86/liblf, the linkage will be performed on libpgc.a, and no copying of libpgc.so to target platforms is needed.

Please see the portabiliy package description FAQ . Users without our compilers installed can use this to set up a linux platform for execution of programs built with out compilers.


Do you have any relative performance numbers?

While performance is a very important reason for using the PGI compilers, typically we do not publish any relative performance numbers. Performance depends upon too many factors to make a credible claim that we are N% faster than a competitor. The only true measure is how your application performs on your system. Please download an evaluation copy of the PGI compilers and try it out.

Two organizations that do publish performance results are Standard Performance Evaluation Corporation (SPEC) and Polyhedron. Polyhedron also allows you to download the benchmark source code so you can do your own performance comparison. Again, when looking at these results remember that the only true benchmark is your application.


Do you have an example of using -byteswapio?

Here is an example of using the -byteswapio switch.
Example

 

% more rtest.f

      program test

      real*4 ssmi

      OPEN(UNIT=10,FILE='ice.89',FORM='UNFORMATTED')

      read(10) ssmi

      print *,'OK: ',ssmi

      end

 

% more wtest.f

      program test

      real*4 ssmi

      ssmi = -999

      OPEN(UNIT=10,FILE='ice.89',FORM='UNFORMATTED')

      write(10) ssmi

      print *,'OK: ',ssmi

      end

On your Sun workstation (or other big-endian device)

f77 -o w_sparc wtest.f

f77 -o r_sparc rtest.f

On your PGI workstation.
pgf77/pgf90 -o w86 wtest.f

pgf77/pgf90 -o w86_swap -byteswapio wtest.f

pgf77/pgf90 -o r86 rtest.f

pgf77/pgf90 -o r86_swap -byteswapio rtest.f

 

------------------------------------------

If you write the file  |  Then read the file

  ice.89 with          |  ice.89 with

 

  w_sparc or w86_swap  |  r_sparc or r86_swap



    w86                |  r86


Does the License Manager allow me to execute on other platforms?

We have no license problems with executables created on other nodes. We limit the number of nodes that can run a HPF program collectively. You will want to recompile codes after permanent license keys are installed, if they were compiled with temporary keys.


read(9,rec=recnr,end=100,err=101,iostat=ios) acts different on other compilers

On some machines stmts like


    read(9,rec=recnr,end=100,err=101,iostat=ios) buf

Some machines is it jumps to 100 and returns IOSTAT of -1 upon getting to the end of the file, while on other machines, it goes to 101 or err exits.

The f77 & f90 standards distinguish between 'error conditions' and 'end-of-file'. The 'correct' (standard conforming, portable) way of writing the read statement to capture 'errors' and 'end-of=file' like the following:


    read(card(iarg:),*,err=701,end=701) iskip, irec1

It is true that the SGI treats an end-of-file condition as an error condition if the ERR= specifier is present and the END= specifier is not present. However, this behavior is inconsistent across systems (for example, HP & g77 both abort execution and report an end-of-file).

Another test case shows another inconsistency in various implementations. Consider this test:


        open(unit=10,file='foo',form='unformatted')

        read(10, err=99, iostat=ios) yy

        print *, 'fail1', ios

        stop

99      continue

        print *, 'fail2', ios

        end

According to the standards, if the 'err' branch is taken, the iostat variable will be defined with a positive value. Given that the SGI takes the ERR= branch in the original example, this test should take the ERR= branch as well. But on the SGI, this test executes as:


 fail1          -1

[NOTE that -1 => end-of-file]

The decalpha is another system where the ERR= branch is taken for the original example. But, the test above executes as:


 fail2          -1

But in this case, iostat shouldn't be negative since the ERR= branch was taken.

The point of all this is that there are inconsitencies in the way ERR= is handled given an 'end-of-file' condition. Adding the END= specifier to your example guarantees consistent behavior across 'all' systems.


I have over 2GB of memory, but your compiled code can't handle even half of it?

Users now are capable of buying machines with 4GB of memory in them, so they expect to be able to declare very large arrays. Most understand that the accessible limit ought to really be 2GB for a 32-bit addressable system, when you assume that signed ints may be involved in libraries that work with addresses.

Here are some things we have learned, from users who were more familiar with linux.

  1. The Linux kernel places shared libraries at 0x40000000 by default,so on x86 you have only about 1GB _total_ for your program code and other elements you provide. It has nothing to do with gcc.

    Possible solutions: (a) link statically, such that there aren't any shared libraries. Or (b) use malloc() to allocate the arrays. That should give you about 3GB total (but note that malloc() can't allocate a single chunk larger than 2GB).

    
         -Wl,-Bstatic
    
    
    will force a static link.

  2. If you wish to modify the kernel, in the kernel source, in file mmap.c, there is a line that reads:
    
         addr = TASK_UNMAPPED_BASE;
    
    
    This is what sets the default address of the shared libs in the memory mapping, and it's at 0x40000000 (1G) by default. So change it to, for example:
    
         addr = 0x80000000;
    
    
    And you should, in theory, have up to ~2GB to use for the codes.

  3. For more info on this, check out the comp.os.linux.development.system newsgroup, back around June 7th or so.

Bottom line is

  1. it is an OS problem.
  2. not a compiler problem.
  3. sometimes, you may have to do alot of work on Linux to use all of your memory.

PGHPF - when executing with -heapz set, I get a '0: mmap: Not enough space' error

The error looks like the following. I compile and execute as follows:


 % pghpf -Mextend -Msmp -Mstats   abc.hpf -o abc.x

 

 % abc.x -pghpf -heapz 1100m -np 32

 0: mmap: Not enough space

By default pghpf uses /var/tmp for memory mapped files (which is used by the -heapz option)...normally this error comes up when there is not enough space on /var/tmp ... To choose a different directory (one that has sufficient space) use the TMPDIR system environment variable to specify the pathname that -heapz should use for allocating its mmap'd file.


NT - When executing, I get a stack overflow

The error exhibits itself as either stack overflow or sometimes the program just hangs.

To enlarge the stack space, edit the driver file $PGI/nt86/[RELEASE#]/bin/nt86rc, and change the line


  LDARGS=""

to something like

  LDARGS="-stack 10000000,50000"

which will enlarge the stack area (maximum size, commit size) of the executable. Relink your application and execute.


When I compile with -Ktrap=inexact, the program gets many exceptions

The PGI compilers do not support exception free execution for -Ktrap=inexact. The purpose of the hardware support is for those who have specific uses for its execution, along with the appropriate signal handlers for handling exceptions it produces. It is not meant for normal floating point operation code support.


When I compile with -Mvect=sse, I get 'illegal instruction' exceptions when I run the program

This usually indicates that you either have a cpu that does not have the Pentium III SSE or prefetch instructions, or you have an OS like linux which does not have the new instructions added to the kernel.


x.f

      program vector_op

      parameter (n = 99999)

      real x(n),y(n),z(n),w(n)

      real con

      con = 1.0

      do i = 1,n

         y(i) = i

         z(i) = 2*i

         w(i) = 4*i

      enddo

!      do j = 1, 10000

         call loop(x,y,z,w,con,n)

!      enddo

!      print*,x(1),x(771),x(3618),x(23498),x(99999)

      end

      subroutine loop(a,b,c,d,s,n)

      integer i,n

      real a(n),b(n),c(n),d(n),s

      do i = 1,n

         a(i) =  b(i) + c(i) - s * d(i)

      enddo

      end  

Use the above example.


pgf77 -o x1 x.f -Mvect=sse -fast

pgf77 -o x2 x.f -Mvect=sse -fast -r8

pgf77 -o x3 x.f -Mvect=prefetch -fast

pgf77 -o x4 x.f -Mvect=prefetch -fast -r8

If you have not installed the sse patch, we believe x1,x2, and x4 will generate an illegal instruction. x3 should run with no problem.

See http://sources.redhat.com/gdb/papers/linux/linux-sse.html for more info. It explains a number of things, but it may be dated. You should contact your linux provider about pentium III SSE support in the kernel. When linux kernel 2.4 is released, SSE will automatically be included.

Linux 2.4 kernel supports > 2GB files - does pgf77/f90 support them?

Yes, in the 3.3 versions of the software, large file support for fortran is supported. We packaged it so that you can link with normal libs or LF libs.

When linking, add -L$PGI/linux86/liblf for release 3.3 or "-Mlfs" for release 4.0 and higher.

The linker will search the liblf directory before the general lib directory.


I declared all my variables as 'REAL*8', yet I get different answers with '-r8' set - why?

The fortran standard does not define the default size of constants. Our compiler treats constants as REAL*4 unless you compile with -r8. So a program like


       real*8  x,y,z

       x=50453.61

       y=29581.28

       z=x*y

       write(*,10)x,y,z, 50453.61*29581.28

 10    format(4f20.5)

       end

will produce different answers with -r8 set. Code that will treat constants as REAL*8 everywhere should be written as


       real*8  x,y,z

       x=50453.61D0

       y=29581.28D0

       z=x*y

       write(*,10)x,y,z, 50453.61D0*29581.28D0

 10    format(4f20.5)

       end

^M
^M

Why does date_and_time(DATE) return 20 for the century?

Century is just defined to be a group of 100 years. For 2007, date_and_time(DATE) sets the century and year values in DATE to 20 & 7, respectively.

So, for example, 2007 = 20*100 + 7

Here is a fortran program that uses date_and_time().


program testread
  implicit none
  integer::cen,year,mon,day
  character(len=8) :: date
  call date_and_time(date)
  read(date,'(4i2)')cen,year,mon,day
  print *,cen,year,mon,day
end program