The x86 floating processor performs all of its computations in extended (80-bit) precision. This may cause problems when porting code to the x86 which successfully executes on other (non-x86) systems. The increased precision of the x86 may result in 'different' answers. Also, the increased precision of the x86 may result in infinite loops if equality tests of floating point data are used to control while loops. Examples of problem cases:
1. a =! 'copy propagate' a's right-hand side to its use b = a + c ! 'propagate' b if (b .eq. y ) ... ! 'exact equality' check 2. while ( C.EQ.ONE ) LT = LT + 1 A = A*LBETA C = DLAMC3( A, ONE ) C = DLAMC3( C, -A ) END WHILE
To reduce the precision, the option '-pc 64' (round floating point operations to double precision) or '-pc 32' (round floating point operations to single precision) may be used.
The -Kieee switch may be used to disable propagating floating point values and to round the argument values passed to intrinsics (sin, cos, etc.).
The x86 architecture implements a floating-point stack by using 8 80-bit registers. Each register uses bits 0-63 as the significand, bits 64-78 for the exponent, and bit 79 is the sign bit. This extended 80-bit real format used by floating instructions is the default. When values are loaded into the floating point stack they are automatically converted into the extended real format. The precision of the floating point stack can be controlled, however, by setting the precision control bits (bits 8 and 9) of the floating control word appropriately. In this way, the programmer can explicitly set the precision to standard IEEE double or single precision (the Intel documentation, however, claims that this only affects the operations of add, subtract, multiply, divide, and square root.)
We have also noticed that, although extended precision is supposedly the default which is set for the control word, it is set at double precision in the x86 linux systems. Thus, we now also have a -pc <val> option which can be used on the command line. The values of <val> are:
32 => single precision
64 => double precision
80 => extended precision
At first glance, an extra 16 bits of precision appears to only be a positive asset. However, operations that are performed exclusively on the floating point stack, without storing into (or loading from) memory, can cause problems with accumulated values within those 16 bits. This can lead to answers, when rounded, that do not match expected results.
We briefly look at several examples which have been encountered. First, we have recently implemented the evaluation of most transcendental functions inline, such as sin, cos, tan, and log, since there are x86 instructions for their direct computation. However, as an example, if the argument to sin is the result of previous calculations performed on the floating point stack, then an 80-bit value vs. a 64-bit value can result in slight discrepancies in the answer. With our sin example, we have seen results even change sign due to the sin curve being so close to an x-intercept value when evaluated. Consistency in this case can be maintained by calling a function which, due to the ABI, must push its arguments on the stack (in this way memory is guaranteed to be accessed, even if the argument is an actual constant.) Thus, even if the called function simply performs the inline expansion, using the function call as a wrapper to sin has the effect of trimming the argument precision down to the expected size. Using the -Mnobuiltin option on the command line for C accomplishes this task by resolving all math routines in the library libm, thus performing a function call of necessity. The other method of generating a function call for math routines, but one which may still produce the inline instructions, is by using the -Kieee switch, described below.
A second example which illustrates the precision control problem can be seen by examining this code fragment adapted from the benchmark "paranoia", used to validate IEEE compliance. This section of code is used to determine machine precision:
program find_precision
w = 1.0
100 w=w+w
y=w+1
z=y-w
if (z .gt. 0) goto 100
C ... now w is just big enough that |((w+1)-w)-1| >= 1 ...
print*,w
end
In this case, where the variables are implicitly real*4, operations are performed on the floating point stack where optimization removed unneeded loads and stores from memory. The general case of copy propagation being performed follows this pattern:
a = x
y = 2.0 + a
Instead of storing x into a, then loading a to perform the addition, the value of x can be left on the floating point stack and have 2.0 added to it. Thus, memory accesses in some cases can be avoided, leaving answers in the extended real format. If copy propagation is disabled, stores of all left-hand sides will automatically be performed, and reloaded when needed. This will have the effect of rounding any results to their declared sizes.
For the above program, w has a value of 1.8446744E+19 when executed as is (extended precision.) If, however, -Kieee is set, the value becomes 1.6777216E+07 (single precision.) This difference is due to the fact that -Kieee disables copy propagation, so all intermediate results are stored into memory, then reloaded when needed. (Actually, copy propagation is only disabled for floating point operations, not integer,when the -Kieee switch is set.) Of course, with this particular example, setting the -pc switch will also adjust the result.
The switch -Kieee also has the effect of making function calls to all transcendental routines. Although the routine still produces the machine instruction for computation (unless in C the -Mnobuiltin switch is set), arguments are passed on the stack, which results in a memory store and load.
The final effect of the -Kieee which we discuss is to disable reciprocal division for constant divisors. That is, for a/b with unknown a and constant b, the expression is converted at compile time to a* 1/b, thus turning an expensive divide into a relatively cheap multiplication. However, small discrepancies can again occur, resulting in differences from expected answers.
Thus, understanding and correctly using the -pc, -Mnobuiltin, and -Kieee switches should enable the user to produce the desired and expected precision for calculations which utilize floating point operations.
The 4.0-x linux releases feature a shared libpgc.so along with libpgthread.so. Before, the libraries libpgc.a and libpgthread.a were used, and they are still available.
The default link will use a shared libpgc and libpghtread, so that users can build one executable for several versions of linux. If a user builds his application using, for example, pgcc, on a Red Hat 6.2 system, normally this will only run on a Red Hat 6.2 system. Using gcc, however, the executable created should run on any linux system Red Hat 7.1 as well, because the libc.so is shared and the version running on Red Hat 7.1 will be the libc.so that works on that release.
We have done the same thing with libpgc.so and libpgthread.so. If you wish to execute code built on your (for example) Red Hat 7.2 system, on another Red Hat 7.2 system, simply copy libpgc.so and libpgthread.so from $PGI/linux86/lib or $PGI/linux86/liblf to the target system, and add the directory you placed it in to the LD_LIBRARY_PATH path environment variable.
Here is a list of the versions of libpgc.so that come with the install package (the tarfile you unpack), and the versions of Linux that apply (remember, this is for Releases 4.0-2 or later).
Version of libpgc.so, libpgthread.so
standard large file support
---------------- ------------------
Red Hat 6.0 (SuSE 6.1) lib-linux86-g211 N/A
Red Hat 6.1 (SuSE 6.2) lib-linux86-g212 N/A
Red Hat 6.2 (SuSE 6.3,6.4) lib-linux86-g212 N/A
Red Hat 7.0 (SuSE 7.1) lib-linux86-g22 lib-linux86-g22-lf
Red Hat 7.1 (SuSE 7.2) lib-linux86-g22 lib-linux86-g22-lf
Red Hat 7.2 (SuSE 7.3) lib-linux86-g224 lib-linux86-g224-lf
Red Hat 7.3 (SuSE 8.0) lib-linux86-g225 lib-linux86-g225-lf
As an example, if I build hi on a Red Hat 6.2 system
% more hello.c
main(){printf("hello\n");}
% pgcc -o hi hello.c
% hi
hello
To run this program on platform B, which is Red Hat 7.2.
% rcp hi B:/your_B_dir/hi ! copy executable to B % rcp your_install_package/linux86/lib-linux86-g224/libpgc.so B:/tmp/. % rsh B 'export LD_LIBRAR_PATH=$LD_LIBRARY_PATH:/tmp (note if "rsh 'echo $LD_LIBRARY_PATH'" indicates it is not defined, use % rsh B 'export LD_LIBRARY_PATH=/tmp' ) ! update dynamic lib path on B % rsh B '/your_B_dir/hi' ! run executable on B hello
If libpgc.so does not exist in $PGI/linux86/lib or $PGI/linux86/liblf, the linkage will be performed on libpgc.a, and no copying of libpgc.so to target platforms is needed.
Please see the portabiliy package description FAQ . Users without our compilers installed can use this to set up a linux platform for execution of programs built with out compilers.
While performance is a very important reason for using the PGI compilers, typically we do not publish any relative performance numbers. Performance depends upon too many factors to make a credible claim that we are N% faster than a competitor. The only true measure is how your application performs on your system. Please download an evaluation copy of the PGI compilers and try it out.
Two organizations that do publish performance results are Standard Performance Evaluation Corporation (SPEC) and Polyhedron. Polyhedron also allows you to download the benchmark source code so you can do your own performance comparison. Again, when looking at these results remember that the only true benchmark is your application.
Example
% more rtest.f
program test
real*4 ssmi
OPEN(UNIT=10,FILE='ice.89',FORM='UNFORMATTED')
read(10) ssmi
print *,'OK: ',ssmi
end
% more wtest.f
program test
real*4 ssmi
ssmi = -999
OPEN(UNIT=10,FILE='ice.89',FORM='UNFORMATTED')
write(10) ssmi
print *,'OK: ',ssmi
end
On your Sun workstation (or other big-endian device) f77 -o w_sparc wtest.f f77 -o r_sparc rtest.fOn your PGI workstation.
pgf77/pgf90 -o w86 wtest.f
pgf77/pgf90 -o w86_swap -byteswapio wtest.f
pgf77/pgf90 -o r86 rtest.f
pgf77/pgf90 -o r86_swap -byteswapio rtest.f
------------------------------------------
If you write the file | Then read the file
ice.89 with | ice.89 with
w_sparc or w86_swap | r_sparc or r86_swap
w86 | r86
We have no license problems with executables created on other nodes. We limit the number of nodes that can run a HPF program collectively. You will want to recompile codes after permanent license keys are installed, if they were compiled with temporary keys.
On some machines stmts like
read(9,rec=recnr,end=100,err=101,iostat=ios) buf
Some machines is it jumps to 100 and returns IOSTAT of -1 upon getting to the end of the file, while on other machines, it goes to 101 or err exits.
The f77 & f90 standards distinguish between 'error conditions' and 'end-of-file'. The 'correct' (standard conforming, portable) way of writing the read statement to capture 'errors' and 'end-of=file' like the following:
read(card(iarg:),*,err=701,end=701) iskip, irec1
It is true that the SGI treats an end-of-file condition as an error condition if the ERR= specifier is present and the END= specifier is not present. However, this behavior is inconsistent across systems (for example, HP & g77 both abort execution and report an end-of-file).
Another test case shows another inconsistency in various implementations. Consider this test:
open(unit=10,file='foo',form='unformatted')
read(10, err=99, iostat=ios) yy
print *, 'fail1', ios
stop
99 continue
print *, 'fail2', ios
end
According to the standards, if the 'err' branch is taken, the iostat variable will be defined with a positive value. Given that the SGI takes the ERR= branch in the original example, this test should take the ERR= branch as well. But on the SGI, this test executes as:
fail1 -1[NOTE that -1 => end-of-file]
The decalpha is another system where the ERR= branch is taken for the original example. But, the test above executes as:
fail2 -1
But in this case, iostat shouldn't be negative since the ERR= branch was taken.
The point of all this is that there are inconsitencies in the way ERR= is handled given an 'end-of-file' condition. Adding the END= specifier to your example guarantees consistent behavior across 'all' systems.
Users now are capable of buying machines with 4GB of memory in them, so they expect to be able to declare very large arrays. Most understand that the accessible limit ought to really be 2GB for a 32-bit addressable system, when you assume that signed ints may be involved in libraries that work with addresses.
Here are some things we have learned, from users who were more familiar with linux.
Possible solutions: (a) link statically, such that there aren't any shared libraries. Or (b) use malloc() to allocate the arrays. That should give you about 3GB total (but note that malloc() can't allocate a single chunk larger than 2GB).
-Wl,-Bstatic
will force a static link.
addr = TASK_UNMAPPED_BASE;
This is what sets the default address of the shared libs in the memory mapping, and it's at 0x40000000 (1G) by default. So change it to, for example:
addr = 0x80000000;
And you should, in theory, have up to ~2GB to use for the codes.
Bottom line is
The error looks like the following. I compile and execute as follows:
% pghpf -Mextend -Msmp -Mstats abc.hpf -o abc.x % abc.x -pghpf -heapz 1100m -np 32 0: mmap: Not enough space
By default pghpf uses /var/tmp for memory mapped files (which is used by the -heapz option)...normally this error comes up when there is not enough space on /var/tmp ... To choose a different directory (one that has sufficient space) use the TMPDIR system environment variable to specify the pathname that -heapz should use for allocating its mmap'd file.
The error exhibits itself as either stack overflow or sometimes the program just hangs.
To enlarge the stack space, edit the driver file $PGI/nt86/[RELEASE#]/bin/nt86rc, and change the line
LDARGS=""to something like
LDARGS="-stack 10000000,50000"
which will enlarge the stack area (maximum size, commit size) of the executable. Relink your application and execute.
The PGI compilers do not support exception free execution for -Ktrap=inexact. The purpose of the hardware support is for those who have specific uses for its execution, along with the appropriate signal handlers for handling exceptions it produces. It is not meant for normal floating point operation code support.
This usually indicates that you either have a cpu that does not have the Pentium III SSE or prefetch instructions, or you have an OS like linux which does not have the new instructions added to the kernel.
x.f
program vector_op
parameter (n = 99999)
real x(n),y(n),z(n),w(n)
real con
con = 1.0
do i = 1,n
y(i) = i
z(i) = 2*i
w(i) = 4*i
enddo
! do j = 1, 10000
call loop(x,y,z,w,con,n)
! enddo
! print*,x(1),x(771),x(3618),x(23498),x(99999)
end
subroutine loop(a,b,c,d,s,n)
integer i,n
real a(n),b(n),c(n),d(n),s
do i = 1,n
a(i) = b(i) + c(i) - s * d(i)
enddo
end
Use the above example.
pgf77 -o x1 x.f -Mvect=sse -fast pgf77 -o x2 x.f -Mvect=sse -fast -r8 pgf77 -o x3 x.f -Mvect=prefetch -fast pgf77 -o x4 x.f -Mvect=prefetch -fast -r8
If you have not installed the sse patch, we believe x1,x2, and x4 will generate an illegal instruction. x3 should run with no problem.
See http://sources.redhat.com/gdb/papers/linux/linux-sse.html for more info. It explains a number of things, but it may be dated. You should contact your linux provider about pentium III SSE support in the kernel. When linux kernel 2.4 is released, SSE will automatically be included.When linking, add -L$PGI/linux86/liblf for release 3.3 or "-Mlfs" for release 4.0 and higher.
The linker will search the liblf directory before the general lib directory.
The fortran standard does not define the default size of constants. Our compiler treats constants as REAL*4 unless you compile with -r8. So a program like
real*8 x,y,z
x=50453.61
y=29581.28
z=x*y
write(*,10)x,y,z, 50453.61*29581.28
10 format(4f20.5)
end
will produce different answers with -r8 set. Code that will treat constants as REAL*8 everywhere should be written as
real*8 x,y,z
x=50453.61D0
y=29581.28D0
z=x*y
write(*,10)x,y,z, 50453.61D0*29581.28D0
10 format(4f20.5)
end
^M
Century is just defined to be a group of 100 years. For 2007, date_and_time(DATE) sets the century and year values in DATE to 20 & 7, respectively.
So, for example, 2007 = 20*100 + 7
Here is a fortran program that uses date_and_time().
program testread implicit none integer::cen,year,mon,day character(len=8) :: date call date_and_time(date) read(date,'(4i2)')cen,year,mon,day print *,cen,year,mon,day end program