Keywords: Fortran, Parallel, OpenMP, Linux, Solaris86, Windows NT
System Requirements: Pentium Pro/II/III workstation or server with 2 or
more processors and a minimum of 64 MB of memory running Linux 2.0.3 or
higher, Solaris86 2.5 or higher, or Windows NT 4.0 or higher. This
tutorial assumes the PGI Workstation compilers and tools are already
installed on the system. See the
PGI Workstation
Installation & Release Notes
if you need to install the PGI Workstation software prior to beginning
the tutorial. The tutorial was developed on a Linux system with dual
150 MHZ Pentium Pro processors. The execution of the entire example
program on 1 processor of this system, prior to optimization and
parallelization, is just under 1 minute.
This tutorial is based around a serial version of the NAS FT benchmark,
one of the 8 NAS Parallel Benchmarks defined by the NAS Division at NASA
Ames Research Center. By following the 6 steps outlined below, you will
progressively parallelize the program
fftpde.F using the OpenMP parallel
capabilities of the PGI Fortran compiler(s), inserting directives where
necessary. Upon completing this tutorial, you will have a basic working
knowledge of how to use a PGI Fortran compiler (PGF77 or PGF90), how to use
the PGPROF graphical profiler for performance tuning, how to parallelize
a Fortran program, and how to execute a program in parallel on a
multi-processor (SMP) workstation or server.
Basic knowledge of Fortran 77 programming is assumed. The example
command-lines below assume usage of the UNIX "vi" editor to modify the
file "fftpde.F", but any available file editor may be used. If you are
not familiar with "vi" or "emacs", you may want to use "pico" on Linux or
"edit" on Windows NT. They are less powerful but more intuitive. For
an experienced Fortran programmer who is familiar with UNIX command-level
compilers and tools, the tutorial takes 2-4 hours to complete, depending
on how much you try to do on your own without looking at the completed
examples.
You should find the following files in this directory (hereafter referred
to as the "working directory"):
If any of these files is missing or others are present, you have somehow
received a modified or previously used version of the tutorial. You should
retrieve an
unmodified copy
to ensure successful completion of the tutorial.
In the steps that follow, "% " is used to represent the UNIX shell command
prompt (or PGI Workstation for Windows NT command prompt). This tutorial
assumes you are working in a csh command shell.
Finally, re-compile "fftpde.F" adding the "-Minfo" option to the compile
line:
In addition to function-level profiling, PGPROF supports line-level
profiling. Re-compile, re-execute, and profile with the following commands:
Here's where you need to start thinking. When PGPROF comes up, note that
"Time" for the "fft" and "cfft3" program units is higher than for "fftpde".
However, the "Cost" of "fftpde" is higher than that of any other program
unit. "Time" represents the execution time spent within a particular
program unit. "Cost" respresent the execution time spent within a
particular program unit *and* all the subroutines or functions it calls.
Your goal in parallelizing a program is to do so at as high a level
as possible. That means you want to work "top down", i.e. start as high
in the call chain as you can. To do that you want to start with program
units that are most expensive in terms of "Cost" rather than "Time".
Let's look at "fftpde". Single-left-click in the PGPROF window on the
line showing "Time" and "Cost" for "fftpde". That line should now be
highlighted (or "selected"). Now double-left-click the "fftpde" line. A
window containing the source code for "fftpde" should appear on your screen.
You may want to widen the "fftpde" source window. You can do so by placing
the cursor over the right border of the PGPROF window -- so that the
cursor converts to a horizontal double-arrow line -- then left-clicking
and dragging the window to the desired width.
Now scroll down through the source. Note that the call to CFFT3 at line 151
consumes most of the time. To see what percentage of execution time is spent
in CFFT3 (in addition to seconds), select "View -> Cost -> Percent of Cost".
You should see that 75% or more of the execution time is spent within the
call to CFFT3. The loop at line 135, which contains the call to CFFT3,
is not parallelizable (why?). Let's look insided CFFT3 and see if we can
parallelize it. Exit the line-level window for "fftpde" by selecting
"File -> Close".
View the line profile for CFFT3 (see above on viewing the line info for
"fftpde" if you don't remember the sequence of commands). Again, view the
"Cost" of each line as a percentage. Scroll down and note that the
triply-nested loops at lines 399, 415, and 431 consume most of the execution
time. Each of these loops is parallelizable! Why? Each is computing
sequences of 1D FFTs which are completely independent. The loop at line
399 performs FFTs on the columns (i.e. in the 1st dimension), the loop at
line 415 performs FFTs on the rows (second dimension), and and the loop at
line 431 performs FFTs on the rays (3rd dimension) of the 3D complex
array X1, represented by "X1real" and "X2real". Quit PGPROF by left-clicking
on "File -> Quit".
Before we start editing fftpde.F, we should make a point here. PGPROF does
not determine which loops are parallelizable, only which loops consume the
most time. You must determine by inspection or prior knowledge of the
algorithm which loops are parallelizable. That is a topic which we do not
cover in this tutorial. Quit PGPROF by selecting the appropriate command
with your mouse.
Once you have parallelized the loop at line 399, do the same for the loops
around line 415 and around line 430 (the line numbers are changing as you
insert new source lines). Then re-compile the program with the "-mp"
switch, which enables OpenMP directive processing:
Then re-execute the program as follows:
Here you may see differing timings based on which operating system you are
working under. Most operating systems (including Linux and Solaris) return
aggregate CPU time from the "etime" timer call. That means you must divide
the time reported by the number of processors to get approximate elapsed
time. Once you've accounted for this factor, you should see an execution
time that is considerably faster than the single processor times (35% - 50%).
Now recompile with -Mprof=func and compare the 1 and 2 processor profiles
using the following commands:
Be careful in each case to declare variables PRIVATE if necessary. A good
rule of thumb is that any variable being used as a work variable, i.e. which
is re-initialized with a new value upon each iteration of the loop, is
probably an array you want to declare private.
When you have completed parallelizing all of fftpde.F, you should see
execution times on 2 processors that are between 60% and 90% faster than
the single-processor version. Speedup is dependent on both the hardware
you are using and the size of the problem you are running. To run the
actual Class A NAS FT benchmark, you need to set M1 = 8, M2 = 8, and
M3 = 7 in the PARAMETER statements near the top of the file fftpde.F On
a system with sufficient memory bandwidth, you should see speedups close
to 2X on the Class A problem. You will need to be running on a system
with 512MB of main memory in order to run the Class A problem.
Parallelizing the NAS FT Benchmark: A Tutorial
Introduction to the Usage of PGI Fortran and OpenMP
STEP 1: Setting up your environment to use the PGI Workstation compilers
Linux or Solaris
If you are working on a Linux or Solaris system, you must know where the
PGI Workstation compilers are installed in order to initialize your
environment properly. Common choices for the installation are /usr/pgi
or /usr/local/pgi. If you don't know where PGI Workstation is installed
on your system, ask your system administrator or the person who installed
PGI Workstation for you. Assuming the insallation is in /usr/local/pgi
execute the following commands to initialize your environment:
% setenv PGI /usr/local/pgi
% set path=($PGI/solaris/bin $path)
% setenv MANPATH "$MANPATH":$PGI/man
In addition, you may want to bring up the online HTML PGI Workstation manuals.
Windows NT
If you are working on a Windows NT system with a properly installed version
of PGI Workstation, double-click the PGI Workstation icon on your desktop
with the left mouse button. A command window should appear on your desktop.
The environment in the command window will be pre-initialized for usage of
the PGI Workstation compilers and tools. Depending on how PGI Workstation
was installed, the command window will either be a UNIX-like command
shell (bash) or a Microsoft command window. If you wish to use a command
window other than the default, select "Start -> Programs -> PGI Workstation"
with the left mouse button to choose an appropriate command window.
STEP 2: Compile and execute the original serial version of fftpde.F
Use the following commands to make a copy of the original serial version
of fftpde.F in the working directory, compile it using the PGF77 Fortran 77
compiler, execute it on your system, and check the results.
% cp STEPS/fftpde.F fftpde.F
% pgf77 fftpde.F -o fftpde
% fftpde
% diff fft.out results
The file "fft.out" is created and written by "fftpde". It contains results
of the error checking and execution time in CPU seconds. You should see no
differences in the numerical results from run to run, but your CPU times
will vary depending on the system you are using, the number of processors
you use, and the version of the program you are compiling. To see the whole
"fft.out" file, type the following
% cat fft.out
Now re-compile and re-execute fftpde.F using basic optimization with the
"-fast" compiler option:
% pgf77 -fast fftpde.F -o fftpde
% fftpde
% diff fft.out results
"fftpde" should execute a little faster this time. The "-fast" option is
equivalent to specifying the options "-O2 -Munroll -Mnoframe" individually.
This set of options will generally provide near-optimal performance on a
single processor system with a minimum of effort. By experimenting with
other available options, further significant performance gains can sometimes
be realized. If you want to see all of the available pgf77 command-line
options, enter the command:
% man pgf77
or browse the PGI Workstation User's Guide in the online documentation. Note
that the "man" command is not available in the PGI Workstation for Windows NT
command window, so you'll have to browse the manual on Windows NT.
% pgf77 -fast -Minfo fftpde.F -o fftpde
You should see output something like the following:
fftpde:
157, Loop unrolled 4 times
randlc:
233, Loop unrolled 4 times
238, Loop unrolled 4 times
vranlc:
308, Loop unrolled 4 times
313, Loop unrolled 4 times
cfft3:
401, Loop unrolled 5 times
405, Loop unrolled 5 times
417, Loop unrolled 5 times
421, Loop unrolled 5 times
433, Loop unrolled 5 times
437, Loop unrolled 5 times
fft:
481, Invariant if transformation
copies loop at: 485
491, Loop unrolled 2 times
Loop unrolled 2 times
523, Loop unrolled 6 times
tabfft:
535, Loop unrolled 5 times
Linking:
The "-Minfo" switch provides what is essentially a compile-time optimization
listing. Information is printed to the screen as compilation proceeds in
order to convey the types of optimizations or transformations performed,
(by program unit and line number).
STEP 3: Profile the serial version of fftpde.F using PGPROF
The first step in the optimization/parallelization process is to determine
where a program spends most of its execution time. The PGPROF graphical
profiler is designed to convey this information in a simple and intuitive
way. Re-compile and re-execute the program using the following commands:
% pgf77 -fast -Minfo -Mprof=func fftpde.F -o fftpde
% fftpde
% diff fft.out results
You should see very little difference in the execution time from your
previous run. The option -Mprof=func causes PGF77 to instrument "fftpde"
for function level profiling. This allows you to see where execution time
is spent on a function-by-function basis. When you execute the program, a
trace file called "pgprof.out" will be written to the working directory
at the completion of execution. to view the profile, enter the command:
% pgprof
and the PGPROF graphical profiler should appear on your screen. Notice
that most of the time is spent in 4 program units: fft, cfft3, fftpde,
and vranlc. However, we need a way to determine exactly which source
lines in those program units are consuming the most time. Quit PGPROF
with a left mouse click on File -> Quit.
% pgf77 -fast -Minfo -Mprof=lines fftpde.F -o fftpde
% fftpde
% diff fft.out results
% pgprof
Note that with line profiling, the execution time of the program increases
significantly. This is sometimes described by saying that line profiling
is "intrusive", meaning it affects execution time noticeably. Even though
line-level profiling is intrusive, it's generally the case that the added
overhead is spread uniformly throughout the program. Line profiling usually
provides valuable information about which source lines are consuming the most
execution time.
STEP 4: Parallelizing CFFT3
Now edit fftpde.F using the editor of your choice. We assume here that
you are using "vi". Enter the editor by issuing the following command:
% vi fftpde.F
Go to line 399. The loop looks as follows:
do k=1,n3
do j=1,n2
do i=1,n1
z(i)=cmplx(x1real(i,j,k),x1imag(i,j,k))
end do
call fft(z,inverse,w,n1,m1)
do i=1,n1
x1real(i,j,k)=real(z(i))
x1imag(i,j,k)=aimag(z(i))
end do
end do
end do
If you are not at all familiar with OpenMP, stop and read the introduction
to Chapter 10 of the PGI Workstation User's Guide. It describes in detail
how to form OpenMP parallelization directives and how each directive
functions. In particular, you want to parallelize the "do k=1,n3" loop. Can
you figure out what directives to enter to make this happen? Note that the
array "z" is a work array that will need to be used by each processor
separately. How do you specify that "z" is local to each processor? Try
to determine what directives to insert. If you need a hint, take a look at
the file STEPS/fftpde_1.F, which contains a version of fftpde.F after the
necessary modifications.
% pgf77 -fast -Minfo -mp fftpde.F -o fftpde
NOTE: In releases 1.7-6 and prior, there is no -Minfo message to indicate
that OpenMP loops are parallelized. This feature was added in the 3.0
release. However, the loops will be parallelized as expected if the OpenMP
directives are properly formed.
% fftpde
% diff fft.out results
You should see timings very similar to your original results above. Now
to run in parallel execute the following commands if you are in a "csh"
shell:
% setenv OMP_NUM_THREADS 2
% fftpde
Or the following if you are in a bourne shell, korn shell, or PGI Workstation
for Windows NT command shell:
% OMP_NUM_THREADS=2
% export OMP_NUM_THREADS
% fftpde
IMPORTANT NOTE:
If your program core dumps at this stage, it is probably because your
default stack size is not large enough. Try typing the command:
% limit
From within a C shell to see what the stack size limit is. On Solaris,
issue the command
% limit stacksize unlimited
and it should take care of the problem. On Linux, you may have to
issue the following commands in order to work around the problem:
% su root
% csh
% unlimit stacksize
% su
Until versions of Linux are available which allow non-privileged users
to set the stacksize, you will have to use this workaround in order
to run parallel programs with large amounts of data.
On Windows NT, your stacksize should grow automatically as needed.
% pgf77 -fast -Minfo -mp -Mprof=func fftpde.F -o fftpde
% fftpde
% mv pgprof.out pgprof.out.2
% setenv OMP_NUM_THREADS 1
% fftpde
% mv pgprof.out pgprof.out.1
% pgprof pgprof.out.1 &
% pgprof pgprof.out.2 &
Or the following if you are in a bourne shell, korn shell, or PGI Workstation
for Windows NT command shell:
% pgf77 -fast -Minfo -mp -Mprof=func fftpde.F -o fftpde
% fftpde
% mv pgprof.out pgprof.out.2
% OMP_NUM_THREADS=1
% export OMP_NUM_THREADS
% fftpde
% mv pgprof.out pgprof.out.1
% pgprof pgprof.out.1 &
% pgprof pgprof.out.2 &
You should see near-linear speedup on the time spent in "fft". The time
spend in "cfft3" may or may not show linear speed-up. Most of the time in
"cfft3" itself is spent moving data back and forth between the X1 arrays
and the work vector. Systems that are bandwidth-limited won't be able to
sustain both processors at full speed during this operation.
STEP 5: Parallelizing the remainder of fftpde.F
Using steps similar to those outlined in STEP 4, progressively parallelize
the following loops:
STEP 6: Optional: Factoring Out Parallel Regions
OpenMP allows you to call subroutines from within parallel regions and
place "orphan" OpenMP directives within those subroutines. For example
you can have sequences of code such as the following:
PROGRAM TEST
.
.
!$OMP PARALLEL
.
.
CALL TEST_SUBR (...)
.
.
!$OMP END PARALLEL
.
.
END
C
SUBROUTINE TEST_SUBR (...)
.
.
!$OMP DO
DO I = 1, N
.
.
ENDDO
!$OMP ENDDO
.
.
RETURN
END
Can you parallelize fftpde.F with fewer parallel regions than were used in
the steps above? How few parallel regions can be used? Does using fewer
parallel regions significantly affect the performance? If you try this
optional step, beware of the need for synchronization barriers between
parallel loops and the need to define private variables properly.
© 1998-2000 The Portland Group, Inc (PGI). All rights reserved.
This software and documentation are supplied under the terms of a
license agreement with The Portland
Group and may not be copied or disclosed except in accordance with
the terms of that agreement.
©Copyright 1998-2000 The Portland Group, Inc (PGI). All rights reserved.
This software and documentation are supplied under the terms of a
license agreement with The Portland
Group and may not be copied or disclosed except in accordance with
the terms of that agreement.