Parallelizing the NAS FT Benchmark: A Tutorial
Introduction to the Usage of PGI Fortran and OpenMP

Keywords: Fortran, Parallel, OpenMP, Linux, Solaris86, Windows NT

System Requirements: Pentium Pro/II/III workstation or server with 2 or more processors and a minimum of 64 MB of memory running Linux 2.0.3 or higher, Solaris86 2.5 or higher, or Windows NT 4.0 or higher. This tutorial assumes the PGI Workstation compilers and tools are already installed on the system. See the PGI Workstation Installation & Release Notes if you need to install the PGI Workstation software prior to beginning the tutorial. The tutorial was developed on a Linux system with dual 150 MHZ Pentium Pro processors. The execution of the entire example program on 1 processor of this system, prior to optimization and parallelization, is just under 1 minute.

This tutorial is based around a serial version of the NAS FT benchmark, one of the 8 NAS Parallel Benchmarks defined by the NAS Division at NASA Ames Research Center. By following the 6 steps outlined below, you will progressively parallelize the program fftpde.F using the OpenMP parallel capabilities of the PGI Fortran compiler(s), inserting directives where necessary. Upon completing this tutorial, you will have a basic working knowledge of how to use a PGI Fortran compiler (PGF77 or PGF90), how to use the PGPROF graphical profiler for performance tuning, how to parallelize a Fortran program, and how to execute a program in parallel on a multi-processor (SMP) workstation or server.

Basic knowledge of Fortran 77 programming is assumed. The example command-lines below assume usage of the UNIX "vi" editor to modify the file "fftpde.F", but any available file editor may be used. If you are not familiar with "vi" or "emacs", you may want to use "pico" on Linux or "edit" on Windows NT. They are less powerful but more intuitive. For an experienced Fortran programmer who is familiar with UNIX command-level compilers and tools, the tutorial takes 2-4 hours to complete, depending on how much you try to do on your own without looking at the completed examples.

You should find the following files in this directory (hereafter referred to as the "working directory"):

If any of these files is missing or others are present, you have somehow received a modified or previously used version of the tutorial. You should retrieve an unmodified copy to ensure successful completion of the tutorial.

In the steps that follow, "% " is used to represent the UNIX shell command prompt (or PGI Workstation for Windows NT command prompt). This tutorial assumes you are working in a csh command shell.

STEP 1: Setting up your environment to use the PGI Workstation compilers

Linux or Solaris

If you are working on a Linux or Solaris system, you must know where the PGI Workstation compilers are installed in order to initialize your environment properly. Common choices for the installation are /usr/pgi or /usr/local/pgi. If you don't know where PGI Workstation is installed on your system, ask your system administrator or the person who installed PGI Workstation for you. Assuming the insallation is in /usr/local/pgi execute the following commands to initialize your environment:

	% setenv PGI /usr/local/pgi
	% set path=($PGI/solaris/bin $path)
	% setenv MANPATH "$MANPATH":$PGI/man

In addition, you may want to bring up the online HTML PGI Workstation manuals.

Windows NT

If you are working on a Windows NT system with a properly installed version of PGI Workstation, double-click the PGI Workstation icon on your desktop with the left mouse button. A command window should appear on your desktop. The environment in the command window will be pre-initialized for usage of the PGI Workstation compilers and tools. Depending on how PGI Workstation was installed, the command window will either be a UNIX-like command shell (bash) or a Microsoft command window. If you wish to use a command window other than the default, select "Start -> Programs -> PGI Workstation" with the left mouse button to choose an appropriate command window.

STEP 2: Compile and execute the original serial version of fftpde.F

Use the following commands to make a copy of the original serial version of fftpde.F in the working directory, compile it using the PGF77 Fortran 77 compiler, execute it on your system, and check the results.

	% cp STEPS/fftpde.F fftpde.F
	% pgf77 fftpde.F -o fftpde
	% fftpde
	% diff fft.out results

The file "fft.out" is created and written by "fftpde". It contains results of the error checking and execution time in CPU seconds. You should see no differences in the numerical results from run to run, but your CPU times will vary depending on the system you are using, the number of processors you use, and the version of the program you are compiling. To see the whole "fft.out" file, type the following

	% cat fft.out

Now re-compile and re-execute fftpde.F using basic optimization with the "-fast" compiler option:

	% pgf77 -fast fftpde.F -o fftpde
	% fftpde
	% diff fft.out results

"fftpde" should execute a little faster this time. The "-fast" option is equivalent to specifying the options "-O2 -Munroll -Mnoframe" individually. This set of options will generally provide near-optimal performance on a single processor system with a minimum of effort. By experimenting with other available options, further significant performance gains can sometimes be realized. If you want to see all of the available pgf77 command-line options, enter the command:

	% man pgf77

or browse the PGI Workstation User's Guide in the online documentation. Note that the "man" command is not available in the PGI Workstation for Windows NT command window, so you'll have to browse the manual on Windows NT.

Finally, re-compile "fftpde.F" adding the "-Minfo" option to the compile line:


	% pgf77 -fast -Minfo fftpde.F -o fftpde

You should see output something like the following:

	fftpde:
	   157, Loop unrolled 4 times
	randlc:
	   233, Loop unrolled 4 times
	   238, Loop unrolled 4 times
	vranlc:
	   308, Loop unrolled 4 times
	   313, Loop unrolled 4 times
	cfft3:
	   401, Loop unrolled 5 times
	   405, Loop unrolled 5 times
	   417, Loop unrolled 5 times
	   421, Loop unrolled 5 times
	   433, Loop unrolled 5 times
	   437, Loop unrolled 5 times
	fft:
	   481, Invariant if transformation
	           copies loop at: 485
	   491, Loop unrolled 2 times
	        Loop unrolled 2 times
	   523, Loop unrolled 6 times
	tabfft:
	   535, Loop unrolled 5 times
	Linking:

The "-Minfo" switch provides what is essentially a compile-time optimization listing. Information is printed to the screen as compilation proceeds in order to convey the types of optimizations or transformations performed, (by program unit and line number).

STEP 3: Profile the serial version of fftpde.F using PGPROF

The first step in the optimization/parallelization process is to determine where a program spends most of its execution time. The PGPROF graphical profiler is designed to convey this information in a simple and intuitive way. Re-compile and re-execute the program using the following commands:

	% pgf77 -fast -Minfo -Mprof=func fftpde.F -o fftpde
	% fftpde
	% diff fft.out results

You should see very little difference in the execution time from your previous run. The option -Mprof=func causes PGF77 to instrument "fftpde" for function level profiling. This allows you to see where execution time is spent on a function-by-function basis. When you execute the program, a trace file called "pgprof.out" will be written to the working directory at the completion of execution. to view the profile, enter the command:

	% pgprof 

and the PGPROF graphical profiler should appear on your screen. Notice that most of the time is spent in 4 program units: fft, cfft3, fftpde, and vranlc. However, we need a way to determine exactly which source lines in those program units are consuming the most time. Quit PGPROF with a left mouse click on File -> Quit.

In addition to function-level profiling, PGPROF supports line-level profiling. Re-compile, re-execute, and profile with the following commands:


	% pgf77 -fast -Minfo -Mprof=lines fftpde.F -o fftpde
	% fftpde
	% diff fft.out results
	% pgprof

Note that with line profiling, the execution time of the program increases significantly. This is sometimes described by saying that line profiling is "intrusive", meaning it affects execution time noticeably. Even though line-level profiling is intrusive, it's generally the case that the added overhead is spread uniformly throughout the program. Line profiling usually provides valuable information about which source lines are consuming the most execution time.

Here's where you need to start thinking. When PGPROF comes up, note that "Time" for the "fft" and "cfft3" program units is higher than for "fftpde". However, the "Cost" of "fftpde" is higher than that of any other program unit. "Time" represents the execution time spent within a particular program unit. "Cost" respresent the execution time spent within a particular program unit *and* all the subroutines or functions it calls. Your goal in parallelizing a program is to do so at as high a level as possible. That means you want to work "top down", i.e. start as high in the call chain as you can. To do that you want to start with program units that are most expensive in terms of "Cost" rather than "Time".

Let's look at "fftpde". Single-left-click in the PGPROF window on the line showing "Time" and "Cost" for "fftpde". That line should now be highlighted (or "selected"). Now double-left-click the "fftpde" line. A window containing the source code for "fftpde" should appear on your screen. You may want to widen the "fftpde" source window. You can do so by placing the cursor over the right border of the PGPROF window -- so that the cursor converts to a horizontal double-arrow line -- then left-clicking and dragging the window to the desired width.

Now scroll down through the source. Note that the call to CFFT3 at line 151 consumes most of the time. To see what percentage of execution time is spent in CFFT3 (in addition to seconds), select "View -> Cost -> Percent of Cost". You should see that 75% or more of the execution time is spent within the call to CFFT3. The loop at line 135, which contains the call to CFFT3, is not parallelizable (why?). Let's look insided CFFT3 and see if we can parallelize it. Exit the line-level window for "fftpde" by selecting "File -> Close".

View the line profile for CFFT3 (see above on viewing the line info for "fftpde" if you don't remember the sequence of commands). Again, view the "Cost" of each line as a percentage. Scroll down and note that the triply-nested loops at lines 399, 415, and 431 consume most of the execution time. Each of these loops is parallelizable! Why? Each is computing sequences of 1D FFTs which are completely independent. The loop at line 399 performs FFTs on the columns (i.e. in the 1st dimension), the loop at line 415 performs FFTs on the rows (second dimension), and and the loop at line 431 performs FFTs on the rays (3rd dimension) of the 3D complex array X1, represented by "X1real" and "X2real". Quit PGPROF by left-clicking on "File -> Quit".

Before we start editing fftpde.F, we should make a point here. PGPROF does not determine which loops are parallelizable, only which loops consume the most time. You must determine by inspection or prior knowledge of the algorithm which loops are parallelizable. That is a topic which we do not cover in this tutorial. Quit PGPROF by selecting the appropriate command with your mouse.

STEP 4: Parallelizing CFFT3

Now edit fftpde.F using the editor of your choice. We assume here that you are using "vi". Enter the editor by issuing the following command:

	% vi fftpde.F

Go to line 399. The loop looks as follows:

          do k=1,n3
                do j=1,n2
                        do i=1,n1
                                z(i)=cmplx(x1real(i,j,k),x1imag(i,j,k))
                        end do
                        call fft(z,inverse,w,n1,m1)
                        do i=1,n1
                                x1real(i,j,k)=real(z(i))
                                x1imag(i,j,k)=aimag(z(i))
                        end do
                end do
          end do

If you are not at all familiar with OpenMP, stop and read the introduction to Chapter 10 of the PGI Workstation User's Guide. It describes in detail how to form OpenMP parallelization directives and how each directive functions. In particular, you want to parallelize the "do k=1,n3" loop. Can you figure out what directives to enter to make this happen? Note that the array "z" is a work array that will need to be used by each processor separately. How do you specify that "z" is local to each processor? Try to determine what directives to insert. If you need a hint, take a look at the file STEPS/fftpde_1.F, which contains a version of fftpde.F after the necessary modifications.

Once you have parallelized the loop at line 399, do the same for the loops around line 415 and around line 430 (the line numbers are changing as you insert new source lines). Then re-compile the program with the "-mp" switch, which enables OpenMP directive processing:


	% pgf77 -fast -Minfo -mp fftpde.F -o fftpde

NOTE: In releases 1.7-6 and prior, there is no -Minfo message to indicate that OpenMP loops are parallelized. This feature was added in the 3.0 release. However, the loops will be parallelized as expected if the OpenMP directives are properly formed.

Then re-execute the program as follows:


	% fftpde
	% diff fft.out results

You should see timings very similar to your original results above. Now to run in parallel execute the following commands if you are in a "csh" shell:

	% setenv OMP_NUM_THREADS 2
	% fftpde

Or the following if you are in a bourne shell, korn shell, or PGI Workstation for Windows NT command shell:

	% OMP_NUM_THREADS=2
	% export OMP_NUM_THREADS
	% fftpde


IMPORTANT NOTE: If your program core dumps at this stage, it is probably because your default stack size is not large enough. Try typing the command:

	% limit

From within a C shell to see what the stack size limit is. On Solaris, issue the command

	% limit stacksize unlimited

and it should take care of the problem. On Linux, you may have to issue the following commands in order to work around the problem:

	% su root
	% csh
	% unlimit stacksize
	% su 

Until versions of Linux are available which allow non-privileged users to set the stacksize, you will have to use this workaround in order to run parallel programs with large amounts of data. On Windows NT, your stacksize should grow automatically as needed.

Here you may see differing timings based on which operating system you are working under. Most operating systems (including Linux and Solaris) return aggregate CPU time from the "etime" timer call. That means you must divide the time reported by the number of processors to get approximate elapsed time. Once you've accounted for this factor, you should see an execution time that is considerably faster than the single processor times (35% - 50%).

Now recompile with -Mprof=func and compare the 1 and 2 processor profiles using the following commands:


	% pgf77 -fast -Minfo -mp -Mprof=func fftpde.F -o fftpde
	% fftpde
	% mv pgprof.out pgprof.out.2
	% setenv OMP_NUM_THREADS 1
	% fftpde
	% mv pgprof.out pgprof.out.1
	% pgprof pgprof.out.1 &
	% pgprof pgprof.out.2 &

Or the following if you are in a bourne shell, korn shell, or PGI Workstation for Windows NT command shell:

	% pgf77 -fast -Minfo -mp -Mprof=func fftpde.F -o fftpde
	% fftpde
	% mv pgprof.out pgprof.out.2
	% OMP_NUM_THREADS=1
	% export OMP_NUM_THREADS
	% fftpde
	% mv pgprof.out pgprof.out.1
	% pgprof pgprof.out.1 &
	% pgprof pgprof.out.2 &

You should see near-linear speedup on the time spent in "fft". The time spend in "cfft3" may or may not show linear speed-up. Most of the time in "cfft3" itself is spent moving data back and forth between the X1 arrays and the work vector. Systems that are bandwidth-limited won't be able to sustain both processors at full speed during this operation.

STEP 5: Parallelizing the remainder of fftpde.F

Using steps similar to those outlined in STEP 4, progressively parallelize the following loops:

Be careful in each case to declare variables PRIVATE if necessary. A good rule of thumb is that any variable being used as a work variable, i.e. which is re-initialized with a new value upon each iteration of the loop, is probably an array you want to declare private.

When you have completed parallelizing all of fftpde.F, you should see execution times on 2 processors that are between 60% and 90% faster than the single-processor version. Speedup is dependent on both the hardware you are using and the size of the problem you are running. To run the actual Class A NAS FT benchmark, you need to set M1 = 8, M2 = 8, and M3 = 7 in the PARAMETER statements near the top of the file fftpde.F On a system with sufficient memory bandwidth, you should see speedups close to 2X on the Class A problem. You will need to be running on a system with 512MB of main memory in order to run the Class A problem.

STEP 6: Optional: Factoring Out Parallel Regions

OpenMP allows you to call subroutines from within parallel regions and place "orphan" OpenMP directives within those subroutines. For example you can have sequences of code such as the following:

	PROGRAM TEST
         .
         .
!$OMP PARALLEL
         .
         .
        CALL TEST_SUBR (...)
         .
         .
!$OMP END PARALLEL
         .
         .
        END
C
        SUBROUTINE TEST_SUBR (...)
         .
         .
!$OMP DO
        DO I = 1, N
         .
         .
        ENDDO
!$OMP ENDDO
         .
         .
        RETURN
        END

Can you parallelize fftpde.F with fewer parallel regions than were used in the steps above? How few parallel regions can be used? Does using fewer parallel regions significantly affect the performance? If you try this optional step, beware of the need for synchronization barriers between parallel loops and the need to define private variables properly.


© 1998-2000 The Portland Group, Inc (PGI). All rights reserved. This software and documentation are supplied under the terms of a license agreement with The Portland Group and may not be copied or disclosed except in accordance with the terms of that agreement.


©Copyright 1998-2000 The Portland Group, Inc (PGI). All rights reserved. This software and documentation are supplied under the terms of a license agreement with The Portland Group and may not be copied or disclosed except in accordance with the terms of that agreement.