Parallelizing the NAS FT Benchmark: A Tutorial Introduction to the Usage of PGI Fortran and OpenMP Keywords: Fortran, Parallel, OpenMP, Linux, Solaris86, Windows NT System Requirements: Pentium Pro/II workstation or server with 2 or more processors and a minimum of 64 MB of memory running Linux 2.0.3 or higher, Solaris86 2.5 or higher, or Windows NT 4.0 or higher. This tutorial assumes the PGI Workstation compilers and tools are already installed on the system. See the PGI Workstation installation notes if you need to install the PGI Workstation software prior to beginning the tutorial. The tutorial was developed on a Linux system with dual 150 MHZ Pentium Pro processors. The execution of the entire example program on 1 processor of this system, prior to optimization and parallelization, is just under 1 minute. This tutorial is based around a serial version of the NAS FT benchmark, one of the 8 NAS Parallel Benchmarks defined by the NAS Division at NASA Ames Research Center. By following the 6 steps outlined below, you will progressively parallelize the program "fftpde.F" using the OpenMP parallel capabilities of the PGI Fortran compiler(s), inserting directives where necessary. Upon completing this tutorial, you will have a basic working knowledge of how to use a PGI Fortran compiler (PGF77 or PGF90), how to use the PGPROF graphical profiler for performance tuning, how to parallelize a Fortran program, and how to execute a program in parallel on a multi-processor (SMP) workstation or server. Basic knowledge of Fortran 77 programming is assumed. The example command-lines below assume usage of the UNIX "vi" editor to modify the file "fftpde.F", but any available file editor may be used. If you are not familiar with "vi" or "emacs", you may want to use "pico" on Linux or "edit" on Windows NT. They are less powerful but more intuitive. For an experienced Fortran programmer who is familiar with UNIX command-level compilers and tools, the tutorial takes 2-4 hours to complete, depending on how much you try to do on your own without looking at the completed examples. You should find the following files in this directory (hereafter referred to as the "working directory"): README This File STEPS A directory containing examples of how fftpde.F should look after each step is completed fftpde.F The working copy of fftpde.F results A file containing the output from a run of fftpde.F which gets correct results -- used for checking answers as each step is completed If any of these files is missing or others are present, you have somehow received a modified or previously used version of the tutorial. You should retrieve an unmodified copy to ensure successful completion of the tutorial. You can find this tutorial on the PGI Web site at ftp://ftp.pgroup.com/pub/SMP/fftpde.tar.gz In the steps that follow, "% " is used to represent the UNIX shell command prompt (or PGI Workstation for Windows NT command prompt). This README file assumes you are working in a "csh" command shell. ******* STEP 1: Setting up your environment to use the PGI Workstation compilers ******* Linux or Solaris ---------------- If you are working on a Linux or Solaris system, you must know where the PGI Workstation compilers are installed in order to initialize your environment properly. Common choices for the installation are "/usr/pgi" or "/usr/local/pgi". If you don't know where PGI Workstation is installed on your system, ask your system administrator or the person who installed PGI Workstation for you. Assuming the insallation is in /usr/local/pgi execute the following commands to initialize your environment: % setenv PGI /usr/local/pgi % set path=($PGI/solaris/bin $path) % setenv MANPATH "$MANPATH":$PGI/man In addition, you may want to bring up the online HTML PGI Workstation manuals by opening the file /usr/local/pgi/doc in a web browser. Windows NT ---------- If you are working on a Windows NT system with a properly installed version of PGI Workstation, double-click the PGI Workstation icon on your desktop with the left mouse button. A command window should appear on your desktop. The environment in the command window will be pre-initialized for usage of the PGI Workstation compilers and tools. Depending on how PGI Workstation was installed, the command window will either be a UNIX-like command shell (bash) or a Microsoft command window. If you wish to use a command window other than the default, select "Start -> Programs -> PGI Workstation" with the left mouse button to choose an appropriate command window. ******* STEP 2: Compile and execute the original serial version of fftpde.F. ******* Use the following commands to make a copy of the original serial version of fftpde.F in the working directory, compile it using the PGF77 Fortran 77 compiler, execute it on your system, and check the results. % cp STEPS/fftpde.F fftpde.F % pgf77 fftpde.F -o fftpde % fftpde % diff fft.out results The file "fft.out" is created and written by "fftpde". It contains results of the error checking and execution time in CPU seconds. You should see no differences in the numerical results from run to run, but your CPU times will vary depending on the system you are using, the number of processors you use, and the version of the program you are compiling. To see the whole "fft.out" file, type the following % cat fft.out Now re-compile and re-execute fftpde.F using basic optimization with the "-fast" compiler option: % pgf77 -fast fftpde.F -o fftpde % fftpde % diff fft.out results "fftpde" should execute a little faster this time. The "-fast" option is equivalent to specifying the options "-O2 -Munroll -Mnoframe" individually. This set of options will generally provide near-optimal performance on a single processor system with a minimum of effort. By experimenting with other available options, further significant performance gains can sometimes be realized. If you want to see all of the available pgf77 command-line options, enter the command: % man pgf77 or browse the PGI Workstation User's Guide in the online documentation. Note that the "man" command is not available in the PGI Workstation for Windows NT command window, so you'll have to browse the manual on Windows NT. Finally, re-compile "fftpde.F" adding the "-Minfo" option to the compile line: % pgf77 -fast -Minfo fftpde.F -o fftpde You should see output something like the following: fftpde: 157, Loop unrolled 4 times randlc: 233, Loop unrolled 4 times 238, Loop unrolled 4 times vranlc: 308, Loop unrolled 4 times 313, Loop unrolled 4 times cfft3: 401, Loop unrolled 5 times 405, Loop unrolled 5 times 417, Loop unrolled 5 times 421, Loop unrolled 5 times 433, Loop unrolled 5 times 437, Loop unrolled 5 times fft: 481, Invariant if transformation copies loop at: 485 491, Loop unrolled 2 times Loop unrolled 2 times 523, Loop unrolled 6 times tabfft: 535, Loop unrolled 5 times Linking: The "-Minfo" switch provides what is essentially a compile-time optimization listing. Information is printed to the screen as compilation proceeds in order to convey the types of optimizations or transformations performed, (by program unit and line number). ******* STEP 3: Profile the serial version of fftpde.F using PGPROF. ******* The first step in the optimization/parallelization process is to determine where a program spends most of its execution time. The PGPROF graphical profiler is designed to convey this information in a simple and intuitive way. Re-compile and re-execute the program using the following commands: % pgf77 -fast -Minfo -Mprof=func fftpde.F -o fftpde % fftpde % diff fft.out results You should see very little difference in the execution time from your previous run. The option -Mprof=func causes PGF77 to instrument "fftpde" for function level profiling. This allows you to see where execution time is spent on a function-by-function basis. When you execute the program, a trace file called "pgprof.out" will be written to the working directory at the completion of execution. to view the profile, enter the command: % pgprof and the PGPROF graphical profiler should appear on your screen. Notice that most of the time is spent in 4 program units: fft, cfft3, fftpde, and vranlc. However, we need a way to determine exactly which source lines in those program units are consuming the most time. Quit PGPROF with a left mouse click on File -> Quit. In addition to function-level profiling, PGPROF supports line-level profiling. Re-compile, re-execute, and profile with the following commands: % pgf77 -fast -Minfo -Mprof=lines fftpde.F -o fftpde % fftpde % diff fft.out results % pgprof Note that with line profiling, the execution time of the program increases significantly. This is sometimes described by saying that line profiling is "intrusive", meaning it affects execution time noticeably. Even though line-level profiling is intrusive, it's generally the case that the added overhead is spread uniformly throughout the program. Line profiling usually provides valuable information about which source lines are consuming the most execution time. Here's where you need to start thinking. When PGPROF comes up, note that "Time" for the "fft" and "cfft3" program units is higher than for "fftpde". However, the "Cost" of "fftpde" is higher than that of any other program unit. "Time" represents the execution time spent within a particular program unit. "Cost" respresent the execution time spent within a particular program unit *and* all the subroutines or functions it calls. Your goal in parallelizing a program is to do so at as high a level as possible. That means you want to work "top down", i.e. start as high in the call chain as you can. To do that you want to start with program units that are most expensive in terms of "Cost" rather than "Time". Let's look at "fftpde". Single-left-click in the PGPROF window on the line showing "Time" and "Cost" for "fftpde". That line should now be highlighted (or "selected"). Now double-left-click the "fftpde" line. A window containing the source code for "fftpde" should appear on your screen. You may want to widen the "fftpde" source window. You can do so by placing the cursor over the right border of the PGPROF window -- so that the cursor converts to a horizontal double-arrow line -- then left-clicking and dragging the window to the desired width. Now scroll down through the source. Note that the call to CFFT3 at line 151 consumes most of the time. To see what percentage of execution time is spent in CFFT3 (in addition to seconds), select "View -> Cost -> Percent of Cost". You should see that 75% or more of the execution time is spent within the call to CFFT3. The loop at line 135, which contains the call to CFFT3, is not parallelizable (why?). Let's look insided CFFT3 and see if we can parallelize it. Exit the line-level window for "fftpde" by selecting "File -> Close". View the line profile for CFFT3 (see above on viewing the line info for "fftpde" if you don't remember the sequence of commands). Again, view the "Cost" of each line as a percentage. Scroll down and note that the triply-nested loops at lines 399, 415, and 431 consume most of the execution time. Each of these loops is parallelizable! Why? Each is computing sequences of 1D FFTs which are completely independent. The loop at line 399 performs FFTs on the columns (i.e. in the 1st dimension), the loop at line 415 performs FFTs on the rows (second dimension), and and the loop at line 431 performs FFTs on the rays (3rd dimension) of the 3D complex array X1, represented by "X1real" and "X2real". Quit PGPROF by left-clicking on "File -> Quit". Before we start editing fftpde.F, we should make a point here. PGPROF does not determine which loops are parallelizable, only which loops consume the most time. You must determine by inspection or prior knowledge of the algorithm which loops are parallelizable. That is a topic which we do not cover in this tutorial. Quit PGPROF by selecting the appropriate command with your mouse. ******* STEP 4: Parallelizing CFFT3 ******* Now edit fftpde.F using the editor of your choice. We assume here that you are using "vi". Enter the editor by issuing the following command: % vi fftpde.F Go to line 399. The loop looks as follows: do k=1,n3 do j=1,n2 do i=1,n1 z(i)=cmplx(x1real(i,j,k),x1imag(i,j,k)) end do call fft(z,inverse,w,n1,m1) do i=1,n1 x1real(i,j,k)=real(z(i)) x1imag(i,j,k)=aimag(z(i)) end do end do end do If you are not at all familiar with OpenMP, stop and read the introduction to Chapter 10 of the PGI Workstation User's Guide. It describes in detail how to form OpenMP parallelization directives and how each directive functions. In particular, you want to parallelize the "do k=1,n3" loop. Can you figure out what directives to enter to make this happen? Note that the array "z" is a work array that will need to be used by each processor separately. How do you specify that "z" is local to each processor? Try to determine what directives to insert. If you need a hint, take a look at the file STEPS/fftpde_1.F, which contains a version of fftpde.F after the necessary modifications. Once you have parallelized the loop at line 399, do the same for the loops around line 415 and around line 430 (the line numbers are changing as you insert new source lines). Then re-compile the program with the "-mp" switch, which enables OpenMP directive processing: % pgf77 -fast -Minfo -mp fftpde.F -o fftpde NOTE: In releases 1.7-6 and prior, there is no -Minfo message to indicate that OpenMP loops are parallelized. This feature was added in the 3.0 release. However, the loops will be parallelized as expected if the OpenMP directives are properly formed. Then re-execute the program as follows: % fftpde % diff fft.out results You should see timings very similar to your original results above. Now to run in parallel execute the following commands if you are in a "csh" shell: % setenv OMP_NUM_THREADS 2 % fftpde Or the following if you are in a bourne shell, korn shell, or PGI Workstation for Windows NT command shell: % OMP_NUM_THREADS=2 % export OMP_NUM_THREADS % fftpde NOTE: If your program core dumps at this stage, it is probably because your NOTE: default stack size is not large enough. Try typing the command: NOTE: NOTE: % limit NOTE: NOTE: From within a C shell to see what the stack size limit is. On Solaris, NOTE: issue the command NOTE: NOTE: % limit stacksize unlimited NOTE: NOTE: and it should take care of the problem. On Linux, you may have to NOTE: issue the following commands in order to work around the problem: NOTE: NOTE: % su root NOTE: % csh NOTE: % unlimit stacksize NOTE: % su NOTE: NOTE: Until versions of Linux are available which allow non-privileged users NOTE: to set the stacksize, you will have to use this workaround in order NOTE: to run parallel programs with large amounts of data. NOTE: NOTE: On Windows NT, your stacksize should grow automatically as needed. Here you may see differing timings based on which operating system you are working under. Most operating systems (including Linux and Solaris) return aggregate CPU time from the "etime" timer call. That means you must divide the time reported by the number of processors to get approximate elapsed time. Once you've accounted for this factor, you should see an execution time that is considerably faster than the single processor times (35% - 50%). Now recompile with -Mprof=func and compare the 1 and 2 processor profiles using the following commands: % pgf77 -fast -Minfo -mp -Mprof=func fftpde.F -o fftpde % fftpde % mv pgprof.out pgprof.out.2 % setenv OMP_NUM_THREADS 1 % fftpde % mv pgprof.out pgprof.out.1 % pgprof pgprof.out.1 & % pgprof pgprof.out.2 & Or the following if you are in a bourne shell, korn shell, or PGI Workstation for Windows NT command shell: % pgf77 -fast -Minfo -mp -Mprof=func fftpde.F -o fftpde % fftpde % mv pgprof.out pgprof.out.2 % OMP_NUM_THREADS=1 % export OMP_NUM_THREADS % fftpde % mv pgprof.out pgprof.out.1 % pgprof pgprof.out.1 & % pgprof pgprof.out.2 & You should see near-linear speedup on the time spent in "fft". The time spend in "cfft3" may or may not show linear speed-up. Most of the time in "cfft3" itself is spent moving data back and forth between the X1 arrays and the work vector. Systems that are bandwidth-limited won't be able to sustain both processors at full speed during this operation. ******* STEP 5: Parallelizing the remainder of fftpde.F ******* Using steps similar to those outlined in STEP 4, progressively parallelize the following loops: 1) The DO 130 loop near line 64 (STEPS/fftpde_2.F) 2) The DO 190 loop near line 115 (STEPS/fftpde_3.F) 3) The DO 220 loop near line 139 (STEPS/fftpde_4.F) 4) The DO 250 loop near line 155 (STEPS/fftpde_5.F) Be careful in each case to declare variables PRIVATE if necessary. A good rule of thumb is that any variable being used as a work variable, i.e. which is re-initialized with a new value upon each iteration of the loop, is probably an array you want to declare private. When you have completed parallelizing all of fftpde.F, you should see execution times on 2 processors that are between 60% and 90% faster than the single-processor version. Speedup is dependent on both the hardware you are using and the size of the problem you are running. To run the actual Class A NAS FT benchmark, you need to set M1 = 8, M2 = 8, and M3 = 7 in the PARAMETER statements near the top of the file fftpde.F On a system with sufficient memory bandwidth, you should see speedups close to 2X on the Class A problem. You will need to be running on a system with 512MB of main memory in order to run the Class A problem. ******* STEP 6: Optional: Factoring Out Parallel Regions ******* OpenMP allows you to call subroutines from within parallel regions and place "orphan" OpenMP directives within those subroutines. For example you can have sequences of code such as the following: PROGRAM TEST . . !$OMP PARALLEL . . CALL TEST_SUBR (...) . . !$OMP END PARALLEL . . END C SUBROUTINE TEST_SUBR (...) . . !$OMP DO DO I = 1, N . . ENDDO !$OMP ENDDO . . RETURN END Can you parallelize fftpde.F with fewer parallel regions than were used in the steps above? How few parallel regions can be used? Does using fewer parallel regions significantly affect the performance? If you try this optional step, beware of the need for synchronization barriers between parallel loops and the need to define private variables properly.