Technical News from The Portland Group

Building Cactus BenchADM with PGI Accelerator Compilers

1. Introduction

The following is a step by step tutorial on using the PGI 9.0 Accelerator Fortran compiler. The tutorial assumes you have installed and configured the PGI 9.0 compilers and a suitable version of the NVIDIA CUDA software (Please see the PGI Installation Guide for details.) It also assumes you have a basic understanding of the PGI Accelerator programming model, sometimes referred to as the "kernels" programming model. If not, before continuing you should read Part 1 of Michael Wolfe's The PGI Accelerator Programming Model on NVIDIA GPUs. Complete documentation of the PGI Accelerator programming model is available in the PGI User's Guide and the PGI 9.0 Release Notes.

The tutorial uses the Cactus BenchADM Benchmark. Although it is an older benchmark, we chose BenchADM because of our familiarity with the SPEC® CPU2006 version (436.cactusADM). The source used here was derived from the same original source used by 436.cactusADM, though the workload used is not the same. Note that BenchADM uses double-precision arithmetic.

To get started, first download the tutorial source file package from the PGI website and unpack the source in a local directory. Next, bring up a shell command window (we use csh), cd to the source directory "PGI_Acc_benchADM", and ensure your environment is initialized for use of the PGI 9.0 compilers. For example:

 % which pgfortran
 /opt/pgi/linux86-64/9.0-3/bin/pgfortran

If your environment is not yet initialized, execute the following commands to do so:

 % setenv PGI /opt/pgi
 % set path=($PGI/linux86-64/9.0-3/bin $path)
 % setenv LM_LICENSE_FILE $PGI/license.dat

You may have to ask your system administrator for the pathname to the FlexNET license file containing PGI 9.0 license keys.

Note that a text version of this tutorial is included in the tar package.

2. Profiling the Application

The first step to effectively working with the PGI Accelerator compilers is to understand which parts of your code can benefit from GPU acceleration. The most efficient way to determine this is to profile your code using the PGPROF performance profiler.

First, build BenchADM with the options to enable common compiler feedback format (CCFF) and profile guided-feedback, then run using the PGPROF hardware performance sampling utility PGCOLLECT.

Method 1: Profile Guided Feedback and Hardware Counter Sampling base profiling

  % make SIZE=80 OPT="-fast -Mpfi -Minfo=ccff" build_base 
  % make SIZE=80 OPT="-fast -Mpfi -Minfo=ccff" run_base
  % make clean
  % make SIZE=80 OPT="-fast -Mpfo -Minfo=ccff" build_base 
  % make SIZE=80 OPT="-fast -Mpfo -Minfo=ccff" pgcollect_base

PGCOLLECT is supported on all Linux and MacOS systems; support may be limited on certain types of Windows systems. If for some reason PGCOLLECT does not work for you, you can use the alternate instrumentation profiling method below.

Method 2: Profiling with Instrumentation.

  % make SIZE=80 OPT="-fast -Mpfi -Minfo=ccff" build_base 
  % make SIZE=80 OPT="-fast -Mpfi -Minfo=ccff" run_base
  % make clean
  % make SIZE=80 OPT="-fast -Mpfo -Mprof=lines -Minfo=ccff" build_base 
  % make SIZE=80 OPT="-fast -Mpfo -Mprof=lines -Minfo=ccff" run_base

3. Reviewing the Profile

With a profile generated, run the PGI performance profiling tool, PGPROF:

  % pgprof -exe bin/benchADM_base

benchADM has a fairly simple profile with ~98% of the time spent in a single routine called "bench_staggeredleapfrog2".

BenchADM Profile with PGPROF

Double-left-click on this routine in the main profile window, and you will see the source code for the routine. It is actually quite large (~800 lines) and performs a complex series of calculations. It's broken up into four main loops at lines 300, 319, 338, and 366. Scrolling through the source, you can see that the loops at lines 338 and 366 account for about ~21% and ~78% of the time in this routine respectively. Timings may vary, depending on the host x64 processor type.

BenchADM Profile Detail

The second item to look for is the compute intensity of each loop. The compute intensity of a loop is the ratio of floating-point computations to memory accesses in the body of the loop. The higher the compute intensity, the more likely you will see speed-up on a GPU accelerator. Left-clicking the "(i)" symbol next to each loop, we can see the intensities are very low (less than 1) for the first three loops. However, the loop at line 366 has an intensity of about 40. Given the high intensity and percentage of time spent in this loop, it seems to be a good candidate for GPU acceleration.

BenchADM Profile Info

4. Baseline Performance

The point of using a GPU accelerator is to speed up your code, so we first need to create a baseline performance measurement.

Note: The SIZE refers the grid size set in the "BenchADM_40l.par" input file. The larger the grid, the longer the runtime.

  % make SIZE=120 OPT="-fast" clean
  % make SIZE=120 OPT="-fast" build_base 
  % make SIZE=120 OPT="-fast" run_base

On our 2.67Ghz Intel Core i7 processor, the baseline run time is about 58 seconds for size 120. Your times will vary!

5. Adding PGI Accelerator !$ACC Directives

Now that we have identified a likely candidate loop for acceleration and created a performance baseline, let's add our first PGI Accelerator directives to the code. Open the file "src/StaggeredLeapfrog2.F" in a text editor. Insert !$ACC REGION on a new line right above line 366, then add a closing directive !$ACC END REGION on a new line right below line 815 (the end do corresponding to the Do loop at line 366). Write the modified code out into a file named "src/StaggeredLeapfrog2_acc1.F".

Re-build the code using the command:

  % make OPT="-fast -ta=nvidia,time -Minfo=accel" build_acc1

The -ta=nvidia,time target accelerator option instructs the compiler to interpret PGI Accelerator directives and try to offload accelerator regions to an NVIDIA GPU. The ,time sub-option instructs the compiler to measure the amount of time spent in each compute kernel, as well as time spent performing data movement between host memory and GPU device memory.

The -Minfo=accel option instructs the compiler to emit CCFF messages to stdout during compilation. The messages indicate whether or not an accelerator region is generated, which arrays are copied to/from the host to the GPU accelerator, and report the schedule used (parallel and/or vector) to map the kernel onto the GPU accelerator processors.

Scrolling back in your command window, you should see output from the PGI compiler that looks something like the following:

pgfortran  -fast -ta=nvidia,time -Minfo=accel   -c -o \
 objdir/StaggeredLeapfrog2_acc1.o ./src/StaggeredLeapfrog2_acc1.F
bench_staggeredleapfrog2:
    366, Generating copyout(adm_kzz_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
         Generating copyout(adm_kyz_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
         Generating copyin(lalp(1:nx-2+2,1:ny-2+2,1:nz-2+2))
         Generating copyout(adm_kyy_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
         Generating copyout(adm_kxz_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
         Generating copyout(adm_kxy_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
         Generating copyout(adm_kxx_stag(2:nx-2+1,2:ny-2+1,2:nz-2+1))
         Generating copyin(lgzz(1:nx-2+2,1:ny-2+2,1:nz-2+2))
         Generating copyin(lgyz(1:nx-2+2,1:ny-2+2,1:nz-2+2))
         Generating copyin(lgyy(1:nx-2+2,1:ny-2+2,1:nz-2+2))
         Generating copyin(lgxz(1:nx-2+2,1:ny-2+2,1:nz-2+2))
         Generating copyin(lgxy(1:nx-2+2,1:ny-2+2,1:nz-2+2))
         Generating copyin(lgxx(1:nx-2+2,1:ny-2+2,1:nz-2+2))
         Generating copyin(adm_kzz_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
         Generating copyin(adm_kzz_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
         Generating copyin(adm_kyz_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
         Generating copyin(adm_kyz_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
         Generating copyin(adm_kyy_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
         Generating copyin(adm_kyy_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
         Generating copyin(adm_kxz_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
         Generating copyin(adm_kxz_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
         Generating copyin(adm_kxy_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
         Generating copyin(adm_kxy_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
         Generating copyin(adm_kxx_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
         Generating copyin(adm_kxx_stag_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))
    367, Loop is parallelizable
    371, Loop is parallelizable
    375, Loop is parallelizable
         Accelerator kernel generated
        367, !$acc do parallel, vector(8)
        371, !$acc do vector(8)
        375, !$acc do parallel
             Non-stride-1 accesses for array 'adm_kxx_stag_p'
             Non-stride-1 accesses for array 'adm_kxx_stag_p_p'
             Non-stride-1 accesses for array 'adm_kxy_stag_p'
             Non-stride-1 accesses for array 'adm_kxy_stag_p_p'
             Non-stride-1 accesses for array 'adm_kxz_stag_p'
             Non-stride-1 accesses for array 'adm_kxz_stag_p_p'
             Non-stride-1 accesses for array 'adm_kyy_stag_p'
             Non-stride-1 accesses for array 'adm_kyy_stag_p_p'
             Non-stride-1 accesses for array 'adm_kyz_stag_p'
             Non-stride-1 accesses for array 'adm_kyz_stag_p_p'
             Non-stride-1 accesses for array 'adm_kzz_stag_p'
             Non-stride-1 accesses for array 'adm_kzz_stag_p_p'
             Non-stride-1 accesses for array 'lgxx'
             Non-stride-1 accesses for array 'lgxy'
             Non-stride-1 accesses for array 'lgxz'
             Non-stride-1 accesses for array 'lgyy'
             Non-stride-1 accesses for array 'lgyz'
             Non-stride-1 accesses for array 'lgzz'
             Non-stride-1 accesses for array 'adm_kxx_stag'
             Non-stride-1 accesses for array 'adm_kxy_stag'
             Non-stride-1 accesses for array 'adm_kxz_stag'
             Non-stride-1 accesses for array 'adm_kyy_stag'
             Non-stride-1 accesses for array 'lalp'
             Non-stride-1 accesses for array 'adm_kyz_stag'
             Non-stride-1 accesses for array 'adm_kzz_stag'

You've just succeeded in mapping a 450 line loop onto a GPU accelerator! If you don't see such output, you may have made an error editing the source. You will find a version pre-edited with the directives and known to work in the file "src/STEPS/StaggeredLeapfrog2_acc1.F". Feel free to compare to that version, or to simply copy it to "src/StaggeredLeapfrog2_acc1.F" and try again. There are similar pre-edited versions in "src/STEPS" for each step of this tutorial.

6. Executing Using the GPU Accelerator

To run the generated executable on the GPU. Use the following command:

  % make SIZE=120 OPT="-fast -ta=nvidia,time -Minfo=accel" run_acc1

The timing output generated by the ",time" suboption looks something like the following. Time spent in kernel execution and on data movement is reported in microseconds. The total time is about the same as the host, with about 27 seconds spent executing on the GPU Accelerator—in this case a Tesla C1060 with 240 cores and 4GBytes of device main memory. About 16 seconds is spent transferring data back and forth between host memory and GPU device memory:

Accelerator Kernel Timing data
./src/StaggeredLeapfrog2_acc1.F
  bench_staggeredleapfrog2
    366: region entered 100 times
        time(us): total=43019899 init=68 region=43019831
                  kernels=27349613 data=15670218
        w/o init: total=43019831 max=433198 min=429363 avg=430198
        375: kernel launched 100 times
            grid: [118x15]  block: [8x8]
            time(us): total=27349613 max=274270 min=272809 avg=273496
acc_init.c
  acc_init
    1: region entered 1 time
        time(us): init=1395632
53.11user 4.43system 0:57.75elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2205916minor)pagefaults 0swaps

Side note: The GPU accelerator card needs to be initialized before use. This is done by default the first time an accelerator region is entered. Alternatively, you can add a call to the function "acc_init()" to initialize the device. We have added the following code to BenchADM's initialization routine in the file "src/Cactus/InitialiseCactus_acc.c":

#ifdef _ACCEL
#  include "accel.h"
#endif
....
#ifdef _ACCEL
  acc_init(acc_device_nvidia);
#endif

The _ACCEL macro name is defined if the PGI compilers are invoked using the -ta option, and will be undefined otherwise. This is an example of a portable method for factoring GPU initialization out of accelerator regions, to ensure the timings for the regions do not include initialization overhead.

7. Improving Data Transfer Speed

In order to minimize data transfers between the host memory and GPU accelerator device memory, the PGI Accelerator compilers compute the minimal amount of data that must be copied. The following -Minfo message shows the compiler has determined that only a section of the array "adm_kzz_stag_p_p" must be copied to the GPU.

   Generating copyin(adm_kzz_stag_p_p(2:nx-2+1,2:ny-2+1,2:nz-2+1))

However, this array section is not contiguous; because the data transfer is done via DMA (direct memory access) operations, the transfer will send each contiguous subarray separately. Often, it's more efficient to copy an entire array. Even though it involves transferring more data, it can be done in one large DMA operation.

Copy "src/StaggeredLeapfrog2_acc1.F" to "src/StaggerredLeapfrog2_acc2.F" and bring up the latter in a text editor. Modify the accelerator region directive by inserting data copy clauses that will cause whole arrays to be copied, rather than array sections. Alternatively, just copy the file "src/STEPS/StaggeredLeapfrog2_acc2.F" to the "src" directory.

Hint: look at the -Minfo messages given by the compiler to determine which arrays are copied in/out, and insert directives that look as follows to explicitly specify the whole-array copies:

!$ACC REGION
!$ACC& COPYOUT (
!$ACC& ADM_kxx_stag(1:nx,1:ny,1:nz),
!$ACC& ADM_kxy_stag(1:nx,1:ny,1:nz),
...

Build and execute the revised code using the following command:

  % make SIZE=120 OPT="-fast -ta=nvidia,time -Minfo=accel" build_acc2 
  % make SIZE=120 OPT="-fast -ta=nvidia,time -Minfo=accel" run_acc2
  ...
Accelerator Kernel Timing data
./src/StaggeredLeapfrog2_acc2.F
  bench_staggeredleapfrog2
    369: region entered 100 times
        time(us): total=34135122 init=67 region=34135055
                  kernels=27004177 data=7130878
        w/o init: total=34135055 max=343213 min=340418 avg=341350
        408: kernel launched 100 times
            grid: [118x15]  block: [8x8]
            time(us): total=27004177 max=270796 min=269179 avg=270041
acc_init.c
  acc_init
    1: region entered 1 time
        time(us): init=1396119
44.27user 4.36system 0:48.84elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2205906minor)pagefaults 0swaps

The revised code with directives to help the compiler optimize data movement reduces host-to-GPU data transfer time by almost nine seconds.

8. Improving the GPU Accelerator Kernel Schedule

We succeeded in improving data transfer performance by adding directives. Can we also improve on the GPU accelerator kernel schedule that was chosen automatically by the compiler? The schedule the compiler generated is listed in the -Minfo messages:

    ...    
    408, Loop is parallelizable
         Accelerator kernel generated
        400, !$acc do parallel, vector(8)
        404, !$acc do vector(8)
        408, !$acc do parallel

This set of messages tells you that the body of the loop at line 400 was translated into an accelerator kernel. The kernel was contained in three loops, at lines 400 (do k), 404 (do j) and 408 (do i). The compiler created a kernel schedule such that an 8x8 block of iterations from the 'k' and 'j' loops execute in vector or synchronous mode. This 8x8 block of iterations corresponds to a 'thread block' in GPU terms. The 8x8 blocks are then scheduled across the GPU multiprocessors by spreading iterations of the 'k' and 'i' loop in parallel.

This turns out to be an unfortunate choice of schedules. Most of the arrays in the loop use the 'i' loop index in the stride-1 dimension. For best performance, the GPU wants the stride-1 dimension to be a vector index, and perhaps the only vector index. The 'Non-stride-1 accesses for array' messages point out the arrays with non-stride-1 accesses in the vector loop indices.

By changing the schedule to parallelize only the outer loop and vectorize the middle loop, we can eliminate several Non-stride-1 memory accesses. Copy "src/StaggeredLeapfrog2_acc2.F" to "src/StaggeredLeapfrog2_acc3.F" and bring up the latter in a text editor. Starting at line 400, insert the PGI Accelerator directives to specify this schedule. (Again, the pre-edited source can be found in "src/STEPS/StaggeredLeapfrog2_acc3.F")

Lines 400–406 of "src/StaggeredLeapfrog2_acc3.F" should look as follows:

!$ACC DO PARALLEL
      do k = 2,nz-1
         km = k-1
         k0 = k
         kp = k+1
!$ACC DO VECTOR(32)
         do j=2,ny-1

Rebuild and re-run as follows:

  % make SIZE=120 OPT="-fast -ta=nvidia,time -Minfo=accel" build_acc3 run_acc3

The -Minfo messages now show that the compiler generates significantly fewer non-stride-1 accesses, is able to store several array elements in registers, and is able to store one array in the high-speed low-latency software-managed cache memory attached to each multi-processor:

    410, Loop is parallelizable
         Accelerator kernel generated
        401, !$acc do parallel
        406, !$acc do vector(32)
             Using register for 'adm_kxx_stag_p'
             Using register for 'adm_kxy_stag_p'
             Using register for 'adm_kxz_stag_p'
             Using register for 'adm_kyy_stag_p'
             Using register for 'adm_kyz_stag_p'
             Using register for 'adm_kzz_stag_p'
             Non-stride-1 accesses for array 'lgxx'
             Non-stride-1 accesses for array 'lgxy'
             Non-stride-1 accesses for array 'lgxz'
             Non-stride-1 accesses for array 'lgyy'
             Non-stride-1 accesses for array 'lgyz'
             Non-stride-1 accesses for array 'lgzz'
             Cached references to size [10x34x3] block of 'lalp'
        410, !$acc do parallel, vector(8)

The new timing information shows a 19 second improvement in the kernel time.

Accelerator Kernel Timing data
./src/StaggeredLeapfrog2_acc3.F
  bench_staggeredleapfrog2
    369: region entered 100 times
        time(us): total=15380541 init=64 region=15380477
                  kernels=8267902 data=7112575
        w/o init: total=15380477 max=155189 min=153277 avg=153804
        410: kernel launched 100 times
            grid: [118x15]  block: [8x32]
            time(us): total=8267902 max=83164 min=82197 avg=82679
acc_init.c
  acc_init
    1: region entered 1 time
        time(us): init=1394681
25.50user 4.33system 0:30.03elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2205922minor)pagefaults 0swaps

Can you optimize the kernel schedule further? The answer is yes, but it usually takes some experimentation with the DO PARALLEL, VECTOR and SEQ directives to find an optimal combination. Keep in mind these guidelines:

  • Schedule outermost loops as PARALLEL whenever possible.
  • At most two loops in a nest can have PARALLEL or PARALLEL(width) directives.
  • Experiment with parallelizing multiple loops in a nest, and also with scheduling certain of the loops using SEQ for sequential execution on the thread processors.
  • Try to arrange for stride one innermost loops, and schedule them for VECTOR execution.
  • Try to make the 'width' in VECTOR(width) directives a multiple of 32 to match the NVIDIA CUDA Warp size.
  • At most three loops can have VECTOR(width) directives, and the product of all the 'width' arguments must be less than or equal to 256.

Spend some time trying to optimize the schedule for the loop nest at line 400. See if you can improve on the schedule we used above.

A somewhat more optimal schedule is included in the version at "src/STEPS/StaggeredLeapfrog2_acc4.F". Try copying it up to the "src" directory and then rebuild and re-run as follows:

  % make SIZE=120 OPT="-fast -ta=nvidia,time -Minfo=accel" build_acc4 run_acc4
...
Accelerator Kernel Timing data
./src/StaggeredLeapfrog2_acc4.F
  bench_staggeredleapfrog2
    369: region entered 100 times
        time(us): total=14214738 init=68 region=14214670
                  kernels=6970749 data=7243921
        w/o init: total=14214670 max=145236 min=140804 avg=142146
        409: kernel launched 100 times
            grid: [118]  block: [128]
            time(us): total=6970749 max=70011 min=69591 avg=69707
acc_init.c
  acc_init
    1: region entered 1 time
        time(us): init=1383911
24.07user 4.45system 0:28.72elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+2205915minor)pagefaults 0swaps

Were you able to find a better schedule than this one? If so, let us know! Is the schedule in "StaggeredLeapfrog2_acc4.F" the most optimal? We don't know the answer to that, but it's the best one we've found as of this writing. If you can find a schedule that eliminates all of the non-coalesced memory accesses, it will probably improve performance of the kernel even further.

9. What More Can be Done?

In the final example file "src/StaggeredLeapfrog2_acc5.F", we can improve benchADM by expanding the accelerator region to include the entire function. Although the first three loops have a low intensity and would not be good candidates for acceleration by themselves, most of the data transfer cost has already been paid so it's beneficial to offload them to the GPU.

Try copying "src/StaggeredLeapfrog2_acc4.F" to "src/StaggeredLeapfrog2_acc5.F" and bring up the latter in your editor. Move the !$ACC REGION directive to encompass all four loops. Which arrays must be added to the COPYIN or COPYOUT clauses? Which arrays can be allocated locally using the LOCAL clause? Feel free to take a look at "src/STEPS/StaggeredLeapfrog2_acc5.F" for hints, or copy it up to "src".

Note that the LOCAL clause on several of the arrays that are only used within the accelerator region eliminates the need to copy them at all, but some arrays must be added as inputs or outputs for the first three loops in the routine. With these changes, the data transfer and aggregate kernel execution times increase slightly, but overall wall-clock time drops an additional 10 seconds.

  % make SIZE=120 OPT="-fast -ta=nvidia,time -Minfo=accel" build_acc5 run_acc5 
   ...
Accelerator Kernel Timing data
./src/StaggeredLeapfrog2_acc5.F
  bench_staggeredleapfrog2
    303: region entered 100 times
        time(us): total=16556041 init=42 region=16555999
                  kernels=7566102 data=8989897
        w/o init: total=16555999 max=188030 min=165031 avg=165559
        348: kernel launched 100 times
            grid: [8x8]  block: [16x16]
            time(us): total=5815 max=68 min=57 avg=58
        368: kernel launched 100 times
            grid: [8x8]  block: [16x16]
            time(us): total=5329 max=57 min=52 avg=53
        389: kernel launched 100 times
            grid: [30x30]  block: [16x4x4]
            time(us): total=574262 max=5913 min=5503 avg=5742
        425: kernel launched 100 times
            grid: [118]  block: [128]
            time(us): total=6980696 max=70073 min=69664 avg=69806
acc_init.c
  acc_init
    1: region entered 1 time
        time(us): init=1408909
16.10user 2.30system 0:18.61elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+160673minor)pagefaults 0swaps

Feedback? Questions? Please post your comments on the PGI User's Forum.

SPEC®, SPECfp®, and SPECint® are registered trademarks of the Standard Performance Evaluation Corporation (SPEC).