Technical News from The Portland Group

Accessing PGI Compiler Performance Advice

Traditional methods of performance tuning can only get you so far. Using PGI compiler feedback with the PGPROF profiler can ease the task of improving application performance

The Tuning Problem

PGI Compilers do a good job of providing "compile-and-go" optimization with a few basic options: -fast for commonly-used optimizations, -mp to enable OpenMP pragmas and directives, -tp for processor-specific optimizations, and maybe -Mipa for inter-procedural analysis. Often that is enough to get the performance you want. Sometimes, however, the resulting performance just isn't quite what you need. Or although performance is good, it is worthwhile to try to squeeze some more speed out of your application.

There are a number of old tried-and-true techniques and aphorisms for the diagnosis and elimination of performance bottlenecks. "Tune serial performance before tuning parallel performance." "Use a profiler to find where the time is spent." "If it is your application, you probably know where the time is spent." "Try some favorite compiler options on the compute-intensive parts of the code." And so on.

These approaches only get you so far. At some point you may be reduced to (a) examining source code and trying to find a way to make it faster, or (b) using trial and error to find better compiler options. In either scenario, you must try to deduce what the compiler will produce from your sources, and how that code will run on whatever processors you are using. Performance profiling often leaves users saying "I see where the time is spent in my program, but what do I do about it?"

Compiler Feedback

Starting in release 8.0, PGI provides a performance tuning feature that bridges the gap between the compiler and the profiler. During compilation, compilers discover and generate a significant amount of information about how the program can (or cannot) be optimized and why, how data is accessed, relationships between procedures, and much more. The PGI compilers can save this information in a form known as the Common Compiler Feedback Format (CCFF). PGI's performance profiler, PGPROF, can correlate this compiler feedback with application source code and associated performance profile data.

This means that you can look at your source code, see which lines of code are taking the most time or using the most resources, and then look at information provided by compiler analysis for insight into how the compiler optimized (or didn't optimize) those lines of code. You can look at when the program was built and what compiler options were used. In addition, optimization hints are sometimes associated with CCFF information, providing suggestions for things to try in addressing performance issues.

Using Compiler Feedback

You can view compiler feedback using the PGPROF profiler. In general, you must create a performance profile using one of the PGI profiling methods to do this. However, on Linux you can use the PGPROF -browse option to view CCFF information without a perfomance profile.

A Quick Overview of PGI Profiling Methods

Today, PGI provides a variety of ways of generating performance data. However, we are moving in a direction where a simple tool, PGCOLLECT, will be the primary mechanism used to create performance profiles. In release 9.0 it will be supported on Linux and Macintosh, with Windows support following soon. Using its most basic mode to create a time-based profile, you just run your program using PGCOLLECT:

  $ pgcollect myprogram

Then you can view the profile data using PGPROF:

  $ pgprof -exe myprogram

On 64-bit Linux, PGCOLLECT can also be used to generate hardware counter profiles using OProfile. In addition, PGI continues to support all the methods of creating performance profiles that we have provided in the past: -Mprof=func, -Mprof=lines, -Mprof=time, and -Mprof=hwcts. The -Mprof=func and -Mprof=lines options are supported on all platforms; -Mprof=time and -Mprof=hwcts are supported only on Linux. See the PGPROF Profiler Guide for more information about these options.

Generating Compiler Feedback for Yourself

Compiling and linking with any of the PGI performance instrumentation options (-Mprof=func, -Mprof=lines, -Mprof=time, -Mprof=hwcts) automatically includes generation of CCFF information.

If you are generating profile data using PGCOLLECT, you can tell the compiler to generate CCFF information by using the -Minfo=ccff compiler option. This will increase the size of your application binary on disk, but will not affect its execution performance or increase its memory footprint.

Example

Let's look at a simple contrived example. We take a simple Fortran matrix multiply routine and insert a call to a subroutine in the innermost loop (see the source code in the screenshot below).

We then compile the program:

  $ pgf90 -fast -Minfo=ccff -o mm mm.f90

Run it using PGCOLLECT:

  $ pgcollect ./mm

And display the resulting profile using PGPROF:

  $ pgprof -exe ./mm

PGPROF Screen Capture Showing CCFF Data
PGPROF Showing Compiler Feedback

Forty-seven percent of the time was spent in the inner loop (line 10). In the screenshot above, we have clicked the round blue button labeled (I) next to line 10. This opens the Compiler Feedback panel in the lower portion of the PGPROF display.

The Compiler Feedback panel provides several pieces of information:

  • The compute intensity of that line is .67. Compute intensity is roughly a ratio of computational operations to memory operations. Lower compute intensity values imply that the code may be memory-bound (that is, memory bandwidth may be a factor limiting performance).
  • The loop was not vectorized because it contains a call. A suggestion is made to compile with inlining options or to modify the code so the loop doesn't contain a call. (Find inlining options in the PGI User's Guide, or using the compiler's -help option.)
  • The options used to compile the program. If you scrolled the Compiler Feedback panel down you would see more information about the object file.

There is a clear course of action in optimizing this code—move the call to mymonitor out of the loop, and if necessary call it in its own loop after the matrix multiply is complete. An alternative might be to compile with -Minline, but the call to mymonitor isn't critical to the computation.

This is just a simple example. CCFF can help not only with vectorization but with other optimizations including auto-parallelization, OpenMP, and even the upcoming accelerator support.

Summary

We have looked at the basics of how to use PGI compiler feedback in the process of tuning your application's performance. PGI is continuing to work to improve and simplify the tools we provide to assist with this.

We will look at using PGPROF and CCFF to optimize a real-world application in a future issue of this newsletter.

If you are a compiler or tool developer, or you are interested in more detail about how CCFF works, please see the CCFF web page .