Technical News from The Portland Group
A New Direction for PGI Performance Profiling
This article introduces a new PGI performance tools capability, pgcollect -time. After a brief overview, we will survey the most common performance data collection techniques, then compare and contrast those techniques with pgcollect -time.
In the last few PGI software releases, new performance profiling features have been introduced as part of the implementation of a new performance tools strategy. These features include Compiler Feedback (CCFF) described in Part 1 of this article series and the new performance data collection method (pgcollect -time) described here.
PGI has provided the PGCOLLECT tool on the linux86-64 platform for some time, as a simple way to collect performance data using the OProfile tools. In Release 9.0, PGI redefined the role of this tool by implementing a new performance profile data collector that is invoked using PGCOLLECT. This performance data collector is intended to complement the PGI performance tools strategy that focuses on making performance tuning (a) easier, and (b) the same on every platform, whether it is Linux, Mac, Windows, Intel, or AMD.
To use the new data collector, you simply start your application with pgcollect -time. For example:
$ pgcollect -time myprog arg1 arg2
To display the results, you can use the command:
$ pgprof -exe myprog
Note that we are working to make performance tuning "easier", not "easy". Like any programming problem, some performance problems are easy to find and fix, while others take time, patience and expertise. Many of the performance tools available for HPC developers today are complicated to install, configure, use, and understand. PGI's intent with PGPROF and PGCOLLECT is to provide some useful and actionable information simply and easily.
Also note that the existing pgcollect functionality, providing an interface to OProfile on 64-bit Linux, continues to be supported. Profiling using the cycle counter is now accessed with the -hwtime option, since -time is used for the new sample collector.
Performance Profiling Methods
There are a variety of methods of collecting performance information from computer applications. Each has advantages and disadvantages, and in the HPC community this field has been a rich area of research for at least twenty years. In this section we'll take a look at some of the more commonly used methods, and later compare them to PGI's new collector.
Instrumentation implies that "something" inserts timer calls at key points in your program and does the bookkeeping necessary to track the execution time and execution counts for routines and source lines. The "something" that inserts timer calls may be:
- the compiler, which inserts the calls at compile time
- a library, which inserts the calls by intercepting known function calls (e.g. MPI) at link time (static or dynamic link)
- an external tool which modifies the code of the program in memory at runtime
- the programmer, who inserts timer calls and print statements to report timings of various code segments
The TAU performance tool from the University of Oregon and the Paradyn tool from the University of Wisconsin (Madison) have both used instrumentation successfully. TAU is fully interoperable with the PGI compilers, and can even be used to profile PGI Accelerator Fortran and C99 programs that offload compute-intensive code segments to CDUA-enabled NVIDIA GPUs. PGI compilers insert instrumentation when invoked with -Mprof=func, -Mprof=time, and/or one of the -Mprof= MPI options (e.g. -Mprof=mpich2).
- Provides exact call counts and/or exact line/block execution counts.
- Reports time attributable to only the code in a routine.
- Reports time attributable to the code in a routine and all the routines it called.
- Can be used for precise path coverage profiling
- May require recompile and/or relink
- Instrumented runs may take a long time due to instrumentation overhead (time spent in performance instrumentation code)
- Performance results may be skewed due to instrumentation overhead
- Granularity of information may be low (to avoid overhead)
- Dynamic (runtime) instrumentation may be complex
Bottom line: instrumentation will get you some performance data, but there may be a price in either accuracy or difficulty of use. PGI instrumentation options will likely be streamlined in a future release to support only those features which aren't readily available through PGCOLLECT.
Sample-based profiling uses statistical methods to determine the execution time and resource utilization of the routines, source lines, and assembly instructions of the program. It does this by periodically stopping the application and recording the program counter (instruction address). Sample-based profiling can be divided into two categories: time-based sampling and event-based sampling.
With time-based sampling the program's current instruction address (program counter) is recorded at statistically significant time intervals. Instruction addresses where a lot of time is spent during execution are read numerous times. The profiler can map these addresses to source lines and/or functions in your program, providing an easy way to navigate from the function where the most time is spent, to the line, to the assembly instruction.
A well-known profiler that uses this method on Linux is gprof. You can use gcc -pg and gprof to get a time-based profile. However gprof is somewhat outdated, since it doesn't support multi-threaded programs, OpenMP, MPI or shared objects. PGI compilers also support -pg.
- easy to use
- reasonably accurate
- low overhead
- many Linux developers are already familiar with it
- no cross platform solution
- some existing solutions only support single-threaded apps
Bottom line: time-based sampling is a good way to get basic timing information, but common methods haven't kept pace with advances like multicore and cluster computing.
In event-based sampling, the current instruction address is recorded not based on time intervals, but on the number of occurrences of a particular event, usually a processor event. Modern x86 processors contain several event counters that can be configured to monitor a wide variety of processor events like memory access, cache miss, etc. There is also usually a processor cycle counter, so event-based sampling can be used to monitor time. An example of event-based sampling is to record the instruction address every 10,000 L2 data cache misses.
Commonly used performance tools that use event-based sampling include OProfile, PAPI, Apple Shark, Intel VTune, and AMD CodeAnalyst.
- Variety of precise, accurate information about the time and resource utilization of an application (this is a big advantage)
- Multi-thread and shared object support
- Some require special privileges (e.g. run as root)
- Some require special drivers or kernel patches
- Processor-specific features make some platform-specific, even for the same processor vendor (eg Core2 vs. Nehalem).
- Some tools lack the ability to correlate instruction address with source code
- No single cross-platform solution
Bottom line: you can get some very accurate information about what your application is doing, but someone will have to spend some time getting your system set up so you can do so. If you use multiple platforms (OS or processor) you will have to learn about new tools and/or hardware features on each platform to make progress.
PGI's New Data Collector
The performance data collector now supported by the PGCOLLECT tool uses time-based sampling. It uses operating system interfaces (timers, debugger interfaces, etc.) to do this, so it can be run without any special operating system software or system privileges. Here is a rundown of the advantages of using PGCOLLECT's new sample collector:
- no need to recompile or relink
- handles dynamic libraries/shared objects
- handles multi-threaded applications
- no need for special privileges (OProfile)
- works the same on all supported processors
- works the same on all supported operating systems
- no need to install special drivers (VTune, CodeAnalyst) or kernel patches (PAPI)
- easy to run in batch mode for apps that are started by scripts
- reasonably scalable, doesn't create huge trace files
- provides correlation between source code, performance data and information about how the compiler optimized the code (CCFF)
As with any technology, there are trade-offs that had to be made to achieve PGCOLLECT's goals. In order to make it simple to use and interoperable across multiple operating systems and processors, some of the platform-specific features that you find in platform-specific tools are lacking.
In addition, pgcollect -time may be sensitive to system load. This is true to varying degrees for all performance tools; to get valid results with any performance tool you should run performance experiments on a lightly loaded or unloaded system.
In PGI Release 9.0 pgcollect -time is available on Linux and Apple Mac OS X. Support for the Windows operating system is coming soon.
Let's take a look at a simple example. We will use the Himeno benchmark to show how PGCOLLECT and PGPROF can be used to easily examine multicore application performance. We will look at how the PGI compiler can auto-parallelize this program.
First we must build the application. Although no special build options are necessary to use PGCOLLECT, if you use -Minfo=ccff you will have access to information and advice from the compiler, as described in Part 1 of this article series. We will build for auto-parallelization and run on a single core, then run on four cores. Note that the serial version of the application (the version not built for auto-parallelization) achieves roughly the same performance as the auto-parallelized version on a single core; for brevity we will omit a demonstration of that here.
After building the program, we'll run it under the control of PGCOLLECT on a single core. In this example, the program prints out some performance information on its own. The first line of the output from PGCOLLECT is the line that reads "the target process has terminated".
$ pgf90 -fast -Mconcur -Minfo=ccff -o himeno himeno.F90 $ unset NCPUS $ pgcollect ./himeno Select Grid-size: Grid-size= XS (64x32x32) S (128x64x64) M (256x128x128) L (512x256x256) XL (1024x512x512) mimax= 513 mjmax= 257 mkmax= 257 imax= 512 jmax= 256 kmax= 256 Time measurement accuracy : .10000E-05 Start rehearsal measurement process. Measure the performance in 3 times. MFLOPS: 901.4296 time(s): 3.723111000000000 8.5663196E-04 Now, start the actual measurement process. The loop will be excuted in 48 times. This will take about one minute. Wait for a while. Loop executed for 48 times Gosa : 8.0512121E-04 MFLOPS: 901.4086 time(s): 59.57115800000000 Score based on Pentium III 600MHz : 10.88132 FORTRAN PAUSE: continuing... target process has terminated, writing profile data
Next we bring up the PGPROF performance profiler:
$ pgprof -exe ./himeno
Note that 79% of the time is spent in routine "jacobi". This is the clear candidate for optimization, so we drill down by double-clicking on the routine name.
Here we see a nested, compute-intensive loop. The round blue buttons labeled 'i' to the left of the source code denote compiler feedback availability for those source lines. When we browse this information for the loops, we find that the loop at line 293 will run in parallel if the "trip count" is greater than or equal to 33. The loop runs from 2 to kmax-1. Looking back at the output from our serial run, we see that the value of "kmax" is 256, so the "trip count" will definitely be greater than 33 for this data set.
We set NCPUS to 4 so as to run on four cores, and run the program under control of PGCOLLECT once again. We see a definite speed-up (again, with the caveat that the serial results are using the auto-parallel version of the code—with a purely serial version of the code serial results may be slightly better). Here are performance results as printed by the Himeno application:
Serial: MFLOPS: 901.4086 time(s): 59.57115800000000 Parallel: MFLOPS: 1603.714 time(s): 59.29370800000000
Note that the Himeno benchmark is designed to run for one minute regardless of the number of processors or cores used, so speed-up should only be judged by looking at the MFLOPS rating.
If we take a look at the performance using PGPROF, we see that the routine "jacobi" runs with almost perfect load-balancing across the four cores. The speed-up of 1.78 is less than the theoretical maximum because of a number of factors, including the fact that the data set does not fit entirely in the processor data cache and part of the work is serial.
In summary, this example illustrates how PGCOLLECT and PGPROC can be combined to provide a simple but powerful method of analyzing the performance of your application. Without any special requirements of system software, system privileges, or any particular expertise we were able to get useful information about the program that we could act upon.
Planning for upcoming releases is not yet final, but some items high on PGI's list for next steps in the evolution of our performance profiling tools include:
- Accelerator support—provide information to assist in tuning applications that use the PGI Accelerator programming model or CUDA Fortran to execute code on accelerators or GPGPUs
- MPI support—using PGCOLLECT (MPI profiling is currently supported using instrumentation only)
- As PGCOLLECT sampling matures and sees wider use, we plan to reduce or eliminate support for some PGI performance data collection features that will have become redundant.
There are lots of other options we are considering—let us know if there are features you'd like added to the PGI performance profiling tools.