November 2010
Tuning Application Performance Using Hardware Event Counters in the PGPROF Profiler
Introduction
Application tuning can be challenging. Performance profilers can help identify which routines and loops in your program are taking the most time, but how can you determine whether performance can be improved, and how?
This article describes a couple of ways to use the PGPROF profiler to answer those questions. We assume that the reader has a basic understanding of computer architecture, including what a data cache is. We will describe the tuning process for an OpenMP program from finding where the time is spent to discovering a reason why time is spent there and a way to improve performance.
Preliminary Tuning Steps
Figure 1 shows a basic profile of our example, the SPEC OMP benchmark, 314.mgrid. (Note: we will describe how to generate such profiles later in this article.) The following compilation command was used to generate the mgrid executable. (see the PGI Users Guide for a description of the compiler options used):
% pgfortran -mp -Minfo=ccff -fast -Mipa=fast,inline -o mgrid mgrid.f
We ran this application on a Core i7 (Nehalem) system using eight OpenMP threads. We see the CPU_CLK_UNHALTED column, which shows the CPU cycles (i.e. time) spent in each routine in our program. 50% of the time is spent in a routine named resid, so that's a good place to start.
Figure 1: Time-base mgrid Profile
Since this program is using eight OpenMP threads, we should check that it is well balanced. In Figure 1 the panels labeled "Process/Thread Browser for application './mgrid'" and "Process/Thread Viewer for routine 'resid'" both show the per-thread times. It appears that in our example the work is evenly distributed among the threads.
We built the program using -Minfo=ccff. When you are tuning, you should use at least one of the following two options:
- -Minfo prints optimization info to stdout
- -Minfo=ccff embeds optimization info in the program for use by the profiler
In either case, when you bring up the profile, double-click on the routine that takes the most time to see the source code. In our example, this is resid. Next navigate to the line (probably a loop) that takes the most time by clicking on the "hottest spot" button in the toolbar (
). Take a look at any info messages for that line, either in the captured -Minfo output or by clicking on the
to the left of the source line in the profiler.
Figure 2: PGPROF CCFF output
Look for messages that describe optimizations that could not be made, like "loop was not vectorized", for clues about how you might restructure your code to help the compiler improve performance. If your longest executing loops are already vectorized, you may be getting the best performance possible.
In our mgrid example, there is a lot of information about what the compiler did to optimize this loop, but it is all positive. The compiler did not complain about anything hindering optimization. So we must look elsewhere for more performance.
You can also experiment with different compiler options to see if those might improve performance. In our case, we will explore whether such options might improve the performance of our program by examining how it utilizes the level 2 data cache. See the optimization chapter in the PGI User's Guide for descriptions of basic optimization options.
Tuning With Hardware Counters (E.G. Cache Misses)
In general, hardware counters are not useful as an absolute measurement. There is no way to look at a particular profile and say "there are too many cache misses in this program". There are too many variables; what might be a good cache miss rate for one program might be terrible for another. It depends on the algorithms and data structures used in the program.
Also, cache tuning can be very processor-specific. Code modifications that result in better performance on one processor may degrade performance on another processor.
Where hardware counters are useful is in measuring the effects of changes to the code. For caches to operate efficiently, access to data should be from contiguous (adjacent) memory locations. For a given loop, it may be possible to examine the source to determine if memory is accessed from contiguous locations, and whether the algorithm comes back to access the same memory locations repeatedly. Restructuring your code for efficient memory access can result in a big performance payoff.
Another approach to using hardware counters derives measurements using combinations of hardware counter data. This approach will require defining your own performance events (see the pgcollect documentation in the PGI Tools Guide—it's not that difficult). AMD has published a paper on how to combine performance counter data to get more informative profiling results. This is a good reference regardless of whether you are using AMD or Intel processors. You may need to make changes to the counter names used in the examples depending on the processor you are using.
Example Using Hardware Counters (E.G. Cache Misses)
To demonstrate the usefulness of the hardware counters, we'll continue to use the SPEC OMP benchmark, 314.mgrid. Use the performance data collection tool pgcollect with the -dcache option to gather hardware counter information on data cache misses. The following are the commands used:
% export OMP_NUM_THREADS=8 % echo "/usr/bin/time ./mgrid <mgrid.in" > tmp % chmod +x ./tmp % pgcollect -dcache -exe ./mgrid ./tmp > mgrid.out
The total runtime was 195 seconds. Use the following command to view the profile from this run:
% pgprof -exe mgrid pgprof.out
Figure 3: Data Cache Profile
The data cache profile shows data for counters named LLC_MISSES and LLC_REFS. One way to find out what these counters are measuring is to use the pgcollect -list-events option. On our Nehalem system, these counters are defined as follows:
- LLC_MISSES: L2 cache demand requests from this core that missed the L2
- LLC_REFS: L2 cache demand requests from this core
A quick search of the web shows us that cache demand requests are those cache requests that are not prefetch requests.
The number of LLC_MISSES for the function resid contribute 57% of the total number. Using LLC_REFS counter, we can compute the cache miss rate by dividing LLC_MISSES by LLC_REFS. This yields a cache miss rate for the resid function of 93%. By decreasing the cache miss rate (i.e., increasing the cache utilization) the code will run faster.
The next step is to examine the resid function to determine the cause of the significant number of cache misses. Drilling down shows that the majority of the LLC_MISSES are on line 368, the innermost loop.
Figure 4: Source-level Data Cache Profile
It's not the intent of this article to explore the reasons why this loop nest has such a high cache miss rate and what the potential solutions are to decrease the cache miss rate. There are a number of good references available that discuss these issues. A good starting point would be Proceedings of the 2000 ACM/IEEE Conference on Supercomputing "Tiling Optimizations for 3D Scientific Computations"
The use of loop tiling can help to decrease the number of cache misses. The PGI compiler can perform loop tiling using the -Mvect=tile option. The recompiled executable was run on the same eight core system specifing eight OpenMP threads using pgcollect, with the -dcache option. The total runtime was 162 seconds, a 17% decrease in runtime.
The profile from this run follows:
Figure 5: Data Cache Profile After Tiling
PGPROF shows us that by using the -Mvect=tile option, the number of LLC_MISSES for the resid function has been reduced by 54% and the overall runtime has decreased 17%.
Further Reading
- Basic Performance Measurements for AMD Processors
- Intel Performance Analysis Guide for Core i7 and Intel Xeon 5500 Processors
- Intel Optimization Reference Manual
- Intel Software Developer Manual vol 3B, chapter 30 and appendix B
- AMD Software Optimization Guide
- AMD64 Programmer's Manual, vol 2, chapter 13 and appendix A
- PGI Tools Guide
- A New Direction for PGI Performance Profiling from the PGInsider
- Using PGPROF and the CUDA Visual Profiler to Profile GPU Applications from the PGInsider
- Accessing Compiler Performance Advice from the PGInsider