Technical News from The Portland Group
Using PGPROF and the CUDA Visual Profiler
to Profile GPU Applications
The PGI 2010 compilers and tools support two GPU programming models, the directive based PGI Accelerator model, and the explicit CUDA Fortran model. The PGI performance profiler PGPROF has been enhanced to support profiling applications that use the PGI Accelerator model. At the time of this writing, profiling support for GPU execution with the CUDA Fortran model is limited to the NVIDIA CUDA Visual Profiler tool (a/k/a cudaprof). The following two sections will describe how to build, run and profile programs built with each of these two GPU programming models.
Profiling PGI Accelerator Programs with PGPROF
With this model, be sure to compile with the -ta=nvidia compiler option. While no other compilation flags are needed, one option that might be useful is ‑Minfo=ccff. This option saves information about the compilation of the application for use by PGPROF that can help in analyzing application performance.
To collect performance data from the application, PGI recommends the use of the pgcollect tool. This tool runs the application and enables sample-based profiling for the host code and profile data collection for the GPU code. The tool is invoked in the following manner:
To display the results of the profile run use the PGPROF GUI.
pgprof -exe myprog
The startup screen should look like the following:
On this initial screen, notice there are two columns for accelerator times: Accelerator Region Time and Accelerator Kernel Time. The Accelerator Region Time represents the total time spent in the accelerator region including data transfer time. Accelerator Kernel Time represents just the time spent in the kernel computing on the GPU, excluding the data transfer time. In this example, there are three functions that executed on the GPU: compute2, compute1 and wsm52d. If we drill down into wsm52d we can see the following:
Selecting the Accelerator tab near the bottom of the window, reveals information about one of the GPU kernels:
This shows that this GPU kernel was executed 10 times, and the sum of the kernel compute time for these 10 invocations was 0.352597 seconds. The total time to transfer the data to and from the GPU for this kernel was 0.149706 seconds.
Scrolling down to line 490 and highlighting it we see the grid size and block size used for this GPU kernel.
For more information on tuning Accelerator programs, refer to Chapter 7, Using an Accelerator, of the PGI Compiler User's Guide. For more information on using the PGI profiler, refer to the PGPROF Profiler Guide.
Profiling CUDA Fortran Programs with the CUDA Visual Profiler
Like the PGI Accelerator model programs, there are no special profiler compiler flags needed for profiling a CUDA Fortran program. NVIDIA provides a graphical tool called the CUDA Visual Profiler which can be used to gather information about the GPU code. This tool does not show any profile data for code run on the host.
When you first start up the CUDA Visual Profiler, you will want to create a new project (File:New). This will pop up a dialog box for naming the project and specifying the project location. Next create a session(Session:Session settings ...). This also will pop up a dialog box where the session can be named, the name of the executable is listed along with the working directory, any command line arguments, and the maximum execution time allowed. To start execution click on the Start button at the bottom of this window. The CUDA Visual Profiler default is to run the program five times and aggregate the data. The reason for running the executable five times is there is a limit of four profile counters that can be sampled during an individual run. To gather all 20 different profile counters the executable needs to be run five times.
After completing the steps above, the display might look like the following:
A summary table of the profile data can be accessed from the toolbar or under the View menu. It might appear as follows:
This view shows a summary of the code that was executed on the GPU. For this example, there were two compute kernels that executed on the GPU, compute2_kernel and compute1_kernel. Each kernel was invoked 200 times, and the total amount of compute time for these 200 calls was 4.38 seconds and 4.13 seconds respectively. Also included in the summary table is information about the data transfers between the Host and the Device. It shows that there were 800 calls to the routine that copies data from the Host to the Device (memcpyHtoD). Likewise, there are 400 calls to the routine that copies data from the Device to the Host (memcpyDtoH).
The CUDA Visual Profiler can also be used to profile PGI Accelerator programs running on GPUs. For more detailed information on using the CUDA Visual Profiler refer to the documentation included with the CUDA SDK. In addition, you can find more about using PGPROF, pgcollect and the CUDA Visual Profiler in these other PGInsider articles:
Accessing PGI Compiler Performance Advice