October 2011
Debugging PGI CUDA Fortran with Allinea DDT
Introduction
GPUs have rapidly changed the landscape of software development for high performance computing (HPC) in recent times with their favourable performance to power and cost ratios. Many developers are working to incorporate the hardware in their new projects or existing software. Unlike some processor innovations, this one comes with a development cost. GPUs can be a challenge to program, as the user must embrace hierarchical memory and fine-grained parallelism. But the performance benefit makes this all worthwhile.
For Fortran users PGI's CUDA Fortran represents a straight forward way to port CPU intensive parts of Fortran applications to GPUs. Existing Fortran code can be converted by restructuring the computational kernels of an application to take into account the massively parallel GPU execution environment.
Necessarily, this involves some development. For as long as software development has existed so has the software bug and so has the need to eradicate it. Programming with CUDA Fortran is no different. In this article we will look at how to debug CUDA Fortran applications and we'll be using Allinea Software's Allinea DDT to do it.
About DDT
Allinea DDT is a graphical debugger for finding and fixing bugs on a variety of platforms including workstations, GPUs, clusters, and some of the world's largest supercomputers. DDT features not typically found in other debuggers include memory debugging, data visualization and support for parallel debugging of MPI and OpenMP applications.
DDT Debugging Parallel MPI Code
DDT was written with concurrency and parallelism in mind. The user interface makes debugging easy at any scale. With DDT, it's no more difficult to debug 1000 threads than it is to debug 1 thread, Given the number of threads possible on a GPU, this is an important consideration.
For users intending to run on multiple processes—for example in a cluster environment—Allinea DDT can debug the whole application across all processes and kernels simultaneously. DDT supports debugging CUDA C and C++ running natively on the GPU (see References) as well. And as we'll show here, Allinea DDT can also debug CUDA Fortran.
Debugging CUDA Fortran
Developers targeting GPU accelerators typically have to work at a very low level of abstraction. They must understand and specify data usage and manually construct sequences of calls to manage all movement of data between the x64 host and GPU. In addition, they often need to restructure loop dimensions to exploit the available hardware properties. PGI CUDA Fortran makes things easier than CUDA C and C++, but as a developer you still need to know about the GPU execution model, the distinction between host (CPU) and device (GPU) memory, and on the GPU, the differences between shared memory and local memory.
For this article we'll be using an example from an earlier PGInsider article Introduction to CUDA Fortran (October 2009) which performs matrix-matrix multiplication on a GPU. That article describes the changes needed to use the GPU. They can be summarized as:
- Identification of contenders for GPU usage: assessing the CPU intensive loops and whether they are suitably data parallel, or can be made to be.
- Convert loop nests into parallel grids/blocks and separate out into a kernel function.
- Use CUDA grids/blocks to launch the GPU kernel function.
CUDA Fortran is in many ways easier than CUDA C. For example device allocated arrays are specified by a simple "device" attribute and allocated in the same way as normal Fortran allocations. Data can be copied from the GPU to the device by a simple assignment. For example:
real :: A(M,N) ! A in host memory
real,device :: Adev(M,N) ! Adev in GPU memory
...
Adev = A ! Copy data from host to GPU
...
A = Adev ! Copy data from GPU to host
The Inevitable Bugs
Sometimes the bug will be in the GPU code itself but it could just as easily be in the host code. This is why it's useful to be using one debugger for the whole of the application debugging. In addition to bugs that might be found in the existing code, adding CUDA can bring new complications. For example, synchronization is a new and previously unnecessary complication to loops. Where a GPU thread relies on another thread to provide values that it subsequently uses, synchronization is necessary. CUDA doesn't do global synchronization, only synchronization within limited sized blocks of threads.
Preparing for the debugger
Unfortunately it's not at present possible to debug inside CUDA Fortran kernels running on the device (unlike for example CUDA C) but there is still a way to debug the CUDA Fortran.
The PGI CUDA Fortran compiler is capable of making kernels that can run on the host processor instead of the GPU. PGI calls this "emulated" mode. In emulated mode, the code runs on as many cores as you have on your CPU. this means you can actually get concurrent execution on a dual or quad core workstation or laptop. This means that you can actually use debuggers in the same straightforward way to debug CUDA Fortran that you probably already do to debug your other code. To enable the emulated mode, use the compiler flag "-Mcuda=emu" and to enable debugging support use the "-g" flag.
Concurrent execution on multiple threads is a good way to introduce the non-determinism that can reproduce GPU kernel synchronization bugs (although often even a single thread is sufficient to expose synchronization bugs within a block).
pgfortran -g -Mcuda=emu matmat.cuf -o matmat
With this complete, we're now ready to debug.
Starting the Debugger
Running an application through Allinea DDT is quite straight forward. Download DDT from www.allinea.com. The download includes a 30 day evaluation licence. DDT runs on almost every version of Linux.
As a debugger, DDT is able to quickly help in situations like a simple crash. When a job crashes, this is what you see:
Figure 1. Typical Application Crash
(click to enlarge)
It's immediately clear exactly where the crash happened to the exact line, and exactly which processes. The error message conveys exactly what the problem is, and the source code is highlighted to tell you where things happened. This is probably familiar to you if you've used a debugger before although in this particular example DDT is actually debugging over a thousand processes simultaneously.
Not all bugs are as simple as a crash. There could instead be incorrect results, say. In these cases it is helpful to step through an application and to watch progress unfold by manually controlling the application with the debugger—perhaps running first to a breakpoint set at a known good location.
What we're focusing on here are the more subtle problems you might come across in a CUDA Fortran code. Let's run the example which has been altered by the addition of a bug. We've removed the first syncthreads call from the main kernel. The code has some self-checks at the end that alert us to the problem.
C(9, 1) = 6.5912390E+10 should be
3.5796101E+11 error= 0.8158671
C(10, 1) = 9.8720858E+10 should be
3.5796596E+11 error= 0.7242172
262144 ERRORS FOUND!!!
After following through DDT's installation process, let's start up DDT on that executable we made in the previous section.
ddt ./matmat
Click Run and your application will start running under the control of the debugger.
Exploring Arrays
The bug we have found is not a regular crash, it's an incorrect result. This can be an error in the data or in the program. DDT lets you look at data quite easily. We'll run the program to just after the data has been initialized. It's always worth checking for garbage in! Scroll to line 189 (or Ctrl-g to jump directly), and then right click "Run to here". DDT runs the program to that line and then pauses it.
On the panel to the right of the source code, click on the "locals" tab. This brings up the list of local variables for the current location.
Figure 2. Overview at Breakpoint on Line 189
(click to enlarge)
Although you can see the scalar variables that define the dimensions of the arrays (such as n and m), here we're interested in looking at the arrays themselves. Scroll through the locals until you see "a" (see Figure 3). Right click and select "View array".
DDT displays a spreadsheet like view of arrays, with the ability to search or filter for NaN or Inf, or some outliers. This is handy a way for detecting rogue data values. You can even use this for arrays distributed across multiple processes. It is simple to stitch together arrays distributed over a regular arrangement of processes, such as a 1, 2, or 3D process grid. DDT also has a built-in export capability to the HDF5 and CSV formats.
The data looks quite sensible at this point and you can repeat the exercise for the other variables of B, C.
Figure 3. Checking Variable "A" Prior to Launching Kernel
(click to enlarge)
Set a further breakpoint at line 112 (double click on the line) and press play. This is where the (emulated) device memory has now been filled. You can check this for consistency too. The device arrays should be copies of the host arrays.
After all this, it looks like the input data is correct, so this means our problem is probably inside the kernel. Let's debug that.
Running the kernel
The best way to start looking at what's happening inside the kernel is to put a breakpoint in the kernel, and then run the program until it pauses at the break point. Then continue stepping through the kernel watching the progress.
Double click to set the breakpoint in line 35 and run the program. Depending on your hardware, you may see one thread, or you may see several.
Figure 4. Stacks Showing for Multiple Threads
The "Stacks" panel at the bottom of DDT shows you where all the threads are. You will also see the source code is colour highlighted.
To understand what is happening inside the kernel, use the step feature to watch how execution progresses. Variables change colour as their value changes. When stepping through a kernel, you are looking at local variables and arrays from one emulated GPU thread.
Figure 5. Local Variables for a CUDA Thread
You will also see some other unfamiliar internal variables. Ignore these for now. There are also the special variables such as threadidx. This is the CUDA thread index; it tells you which thread you are currently viewing. Similarly, blockIdx shows the current block. If you have multiple (CPU) threads running, you can switch by clicking on the thread displayed above the source code window. You can then examine its data.
Choosing a CUDA thread
We know from the self-check error that a number of data values are incorrect. So lets pick one and see how it's calculated. In the emulation mode, kernel threads are executed by CPU threads. The CPU threads use OpenMP tasks to iterate over all the kernel threads. At some point, one of the CPU threads will execute the kernel thread (3,4). It's at that time the kernel thread can be examined. Until then, it doesn't exist so you won't be able to see what it's doing.
The easiest way to stop and follow through a particular kernel thread is to set a conditional breakpoint at the start of the kernel. If, for example, we want to see how cell C(10,1) is computed we could set a condition for the thread identifiers threadIdx and blockIdx. It is easier and just as valid to pause your program when i and j are 10 and 1 respectively.
So lets try it. Go to line 31, double click to add a breakpoint and then right click to edit the breakpoint. Enter a condition in the source language, (here we're using Fortran). Now press play and continue. The program pauses exactly at the right thread.
As we now know, the calculation of Cij is incorrect. That means probably the initialization of Cij or the input arrays used for the calculation are at fault. We can see that Cij is initialized to zero. Let's now take a look at the array Asub. It's a block of the original matrix that was copied into shared memory for efficiency.
Right click on Asub and select "View Array". Most of the data looks correct. Let's view it in 3D just in case.
Figure 6. 3D Display of Array Asub
(click to enlarge)
The problem is quite literally pointing right at us! This surface should not be spiked like this. The application comments tell us that each thread reads a portion of the array. However, it's clear we've reached a location where Asub should be correct but yet it is not. No wonder the output is wrong!
It's easy to see what the fix should be. Let's make the change and then run the program through again. Re-enable the synchronization after the Asub is copied into shared memory (line 45), recompile the program and run it again. This time it works.
Summary
Using debuggers is the most effective way of fixing software problems. We've shown that applications written using CUDA Fortran can be debugged with Allinea DDT and done so while retaining full access to the usual features like breakpoints and variables.
References
PGI Accelerator Model
PGI CUDA Fortran
Debugging CUDA with Allinea DDT
Video of DDT with CUDA



