Technical News from The Portland Group

An Introduction to Debugging CUDA-x86 Applications

NVIDIA first introduced the C for CUDA programming language in 2007. CUDA was developed to enable offloading of compute-intensive code to GPUs. The recently released PGI CUDA C/C++ compiler for x86 lets developers build and optimize CUDA C/C++ applications to run on multi-core x86 architecture systems with or without the presence of an NVIDIA GPU. In particular, the PGI CUDA-x86 compilers support multi-core x86 processors as a CUDA compute device. Why does this feature matter? It provides CUDA developers with a uniform many-core parallel programming model, for applications where it's appropriate, that is portable to virtually any multi-core x86 processor-based system. It also turns out to be a very good vehicle for debugging CUDA C/C++ programs, as we will see in the sections which follow.

When running CUDA-x86 applications, you'll want to run on a multi-core system but a GPU is not required. The examples in this article were run using a dual-core Intel Core i7 laptop running Windows 7 x64 with hyper-threading enabled.

PGI Release 2011 version 11.6 or later is required to build and debug the examples shown in this article. The PGI CUDA-x86 compilers and tools are supported on Linux, Windows and MacOS.

Getting Started

Let's start our exploration of CUDA-x86 debugging with a trivial example—computing squares—so we can focus on how things look under debug without getting too bogged down with CUDA programming details right away. You can download the complete source code from the PGI website.

In our example the square_array routine is a global function, which means it runs on a CUDA device but is callable only from a host function:

// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
  int idx = blockIdx.x + blockDim.x + threadIdx.x;
  if (idx<N) 
     a[idx] = a[idx]*a[idx];
}

In contrast, device functions (marked by __device__) also run on a CUDA device, and are callable from other device functions and global functions, but are not callable from host functions.

CUDA global functions are called from host functions using CUDA C chevron syntax:

// Do calculation on device:
square_array <<< n_blocks, block_size >>> (a_d, N);

When invoking a global CUDA kernel, the first argument in the chevron syntax is the grid size and the second argument is the thread block size. For a complete description of the CUDA threading model, see the earlier PGInsider article on Understanding the CUDA Data Parallel Threading Model.

Compiling and Running for CUDA-x86

The PGI C++ compiler uses the -Mcudax86 option to enable compilation of programs targeting multi-core x86 as a CUDA compute device. This option is required unless compiling a file with a .cu extension directly to an executable. We'll add -g to our compilation because we want to debug the executable.

PGI$ pgcpp -Mcudax86 -g squares.cu -o squares

By default, CUDA-x86 executables will use all available host cores to effect parallel execution of CUDA kernels.

When we execute our example:

PGI$ squares
copying to device
launching kernel <<<3,4>>>
retrieving data
0 0.000000
1 1.000000
2 2.000000
3 3.000000
4 16.000000
5 625.000000
6 1679616.000000
7 5764801.000000
8 4096.000000
9 81.000000

It becomes clear that we have a bug in our code.

Debugging CUDA-x86

The squares in our example are not being calculated correctly for most of the input values. Let's bring up the executable in the debugger and try to figure out what's going wrong:

PGI$ pgdbg squares

Because the computation occurs in the CUDA global routine, start by setting a breakpoint in square_array and run the program:

CUDA-x86 debugger break point

It is important to note that when debugging CUDA-x86, breakpoints set in device code will not work properly if the number of cores on your machine is greater than the number of CUDA threads being run. Additionally, if the number of CUDA threads is not evenly divisible by the number of cores, some cores will not hit device code breakpoints on the last iteration.

OK, let's take a look at the call stack:

CUDAx86 Debugger Call Stack

You'll notice there are two calls to square_array listed in the call stack. The extra call layer is inserted during compilation but this implementation detail can be safely ignored during debugging. Similarly, stepping into a global routine from host code requires stepping through this layer. To avoid this scenario, simply set a breakpoint in the global routine and run. A future PGI release will likely suppress the existence of this implementation layer altogether.

The Built-in Variables

In CUDA programming, the entire collection of CUDA threads that will participate during the execution of a kernel is called a grid. A grid is divided into blocks. These blocks are indexed in two dimensions, x and y. Each block is made up of threads. The threads in each thread block are uniquely identified by their thread index.

These constructs are described by several built-in variables of type dim3, defined as:

typedef struct {
    uint1 x,y,z;
} dim3;

The dimensions of the grid are held in gridDim. The value of gridDim.z is always one. Locating a block within the grid is accomplished using blockIdx. The value of blockIdx.z is always one. The value of blockIdx.y is one when the grid is one-dimensional. All of the blocks in a given grid have the same dimension. The dimensions of the block are held in blockDim. Locating a thread within its block is accomplished using threadIdx. Each thread in a block is therefore uniquely described.

NVIDIA's CUDA Toolkit includes detailed documentation on CUDA programming. These documents are updated with each CUDA Toolkit release. The examples used in this article are based on the NVIDIA CUDA C Programming Guide v3.2.

Scheduling CUDA Threads

CUDA threads in a thread block can be executed in any order. At any given moment, the number of CUDA threads executing in a CUDA-x86 application is the same as the number of cores on the host. Use the built-in variables to see which CUDA threads are executing at any given point. When stopped in square_array, for example, examine the values of threadIdx for all four of our cores.

CUDAx86 Debugger Call Stack

Back to the Bug

Let's step through where a[idx] is assigned the value of its square:

CUDAx86 Debugger Call Stack

Let's print the array a:

CUDAx86 Debugger Call Stack

The array looks OK so far. We're running with four cores, so four CUDA threads will execute at once. Let's change our breakpoint placement and continue to this same spot for the next set of CUDA threads:

CUDAx86 Debugger Call Stack

Let's print the array again:

CUDAx86 Debugger Call Stack

Oops. Instead of filling out four new elements of array a, we've used (and overwritten) three of the elements that we filled in last time. We must be doing something wrong in the calculation of our array index variable.

We intended the index variable to be a unique number describing each CUDA thread. To get this identifier, one uses blockIdx, blockDim, and threadIdx but we've used them incorrectly.

Instead of adding together the blockIdx and blockDim:

  int idx = blockIdx.x + blockDim.x + threadIdx.x;

We should have multiplied them:

  int idx = blockIdx.x * blockDim.x + threadIdx.x;

Let's rebuild and rerun to check our work.

PGI$ pgcpp -g -Mcudax86 squares_corrected.cu -o squares_corrected
PGI$ squares_corrected
copying to device
launching kernel <<<3,4>>>
retrieving data
0 0.000000
1 1.000000
2 4.000000
3 9.000000
4 16.000000
5 25.000000
6 36.000000
7 49.000000
8 64.000000
9 81.000000

The answers are as expected. You may wish to download the corrected source code as well.

PGI CUDA-x86 SDK Examples

The PGI CUDA-x86 compilers ship with a set of sample programs from NVIDIA's CUDA SDK. These examples are available on all supported platforms. A few modifications have been made to the original source code. For example, the PGI CUDA-x86 compilers do not support warp synchronous programming so use of this feature has been removed from the original CUDA SDK examples that relied on it.

The location of the CUDA-x86 SDK varies by platform:

/usr/pgi/linux86/2011/cuda/cudaX86SDK
/usr/pgi/linux86-64/2011/cuda/cudaX86SDK

/usr/pgi/osx86/2011/cuda/cudaX86SDK
/usr/pgi/osx86-64/2011/cuda/cudaX86SDK

C:\Program Files\PGI\win32\2011\cuda\CUDA X86 SDK
C:\Program Files (x86)\PGI\win32\2011\cuda\CUDA X86 SDK
C:\Program Files\PGI\win64\2011\cuda\CUDA X86 SDK

To build and run all of the CUDA-x86 SDK examples, simply type 'make' in the top-level directory. To build an individual example, build the utility libraries in the top-level directory first:

PGI$ make -C common
PGI$ make -C shared

Then navigate to an example directory (located in the src directory) and type 'make' to build it there.

Shared Memory with CUDA Threads

A combination of shared memory and memory access synchronization can be used to enable threads within a block to share data. The __shared__ type qualifier designates a shared memory space in a thread block, and the __syncthreads() intrinsic function provides synchronization points.

For this next example, we'll use the CUDA SDK histogram example to look at debugging a program using shared memory. Start by building the debug version of the histogram example:

PGI$ cd src/histogram
PGI$ make dbg=1

Load the histogram executable into pgdbg, set a breakpoint in histogram64Kernel() in histogram64.cu and run to it:

CUDAx86 Debugger Call Stack

After its declaration, the shared array is initialized to zero and then, before anything else is done with it, a call is made to the __syncthreads() intrinsic:

CUDAx86 Debugger Call Stack

The __syncthreads() call serves to synchronize the CUDA threads; execution will not proceed until all threads have been executed to this point.

Let's proceed through one iteration of the loop that is filling in s_Hist:

CUDAx86 Debugger Call Stack

Take a look at s_Hist—because of its size we are only showing portions of it here—and note that multiple CUDA threads are updating values on the iterations of each loop:

CUDAx86 Debugger Call Stack

After proceeding through the final __syncthreads() call in this global routine:

CUDAx86 Debugger Call Stack

View the completely-filled-in s_Hist:

CUDAx86 Debugger Call Stack

Viewing shared memory while it's being accessed by CUDA threads is a good way to diagnose shared memory problems.

Future Directions

As discussed in this article, the PGI debugger can debug CUDA-x86 programs. PGDBG can help to identify improper indexing using built-in variables and misuse of shared memory, two common types of CUDA programming bugs. PGI development of CUDA-specific features in the debugger is just beginning though. As CUDA-x86 compiler development moves from functional to performance releases, so too will the feature set of the CUDA-x86 debugger mature. We are exploring a number of possible enhancements, including improved access to built-in variables and CUDA thread block visualization. Stay tuned.