Technical News from The Portland Group

Understanding the CUDA Data Parallel Threading Model
A Primer

General purpose parallel programming on GPUs is a relatively recent phenomenon. GPUs were originally hardware blocks optimized for a small set of graphics operations. As demand arose for more flexibility, GPUs became ever more programmable. Early approaches to computing on GPUs cast computations into a graphics framework, allocating buffers (arrays) and writing shaders (kernel functions). Several research projects looked at designing languages to simplify this task; in late 2006, NVIDIA introduced its CUDA architecture and tools to make data parallel computing on a GPU more straightforward. Not surprisingly, the data parallel features of CUDA map pretty well to the data parallelism available on the NVIDIA GPUs. Here, we'll describe the data parallelism model supported in CUDA and the GPU.

Why should PGI users want to understand the CUDA threading model? Clearly, PGI CUDA Fortran users should want to learn enough to tune their kernels. Programmers using the directive-based PGI Accelerator programming model will also find it instructive in order to understand and use the compiler feedback (-Minfo messages) about which loops were run in parallel or vector mode on the GPU; it's also important to know how to tune performance using the loop mapping clauses.

So let's start with an overview of the hardware in today's NVIDIA Tesla and Fermi GPUs.

NVIDIA Tesla Block Diagram
NVIDIA Tesla Block Diagram

NVIDIA Fermi Block Diagram
NVIDIA Fermi Block Diagram

GPU Hardware

A GPU is connected to a host through a high speed IO bus slot, typically a PCI-Express in current high performance systems. The GPU has its own device memory, up to several gigabytes in current configurations. Data is usually transferred between the GPU and host memories using programmed DMA, which can operate concurrently with both the host and GPU compute units, though there is some support for direct access to host memory from the GPU under certain restrictions. As a GPU is designed for stream or throughput computing, it does not depend on a deep cache memory hierarchy for memory performance. The device memory supports very high data bandwidth using a wide data path. On NVIDIA GPUs, it's 512-bits wide, allowing sixteen consecutive 32-bit words to be fetched from memory in a single cycle. It also means there is severe bandwidth degradation for strided accesses.

NVIDIA GPUs have a number of multiprocessors, each of which executes in parallel with the others. On Tesla, each multiprocessor has a group of 8 stream processors; a Fermi multiprocessor has two groups of 16 stream processors. I'll use the more common term core to refer to a stream processor. The high end Tesla accelerators have 30 multiprocessors, for a total of 240 cores; a high end Fermi has 16 multiprocessors, for 512 cores. Each core can execute a sequential thread, but the cores execute in what NVIDIA calls SIMT (Single Instruction, Multiple Thread) fashion; all cores in the same group execute the same instruction at the same time, much like classical SIMD processors. SIMT handles conditionals somewhat differently than SIMD, though the effect is much the same, where some cores are disabled for conditional operations.

The code is actually executed in groups of 32 threads, what NVIDIA calls a warp. On a Tesla, the 8 cores in a group are quad-pumped to execute one instruction for an entire warp, 32 threads, in four clock cycles. Each Tesla core has integer and single-precision floating point functional units; a shared special function unit in each multiprocessor handles transcendentals and double-precision operations at 1/8 the compute bandwidth. A Fermi multiprocessor double-pumps each group of 16 cores to execute one instruction for each of two warps in two clock cycles, for integer or single-precision floating point. For double-precision instructions, a Fermi multiprocessor combines the two groups of cores to look like a single 16-core double-precision multiprocessor; this means the peak double-precision throughput is 1/2 of the single-precision throughput.

There is also a small software-managed data cache attached to each multiprocessor, shared among the cores; NVIDIA calls this the shared memory. This is a low-latency, high-bandwidth, indexable memory which runs essentially at register speeds. On Tesla, the shared memory is 16KB. On Fermi, the shared memory is actually 64KB, and can be configured as a 48KB software-managed data cache with a 16KB hardware data cache, or the other way around (16KB SW, 48KB HW cache).

When the threads in a warp issue a device memory operation, that instruction will take a very long time, perhaps hundreds of clock cycles, due to the long memory latency. Mainstream architectures would add a cache memory hierarchy to reduce the latency, and Fermi does include some hardware caches, but mostly GPUs are designed for stream or throughput computing, where cache memories are ineffective. Instead, these GPUs tolerate memory latency by using a high degree of multithreading. A Tesla supports up to 32 active warps on each multiprocessor, and a Fermi supports up to 48. When one warp stalls on a memory operation, the multiprocessor selects another ready warp and switches to that one. In this way, the cores can be productive as long as there is enough parallelism to keep them busy.

Programming

GPUs are programmed as a sequence of kernels; typically, each kernel completes execution before the next kernel begins, with an implicit barrier synchronization between kernels. Fermi has some support for multiple, independent kernels to execute simultaneously, but most kernels are large enough to fill the entire machine. As mentioned, the multiprocessors execute in parallel, asynchronously. However, GPUs do not support a fully coherent memory model that allows the multiprocessors to synchronize with each other. Classical parallel programming techniques can't be used here. Threads can't spawn more threads; threads on one multiprocessor can't send results to threads on another multiprocessor; there's no facility for a critical section among all the threads across the whole system. Trying to use a Pthreads or OpenMP programming model will lead to pain, frustration, and failure.

CUDA offers a data parallel programming model that is supported on NVIDIA GPUs. In this model, the host program launches a sequence of kernels. A kernel is organized as a hierarchy of threads. Threads are grouped into blocks, and blocks are grouped into a grid. Each thread has a unique local index in its block, and each block has a unique index in the grid. Kernels can use these indices to compute array subscripts, for instance.

Threads in a single block will be executed on a single multiprocessor, sharing the software data cache, and can synchronize and share data with threads in the same block; a warp will always be a subset of threads from a single block. Threads in different blocks may be assigned to different multiprocessors concurrently, to the same multiprocessor concurrently (using multithreading), or may be assigned to the same or different multiprocessors at different times, depending on how the blocks are scheduled dynamically.

There is a hard limit on the size of a thread block, 512 threads or 16 warps for Tesla, 1024 threads or 32 warps for Fermi. Thread blocks are always created in warp units, so there is no point in trying to create a thread block of size that is not a multiple of 32 threads; all thread blocks in the whole grid will have the same size and shape. A Tesla multiprocessor can have 1024 threads simultaneously active, or 32 warps. These can come from 2 thread blocks of 16 warps, or 3 thread blocks of 10 warps, 4 thread blocks of 8 warps, and so on up to 8 blocks of 4 warps; there is another hard limit of 8 thread blocks simultaneously active on a single multiprocessor. As mentioned, Fermi can have 48 simultaneously active warps, equivalent to 1536 threads, from up to 8 thread blocks.

Performance tuning on the GPU requires optimizing all these architectural features:

  • Finding and exposing enough parallelism to populate all the multiprocessors.
  • Finding and exposing enough additional parallelism to allow multithreading to keep the cores busy.
  • Optimizing device memory accesses for contiguous data, essentially optimizing for stride-1 memory accesses.
  • Utilizing the software data cache to store intermediate results or to reorganize data that would otherwise require non-stride-1 device memory accesses.

This is the challenge for the CUDA programmer, and for the PGI Accelerator compilers. If you are a CUDA Fortran programmer, we hope this gives you a basic understanding that you can use to tune your kernels and launch configurations for efficient execution.

If you are a PGI Accelerator programming model programmer, this article should help you understand the compiler feedback messages and make efficient use of the loop mapping directives. The PGI Accelerator directives are designed to allow you to write concise, efficient and portable X64+GPU programs. However, writing efficient programs still requires you to understand the target architecture, and how your program is mapped onto the target. We hope this article is one step towards that understanding.