Technical News from The Portland Group

Understanding the CUDA Data Parallel Threading Model
A Primer

General purpose parallel programming on GPUs is a relatively recent phenomenon. GPUs were originally hardware blocks optimized for a small set of graphics operations. As demand arose for more flexibility, GPUs became increasingly more programmable. Early approaches to computing on GPUs cast computations into a graphics framework, allocating buffers (arrays) and writing shaders (kernel functions). Several research projects looked at designing languages to simplify this task; in late 2006, NVIDIA introduced its CUDA architecture and tools to make data parallel computing on a GPU more straightforward. Not surprisingly, the data parallel features of CUDA map pretty well to the data parallelism available on NVIDIA GPUs. Here, we'll describe the data parallelism model supported in CUDA and the latest NVIDIA Kepler GPUs.

Why should PGI users want to understand the CUDA threading model? Clearly, PGI CUDA Fortran users should want to learn enough to tune their kernels. Programmers using the directive-based PGI Accelerator Compilers with OpenACC will also find it instructive in order to understand and use the compiler feedback (-Minfo messages) indicating which loops were scheduled to run in parallel or vector mode on the GPU; it's also important to know how to tune performance using the loop mapping clauses. So, let's start with an overview of the hardware in today's NVIDIA GPUs.

NVIDIA Kepler Block Diagram
NVIDIA Kepler Block Diagram

GPU Hardware

A GPU is connected to a host through a high speed IO bus slot, typically PCI-Express in current high performance systems. The GPU has its own device memory, up to several gigabytes in current configurations. Data is usually transferred between the GPU and host memories using programmed DMA, which operates concurrently with both the host and GPU compute units, though there is some support for direct access to host memory from the GPU under certain restrictions. As a GPU is designed for stream or throughput computing, it does not depend on a deep cache memory hierarchy for memory performance. The device memory supports very high data bandwidth using a wide data path. On NVIDIA GPUs, it's 512-bits wide, allowing sixteen consecutive 32-bit words to be fetched from memory in a single cycle. It also means there is severe effective bandwidth degradation for strided accesses. A stride-two access, for instance, will fetch those 512 bits, but only use half of them, suffering a 50% bandwidth penalty.

NVIDIA GPUs have a number of multiprocessors, each of which executes in parallel with the others. A Kepler multiprocessor has 12 groups of 16 stream processors. I'll use the more common term core to refer to a stream processor. A high-end Kepler has 15 multiprocessors and 2880 cores. Each core can execute a sequential thread, but the cores execute in what NVIDIA calls SIMT (Single Instruction, Multiple Thread) fashion; all cores in the same group execute the same instruction at the same time, much like classical SIMD processors. SIMT handles conditionals somewhat differently than SIMD, though the effect is much the same, where some cores are disabled for conditional operations.

The code is actually executed in groups of 32 threads, what NVIDIA calls a warp. A Kepler multiprocessor takes two cycles to execute one instruction for an entire warp on each group of 16 cores, for integer and single-precision operations. At most 4 of the 12 groups of cores in a multiprocessor can be executing double-precision instructions concurrently. This means the double-precision peak speed is one-third of the single-precision peak. However, the other 8 groups can be executing either single-precision or integer instructions, so total throughput of double-precision programs can be higher than one-third of single-precision programs depending on the percentage of floating-point instructions in a given loop. At most 2 of the 12 groups of multiprocessors can concurrently execute intrinsic functions and transcendentals, such as sine, tangent and expontential.

There is also a small software-managed data cache attached to each multiprocessor, shared among the cores; NVIDIA calls this the shared memory. This is a low-latency, high-bandwidth, indexable memory which runs close to register speeds. On Kepler, the shared memory is 64KB. It can be configured to a 48KB software-managed data cache with a 16KB hardware data cache, or 32KB SW cache with 32KB HW cache, or 16KB SW cache and 48KB HW cache.

When the threads in a warp issue a device memory operation, that instruction will take a very long time, perhaps hundreds of clock cycles, due to the long memory latency. Mainstream architectures include a two-level or three-level cache memory hierarchy to reduce the average memory latency, and Kepler does include some hardware caches, but mostly GPUs are designed for stream or throughput computing, where cache memories are ineffective. Instead, GPUs tolerate memory latency by using a high degree of multithreading. A Kepler supports up to 64 active warps on each multiprocessor. When one warp stalls on a memory operation, the multiprocessor control unit selects another ready warp and switches to that one. In this way, the cores can be productive as long as there is enough parallelism to keep them busy.

Programming

NVIDIA GPUs are programmed as a sequence of kernels. Typically, each kernel completes execution before the next kernel begins, with an implicit barrier synchronization between kernels. Kepler has support for multiple, independent kernels to execute simultaneously, but many kernels are large enough to fill the entire machine. As mentioned, the multiprocessors execute in parallel, asynchronously. However, GPUs do not support a fully coherent memory model that would allow the multiprocessors to synchronize with each other. Classical parallel programming techniques can't be used here. Threads can spawn more threads on Kepler GPUs, so nested parallelism is supported. However, threads on one multiprocessor can't send results to threads on another multiprocessor; there's no facility for a critical section among all the threads of the whole system.

CUDA offers a data parallel programming model that is supported on NVIDIA GPUs. In this model, the host program launches a sequence of kernels, and those kernels can spawn sub-kernels.

Threads are grouped into blocks, and blocks are grouped into a grid. Each thread has a unique local index in its block, and each block has a unique index in the grid. Kernels can use these indices to compute array subscripts, for instance.

Threads in a single block will be executed on a single multiprocessor, sharing the software data cache, and can synchronize and share data with threads in the same block; a warp will always be a subset of threads from a single block. Threads in different blocks may be assigned to different multiprocessors concurrently, to the same multiprocessor concurrently (using multithreading), or may be assigned to the same or different multiprocessors at different times, depending on how the blocks are scheduled dynamically.

There is a hard upper limit on the size of a thread block, 1,024 threads or 32 warps for Kepler. Thread blocks are always created in warp-sized units, so there is little point in trying to create a thread block of a size that is not a multiple of 32 threads; all thread blocks in the whole grid will have the same size and shape. A Kepler multiprocessor can have 2,048 threads simultaneously active, or 64 warps. These can come from 2 thread blocks of 32 warps, or 3 thread blocks of 21 warps, 4 thread blocks of 16 warps, and so on up to 16 blocks of 4 warps; there is another hard upper limit of 16 thread blocks simultaneously active on a single multiprocessor.

Performance tuning on NVIDIA GPUs requires optimizing all these architectural features:

  • Finding and exposing enough parallelism to populate all the multiprocessors.
  • Finding and exposing enough additional parallelism to allow multithreading to keep the cores busy.
  • Optimizing device memory accesses for contiguous data, essentially optimizing for stride-1 memory accesses.
  • Utilizing the software data cache to store intermediate results or to reorganize data that would otherwise require non-stride-1 device memory accesses.

This is the challenge for the CUDA programmer, and for the PGI Accelerator compilers. If you are a CUDA Fortran programmer, we hope this gives you a basic understanding that you can use to tune your kernels and launch configurations for efficient execution.

If you are an OpenACC programmer, this article should help you understand the compiler feedback messages and make efficient use of the loop mapping directives. The OpenACC directives are designed to allow you to write concise, efficient and portable x64+GPU programs. However, writing efficient programs still requires you to understand the target architecture, and how your program is mapped onto the target. The OpenACC directives are designed to reduce the cost of writing simple GPU programs, and to make it much less tedious to write large efficient programs. We hope this article is one step towards that understanding.