Technical News from The Portland Group

Thread- and Instruction-Level Parallelism in CUDA Fortran

One key aspect of achieving good performance on GPUs is exposing enough parallelism to keep the hardware busy. Even if a kernel has been optimized so that all global memory accesses are perfectly coalesced, one still has to deal with the issue that such memory accesses have a latency of several hundred cycles. To achieve good overall performance, one has to ensure that there is enough parallelism on a multiprocessor so stalls for memory accesses are hidden as much as possible. There are two ways to achieve this parallelism: through the number of concurrent threads on a multiprocessor, and through the number of independent operations issued per thread. The first of these we call thread-level parallelism and the second we call instruction-level parallelism.

Thread-level Parallelism

Thread-level parallelism can be controlled to some degree by the execution configuration specified in the host code used to launch kernels. In the execution configuration one specifies the number of threads per block and the number of blocks in the kernel launch. The number of thread blocks that can reside on a multiprocessor for a given kernel is then an important consideration, and can be limited by a variety of factors, some of which are given below:

Tesla C870 Tesla C1060 Tesla C2050
Compute Capability 1.0 1.3 2.0
Max threads per thread block 512 512 1024
Max thread blocks per multiprocessor 8 8 8
Max threads per
multiprocessor
768 1024 1536
32-bit registers per multiprocessor 8192 16384 32768
Shared memory per multiprocessor (KB) 16 16 48

In addition to the hard limits of the number threads and threads blocks in the second through fourth rows, the number of resident thread blocks can be limited by resource utilization, such as the number of registers required per thread and the amount of shared memory used per thread block.

The metric occupancy is used to help assess the thread-level parallelism of a kernel on a multiprocessor. Occupancy is the ratio of the actual number of active threads per multiprocessor over the maximum number of possible active threads (given in the table above). A higher occupancy does not necessarily lead to higher performance, as one can express a fair amount of instruction-level parallelism in kernel code. But if one relies on thread-level parallelism alone to hide latencies, then a low occupancy can hinder performance.

We examine the effects of thread block size on occupancy and bandwidth using the simple copy kernel below:

  attributes(global) subroutine copy(odata, idata)
    implicit none
    real(fp_kind) :: odata(*), idata(*)
    integer :: i
    
    i = (blockIdx%x-1)*blockDim%x + threadIdx%x
    odata(i) = idata(i)
  end subroutine copy

The complete code is also available. The kind type parameter fp_kind IĠm using results in double precision. We compile this code for the Fermi architecture (compute capability 2.0) and using the CUDA 4.1 Toolkit as follows:

% pgfortran -Mcuda=cc20,cuda4.1 -O3 par_samp.cuf

In this code the kernel is launched using a variety of thread-block sizes, specified in the chevron launch syntax, and the effective bandwidth is measured using CUDA events. We obtain the occupancy of each kernel launch by using the command-line profiler. After the command-line profiler is enabled in a bash shell via:

% export COMPUTE_PROFILE=1

Executing the code will create a cuda_profile_0.log file which will contain the profiler output. Each kernel call will result in a line in this file, which by default will contain the occupancy. Running this code on a C2050 we obtain:

Threads per Block Occupancy Bandwidth (GB/s)
32 0.167 54
64 0.333 82
128 0.667 104
256 1.0 102
512 1.0 104
1024 0.667 97

Before weĠre satisfied with these results we should mention that the copy kernel is very light on resource utilization. The kernel uses no shared memory, and compiling with ‑Mcuda=ptxinfo indicated only 10 registers per thread are used. For small thread blocks, the occupancy will be limited only as a result of the maximum of eight thread blocks per multiprocessor. We see this for thread blocks of 32, 64, and 128 threads, where eight of these blocks results in 256, 512, and 1024 concurrent threads per multiprocessor and occupancies of one sixth, one third, and two thirds, respectively. Note that the peak performance is reached when the block size is 128 which has two thirds occupancy—occupancy does not equate to performance, rather it is one aspect that can play a role in overall performance. With thread blocks of 256 and 512 threads, full occupancy is reached with six and three thread blocks per multiprocessor, respectively. We cannot achieve full occupancy with a thread block of 1024, because partial blocks cannot be launched and the limit of threads per multiprocessor for this architecture is 1536.

The next step is to look at kernels with higher resource utilization. One way we can do this without modifying the kernel code is to specify dynamic shared memory used by each thread block as the third parameter of the execution configuration. This dynamically reserved shared memory need not by utilized in our example kernel code, but it is reserved per thread block nonetheless, and more typically represents real-world CUDA kernels where shared memory is used. If smBytes is the total amount of shared memory per multiprocessor, then launching our copy kernel with:

call copy<<<grid, threadBlock, 0.9*smBytes>>>(b_d, a_d)

results in only a single concurrent thread block per multiprocessor. By running the same thread block configuration while specifying dynamic shared memory we can extend our results table:

Threads per Block No Shared Memory Shared Memory Load
Occup. BW
(GB/s)
Occup. BW
(GB/s)
32 0.167 54 0.021 7
64 0.333 82 0.042 15
128 0.667 104 0.083 28
256 1.0 102 0.167 50
512 1.0 104 0.333 75
1024 0.667 97 0.667 97

With shared memory limiting each multiprocessor to a single thread block, the occupancy drops along with the bandwidth. For the smallest thread block of 32 threads, both occupancy and bandwidth are reduced by a factor of eight from the case without shared memory use. As the thread block size increases, the performance discrepancy between the two cases decreases.

In the above cases where shared memory use restricts the number of thread blocks concurrently executing on a multiprocessor, but register use is low, reasonably good performance can be obtained by choosing large thread blocks. This behavior reinforces lessons beginning CUDA programmers learn, but in the end it is the amount of parallelism on a multiprocessor that hides the latency of global memory stores and loads, and whether this parallelism comes from many threads in a single thread block or fewer threads in many thread blocks is irrelevant.

This situation changes when we have significant register utilization by a kernel. In such cases one is not able to pick any thread block size, as a kernel might not launch if the number of registers required by a single block of threads exceeds the total register count on the multiprocessor. If, in the above example, shared memory usage in a kernel limits a multiprocessor to one thread block and register usage limits a block to 32 threads, are we stuck with the poor performance indicated in the table above? The answer is no, we can improve performance, if one utilizes instruction-level parallelism.

Instruction-level Parallelism

When resource utilization in kernels limits the number of concurrent threads to the point where global memory latencies are not being hidden, for good performance one needs to augment low thread-level parallelism with higher instruction-level parallelism. We can rewrite the copy kernel above to add instruction-level parallelism as follows:

  attributes(global) subroutine copy_ILP(odata, idata)
    implicit none
    real(fp_kind) :: odata(*), idata(*)
    integer :: i,j
    
    i = (blockIdx%x-1)*blockDim%x*ILP + threadIdx%x
    do j = 1, ILP
       odata(i+(j-1)*blockDim%x) = idata(i+(j-1)*blockDim%x)
    enddo
 end subroutine copy_ILP

In this kernel, each thread copies ILP array elements, so a thread block of blockDim%x threads will copy ILP*blockDim%x elements. It is for this reason that this term appears in the calculation of the index i. We can further append our table with the results for this kernel with ILP=4 and compare it to the original where ILP=1:

Threads per Block No Shared Memory Shared Memory Load
Occup. BW
(GB/s)
Occup. ILP=1
BW
ILP=4
BW
32 0.167 54 0.021 7 11
64 0.333 82 0.042 15 21
128 0.667 104 0.083 28 39
256 1.0 102 0.167 50 63
512 1.0 104 0.333 75 89
1024 0.667 97 0.667 97 93

For the runs with instruction-level parallelism we once again specify enough dynamic shared memory in the execution configuration so that only a single thread block can reside on a multiprocessor at a time. While we do see some improvement in the ILP=4 case relative to the previous column, the results are far behind the case where no shared memory is used.

Load-Use Separation and Load Batching

We can see how to improve the performance further by looking at how loads and stores are performed in the following loop of kernel code:

    do j = 1, ILP
       odata(i+(j-1)*blockDim%x) = idata(i+(j-1)*blockDim%x)
    enddo

In CUDA, stores are "fire and forget", meaning that there is no blocking or waiting for a store to complete unless an explicit synchronization is given. A load command will also not block further independent execution, however the first use of the data requested by a load will block until that load completes. The term load-use separation is used to describe the amount of time (or the number of instructions) between when a load is issued and when the requested data is used. The larger the load-use separation, the better in terms of hiding load latencies. In the above case the load-use separation is small, the load issued for an element of idata is immediately used in the store of an element of odata. Even though the store is "fire and forget", the store will not issue until the load completes. As a result, only a single load will be in flight at any moment in time. We can improve this situation by using load batching, where we issue all the loads before issuing a store. Using load batching, the modified version of our copy kernel becomes:

attributes(global) subroutine copy_ILP2(odata, idata)
    implicit none
    real(fp_kind) :: odata(*), idata(*), tmp(ILP)
    integer :: i,j
    
    i = (blockIdx%x-1)*blockDim%x*ILP + threadIdx%x
    do j = 1, ILP
       tmp(j) = idata(i+(j-1)*blockDim%x)
    enddo
    do j = 1, ILP
       odata(i+(j-1)*blockDim%x) = tmp(j)
    enddo
end subroutine copy_ILP2

In the above section of code we define a thread-local array tmp which resides in register memory, and use this array to facilitate load batching. The body of the kernel is split into two loops: the first issues all the loads from global memory, and the second issues all the stores. In this kernel, all ILP loads issues by a thread are in flight before the first store in the second loop blocks for the load to complete. The results for this load batching approach are given below:

Threads per Block No Shared Memory Shared Memory Load
Occup. BW
(GB/s)
Occup. ILP = 1
BW
ILP = 4
BW
ILP = 4,
Load Batched
BW
32 0.167 54 0.021 7 11 25
64 0.333 82 0.042 15 21 45
128 0.667 104 0.083 28 39 70
256 1.0 102 0.167 50 63 97
512 1.0 104 0.333 75 89 102
1024 0.667 97 0.667 97 93 94

Here we observe far better performance when load batching is used relative to using instruction-level parallelism without load batching. We observe over 100 GB/s with one-third occupancy when four-way instruction-level parallelism is used.

We can increase the amount of instruction-level parallelism to improve performance further, with the caveat that the thread-local array tmp(ILP) resides in register memory and thus a larger ILP further increases register pressure. Compiling with ‑Mcuda=ptxinfo indicates that the kernels with ILP of four and eight use 19 and 35 registers per thread, compared to the 10 registers per thread used in the kernel without instruction-level parallelism. With 35 registers per thread, the ILP=8 case cannot be run using a 1024 thread block, as the 35,840 registers required by a single thread exceeds the limit of 32,768 threads per multiprocessor on a C2050. This can be avoided by specifying the register use per kernel to be 32 or less via the option ‑Mcuda=maxregcount:32 at the expense of spilling which adversely affects performance for all thread block sizes. We present the results for ILP=8 without any restriction on register use for all but the 1024 thread block:

Threads per Block No Shared Memory Shared Memory Load
Occup. BW
(GB/s)
Occup. ILP = 1
BW
ILP = 4
BW
ILP = 4,
Load Batched
BW
ILP = 8,
Load Batched
BW
32 0.167 54 0.021 7 11 25 32
64 0.333 82 0.042 15 21 45 57
128 0.667 104 0.083 28 39 70 86
256 1.0 102 0.167 50 63 97 103
512 1.0 104 0.333 75 89 102 95
1024 0.667 97 0.667 97 93 94 **

Here we observe over 100 GB/s with one-sixth occupancy when ILP=8. With a thread block of 32 threads, or 1/48 occupancy, we achieve a bit under one third of the highest observed bandwidth, roughly five times the performance of the kernel without any instruction-level parallelism.

Summary

In this article we discussed two basic techniques of exposing enough parallelism in CUDA kernels to keep the hardware busy and achieve good performance. The first technique, using thread-level parallelism, is the easiest because it is specified in the host code through the execution configuration and does not require any kernel modification. However, there are situations when resource consumption of a kernel is so large as to greatly restrict the number of concurrent threads on a multiprocessor. In these cases we need to look at instruction-level parallelism.

Use of instruction-level parallelism is not only reserved for cases where resource consumption dictates it. It can be used as a general optimization technique. Many of the highly optimized routines in the CUBLAS and CUFFT libraries run at low occupancy due to instruction-level parallelism. There is a trade-off with register pressure when implementing instruction-level parallelism. Instruction-level parallelism typically allows one to run at lower occupancy, which increases the number of registers available per thread. However, load batching requires temporary arrays that reside in registers and therefore increases the register pressure. Determination of how much instruction-level parallelism will result in optimal performance needs to be made on a case-by-case basis.

For more information on instruction-level parallelism, see the slides for the GTC 2010 presentation Better Performance at Lower Occupancy.