August 2012
Thread- and Instruction-Level Parallelism in CUDA Fortran
One key aspect of achieving good performance on GPUs is exposing enough parallelism to keep the hardware busy. Even if a kernel has been optimized so that all global memory accesses are perfectly coalesced, one still has to deal with the issue that such memory accesses have a latency of several hundred cycles. To achieve good overall performance, one has to ensure that there is enough parallelism on a multiprocessor so stalls for memory accesses are hidden as much as possible. There are two ways to achieve this parallelism: through the number of concurrent threads on a multiprocessor, and through the number of independent operations issued per thread. The first of these we call thread-level parallelism and the second we call instruction-level parallelism.
Thread-level Parallelism
Thread-level parallelism can be controlled to some degree by the execution configuration specified in the host code used to launch kernels. In the execution configuration one specifies the number of threads per block and the number of blocks in the kernel launch. The number of thread blocks that can reside on a multiprocessor for a given kernel is then an important consideration, and can be limited by a variety of factors, some of which are given below:
| Tesla C870 | Tesla C1060 | Tesla C2050 | |
| Compute Capability | 1.0 | 1.3 | 2.0 |
| Max threads per thread block | 512 | 512 | 1024 |
| Max thread blocks per multiprocessor | 8 | 8 | 8 |
| Max threads per multiprocessor |
768 | 1024 | 1536 |
| 32-bit registers per multiprocessor | 8192 | 16384 | 32768 |
| Shared memory per multiprocessor (KB) | 16 | 16 | 48 |
In addition to the hard limits of the number threads and threads blocks in the second through fourth rows, the number of resident thread blocks can be limited by resource utilization, such as the number of registers required per thread and the amount of shared memory used per thread block.
The metric occupancy is used to help assess the thread-level parallelism of a kernel on a multiprocessor. Occupancy is the ratio of the actual number of active threads per multiprocessor over the maximum number of possible active threads (given in the table above). A higher occupancy does not necessarily lead to higher performance, as one can express a fair amount of instruction-level parallelism in kernel code. But if one relies on thread-level parallelism alone to hide latencies, then a low occupancy can hinder performance.
We examine the effects of thread block size on occupancy and bandwidth using the simple copy kernel below:
attributes(global) subroutine copy(odata, idata)
implicit none
real(fp_kind) :: odata(*), idata(*)
integer :: i
i = (blockIdx%x-1)*blockDim%x + threadIdx%x
odata(i) = idata(i)
end subroutine copy
The complete code is also available. The kind type parameter fp_kind IĠm using results in double precision. We compile this code for the Fermi architecture (compute capability 2.0) and using the CUDA 4.1 Toolkit as follows:
% pgfortran -Mcuda=cc20,cuda4.1 -O3 par_samp.cuf
In this code the kernel is launched using a variety of thread-block sizes, specified in the chevron launch syntax, and the effective bandwidth is measured using CUDA events. We obtain the occupancy of each kernel launch by using the command-line profiler. After the command-line profiler is enabled in a bash shell via:
% export COMPUTE_PROFILE=1
Executing the code will create a cuda_profile_0.log file which will contain the profiler output. Each kernel call will result in a line in this file, which by default will contain the occupancy. Running this code on a C2050 we obtain:
| Threads per Block | Occupancy | Bandwidth (GB/s) |
| 32 | 0.167 | 54 |
| 64 | 0.333 | 82 |
| 128 | 0.667 | 104 |
| 256 | 1.0 | 102 |
| 512 | 1.0 | 104 |
| 1024 | 0.667 | 97 |
Before weĠre satisfied with these results we should mention that the copy kernel is very light on resource utilization. The kernel uses no shared memory, and compiling with ‑Mcuda=ptxinfo indicated only 10 registers per thread are used. For small thread blocks, the occupancy will be limited only as a result of the maximum of eight thread blocks per multiprocessor. We see this for thread blocks of 32, 64, and 128 threads, where eight of these blocks results in 256, 512, and 1024 concurrent threads per multiprocessor and occupancies of one sixth, one third, and two thirds, respectively. Note that the peak performance is reached when the block size is 128 which has two thirds occupancy—occupancy does not equate to performance, rather it is one aspect that can play a role in overall performance. With thread blocks of 256 and 512 threads, full occupancy is reached with six and three thread blocks per multiprocessor, respectively. We cannot achieve full occupancy with a thread block of 1024, because partial blocks cannot be launched and the limit of threads per multiprocessor for this architecture is 1536.
The next step is to look at kernels with higher resource utilization. One way we can do this without modifying the kernel code is to specify dynamic shared memory used by each thread block as the third parameter of the execution configuration. This dynamically reserved shared memory need not by utilized in our example kernel code, but it is reserved per thread block nonetheless, and more typically represents real-world CUDA kernels where shared memory is used. If smBytes is the total amount of shared memory per multiprocessor, then launching our copy kernel with:
call copy<<<grid, threadBlock, 0.9*smBytes>>>(b_d, a_d)
results in only a single concurrent thread block per multiprocessor. By running the same thread block configuration while specifying dynamic shared memory we can extend our results table:
| Threads per Block | No Shared Memory | Shared Memory Load | ||
| Occup. | BW (GB/s) |
Occup. | BW (GB/s) |
|
| 32 | 0.167 | 54 | 0.021 | 7 |
| 64 | 0.333 | 82 | 0.042 | 15 |
| 128 | 0.667 | 104 | 0.083 | 28 |
| 256 | 1.0 | 102 | 0.167 | 50 |
| 512 | 1.0 | 104 | 0.333 | 75 |
| 1024 | 0.667 | 97 | 0.667 | 97 |
With shared memory limiting each multiprocessor to a single thread block, the occupancy drops along with the bandwidth. For the smallest thread block of 32 threads, both occupancy and bandwidth are reduced by a factor of eight from the case without shared memory use. As the thread block size increases, the performance discrepancy between the two cases decreases.
In the above cases where shared memory use restricts the number of thread blocks concurrently executing on a multiprocessor, but register use is low, reasonably good performance can be obtained by choosing large thread blocks. This behavior reinforces lessons beginning CUDA programmers learn, but in the end it is the amount of parallelism on a multiprocessor that hides the latency of global memory stores and loads, and whether this parallelism comes from many threads in a single thread block or fewer threads in many thread blocks is irrelevant.
This situation changes when we have significant register utilization by a kernel. In such cases one is not able to pick any thread block size, as a kernel might not launch if the number of registers required by a single block of threads exceeds the total register count on the multiprocessor. If, in the above example, shared memory usage in a kernel limits a multiprocessor to one thread block and register usage limits a block to 32 threads, are we stuck with the poor performance indicated in the table above? The answer is no, we can improve performance, if one utilizes instruction-level parallelism.
Instruction-level Parallelism
When resource utilization in kernels limits the number of concurrent threads to the point where global memory latencies are not being hidden, for good performance one needs to augment low thread-level parallelism with higher instruction-level parallelism. We can rewrite the copy kernel above to add instruction-level parallelism as follows:
attributes(global) subroutine copy_ILP(odata, idata)
implicit none
real(fp_kind) :: odata(*), idata(*)
integer :: i,j
i = (blockIdx%x-1)*blockDim%x*ILP + threadIdx%x
do j = 1, ILP
odata(i+(j-1)*blockDim%x) = idata(i+(j-1)*blockDim%x)
enddo
end subroutine copy_ILP
In this kernel, each thread copies ILP array elements, so a thread block of blockDim%x threads will copy ILP*blockDim%x elements. It is for this reason that this term appears in the calculation of the index i. We can further append our table with the results for this kernel with ILP=4 and compare it to the original where ILP=1:
| Threads per Block | No Shared Memory | Shared Memory Load | |||
| Occup. | BW (GB/s) |
Occup. | ILP=1 BW |
ILP=4 BW |
|
| 32 | 0.167 | 54 | 0.021 | 7 | 11 |
| 64 | 0.333 | 82 | 0.042 | 15 | 21 |
| 128 | 0.667 | 104 | 0.083 | 28 | 39 |
| 256 | 1.0 | 102 | 0.167 | 50 | 63 |
| 512 | 1.0 | 104 | 0.333 | 75 | 89 |
| 1024 | 0.667 | 97 | 0.667 | 97 | 93 |
For the runs with instruction-level parallelism we once again specify enough dynamic shared memory in the execution configuration so that only a single thread block can reside on a multiprocessor at a time. While we do see some improvement in the ILP=4 case relative to the previous column, the results are far behind the case where no shared memory is used.
Load-Use Separation and Load Batching
We can see how to improve the performance further by looking at how loads and stores are performed in the following loop of kernel code:
do j = 1, ILP
odata(i+(j-1)*blockDim%x) = idata(i+(j-1)*blockDim%x)
enddo
In CUDA, stores are "fire and forget", meaning that there is no blocking or waiting for a store to complete unless an explicit synchronization is given. A load command will also not block further independent execution, however the first use of the data requested by a load will block until that load completes. The term load-use separation is used to describe the amount of time (or the number of instructions) between when a load is issued and when the requested data is used. The larger the load-use separation, the better in terms of hiding load latencies. In the above case the load-use separation is small, the load issued for an element of idata is immediately used in the store of an element of odata. Even though the store is "fire and forget", the store will not issue until the load completes. As a result, only a single load will be in flight at any moment in time. We can improve this situation by using load batching, where we issue all the loads before issuing a store. Using load batching, the modified version of our copy kernel becomes:
attributes(global) subroutine copy_ILP2(odata, idata)
implicit none
real(fp_kind) :: odata(*), idata(*), tmp(ILP)
integer :: i,j
i = (blockIdx%x-1)*blockDim%x*ILP + threadIdx%x
do j = 1, ILP
tmp(j) = idata(i+(j-1)*blockDim%x)
enddo
do j = 1, ILP
odata(i+(j-1)*blockDim%x) = tmp(j)
enddo
end subroutine copy_ILP2
In the above section of code we define a thread-local array tmp which resides in register memory, and use this array to facilitate load batching. The body of the kernel is split into two loops: the first issues all the loads from global memory, and the second issues all the stores. In this kernel, all ILP loads issues by a thread are in flight before the first store in the second loop blocks for the load to complete. The results for this load batching approach are given below:
| Threads per Block | No Shared Memory | Shared Memory Load | ||||
| Occup. | BW (GB/s) |
Occup. | ILP = 1 BW |
ILP = 4 BW |
ILP = 4, Load Batched BW |
|
| 32 | 0.167 | 54 | 0.021 | 7 | 11 | 25 |
| 64 | 0.333 | 82 | 0.042 | 15 | 21 | 45 |
| 128 | 0.667 | 104 | 0.083 | 28 | 39 | 70 |
| 256 | 1.0 | 102 | 0.167 | 50 | 63 | 97 |
| 512 | 1.0 | 104 | 0.333 | 75 | 89 | 102 |
| 1024 | 0.667 | 97 | 0.667 | 97 | 93 | 94 |
Here we observe far better performance when load batching is used relative to using instruction-level parallelism without load batching. We observe over 100 GB/s with one-third occupancy when four-way instruction-level parallelism is used.
We can increase the amount of instruction-level parallelism to improve performance further, with the caveat that the thread-local array tmp(ILP) resides in register memory and thus a larger ILP further increases register pressure. Compiling with ‑Mcuda=ptxinfo indicates that the kernels with ILP of four and eight use 19 and 35 registers per thread, compared to the 10 registers per thread used in the kernel without instruction-level parallelism. With 35 registers per thread, the ILP=8 case cannot be run using a 1024 thread block, as the 35,840 registers required by a single thread exceeds the limit of 32,768 threads per multiprocessor on a C2050. This can be avoided by specifying the register use per kernel to be 32 or less via the option ‑Mcuda=maxregcount:32 at the expense of spilling which adversely affects performance for all thread block sizes. We present the results for ILP=8 without any restriction on register use for all but the 1024 thread block:
| Threads per Block | No Shared Memory | Shared Memory Load | |||||
| Occup. | BW (GB/s) |
Occup. | ILP = 1 BW |
ILP = 4 BW |
ILP = 4, Load Batched BW |
ILP = 8, Load Batched BW |
|
| 32 | 0.167 | 54 | 0.021 | 7 | 11 | 25 | 32 |
| 64 | 0.333 | 82 | 0.042 | 15 | 21 | 45 | 57 |
| 128 | 0.667 | 104 | 0.083 | 28 | 39 | 70 | 86 |
| 256 | 1.0 | 102 | 0.167 | 50 | 63 | 97 | 103 |
| 512 | 1.0 | 104 | 0.333 | 75 | 89 | 102 | 95 |
| 1024 | 0.667 | 97 | 0.667 | 97 | 93 | 94 | ** |
Here we observe over 100 GB/s with one-sixth occupancy when ILP=8. With a thread block of 32 threads, or 1/48 occupancy, we achieve a bit under one third of the highest observed bandwidth, roughly five times the performance of the kernel without any instruction-level parallelism.
Summary
In this article we discussed two basic techniques of exposing enough parallelism in CUDA kernels to keep the hardware busy and achieve good performance. The first technique, using thread-level parallelism, is the easiest because it is specified in the host code through the execution configuration and does not require any kernel modification. However, there are situations when resource consumption of a kernel is so large as to greatly restrict the number of concurrent threads on a multiprocessor. In these cases we need to look at instruction-level parallelism.
Use of instruction-level parallelism is not only reserved for cases where resource consumption dictates it. It can be used as a general optimization technique. Many of the highly optimized routines in the CUBLAS and CUFFT libraries run at low occupancy due to instruction-level parallelism. There is a trade-off with register pressure when implementing instruction-level parallelism. Instruction-level parallelism typically allows one to run at lower occupancy, which increases the number of registers available per thread. However, load batching requires temporary arrays that reside in registers and therefore increases the register pressure. Determination of how much instruction-level parallelism will result in optimal performance needs to be made on a case-by-case basis.
For more information on instruction-level parallelism, see the slides for the GTC 2010 presentation Better Performance at Lower Occupancy.