Using Cooperative Groups in CUDA Fortran
Prior to the CUDA 9.0 Toolkit, synchronization between device threads was limited to using the synchthreads() subroutine to perform a barrier synchronization amongst all threads in a thread block. To synchronize larger groups of threads, one would have to break a kernel into multiple smaller kernels where the completion of these smaller kernels were in effect synchronization barriers for all the threads in a grid. With the cooperative groups feature introduced in the CUDA 9.0 toolkit, one can now synchronize groups of threads both larger and smaller than thread blocks on supported hardware, making coding easier and the resultant code more efficient for cases that naturally map to such synchronization. This post is the first in a series that describes the features of cooperative groups as implemented in CUDA Fortran in the PGI compilers.
The Cooperative Groups API
Access to the derived types and functions associated with cooperative groups in device code is available through the cooperative_groups module. The derived types currently available are type(grid_group), type(thread_group), and type(coalesced_group). On devices of compute capability 6.0 and higher (currently Pascal and Volta), the grid_group type can be used to form a cooperative group amongst all threads in a grid using the this_grid() function:
use cooperative_groups type(grid_group) :: gg gg = this_grid()
while at the other end of the scale, a group consisting of (at most) all threads in a warp can be created with:
use cooperative_groups type(coalesced_group) :: cg cg = this_warp()
The thread_group is the most generic cooperative group type and can be used to form a group of all threads in a thread block:
use cooperative_groups type(thread_group) :: tg tg = this_thread_group()
Both thread_group and coalesced_group are available on all supported CUDA hardware. All of these cooperative groups have two public members, size and rank, which give the number of threads in the group and the rank or index of the current thread (using Fortran’s unit offset convention), respectively. The size and rank members are analogous to the blockDim and threadIdx automatically defined variables in device code — for a cooperative group of a thread block they are essentially the same, as seen in the code below where the two kernels, one of which uses cooperative groups, both perform the same task of incrementing each array element:
attributes(global) subroutine incrementKernel(a,n) implicit none integer, device :: a(n) integer, value :: n integer :: tid tid = (blockIdx%x-1)*blockDim%x + threadIdx%x if (tid <= n) a(tid) = a(tid) + 1 end subroutine incrementKernel attributes(global) subroutine incrementCGKernel(a,n) use cooperative_groups implicit none integer, device :: a(n) integer, value :: n integer :: tid type(thread_group) :: tg tg = this_thread_block() tid = (blockIdx%x-1)*tg%size + tg%rank if (tid <= n) a(tid) = a(tid) + 1 end subroutine incrementCGKernel
The main purpose of having cooperative groups is to allow synchronization amongst all threads within that group. The cooperative_groups module overloads the syncthreads() subroutine to accept an cooperative group type instance as an argument which will then synchronize all threads in that group. One can safely synchronize all threads in a warp with
use cooperative_groups type(coalesced_group) :: cg cg = this_warp() ... call syncthreads(cg)
and without having to block activity occurring in other warps as a thread-block synchronization would do. Synchronization across all threads in a grid requires some additional features which we discuss in the next section.
Grid Synchronization
The CUDA runtime allows one to launch a kernel with more blocks than can simultaneously fit on a device. Given the thread-block independence of the CUDA programming model this approach, in general, poses no problem. The scheduler issues thread blocks to multiprocessors when resources become available. To synchronize across all the threads in a grid, all of the threads launched in a kernel must actively reside on the device; a model of computation called persistent threads. To use the persistent thread approach, one needs to determine, for a specific kernel, the number of thread blocks that can concurrently reside on a multiprocessor, multiply that by the number multiprocessors on the device, and launch the kernel with the resulting number of thread blocks. The new routine cudaOccupancyMaxActiveBlocksPerMultiprocessor() can be used to determine the number of active blocks that can reside on a kernel. CUDA Fortran offers a simpler approach however. By using a wildcard ‘*’ as the first parameter in the kernel execution configuration in host code, the compiler will determine the number of blocks to launch on your behalf. Aside from the wildcard used in the first execution parameter, launching a kernel with grid synchronization is no different than launching any other kernel. There are some differences in a kernel that utilizes grid synchronization, which we will explain using the following example:
module m contains attributes(grid_global) subroutine smooth(a,b,n,radius) use cooperative_groups implicit none real, device :: a(n), b(n) integer, value :: n, radius integer :: i, j, jj type(grid_group) :: gg gg = this_grid() do i = gg%rank, n, gg%size a(i) = i end do call syncthreads(gg) do i = gg%rank, n, gg%size b(i) = 0.0 do j = i-radius, i+radius jj = j if (j < 1) jj = jj + n if (j > n) jj = jj - n b(i) = b(i) + a(jj) enddo b(i) = b(i)/(2*radius+1) enddo end subroutine smooth end module m program main use cudafor use m implicit none integer, parameter :: n = 1024*1024 real :: a(n), b(n) real, device :: a_d(n), b_d(n) integer :: i, radius radius = 2 call smooth<<<*,256>>>(a_d, b_d, n, radius) a = a_d b = b_d write(*,*) 'Filter radius: ', radius do i = 1, n if (abs(b(i)-a(i)) > 0.00010) write(*,*) i, a(i), b(i) enddo end program main
This code initializes values in one array and in a separate array, records the values after smoothing by averaging the array elements with their nearest neighbors (assuming periodicity) for a specified radius. Given the linear initialization, the only values affected are those near the (discontinuous) beginning and end of the array, which are printed out in host code.
Kernels that use grid synchronization must be declared with the grid_global attribute, as was done here with:
attributes(grid_global) subroutine smooth(a,b,n,radius)
The other features of the code, aside from the types and functions defined in the cooperative_groups module previously discussed, are steps necessary to implement the persistent thread model of computing. It is common for the number of persistent threads in a kernel to be fewer than the number of elements in an array, so to process all the elements of an array one must use grid-stride loops, as is done in initializing the array:
gg = this_grid() do i = gg%rank, n, gg%size a(i) = i end do
If the array size n is equal or smaller than the number of threads launched by the kernel, then the statement a(i) = i is executed at most once by each thread, but n can be arbitrarily large and this loop will initialize all elements of the array. Before these array values can be used in the smoothing operation where each thread uses data written by other threads, we must synchronize across the entire grid to ensure the initialization step has been completed by all threads:
call syncthreads(gg)
Without the ability to synchronize across all threads in the grid, one would have to break this kernel into two separate kernels running sequentially. After the synchronization, the smoothing operation can be performed using another grid-stride loop:
do i = gg%rank, n, gg%size b(i) = 0.0 do j = i-radius, i+radius jj = j if (j < 1) jj = jj + n if (j > n) jj = jj - n b(i) = b(i) + a(jj) enddo b(i) = b(i)/(2*radius+1) enddo
As mentioned previously, the host code used to launch a kernel that uses grid synchronization is similar to host code that launches generic kernels, with the exception that an asterisk is used as the first argument in the execution configuration:
radius = 2 call smooth<<<*,256>>>(a_d, b_d, n, radius) a = a_d b = b_d
Compiling and executing this code we have:
% pgf90 -Mcuda=cc70,9.0 gridGlobal.cuf % ./a.out Filter radius: 2 1 1.000000 419431.4 2 2.000000 209717.2 1048575 1048575. 838859.8 1048576 1048576. 629145.6
Summary
This post provided a basic introduction to cooperative groups as implemented in the PGI 18.1 CUDA Fortran compiler, focusing primarily on grid synchronization. By overloading the synchthreads() subroutine to take arguments of cooperative group derived types, and allowing a wildcard in the execution configuration for number of blocks launched needed for persistent threads, performing a synchronization across an entire grid of threads is a straightforward extension of the thread-block synchronization notation. In a follow-up post, I’ll look at the other end of the spectrum and discuss synchronization of thread groups smaller than a thread block.