September 2010

## CUDA Fortran: The Next Level

### Overview

CUDA Fortran is the Fortran analog of NVIDIA's CUDA C language and compiler. It takes advantage of higher-level Fortran language features, such as using type attributes to distinguish host data from device data and using array assignments to move data between the host and the GPU. This article presents an important enhancement and an exciting new high-level feature in the PGI CUDA Fortran implementation.

### Fortran Modules

Fortran has a very nice **module** language feature which allows procedures and data to be grouped together, and allows other program units to import the module definitions and interfaces with a simple **use** statement. This avoids the maintenance overhead of keeping header files consistent with source code, simplifying life for the programmer.

CUDA Fortran allows **device** data and **global** and **device** subprograms to be defined in a module, and these will be accessible to any host routine that **uses** the module. However, because there is no linker for the device code, there was no way to access module data or subprograms in device code that wasn't in the same module. For example, the following is not supported using PGI 10.x versions of CUDA Fortran:

module a real,allocatable,dimension(:),device :: ga end module module b use a contains attributes(global) subroutine sub( s ) real,dimension(:) :: s integer :: i i = threadidx%x + (blockidx%x-1)*blockdim%x ga(i) = s(i) ! access to 'ga' from another module end subroutine end module

This is a severe restriction, and reduces program modularity. The PGI 11.0 CUDA Fortran implementation will allow seamless access to module data in any device subprogram that **uses** the module. The compiler will generate code to dynamically link the module data to the subprogram, using indirection, the first time a device kernel is launched. There is a little overhead because this is done at runtime, and only the first time the kernel is launched. Note that this feature will only work on Fermi-class GPUs, which support a unified address space and flexible pointer dereferencing.

The obvious next step would be to allow global or device subprograms to call device subprograms contained in another module. This is quite a bit harder, because function pointers are not as flexible as data pointers in the CUDA architecture and software environment. A possible implementation is being explored, but support for this feature will not be included in the PGI 2011 release.

### Kernel Loops

Most CUDA kernels in either C or Fortran correspond directly to nested parallel loops. For instance, a matrix add on the host looks like:

module madd_module contains subroutine madd(a,b,c,n1,n2) real, dimension(:,:) :: a,b,c integer :: n1,n2 integer :: i,j do j = 1,n2 do i = 1,n1 a(i,j) = b(i,j) + c(i,j) enddo enddo end subroutine end module

Writing even this simple operation on device arrays in CUDA Fortran requires splitting out the kernel (the loop body) to a global subroutine, adding checks to not run over the bounds of the array, and adding the kernel launch; in this example, we use a 32x4 thread block:

module madd_device_module use cudafor contains attributes(global) subroutine madd_kernel(a,b,c,n1,n2) real, dimension(:,:) :: a,b,c integer, value :: n1,n2 integer :: i,j i = threadidx%x + (blockidx%x-1)*blockdim%x j = threadidx%y + (blockidx%y-1)*blockdim%y if( i <= n1 .and. j <= n2 ) & a(i,j) = b(i,j) + c(i,j) end subroutine subroutine madd_dev(a,b,c,n1,n2) real, dimension(:,:), device :: a,b,c integer :: n1,n2 type(dim3) :: grid, block grid = dim3((n1+31)/32, (n2+3)/4, 1) block = dim3(32,4,1) call madd_kernel<<< grid, block >>>(a,b,c,n1,n2) end subroutine end module

This gets even more complex if we want to handle large matrices. CUDA supports grids of 65535x65535; if the matrices are large enough to overflow that limit, we have to handle that explicitly:

module madd_device_module use cudafor contains attributes(global) subroutine madd_kernel(a,b,c,n1,n2) real, dimension(:,:) :: a,b,c integer, value :: n1,n2 integer :: i,j do j = threadidx%y + (blockidx%y-1)*blockdim%y, n2, blockdim%y*griddim%y do i = threadidx%x + (blockidx%x-1)*blockdim%x, n1, blockdim%x*griddim%x a(i,j) = b(i,j) + c(i,j) enddo enddo end subroutine subroutine madd_dev(a,b,c,n1,n2) real, dimension(:,:), device :: a,b,c integer :: n1,n2 type(dim3) :: grid, block grid = dim3(min(65535,(n1+31)/32), min(65535,(n2+3)/4), 1) block = dim3(32,4,1) call madd_kernel<<< grid, block >>>(a,b,c,n1,n2) end subroutine end module

### Kernel Loop Directive

We have designed a new directive for CUDA Fortran to simplify the task of writing such trivial kernels. The directive appears before a loop or tightly nested loops, and tells the compiler to *kernelize* the loop or loops, and gives the thread block shape and size. The grid size can be specified, or can be automatically computed by dividing the loop trip count by the thread block size for that loop. Our matrix addition example would be written using this kernel directive as:

module madd_device_module use cudafor contains subroutine madd_dev(a,b,c,n1,n2) real, dimension(:,:), device :: a,b,c integer :: n1,n2 !$cuf kernel do(2) <<< (*,*), (32,4) >>> do j = 1,n2 do i = 1,n1 a(i,j) = b(i,j) + c(i,j) enddo enddo end subroutine end module

Here, the kernel do directive defines a two-dimensional thread block (32x4). The body of the immediately following two tightly nested loops are turned into the kernel body; the inner i loop will be mapped to threadidx%x with blockdim%x=32, and the outer j loop will be mapped to threadidx%y with blockdim%y=4. The grid shape, specified as (*,*), will be computed by dividing the loop trip counts by the thread block sizes.

The kernel directive feature won't replace all CUDA Fortran kernels, by any means. However, it will greatly simplify the job of writing many trivial kernels, which can be the bulk of the coding work in a large program. There are many restrictions on the code that can appear in a loop subject to the kernel directive:

- If the directive specifies two or more dimensions, it must be immediately followed by that many tightly nested DO loops.
- The DO loops must have invariant loop limits: the lower limit, upper limit, and stride must be invariant.
- There can be no GOTO from inside the inner loop to any statement outside the inner loop, nor can there be an EXIT statement from the inner loop.
- The body of the loops may contain assignment statements, IF statements, loops, and GOTO statements.
- CUDA Fortran datatypes are allowed.
- CUDA Fortran intrinsic functions are allowed, if they are allowed in device code, but the device-specific intrinsics (syncthreads, allthreads, atomic functions, etc.) are not.
- Subroutine and function calls to attributes(device) subprograms in the same module as this routine are allowed.
- Any arrays used or assigned in the loop must have the device attribute.
- Any scalars used in the loop must either have the device attribute, or the compiler will make a device copy of that variable for this kernel.
- As these loops appear in host subprograms, there is no option to explicitly use shared memory, though the compiler may do so automatically.

Most importantly, the loops will be executed in parallel, across the thread blocks and grid; it is the programmer's responsibility to ensure that parallel execution is legal and will produce the correct answer. The only exception is for a scalar reduction operation, such as summing the values in a vector or matrix; in that case, the compiler will handle the generation of the final reduction kernel, inserting synchronization into the kernel as appropriate. As with any kernel, the launch is asynchronous; the host program will continue without waiting; the program can use cudaThreadSynchronize() or CUDA Events to wait for the completion of the kernel.

The directive uses the CUDA chevron syntax to specify the thread block and grid shape. The number of loops is given just after the do keyword. The first entry in the chevrons is the grid shape; the directive will allow you to specify the grid shape, but specifying asterisks here will tell the compiler to compute the grid shape from the loop limits and thread block shape. The second entry is the thread block shape. The first entry corresponds to the inner loop, the next entry to its outer loop, and so on. An entry can be a constant expression or a runtime expression, and there must be an entry for each loop. The inner loop will map to the threadidx%x index, the next outer loop to threadidx%y, and again for threadidx%z. The compiler essentially strip-mines the loops to the thread block size.

### Examples

Here we show the matrix add using kernel loop directive three ways, with corresponding CUDA Fortran code. The previous section showed a two-dimensional thread block of size 32x4, with a two-dimensional grid of size N1/32 x N2/4. We could change this to a one-dimensional thread block as follows:

module madd_device_module use cudafor contains subroutine madd_dev(a,b,c,n1,n2) real, dimension(:,:), device :: a,b,c integer :: n1,n2 type(dim3) :: grid, block !$cuf kernel do(1) <<< *, 256 >>> do i = 1,n1 do j = 1,n2 a(i,j) = b(i,j) + c(i,j) enddo enddo end subroutine end module

The kernel directive specifies that only the i loop will be run in parallel with a the thread block of size 256; the j dimension is run sequentially. The corresponding CUDA Fortran code is:

module madd_device_module use cudafor contains attributes(global) subroutine madd_kernel(a,b,c,n1,n2) real, dimension(:,:) :: a,b,c integer, value :: n1,n2 integer :: i,j do i = threadidx%x + (blockidx%x-1)*blockdim%x, n1, blockdim%x*griddim%x do j = 1, n2 a(i,j) = b(i,j) + c(i,j) enddo enddo end subroutine subroutine madd_dev(a,b,c,n1,n2) real, dimension(:,:), device :: a,b,c integer :: n1,n2 integer grid grid = min(65535,(n1+31)/256) call madd_kernel<<< grid, 256 >>>(a,b,c,n1,n2) end subroutine end module

Alternatively, we could change this to map the j loop onto a blockidx dimension, using the following:

module madd_device_module use cudafor contains subroutine madd_dev(a,b,c,n1,n2) real, dimension(:,:), device :: a,b,c integer :: n1,n2 type(dim3) :: grid, block !$cuf kernel do(2) <<< (*,n2), (256,1) >>> do j = 1,n2 do i = 1,n1 a(i,j) = b(i,j) + c(i,j) enddo enddo end subroutine end module

This would correspond to the CUDA Fortran code:

module madd_device_module use cudafor contains attributes(global) subroutine madd_kernel(a,b,c,n1,n2) real, dimension(:,:) :: a,b,c integer, value :: n1,n2 integer :: i,j do j = blockidx%y, n2, griddim%y do i = threadidx%x + (blockidx%x-1)*blockdim%x, n1, blockdim%x*griddim%x a(i,j) = b(i,j) + c(i,j) enddo enddo end subroutine subroutine madd_dev(a,b,c,n1,n2) real, dimension(:,:), device :: a,b,c integer :: n1,n2 type(dim3) :: grid, block grid = dim3(min(65535,(n1+31)/256), min(65535,n2), 1) call madd_kernel<<< grid, 256 >>>(a,b,c,n1,n2) end subroutine end module

Finally, we could go back to a two dimensional thread block, but add code to compute a sum of all the elements of the resulting matrix as well using the following loop:

module madd_device_module use cudafor contains subroutine madd_dev(a,b,c,sum,n1,n2) real, dimension(:,:), device :: a,b,c real :: sum integer :: n1,n2 type(dim3) :: grid, block !$cuf kernel do (2) <<< (*,*), (32,4) >>> do j = 1,n2 do i = 1,n1 a(i,j) = b(i,j) + c(i,j) sum = sum + a(i,j) enddo enddo end subroutine end module

As anyone who has written a reduction using CUDA, there are many details to get right. The corresponding CUDA Fortran to this code is the following:

module madd_device_module use cudafor implicit none contains attributes(global) subroutine madd_kernel(a,b,c,blocksum,n1,n2) real, dimension(:,:) :: a,b,c real, dimension(:) :: blocksum integer, value :: n1,n2 integer :: i,j,tindex,tneighbor,bindex real :: mysum real, shared :: bsum(256) ! Do this thread's work mysum = 0.0 do j = threadidx%y + (blockidx%y-1)*blockdim%y, n2, blockdim%y*griddim%y do i = threadidx%x + (blockidx%x-1)*blockdim%x, n1, blockdim%x*griddim%x a(i,j) = b(i,j) + c(i,j) mysum = mysum + a(i,j) ! accumulates partial sum per thread enddo enddo ! Now add up all partial sums for the whole thread block ! Compute this thread's linear index in the thread block ! We assume 256 threads in the thread block tindex = threadidx%x + (threadidx%y-1)*blockdim%x ! Store this thread's partial sum in the shared memory block bsum(tindex) = mysum call syncthreads() ! Accumulate all the partial sums for this thread block to a single value tneighbor = 128 do while( tneighbor >= 1 ) if( tindex <= tneighbor ) & bsum(tindex) = bsum(tindex) + bsum(tindex+tneighbor) tneighbor = tneighbor / 2 call syncthreads() enddo ! Store the partial sum for the thread block bindex = blockidx%x + (blockidx%y-1)*griddim%x if( tindex == 1 ) blocksum(bindex) = bsum(1) end subroutine ! Add up partial sums for all thread blocks to a single cumulative sum attributes(global) subroutine madd_sum_kernel(blocksum,dsum,nb) real, dimension(:) :: blocksum real :: dsum integer, value :: nb real, shared :: bsum(256) integer :: tindex,tneighbor,i ! Again, we assume 256 threads in the thread block ! Accumulate a partial sum for each thread tindex = threadidx%x bsum(tindex) = 0.0 do i = tindex, nb, blockdim%x bsum(tindex) = bsum(tindex) + blocksum(i) enddo call syncthreads() ! This code is copied from the previous kernel ! Accumulate all the partial sums for this thread block to a single value ! Since there is only one thread block, this single value is the final result tneighbor = 128 do while( tneighbor >= 1 ) if( tindex <= tneighbor ) & bsum(tindex) = bsum(tindex) + bsum(tindex+tneighbor) tneighbor = tneighbor / 2 call syncthreads() enddo if( tindex == 1 ) dsum = bsum(1) end subroutine subroutine madd_dev(a,b,c,dsum,n1,n2) real, dimension(:,:), device :: a,b,c real, device :: dsum real, dimension(:), allocatable, device :: blocksum integer :: n1,n2,nb type(dim3) :: grid, block integer :: r ! Compute grid/block size; block size must be 256 threads grid = dim3((n1+31)/32, (n2+7)/8, 1) block = dim3(32,8,1) nb = grid%x * grid%y allocate(blocksum(1:nb)) call madd_kernel<<< grid, block >>>(a,b,c,blocksum,n1,n2) call madd_sum_kernel<<< 1, 256 >>>(blocksum,dsum,nb) r = cudaThreadSynchronize() ! don't deallocate too early deallocate(blocksum) end subroutine end module

Those familiar with the PGI Accelerator programming model will see a similarity between the kernel loop and the Accelerator compute region, and in fact some of the same technology is used in the compiler to generate the kernel code. However, the compiler doesn't add data movement for kernel loops, because they work directly on CUDA Fortran device data.

The details of the kernel directive syntax may change somewhat before the 2011 release, but the concepts are well defined. The kernel directive will allow more straightforward porting to the CUDA programming model, and allow developers to concentrate their efforts on the time-critical portions of the application.