# A CUDA Fortran Interface to Tensor Cores

## 1 Introduction

One of the defining features of recent NVIDIA GPUs including the Telsa V100 is the introduction of Tensor Cores, which are programmable matrix multiply and accumulate units that operate on half-precision (16-bit) multiplicands. Access to programming Tensor Cores in CUDA C became available in the CUDA 9.0 release through the WMMA (Warp Matrix Multiply and Accumulate) API. This paper describes a CUDA Fortran interface to this same functionality. Note that the WMMA interface is a preview feature in CUDA C and subject to change, as is the CUDA Fortran interface described in what follows.

With the WMMA interface, a single warp of 32 threads performs *D = A∗B+C* where *C* and *D* are 256-element matrices. The multiplicands *A* and *B* are matrices of half-precision (16-bit) floating point values, whereas *C* and *D* are matrices of either both half-precision or both full-precision (32-bit) floating point values. Each Tensor Core actually performs a 4×4 matrix multiply, so multiple Tensor Cores are used in each WMMA operation. Before the WWMA operation can take place the operand matrices must be loaded into registers, distributed amongst the threads in the warp. The mapping of threads to matrix elements is opaque, where the WMMASubmatrix datatype (equivalent to the *fragment* in CUDA C), is used to represent the elements each thread holds of the matrix represented by the warp of threads, along with other metadata.

This post will first discuss the support for 16-bit floating point values using real(2) declarations, as well as the wmma CUDA Fortran module, in particular how variables of the WMMASubmatrix type are used to perform matrix multiplications. That section is followed by a series of progressively complex examples illustrating the use of WMMA API from CUDA Fortran, starting with a simple 16×16 matrix multiply. The final section discusses the performance aspects of using tensor cores.

## 2 Representation of 16–bit values

CUDA Fortran supports 16-bit floating point variables using the real(2) declarations, which are available on both the host and device. Arithmetic, input/output, and implicit conversions between types are all supported.

The following code illustrates use of real(2) variables, as well as accuracy implications, in host and device code by performing implicit conversions and some simple arithmetic:

1 program main 2 implicit none 3 integer, parameter :: n=3 4 real(2) :: a2(n), b2(n), a2d(n), c2(n), c2d(n) 5 real(2), device :: a2_d(n), b2_d(n), c2_d(n) 6 real(4) :: a4(n), a4n(n), b4(n), c4(n), c4d(n), & 7 d4(n), d4d(n), e4(n), e4d(n) 8 real(4), device :: a4_d(n), b4_d(n), c4_d(n), d4_d(n), e4_d(n) 9 integer :: i, j 10 11 do i = 1, n 12 a4(i) = real(i)/n 13 b4(i) = real(i)/n 14 enddo 15 16 a2 = a4 17 a4n=a2 18 write(*,*) ’Host conversion:’ 19 do i = 1, n 20 write(*,*) i, a4(i), a4n(i), a2(i) 21 enddo 22 23 b2 = b4; c2 = 0.0_2; c4 = 0.0_4; d4 = 0.0_4; e4 = 0.0_4 24 a2_d = a4; b2_d = b4; c2_d = 0.0_2 25 a4_d = a4; b4_d = b4; c4_d = 0.0_4; d4_d = 0.0_4; e4_d = 0.0_4 26 27 do i = 1, n 28 do j = 1, 100! sum up product 100 times29 c2(i) = c2(i) + a2(i)*b2(i) 30 c4(i) = c4(i) + a4(i)*b4(i) 31 d4(i) = d4(i) + a2(i)*b2(i) 32 e4(i) = e4(i) + real(a2(i), kind=4)*real(b2(i), kind=4) 33 enddo 34 enddo 35!$cuf kernel do <<<*,*>>>36 do i = 1, n 37 do j = 1, 100 ! sum up product 100 times 38 c2_d (i) = c2_d(i) + a2_d(i) * b2_d(i) 39 c4_d (i) = c4_d(i) + a4_d(i) * b4_d(i) 40 d4_d (i) = d4_d(i) + a2_d(i) * b2_d(i) 41 e4_d (i) = e4_d(i) & 42 + real(a2_d(i), kind =4)* real(b2_d(i), kind =4) 43 enddo 44 enddo 45 c2d = c2_d 46 c4d = c4_d 47 d4d = d4_d 48 e4d = e4_d 49 write(*,*) 'Host / device arithmetic :' 50 do i = 1, n 51 write(*,*) i 52 write(*,*) c2(i), c4(i), d4(i), e4(i) 53 write(*,*) c2d(i), c4d(i), d4d(i), e4d(i) 54 write(*,*) 55 enddo 56 end program main

which when compiled and run results in the output:

% pgf90 -Mcuda =cc70, cuda10 .1 -O3 -o real2Accuracy real2Accuracy .cuf % ./ real2Accuracy Host conversion : 1 0.3333333 0.3332520 0.3333 2 0.6666667 0.6665039 0.6665 3 1.000000 1.000000 1.000 Host / device arithmetic : 1 11.01 11.11110 11.10840 11.10569 11.01 11.11110 11.10840 11.10569 2 44.03 44.44441 44.43359 44.42278 44.03 44.44441 44.43359 44.42278 3 100.0 100.0000 100.0000 100.0000 100.0 100.0000 100.0000 100.0000

For the host and device arithmetic examples, we have done accumulation of multiplicands in several ways: 16-bit multiplicands and accumulation, 32-bit multiplicands and accumulation, 16-bit multiplicands and 32-bit accumulation, and 16-bit multiplicands promoted to 32-bits before the multiplication and 32-bit accumulation. We should note that on the host arithmetic on real(2) data is actually performed in 32 bits and then converted back to 16 bits at each store.

## 3 The CUDA Fortran wmma module

Use of Tensor Cores through the WMMA API in CUDA Fortran requires the wmma module as well as the cuf_macros.CUF macro file. These provide tensor-core specific data types, along with routines to load and store data and perform warp-based matrix multiplications using these data types.

### 3.1 The WMMASubMatrix datatype

Tiles of matrices used by a warp of threads to perform matrix multiplication are stored in variables of the WMMASubMatrix datatype in device code. For those familiar with the CUDA C API to Tensor Cores, WMMASubmatrix corresponds to the fragment template. There are different WMMASubMatrix types based on use (i.e. which operand in *D = A B + C*), precision, storage order, and dimensions, which are specified as type parameters in CUDA Fortran. Typical declarations of WMMASubMatrix variables used in device code are:

WMMASubMatrix(WMMAMatrixA, 16, 16, 16, Real, WMMAColMajor) :: a WMMASubMatrix(WMMAMatrixB, 16, 16, 16, Real, WMMAColMajor) :: b WMMASubMatrix(WMMAMatrixC, 16, 16, 16, Real, WMMAKind4) :: c

where the first parameter indicates the operand, corresponding to the A matrix, B matrix, and the accumulator. The following three parameters are the tile sizes, in this case (*m, n, k*) = (16, 16, 16). Allowable values for (*m, n, k*) are (16, 16, 16), (32, 8, 16), and (8, 32, 16). As a result, the accumulator submatrix (*m* × *n*) can be 8 × 32, 32 × 8, or 16 × 16 and will always contain 256 elements. The A submatrix can be 8 × 16, 16 × 16, or 32×16 and the B submatrices can be 16×32, 16×8, or 16×8 and therefore can contain 128, 256, or 512 elements. The datatype Real is specified next, and is currently the only allowed value. For WMMAMatrixA and WMMAMatrixB matrices one needs to specify row- or column-major ordering, whereas the only precision currently supported for these is 16-bit. The accumulator, WMMAMatrixC, can currently be either 16- or 32-bit precision as specified in the last parameter. The storage order for accumulators is not specified at declaration.

### 3.2 Compilation

When compiling code utilizing Tensor Cores, a compute capability of at least 7.0 and a CUDA runtime version of at least 9.0 should be used, either implicitly or explitily via -Mcuda=cc70,cuda9.0. In addition, the preprocessor must be invoked due to the included macro file cuf_macros.CUF, either explicitly through the -Mpreprocess compiler option or implicitly by using an uppercase file extension such as .CUF or .F90.

## 4 WMMA Programming Basics

In this section we present a sequence of small example programs illustrating the concepts of WMMA programming in CUDA Fortran. We begin with a code that launches a kernel using a single warp of threads that performs a matrix multiplication of two 16×16 matrices. We then add complexity with each example until we can address performance issues, which is the next section.

### 4.1 Example 1: 16×16×16 matrix multiply

The simplest possible matrix multiply using WMMA is the operation *C = A ∗ B* where all the matrices have the dimensions 16×16. The kernel code for this is:

1 # include "cuf_macros .CUF" 2 3 module m 4 integer, parameter :: wmma_m = 16 5 integer, parameter :: wmma_n = 16 6 integer, parameter :: wmma_k = 16 7 contains 8 9! kernel for 16 x16 matrices (a,b, and c) using wmma10! Should be launched with one block of 32 threads11 12 attributes(global) subroutine wmma_single(a, b, c) 13 use wmma 14 implicit none 15 real(2), intent(in) :: a(wmma_m,*), b(wmma_k,*) 16 real(4) :: c(wmma_m,*) 17 WMMASubMatrix(WMMAMatrixA, 16, 16, 16, Real, WMMAColMajor) :: sa 18 WMMASubMatrix(WMMAMatrixB, 16, 16, 16, Real, WMMAColMajor) :: sb 19 WMMASubMatrix(WMMAMatrixC, 16, 16, 16, Real, WMMAKind4) :: sc 20 integer :: lda, ldb, ldc 21 22 lda = wmma_m 23 ldb = wmma_k 24 ldc = wmma_m 25 26 sc = 0.0_4 27 call wmmaLoadMatrix(sa, a(1,1), lda) 28 call wmmaLoadMatrix(sb, b(1,1), ldb) 29 call wmmaMatMul(sc, sa, sb, sc) 30 call wmmaStoreMatrix(c(1,1), sc, ldc) 31 32 end subroutine wmma_single 33 34 end module m

which is launched with a single warp of threads:

62 call wmma_single<<<1,32>>>(ah_d, bh_d, c_d)

In this example we use a 32-bit accumulator for the C matrix, in contrast to the 16-bit representations for the A and B matrices. For this case the matrices passed in as arguments to the kernel are the same size as the WMMA submatrices, so to perform the matrix multiplication we simply initialize the C WMMA submatrix to 0.0, load the A and B matrices from global memory to WMMA submatrices, perform the matrix multiplication on the submatrices, and store the result in the WMMA submatix to global memory.

You may have noticed that the thread index threadIdx does not appear at all in this code. This underlies the important concept to take away from this example: the threads in a warp work collectively to accomplish these tasks. So when dealing with the WMMA submatrices, we are doing warp-level programming rather than thread-level programming. This kernel is launched with a single warp of 32 threads, yet each of our WMMA submatrices has 16x16 or 256 elements. So when the initialization statement:

26 sc = 0.0_4

is executed, each thread sets 8 elements in the 16x16 submatrix to zero. (For those familiar with the CUDA C Tensor Core API, assignment to WMMA submatrices is overloaded to invoke the fill_fragment() method.) The mapping of threads to submatrix elements is opaque for this and other operations involving WMMA submatrices, from a programming standpoint we only address what happens collectively by a warp of threads on WMMA submatrices.

The statements that load the A and B from global memory to WMMA submatrices:

27 call wmmaLoadMatrix(sa, a(1,1), lda) 28 call wmmaLoadMatrix(sb, b(1,1), ldb)

also work collectively. In these calls the WMMA submatrices are specified as the first argument, and the second arguments contain the addresses of the upper left element of the tiles in global (or shared) memory to be loaded to the WMMA submatrices. The leading dimension of the the matrices in global (or shared) memory is the third argument. Note that the arguments passed to wmmaLoadMatrix() are the same for all threads in the warp. Because the mapping of elements to threads in a warp is opaque, each thread just passes the address of the first element in the 16x16 matrix along with the leading dimension as the third parameter, and the load operation is distributed amongst the threads in the warp.

The matrix multiplication on the WMMA submatrices is performed by the statement:

29 call wmmaMatMul(sc, sa, sb, sc)

which is again performed collectively by a warp of threads. Here we have used the same accumulator submatrix for the first and last arguments in the wmmaMatMul() call, which is why its initialization to zero was required.

The wmmaStoreMatrix() call:

30 call wmmaStoreMatrix(c(1,1), sc, ldc)

is analogous to the prior wmmaLoadMatrix calls, but here the first argument is the address of the upper left element of the tile in global (or shared) memory and the second argument is the WMMA submatrix whose values are stored. When both wmmaLoadMatrix() and wmmaStoreMatrix() are called with accumulator (WMMAMatrixC) arguments, there is an optional fourth argument that specifies the storage order — recall that declarations for WMMA accumulators do not specify storage order. In CUDA Fortran, the default is the WMMAColMajor or column-major storage order.

One final note on arguments to the wmmaLoadMatrix() and wmmaStoreMatrix() routines. There is a requirement that the leading dimension of the matrices, specified by the third argument of these routines, must be a multiple of 16 bytes (e.g. 8 real(2) words or 4 real(4) words). This will become important when we use shared memory for the input arrays, or more specifically when we pad shared memory arrays.

#### 4.1.1 Accuracy revisited

Before moving onto our second tensor core example, it is worthwhile adding some code that performs the matrix multiplication in loops in the kernel, in addition to using the tensor cores, and compare the results. This new kernel is:

14 attributes(global) subroutine wmma_accuracy(a, b, c, d) 15 use wmma 16 implicit none 17 real(2), intent(in) :: a(wmma_m,*), b(wmma_k,*) 18 real(4) :: c(wmma_m,*), d(wmma_m,*) 19 WMMASubMatrix(WMMAMatrixA, 16, 16, 16, Real, WMMAColMajor) :: sa 20 WMMASubMatrix(WMMAMatrixB, 16, 16, 16, Real, WMMAColMajor) :: sb 21 WMMASubMatrix(WMMAMatrixC, 16, 16, 16, Real, WMMAKind4) :: sc 22 integer :: lda, ldb, ldc, m, n, k, colOffset 23 24 lda = wmma_m 25 ldb = wmma_k 26 ldc = wmma_m 27 28 sc = 0.0_4 29 call wmmaLoadMatrix(sa, a(1,1), lda) 30 call wmmaLoadMatrix(sb, b(1,1), ldb) 31 call wmmaMatMul(sc, sa, sb, sc) 32 call wmmaStoreMatrix(c(1,1), sc, ldc) 33 34! matrix multiplication with explicit loops35 m = threadIdx %x 36 colOffset = (m -1)/16 37 m = m - colOffset * 16 38 39 do n = 1+ colOffset, 16, 2 40 do k = 1, 16 41 d(m,n) = d(m,n) + real(a(m,k),4) * real(b(k,n),4) 42 enddo 43 enddo 44 45 end subroutine wmma_accuracy

where lines 35–43 perform the matrix multiplication of 16x16 matrices using the 32 threads in the warp, two columns of the resultant matrix at a time. Note that the real(2) multiplicands on line 41 are promoted to real(4) before the multiplication. In the host code we launch the kernel and then compare the c() and d() arrays from the kernel:

75 call wmma_accuracy <<<1,32>>>(ah_d, bh_d, c_d, d_d) 76 c = c_d 77 d = d_d 78 err = sum(abs(c-d)) 79 if (err /= 0.0) then 80 write(*,*) 'L1 error = ', err 81 else 82 write(*,*) 'Test Passed ' 83 end if

Compiling and running the code we have:

% pgf90 -Mcuda =cc70, cuda10 .1 -O3 tensorCoreAccuracy .CUF % ./a.out Test Passed

So the WMMA matrix multiplication effectively promotes the real(2) multiplicands to real(4) before performing the multiplication.

### 4.2 Example 2: 16x16x64 matrix multiply

Our second example performs the matrix multiplication *C = A * B* where *A* is 16x64 and *B* is 64x16. A common theme in all of these examples of this section is that a warp of threads calculates a 16x16 tile of the resultant matrix. As such, this kernel is still launched with a single warp of threads. The kernel for this code is:

13 attributes(global) subroutine wmma_row(a, b, c, k) 14 use wmma 15 implicit none 16 real(2), intent(in) :: a(wmma_m,*), b(k,*) 17 real(4) :: c(wmma_m,*) 18 integer, value :: k 19 integer :: i 20 21 WMMASubMatrix(WMMAMatrixA, 16, 16, 16, Real, WMMAColMajor) :: sa 22 WMMASubMatrix(WMMAMatrixB, 16, 16, 16, Real, WMMAColMajor) :: sb 23 WMMASubMatrix(WMMAMatrixC, 16, 16, 16, Real, WMMAKind4) :: sc 24 integer :: lda, ldb, ldc 25 26 lda = wmma_m 27 ldb = k 28 ldc = wmma_m 29 30 sc = 0.0_4 31 do i = 1, k, wmma_k 32 call wmmaLoadMatrix(sa, a(1,i), lda) 33 call wmmaLoadMatrix(sb, b(i,1), ldb) 34 call wmmaMatMul(sc, sa, sb, sc) 35 enddo 36 37 call wmmaStoreMatrix(c(1,1), sc, ldc) 38 39 end subroutine wmma_row

The only difference between this kernel and the kernel from the previous example is that the loading of *A* and *B* submatrices and the matrix multiplication on these submatrices occur inside a loop, where for each iteration a matrix multiplication on 16x16 tiles is performed. Note once again that the second argument to wmmaLoadMatrix() is the same for all threads in the warp, it is the address of the first element of the 16x16 tile used in that iteration. The results are accumulated in the submatrix sc since the first and last arguments in wmmaMatMul() are the same.

### 4.3 Example 3: 32x32x32 matrix multiply

For this example we move past launching a kernel with a single warp of threads. We still use a single warp to calculate each 16x16 tile of the resultant matrix, but for a 32x32 resultant matrix we now launch the kernel with four warps of threads, for now with one warp per thread block:

54 integer, parameter :: m_blocks = 2, n_blocks = 2, k_blocks = 2 55 integer, parameter :: m = m_blocks * wmma_m, & 56 n = n_blocks * wmma_n, & 57 k = k_blocks * wmma_k

80 grid = dim3(m_blocks, n_blocks, 1) 81 call wmma_16x16<<>>(ah_d, bh_d, c_d, m, n, k)

The kernel code for this example is

11 attributes(global) subroutine wmma_16x16(a, b, c, m, n, k) 12 use wmma 13 implicit none 14 real(2), intent(in) :: a(m,*), b(k,*) 15 real(4) :: c(m,*) 16 integer , value :: m, n, k 17 integer :: i, row_t, col_t 18 19 WMMASubMatrix(WMMAMatrixA, 16, 16, 16, Real, WMMAColMajor) :: sa 20 WMMASubMatrix(WMMAMatrixB, 16, 16, 16, Real, WMMAColMajor) :: sb 21 WMMASubMatrix(WMMAMatrixC, 16, 16, 16, Real, WMMAKind4) :: sc 22 integer :: lda, ldb, ldc 23 24 lda = m 25 ldb = k 26 ldc = m 27 28 row_t = (blockIdx%x-1) * wmma_m+1 29 col_t = (blockIdx%y-1) * wmma_n+1 30 31 sc = 0.0_4 32 33 do i = 1, k, wmma_k 34 call wmmaLoadMatrix(sa, a(row_t, i), lda) 35 call wmmaLoadMatrix(sb, b(i, col_t), ldb) 36 call wmmaMatMul(sc, sa, sb, sc) 37 enddo 38 39 call wmmaStoreMatrix(c(row_t, col_t), sc, ldc) 40 41 end subroutine wmma_16x16

Each thread block (consisting of a single warp of threads) calculates the results in the tile c(row_t:row)t+15, col_t:col_t+15). Since we launch this kernel with one warp per thread block, the row_t and col_t indices can be calculated from the blockIdx values:

28 row_t = (blockIdx%x-1) * wmma_m+1 29 col_t = (blockIdx%y-1) * wmma_n+1

While this kernel code is general and can be used for large matrices, with only 32 threads per block it is inefficient. In the next example we address this inefficientciency.

### 4.4 Example 4: 32x32x32 single block matrix multiply

In this example we perform the same matrix multiplication as in example 3, but using a single block of four warps of threads rather than four blocks of one warp each. The kernel is launched with:

87 tpb = dim3(64,2,1) 88 grid = dim3((m-1)/tile_r+1,(n-1)/tile_c+1,1) 89 call wmma_32x32<<>>(ah_d, bh_d, c_d, m, n, k)

and the module containing the kernel is:

3 module m 4 integer, parameter :: wmma_m = 16 5 integer, parameter :: wmma_n = 16 6 integer, parameter :: wmma_k = 16 7 8! tile_r, tile_c are size of submatrix of C that9! gets calculated per block10 integer, parameter :: tile_r = 32, tile_c = 32 11 12 contains 13 14! kernel where each block performs gemm for a 32 x32 submatrix of C15! launch with four warps per block with blocksize of dim3(64 ,2)16 17 attributes(global) subroutine wmma_32x32(a, b, c, m, n, k) 18 use wmma 19 use cudadevice 20 implicit none 21 real(2), intent(in) :: a(m,*) , b(k,*) 22 real(4) :: c(m ,*) 23 integer, value :: m, n, k 24 25 WMMASubMatrix(WMMAMatrixA, 16, 16, 16, Real, WMMAColMajor) :: sa 26 WMMASubMatrix(WMMAMatrixB, 16, 16, 16, Real, WMMAColMajor) :: sb 27 WMMASubMatrix(WMMAMatrixC, 16, 16, 16, Real, WMMAKind4) :: sc 28 integer :: lda, ldb, ldc 29 type(dim3) :: warpIdx 30 integer :: i, row_t, col_t 31 32 lda = m 33 ldb = k 34 ldc = m 35 36 warpIdx%x = (threadIdx%x-1) / warpsize+1 37 warpIdx%y = threadIdx%y 38 39 row_t = (blockIdx%x-1) * tile_r + (warpIdx%x-1) * wmma_m+1 40 col_t = (blockIdx%y-1) * tile_c + (warpIdx%y-1) * wmma_n+1 41 42 sc = 0.0_4 43 44 do i = 1, k, wmma_k 45 call wmmaLoadMatrix(sa, a(row_t, i), lda) 46 call wmmaLoadMatrix(sb, b(i, col_t), ldb) 47 call wmmaMatMul(sc, sa, sb, sc) 48 enddo 49 50 call wmmaStoreMatrix(c(row_t, col_t), sc, ldc) 51 52 end subroutine wmma_32x32 53 54 end module m

The parameters tile r=32 and tile c=32 denote the size of the tile of C that gets calculated per thread block. It is convenient to define a warpIdx variable of type(dim3) that is analogous to threadIdx, since all of the WMMA operations are done on a per-warp basis. The threadIdx variable appears in the code, but is used only used to calculate the warpIdx values. The row_t and col_t indices now depend on the warpIdx values as well as the blockIdx values, but aside from the the code is similar to the code in Example 3.

While this code addresses the low occupancy of previous examples, it is inecient in
that it loads each 16x16 tile of *A* and *B* twice. This brings us to our next section on
performance.

## 5. Performance of WMMA codes

In the section on Example 4 we noted the redundant loads of *A* and *B* submatrices. Before going further we should quantify the cost of performing loads from global memory to the the WMMA submatrices (which live in registers), as well as the cost of the store operation from WMMA submatrices to global memory. The following code is an artificial benchmark that uses the same 16×16 *A* and *B* matrices to calculate their product in a tile of a larger *C* matrix. The kernel code for this benchmark is:

20 attributes(global) subroutine wmma_peak(a, b, c, m, n, niter) 21 use cudadevice 22 use wmma 23 implicit none 24 real(2) :: a(wmma_m,wmma_k), b(wmma_k,wmma_n) 25 real(4) :: c(m,n) 26 integer , value :: m, n, niter 27 28 WMMASubMatrix(WMMAMatrixA, 16, 16, 16, Real, WMMAColMajor) :: sa 29 WMMASubMatrix(WMMAMatrixB, 16, 16, 16, Real, WMMAColMajor) :: sb 30 WMMASubMatrix(WMMAMatrixC, 16, 16, 16, Real, WMMAKind4) :: sc 31 type(dim3) :: warpIdx 32 integer :: i, row_t, col_t 33 34 warpIdx%x = (threadIdx%x-1) / warpsize+1 35 warpIdx%y = threadIdx%y 36 row_t = (blockIdx%x-1) * tile_r + (warpIdx%x-1) * wmma_m+1 37 col_t = (blockIdx%y-1) * tile_c + (warpIdx%y-1) * wmma_n+1 38 39 sc = 0.0_4 40 call wmmaLoadMatrix(sa, a, wmma_m) 41 call wmmaLoadMatrix(sb, b, wmma_k) 42 do i = 1, niter 43 call wmmaMatMul(sc, sa , sb , sc) 44 enddo 45 call wmmaStoreMatrix(c(row_t, col_t), sc, m) 46 47 end subroutine wmma_peak 48 49 end module m

where we launch this kernel multiple times with different values of niter, the number of times a matrix multiplication occurs using a single load/store of the input/output matrices. CUDA events are used to time the kernel and the resulting teraflops are reported. Compiling and running the code we have:

% pgf90 -Mcuda =cc70,cuda10.1 -O3 -o tensorCorePeak tensorCorePeak .CUF % ./ tensorCorePeak Device : Tesla V100 -PCIE -16 GB niter, Tensor Core TFlops 1 2.433460 2 5.626374 4 11.41901 8 22.18852 16 40.35468 32 66.48002 64 77.92628 128 85.07854 256 89.51171 512 91.10131 1024 92.56416 2048 93.14177 4096 93.44764 8192 93.58527 16384 93.65218 32768 102.6219

For large values of niter the cost of loading and storing matrices is largely amortized away and we observe that the wmmaMatMul() is performing at above 100 TFlops. However, if there is no reuse of the *A* and *B* submatrices, so each load is used in only one wmmaMatMul() call, the kernel operates at under 3 TFlops. As a result, the first step in achieving good performance is to reuse the input matrices as much as possible, which brings us to the next example.

### 5.1 Example 5: Shared memory GEMM

From this point on in our discussion, since we are measuring performance, we are no longer using small example matrices as in the previous section. We choose values of m, n, and k to be 6,400. To illustrate our shared memory strategy we start with a kernel where each thread block calculates a 64×32 element tile of *C*, still using 16×16 WMMA submatrices. The parameters for this case are:

4 integer, parameter :: wmma_m = 16 5 integer, parameter :: wmma_n = 16 6 integer, parameter :: wmma_k = 16 7 8! tile_r, tile_c are size of submatrix of C9! that gets calculated per block10 integer, parameter :: tile_r = 64, tile_c = (tile_r/32) * wmma_n 11 12! Number of wmma_m block rows in A shared memory tile13 integer, parameter :: blockRows_as = tile_r / wmma_m

where the expression for tile_c evaluates to 32.

Our strategy for optimizing reuse of data in WMMA submatrices is to load data used for *A* submatrices into shared memory, and for each warp of threads to load a single 16×16 submatrix of *B* and use that to calculate a 64×16 column of the *C* tile that is calculated in the thread block, as depicted in Figure 1. As such, for the 64×32 tile used here we launch this kernel with two warps of threads. In the kernel, our shared memory and WMMA submatrix declarations are:

31 real(2), shared :: a_s(tile_r, wmma_k) 32 WMMASubMatrix(WMMAMatrixA, 16, 16, 16, Real, WMMAColMajor) :: sa 33 WMMASubMatrix(WMMAMatrixB, 16, 16, 16, Real, WMMAColMajor) :: sb 34 WMMASubMatrix(WMMAMatrixC, 16, 16, 16, Real, WMMAKind4) :: & 35 sc(blockRows_as)

Since each warp of threads is calculating a 64×16 tile of matrix *C*, we need an array of 4 (blockRows_as) submatrix accumulators. The kernel code performs the following index calculations prior to entering the main loop over the k dimension:

46! row and column indices to the first element47! in the tile_r x tile_c tile of C48 row_t = ( blockIdx %x -1)* tile_r + 1 49 col_t = ( blockIdx %y -1)* tile_c + 1 50 51! C column index for each warp52 col_w = col_t + ( warpIdx -1)* wmma_n

where col_w is the column index used to load the warp’s *B* matrix. After initializing the array of accumulator submatrices, the main loop over the k dimension and loop to store the resultant *C* submatrices are:

58 do i = 1, k, wmma_k 59! load the tile_r x 16 tile of A into shared memory ,60! the block of tile_r threads loads one column per61! iteration in the j- loop62 call syncthreads() 63 do j=1, wmma_k 64 a_s(threadIdx %x,j) = a(row_t-1 + threadIdx%x, i - 1 + j) 65 enddo 66 67! load B into wmma submatrices ,68! each warp gets a different 16 x16 block69 call wmmaLoadMatrix(sb, b(i,col_w), ldb) 70 71! for a_s72 call syncthreads () 73 74 ! loop loading A wmma submatrices from shared memory 75 ! and performing matmul 76 do j = 1, blockRows_as 77 call wmmaLoadMatrix(sa, a_s(1+(j-1)*wmma_m, 1), tile_r) 78 call wmmaMatMul(sc(j), sa, sb, sc(j)) 79 enddo 80 end do 81 82 do j = 1, blockRows_as 83 call wmmaStoreMatrix(c(row_t+(j -1)*wmma_m, col_w), sc(j), ldc ) 84 end do

where for each iteration over the k dimension the 64×16 *A* tile is loaded into shared memory (lines 63-65). Because we want to keep the indexing for loading *A* into shared memory memory as simple as possible, namely one column of values is loaded into the shared memory tile per block of threads during each iteration in the loop on lines 63-65, we are essentially fixing the thread block size to be tile_r (which is how the block size is determined in host code). Consequently, this arrangement determines the number of warps launched per thread block and thus determines the value of tile_c:

10 integer, parameter :: tile_r = 64, tile_c = (tile_r/32)*wmma_n

rather than being chosen independent of other parameters. Back in the main loop of the kernel, after issuing loads for the shared memory *A* tile each warp of threads loads a *B* WMMA submatrix (using the col_w index). A thread synchronization ensures all the elements in a_s are available to all threads, the code then loops over block rows, loading the WMMA submatrices from shared memory and performing the matrix multiplication. After the main loop over the k dimension completes, the results in the sc(j) tile are stored to global memory (lines 82-84).

When we compile and run this code we obtain the following:

% pgf90 -Mcuda=cc70,cuda10.1 -O3 -o tensorCoreShared \ tensorCoreShared .CUF % ./ tensorCoreShared Device: Tesla V100-PCIE-16GB M = 6400, N = 6400, K = 6400 tile_r = 64, tile_c = 32 Test passed, TFlops: 16.14157

Note that for this case the kernel is launched with a thread block of 64 threads. We can get better utilization by changing tile_r to 128, which effectively doubles the thread block size and quadruples the size of the *C* tile calculated per thread block. Running the code with these parameters we obtain:

% ./tensorCoreShared Device: Tesla V100-PCIE-16GB M = 6400, N = 6400, K = 6400 tile_r = 128, tile_c = 64 Test passed, TFlops: 19.06848

which is a bit better performance. But a more substantial increase can be obtained by addressing memory bank conflicts.

#### 5.1.1 Shared memory padding

A common optimization technique when using shared memory in CUDA is to ensure that shared memory bank conflicts are avoided. Shared memory is divided into equally sized banks that can be accesses simultaneously, in the best case, but if mutltiple addresses of a memory request map to the same bank, then these accesses are serialized. The fix for such a case can be relatively simple — just pad the leading dimension of the shared memory array. For the case of shared memory arrays used to load WMMA submatrices, there is an alignment requirement that any padding must be a multiple of 16 bytes, or 8 real(2) words.

Changes to the device code are minimal, and include declaring a padding parameter:

15! Shared memory padding16 integer, parameter :: pad_s = 16

and using the parameter in the declaration of the shared memory array:

34 real(2), shared :: a_s(tile_r+pad_s, wmma_k)

as well as adjusting the last argument (the leading dimension) to the wmmaLoadMatrix() call:

80 call wmmaLoadMatrix(sa, a_s(1+(j -1)*wmma_m, 1), & 81 tile_r+pad_s)

After implementing the above changes (in the tensorCoreSharedPad.CUF file), this code achieves the following

% ./tensorCoreSharedPad Device: Tesla V100-PCIE-16GB M = 6400, N = 6400, K = 6400 tile_r = 128, tile_c = 64 pad_s = 16 Test passed, TFlops: 41.02449

which represents a substantial increase in performance over the previous case.

Since the mapping of device threads to elements in the WMMA submatrix is opaque, we cannot use access patterns to determine how much shared memory padding is optimal, and need to resort to an empirical approach. Padding by 16 half-precision elements is optimal for this kernel with these parameters, but when changing the kernel or parameters the choice of the amount of padding should be revisited.

The use of padding is one reason why we chose to place the *A* tile rather than the *B* tile in shared memory. A value of pad_s=16 would double the amount of shared memory used by a block of threads if the *B* tile were places in shared memory, whereas padding increases the shared memory usage for the *A* tile by only 16×16 or 256 half-precision words.

### 5.2 Example 6: Vectorized Memory Access

The load and store operations used in wmmaLoadMatrix() and wmmaStoreMatrix() are well optimized. The transfer of half-precision data from global to shared memory, however, can be improved using vectorized memory access.

For every copy operation, several instructions are issued which calculate the source and destination addresses as well as perform the load and stores. We can amortize the cost of these instructions by having the loads and stores operation on a larger word size. In CUDA C, there are vector types like float2 that facilitate vectorized memory access. In CUDA Fortran we can use several techniques to essentially "cast" or "equivalence" data to larger word sizes.

The following device subroutine copies an *m* × *n* tile of real(2) data from global to shared memory using real(4) words:

28 attributes ( device ) subroutine vectorCopyHalfAsync & 29 (a_s, lda_s, a, lda, m, n, cg) 30 use cooperative_groups 31 implicit none 32 33!pgi$ ignore_tkr a, a_s34! destination real(2) shared memory array "cast" as real (4)35 real(4) :: a_s(lda_s /vecLen, *) 36 37! source real(2) global array "cast" as real(4)38 real(4), intent (in) :: a(lda /vecLen,*) 39 40! leading dimension of destination array (units of real(2) words)41 integer :: lda_s 42 43! leading dimension of source array (units of real(2) words)44 integer :: lda 45 46! mxn block to be transferred (units of real(2) words)47 integer :: m, n 48 49! currently thread block,50! in the future this can be a more general cooperative group51 type(thread_group ) :: cg 52 53 integer :: ir, ic 54 55 if (m/vecLen == cg%size) then 56 ir = cg%rank 57 do ic = 1, n 58 a_s(ir, ic) = a(ir, ic) 59 enddo 60 end if 61 62 end subroutine vectorCopyHalfAsync

This routine is called with real(2) arrays for a() and a_s(), which is permissible due to the !pgi$ ingore_tkr directive. The leading dimensions are adjusted in the declarations by the parameter vecLen=2. To keep the indexing as simple as possible (for performance reasons), a block of threads copies a column of real(4) words each iteration of the loop. This means that for the same thread block size (128) the *A* shared memory tile doubles in terms of half-precision words (256×64), as reflected in the parameter declarations:

8! vector size used in loading shared memory. If you change this,9! change type/kind of a() and a_s() in vectorLoadHalfAsync10 integer, parameter :: vecLen = 2 11 12! tile_r, tile_c are size of submatrix of C13! that gets calculated per block14 integer, parameter :: tile_r = 256, & 15 tile_c = (tile_r/vecLen/32)*wmma_n 16 17! Number of wmma_m block rows in A shared memory tile18 integer, parameter :: blockRows_as = tile_r/wmma_m 19 20! Shared memory padding21 integer, parameter :: pad_s = 16

The kernel code remains unchanged except for the call to vectorCopyHalfAsync() which replaces the loop which previously performed the data copy from global to shared memory:

107 do i = 1, k, wmma_k 108! load A into shared memory,109 call syncthreads(tg) 110 call vectorCopyHalfAsync(a_s, tile_r+pad_s, a(row_t,i), m, & 111 tile_r, wmma_k, tg) 112 113! load B into wmma submatrices,114! each warp gets a different 16x16 block115 call wmmaLoadMatrix(sb, b(i,col_w), ldb) 116 117! for a_s118 call syncthreads(tg) 119 120! loop loading A wmma submatrices from shared memory121! and performing matmul122 do j = 1, blockRows_as 123 call wmmaLoadMatrix(sa, a_s(1+(j-1)*wmma_m, 1), & 124 tile_r+pad_s) 125 call wmmaMatMul(sc(j), sa, sb, sc(j)) 126 enddo 127 end do

Compiling and running this code we obtain:

% pgf90 -Mcuda=cc70,cuda10.1 -O3 -o tensorCoreVector \ tensorCoreVector.CUF % ./ tensorCoreVector Device: Tesla V100-PCIE-16GB M = 6400, N = 6400, K = 6400 tile_r = 256, tile_c = 64 pad_s = 16 vecLen = 2 Test passed, TFlops: 52.53718

which is a significant improvement over the nonvectorized memory access performance.

#### 5.2.1 Vectorized memory access with Cray pointers

Another way to accomplish vectorized memory access in CUDA Fortran is through Cray pointers. By declaring real(4) pointers:

39! Cray pointers for vectorized access of a and a_s40 real(4) :: a4(m/vecLen,*) 41 real(4) , shared :: a4_s((tile_r+pad_s)/vecLen, *) 42 pointer(xa4, a4) 43 pointer(xa4_s, a4_s)

and associating them with our real(2) arrays:

59! associate Cray pointers60 xa4 = loc(a(1,1)) 61 xa4_s = loc(a_s(1,1))

we can copy data from global to shared memory using real(4) words. After declaring an addition indexing parameter row_tv to accommodate the vector length:

63! row and column indices to the first element64! in the tile_r x tile_c tile of C65 row_t = (blockIdx%x-1)*tile_r + 1 66 col_t = (blockIdx%y-1)*tile_c + 1 67 68! vector version of row_t index69 row_tv = (blockIdx%x-1)*tile_r/vecLen + 1

we can perform the vector copy to shared memory within the main loop of the kernel without needing to call a device function:

78 do i = 1, k, wmma_k 79! load A into shared memory,80 call syncthreads() 81 do j = 1, wmma_k 82 a4_s(threadIdx%x,j) = & 83 a4(row_tv-1+threadIdx%x, i-1+j) 84 enddo 85 86! load B into wmma submatrices,87! each warp gets a different 16x16 block88 call wmmaLoadMatrix(sb, b(i,col_w ), ldb ) 89 90! for a_s91 call syncthreads() 92 93! loop loading A wmma submatrices from shared memory94! and performing matmul95 do j = 1, blockRows_as 96 call wmmaLoadMatrix(sa, a_s(1+(j-1)*wmma_m, 1), & 97 tile_r+pad_s) 98 call wmmaMatMul(sc(j), sa, sb, sc(j)) 99 enddo 100 end do

Note that access to the *A* tile is though the real(4) pointers only when copying the data from global to shared memory, elsewhere the real(2) arrays are used.

Compiling and running this code we obtain the same performance, within run-to-run variations:

% pgf90 -Mcuda=cc70,cuda10.1 -O3 -o tensorCoreVectorCP \ tensorCoreVectorCP .CUF % ./ tensorCoreVectorCP Device: Tesla V100-PCIE-16GB M = 6400, N = 6400, K = 6400 tile_r = 256, tile_c = 64 pad_s = 16 vecLen = 2 Test passed, TFlops: 51.76673

## Next steps

The sequence of example codes used in this paper ranged from those illustrating basic concepts of tensor core programming to those performing GEMM on half-precision data at above 50 TFlops. The performance emphasis in the latter part of the paper is not meant to result in a optimal code, rather to demonstrate what issues need to be addressed (e.g. data reuse, shared memory, vector access) as you to develop tensor core codes. WMMA tile sizes, vector lengths, thread block sizes, are just a few of the parameter spaces that can be further explored. There are a few things to be aware of when exploring these parameter space. When changing the WMMA tile sizes, one needs to change both the wmma_m and wmma_n parameters (wmma_k is always 16) and in addition the tile size parameters in the wmmaSubMatrix declarations. When changing tile_r and tile_c, be aware of register utilization and register spills. Compiling with -Mcuda=ptxinfo will indicate register usage and register spills, although the presence of the latter will be very obvious from the resulting performance.

One issue we did not touch on is whether 16 bits of floating-point precision is sufficient for your needs, which is an assessment you will need to make. This can be assessed with real(2) support in host code, before jumping into tensor core programming.

Example Source Code