1 Introduction

One of the defining features of recent NVIDIA GPUs including the Telsa V100 is the introduction of Tensor Cores, which are programmable matrix multiply and accumulate units that operate on half-precision (16-bit) multiplicands. Access to programming Tensor Cores in CUDA C became available in the CUDA 9.0 release through the WMMA (Warp Matrix Multiply and Accumulate) API. This paper describes a CUDA Fortran interface to this same functionality. Note that the WMMA interface is a preview feature in CUDA C and subject to change, as is the CUDA Fortran interface described in what follows.

With the WMMA interface, a single warp of 32 threads performs D = A∗B+C where C and D are 256-element matrices. The multiplicands A and B are matrices of half-precision (16-bit) floating point values, whereas C and D are matrices of either both half-precision or both full-precision (32-bit) floating point values. Each Tensor Core actually performs a 4×4 matrix multiply, so multiple Tensor Cores are used in each WMMA operation. Before the WWMA operation can take place the operand matrices must be loaded into registers, distributed amongst the threads in the warp. The mapping of threads to matrix elements is opaque, where the WMMASubmatrix datatype (equivalent to the fragment in CUDA C), is used to represent the elements each thread holds of the matrix represented by the warp of threads, along with other metadata.

This post will first discuss the support for 16-bit floating point values using real(2) declarations, as well as the wmma CUDA Fortran module, in particular how variables of the WMMASubmatrix type are used to perform matrix multiplications. That section is followed by a series of progressively complex examples illustrating the use of WMMA API from CUDA Fortran, starting with a simple 16×16 matrix multiply. The final section discusses the performance aspects of using tensor cores.

2 Representation of 16–bit values

CUDA Fortran supports 16-bit floating point variables using the real(2) declarations, which are available on both the host and device. Arithmetic, input/output, and implicit conversions between types are all supported.

The following code illustrates use of real(2) variables, as well as accuracy implications, in host and device code by performing implicit conversions and some simple arithmetic:

 1  program main
 2    implicit none
 3    integer, parameter :: n=3
 4    real(2) :: a2(n), b2(n), a2d(n), c2(n), c2d(n)
 5    real(2), device :: a2_d(n), b2_d(n), c2_d(n)
 6    real(4) :: a4(n), a4n(n), b4(n), c4(n), c4d(n), &
 7         d4(n), d4d(n), e4(n), e4d(n)
 8    real(4), device :: a4_d(n), b4_d(n), c4_d(n), d4_d(n), e4_d(n)
 9    integer :: i, j
10
11    do i = 1, n
12       a4(i) = real(i)/n 
13       b4(i) = real(i)/n 
14    enddo
15
16    a2 = a4
17    a4n=a2
18    write(*,*) ’Host conversion:’
19    do i = 1, n
20       write(*,*) i, a4(i), a4n(i), a2(i)
21    enddo
22
23    b2 = b4; c2 = 0.0_2; c4 = 0.0_4; d4 = 0.0_4; e4 = 0.0_4
24    a2_d = a4; b2_d = b4; c2_d = 0.0_2
25    a4_d = a4; b4_d = b4; c4_d = 0.0_4; d4_d = 0.0_4; e4_d = 0.0_4
26
27    do i = 1, n
28       do j = 1, 100 ! sum up product 100 times
29          c2(i) = c2(i) + a2(i)*b2(i) 
30          c4(i) = c4(i) + a4(i)*b4(i)
31          d4(i) = d4(i) + a2(i)*b2(i)
32          e4(i) = e4(i) + real(a2(i), kind=4)*real(b2(i), kind=4)
33       enddo
34    enddo
35    !$cuf kernel do <<<*,*>>>
36    do i = 1, n
37       do j = 1, 100 ! sum up product 100 times
38          c2_d (i) = c2_d(i) + a2_d(i) * b2_d(i)
39          c4_d (i) = c4_d(i) + a4_d(i) * b4_d(i)
40          d4_d (i) = d4_d(i) + a2_d(i) * b2_d(i)
41          e4_d (i) = e4_d(i) &
42                + real(a2_d(i), kind =4)* real(b2_d(i), kind =4)
43       enddo
44    enddo
45    c2d = c2_d
46    c4d = c4_d
47    d4d = d4_d
48    e4d = e4_d
49    write(*,*) 'Host / device arithmetic :'
50    do i = 1, n
51       write(*,*) i
52       write(*,*) c2(i), c4(i), d4(i), e4(i)
53       write(*,*) c2d(i), c4d(i), d4d(i), e4d(i)
54       write(*,*)
55    enddo
56 end program main

which when compiled and run results in the output:

% pgf90 -Mcuda =cc70, cuda10 .1 -O3 -o real2Accuracy real2Accuracy .cuf
% ./ real2Accuracy
 Host conversion :
            1    0.3333333         0.3332520         0.3333
            2    0.6666667         0.6665039         0.6665
            3    1.000000          1.000000          1.000
 Host / device arithmetic :
            1
    11.01       11.11110          11.10840          11.10569
    11.01       11.11110          11.10840          11.10569
            2
    44.03       44.44441          44.43359          44.42278
    44.03       44.44441          44.43359          44.42278
            3
    100.0       100.0000          100.0000          100.0000
    100.0       100.0000          100.0000          100.0000

For the host and device arithmetic examples, we have done accumulation of multiplicands in several ways: 16-bit multiplicands and accumulation, 32-bit multiplicands and accumulation, 16-bit multiplicands and 32-bit accumulation, and 16-bit multiplicands promoted to 32-bits before the multiplication and 32-bit accumulation. We should note that on the host arithmetic on real(2) data is actually performed in 32 bits and then converted back to 16 bits at each store.

3 The CUDA Fortran wmma module

Use of Tensor Cores through the WMMA API in CUDA Fortran requires the wmma module as well as the cuf_macros.CUF macro file. These provide tensor-core specific data types, along with routines to load and store data and perform warp-based matrix multiplications using these data types.

3.1 The WMMASubMatrix datatype

Tiles of matrices used by a warp of threads to perform matrix multiplication are stored in variables of the WMMASubMatrix datatype in device code. For those familiar with the CUDA C API to Tensor Cores, WMMASubmatrix corresponds to the fragment template. There are different WMMASubMatrix types based on use (i.e. which operand in D = A B + C), precision, storage order, and dimensions, which are specified as type parameters in CUDA Fortran. Typical declarations of WMMASubMatrix variables used in device code are:

      WMMASubMatrix(WMMAMatrixA, 16, 16, 16, Real, WMMAColMajor) :: a
      WMMASubMatrix(WMMAMatrixB, 16, 16, 16, Real, WMMAColMajor) :: b
      WMMASubMatrix(WMMAMatrixC, 16, 16, 16, Real, WMMAKind4) :: c

where the first parameter indicates the operand, corresponding to the A matrix, B matrix, and the accumulator. The following three parameters are the tile sizes, in this case (m, n, k) = (16, 16, 16). Allowable values for (m, n, k) are (16, 16, 16), (32, 8, 16), and (8, 32, 16). As a result, the accumulator submatrix (m × n) can be 8 × 32, 32 × 8, or 16 × 16 and will always contain 256 elements. The A submatrix can be 8 × 16, 16 × 16, or 32×16 and the B submatrices can be 16×32, 16×8, or 16×8 and therefore can contain 128, 256, or 512 elements. The datatype Real is specified next, and is currently the only allowed value. For WMMAMatrixA and WMMAMatrixB matrices one needs to specify row- or column-major ordering, whereas the only precision currently supported for these is 16-bit. The accumulator, WMMAMatrixC, can currently be either 16- or 32-bit precision as specified in the last parameter. The storage order for accumulators is not specified at declaration.

3.2 Compilation

When compiling code utilizing Tensor Cores, a compute capability of at least 7.0 and a CUDA runtime version of at least 9.0 should be used, either implicitly or explitily via -⁠Mcuda=cc70,cuda9.0. In addition, the preprocessor must be invoked due to the included macro file cuf_macros.CUF, either explicitly through the -⁠Mpreprocess compiler option or implicitly by using an uppercase file extension such as .CUF or .F90.

4 WMMA Programming Basics

In this section we present a sequence of small example programs illustrating the concepts of WMMA programming in CUDA Fortran. We begin with a code that launches a kernel using a single warp of threads that performs a matrix multiplication of two 16×16 matrices. We then add complexity with each example until we can address performance issues, which is the next section.

4.1 Example 1: 16×16×16 matrix multiply

The simplest possible matrix multiply using WMMA is the operation C = A ∗ B where all the matrices have the dimensions 16×16. The kernel code for this is:

 1  # include "cuf_macros .CUF"
 2
 3  module m
 4    integer, parameter :: wmma_m = 16
 5    integer, parameter :: wmma_n = 16
 6    integer, parameter :: wmma_k = 16
 7  contains
 8
 9    ! kernel for 16 x16 matrices (a,b, and c) using wmma
10    ! Should be launched with one block of 32 threads
11
12    attributes(global) subroutine wmma_single(a, b, c)
13      use wmma
14      implicit none
15      real(2), intent(in) :: a(wmma_m,*), b(wmma_k,*)
16      real(4) :: c(wmma_m,*)
17      WMMASubMatrix(WMMAMatrixA, 16, 16, 16, Real, WMMAColMajor) :: sa
18      WMMASubMatrix(WMMAMatrixB, 16, 16, 16, Real, WMMAColMajor) :: sb
19      WMMASubMatrix(WMMAMatrixC, 16, 16, 16, Real, WMMAKind4) :: sc
20      integer :: lda, ldb, ldc
21
22      lda = wmma_m
23      ldb = wmma_k
24      ldc = wmma_m
25
26      sc = 0.0_4
27      call wmmaLoadMatrix(sa, a(1,1), lda)
28      call wmmaLoadMatrix(sb, b(1,1), ldb)
29      call wmmaMatMul(sc, sa, sb, sc)
30      call wmmaStoreMatrix(c(1,1), sc, ldc)
31
32    end subroutine wmma_single
33
34  end module m

which is launched with a single warp of threads:

62      call wmma_single<<<1,32>>>(ah_d, bh_d, c_d)

In this example we use a 32-bit accumulator for the C matrix, in contrast to the 16-bit representations for the A and B matrices. For this case the matrices passed in as arguments to the kernel are the same size as the WMMA submatrices, so to perform the matrix multiplication we simply initialize the C WMMA submatrix to 0.0, load the A and B matrices from global memory to WMMA submatrices, perform the matrix multiplication on the submatrices, and store the result in the WMMA submatix to global memory.

You may have noticed that the thread index threadIdx does not appear at all in this code. This underlies the important concept to take away from this example: the threads in a warp work collectively to accomplish these tasks. So when dealing with the WMMA submatrices, we are doing warp-level programming rather than thread-level programming. This kernel is launched with a single warp of 32 threads, yet each of our WMMA submatrices has 16x16 or 256 elements. So when the initialization statement:

26      sc = 0.0_4

is executed, each thread sets 8 elements in the 16x16 submatrix to zero. (For those familiar with the CUDA C Tensor Core API, assignment to WMMA submatrices is overloaded to invoke the fill_fragment() method.) The mapping of threads to submatrix elements is opaque for this and other operations involving WMMA submatrices, from a programming standpoint we only address what happens collectively by a warp of threads on WMMA submatrices.

The statements that load the A and B from global memory to WMMA submatrices:

27      call wmmaLoadMatrix(sa, a(1,1), lda)
28      call wmmaLoadMatrix(sb, b(1,1), ldb)

also work collectively. In these calls the WMMA submatrices are specified as the first argument, and the second arguments contain the addresses of the upper left element of the tiles in global (or shared) memory to be loaded to the WMMA submatrices. The leading dimension of the the matrices in global (or shared) memory is the third argument. Note that the arguments passed to wmmaLoadMatrix() are the same for all threads in the warp. Because the mapping of elements to threads in a warp is opaque, each thread just passes the address of the first element in the 16x16 matrix along with the leading dimension as the third parameter, and the load operation is distributed amongst the threads in the warp.

The matrix multiplication on the WMMA submatrices is performed by the statement:

29      call wmmaMatMul(sc, sa, sb, sc)

which is again performed collectively by a warp of threads. Here we have used the same accumulator submatrix for the first and last arguments in the wmmaMatMul() call, which is why its initialization to zero was required.

The wmmaStoreMatrix() call:

30      call wmmaStoreMatrix(c(1,1), sc, ldc)

is analogous to the prior wmmaLoadMatrix calls, but here the first argument is the address of the upper left element of the tile in global (or shared) memory and the second argument is the WMMA submatrix whose values are stored. When both wmmaLoadMatrix() and wmmaStoreMatrix() are called with accumulator (WMMAMatrixC) arguments, there is an optional fourth argument that specifies the storage order — recall that declarations for WMMA accumulators do not specify storage order. In CUDA Fortran, the default is the WMMAColMajor or column-major storage order.

One final note on arguments to the wmmaLoadMatrix() and wmmaStoreMatrix() routines. There is a requirement that the leading dimension of the matrices, specified by the third argument of these routines, must be a multiple of 16 bytes (e.g. 8 real(2) words or 4 real(4) words). This will become important when we use shared memory for the input arrays, or more specifically when we pad shared memory arrays.

4.1.1 Accuracy revisited

Before moving onto our second tensor core example, it is worthwhile adding some code that performs the matrix multiplication in loops in the kernel, in addition to using the tensor cores, and compare the results. This new kernel is:

14    attributes(global) subroutine wmma_accuracy(a, b, c, d)
15      use wmma
16      implicit none
17      real(2), intent(in) :: a(wmma_m,*), b(wmma_k,*)
18      real(4) :: c(wmma_m,*), d(wmma_m,*)
19      WMMASubMatrix(WMMAMatrixA, 16, 16, 16, Real, WMMAColMajor) :: sa
20      WMMASubMatrix(WMMAMatrixB, 16, 16, 16, Real, WMMAColMajor) :: sb
21      WMMASubMatrix(WMMAMatrixC, 16, 16, 16, Real, WMMAKind4) :: sc
22      integer :: lda, ldb, ldc, m, n, k, colOffset
23
24      lda = wmma_m
25      ldb = wmma_k
26      ldc = wmma_m
27
28      sc = 0.0_4
29      call wmmaLoadMatrix(sa, a(1,1), lda)
30      call wmmaLoadMatrix(sb, b(1,1), ldb)
31      call wmmaMatMul(sc, sa, sb, sc)
32      call wmmaStoreMatrix(c(1,1), sc, ldc)
33
34      ! matrix multiplication with explicit loops
35      m = threadIdx %x
36      colOffset = (m -1)/16
37      m = m - colOffset * 16
38
39      do n = 1+ colOffset, 16, 2
40         do k = 1, 16
41            d(m,n) = d(m,n) + real(a(m,k),4) * real(b(k,n),4)
42         enddo
43      enddo
44
45    end subroutine wmma_accuracy

where lines 35–43 perform the matrix multiplication of 16x16 matrices using the 32 threads in the warp, two columns of the resultant matrix at a time. Note that the real(2) multiplicands on line 41 are promoted to real(4) before the multiplication. In the host code we launch the kernel and then compare the c() and d() arrays from the kernel:

75      call wmma_accuracy <<<1,32>>>(ah_d, bh_d, c_d, d_d)
76      c = c_d
77      d = d_d
78      err = sum(abs(c-d))
79      if (err /= 0.0) then
80         write(*,*) 'L1 error = ', err
81      else
82         write(*,*) 'Test Passed '
83      end if

Compiling and running the code we have:

% pgf90 -Mcuda =cc70, cuda10 .1 -O3 tensorCoreAccuracy .CUF
% ./a.out
Test Passed

So the WMMA matrix multiplication effectively promotes the real(2) multiplicands to real(4) before performing the multiplication.

4.2 Example 2: 16x16x64 matrix multiply

Our second example performs the matrix multiplication C = A * B where A is 16x64 and B is 64x16. A common theme in all of these examples of this section is that a warp of threads calculates a 16x16 tile of the resultant matrix. As such, this kernel is still launched with a single warp of threads. The kernel for this code is:

13    attributes(global) subroutine wmma_row(a, b, c, k)
14      use wmma
15      implicit none
16      real(2), intent(in) :: a(wmma_m,*), b(k,*)
17      real(4) :: c(wmma_m,*)
18      integer, value :: k
19      integer :: i
20
21      WMMASubMatrix(WMMAMatrixA, 16, 16, 16, Real, WMMAColMajor) :: sa
22      WMMASubMatrix(WMMAMatrixB, 16, 16, 16, Real, WMMAColMajor) :: sb
23      WMMASubMatrix(WMMAMatrixC, 16, 16, 16, Real, WMMAKind4) :: sc
24      integer :: lda, ldb, ldc
25
26      lda = wmma_m
27      ldb = k
28      ldc = wmma_m
29
30      sc = 0.0_4
31      do i = 1, k, wmma_k
32         call wmmaLoadMatrix(sa, a(1,i), lda)
33         call wmmaLoadMatrix(sb, b(i,1), ldb)
34         call wmmaMatMul(sc, sa, sb, sc)
35      enddo
36
37      call wmmaStoreMatrix(c(1,1), sc, ldc)
38
39    end subroutine wmma_row

The only difference between this kernel and the kernel from the previous example is that the loading of A and B submatrices and the matrix multiplication on these submatrices occur inside a loop, where for each iteration a matrix multiplication on 16x16 tiles is performed. Note once again that the second argument to wmmaLoadMatrix() is the same for all threads in the warp, it is the address of the first element of the 16x16 tile used in that iteration. The results are accumulated in the submatrix sc since the first and last arguments in wmmaMatMul() are the same.

4.3 Example 3: 32x32x32 matrix multiply

For this example we move past launching a kernel with a single warp of threads. We still use a single warp to calculate each 16x16 tile of the resultant matrix, but for a 32x32 resultant matrix we now launch the kernel with four warps of threads, for now with one warp per thread block:

54      integer, parameter :: m_blocks = 2, n_blocks = 2, k_blocks = 2
55      integer, parameter :: m = m_blocks * wmma_m, &
56           n = n_blocks * wmma_n, &
57           k = k_blocks * wmma_k
80      grid = dim3(m_blocks, n_blocks, 1)
81      call wmma_16x16<<>>(ah_d, bh_d, c_d, m, n, k)

The kernel code for this example is

11    attributes(global) subroutine wmma_16x16(a, b, c, m, n, k)
12      use wmma
13      implicit none
14      real(2), intent(in) :: a(m,*), b(k,*)
15      real(4) :: c(m,*)
16      integer , value :: m, n, k
17      integer :: i, row_t, col_t
18
19      WMMASubMatrix(WMMAMatrixA, 16, 16, 16, Real, WMMAColMajor) :: sa
20      WMMASubMatrix(WMMAMatrixB, 16, 16, 16, Real, WMMAColMajor) :: sb
21      WMMASubMatrix(WMMAMatrixC, 16, 16, 16, Real, WMMAKind4) :: sc
22      integer :: lda, ldb, ldc
23
24      lda = m
25      ldb = k
26      ldc = m
27
28      row_t = (blockIdx%x-1) * wmma_m+1
29      col_t = (blockIdx%y-1) * wmma_n+1
30
31      sc = 0.0_4
32
33      do i = 1, k, wmma_k
34         call wmmaLoadMatrix(sa, a(row_t, i), lda)
35         call wmmaLoadMatrix(sb, b(i, col_t), ldb)
36         call wmmaMatMul(sc, sa, sb, sc)
37      enddo
38
39      call wmmaStoreMatrix(c(row_t, col_t), sc, ldc)
40
41    end subroutine wmma_16x16

Each thread block (consisting of a single warp of threads) calculates the results in the tile c(row_t:row)t+15, col_t:col_t+15). Since we launch this kernel with one warp per thread block, the row_t and col_t indices can be calculated from the blockIdx values:

28     row_t = (blockIdx%x-1) * wmma_m+1
29     col_t = (blockIdx%y-1) * wmma_n+1

While this kernel code is general and can be used for large matrices, with only 32 threads per block it is inefficient. In the next example we address this inefficientciency.

4.4 Example 4: 32x32x32 single block matrix multiply

In this example we perform the same matrix multiplication as in example 3, but using a single block of four warps of threads rather than four blocks of one warp each. The kernel is launched with:

87    tpb = dim3(64,2,1)
88    grid = dim3((m-1)/tile_r+1,(n-1)/tile_c+1,1)
89    call wmma_32x32<<>>(ah_d, bh_d, c_d, m, n, k)

and the module containing the kernel is:

 3  module m
 4    integer, parameter :: wmma_m = 16
 5    integer, parameter :: wmma_n = 16
 6    integer, parameter :: wmma_k = 16
 7
 8    ! tile_r, tile_c are size of submatrix of C that
 9    !   gets calculated per block
10    integer, parameter :: tile_r = 32, tile_c = 32
11
12  contains
13
14    ! kernel where each block performs gemm for a 32 x32 submatrix of C
15    ! launch with four warps per block with blocksize of dim3(64 ,2)
16
17    attributes(global) subroutine wmma_32x32(a, b, c, m, n, k)
18      use wmma
19      use cudadevice
20      implicit none
21      real(2), intent(in) :: a(m,*) , b(k,*)
22      real(4) :: c(m ,*)
23      integer, value :: m, n, k
24
25      WMMASubMatrix(WMMAMatrixA, 16, 16, 16, Real, WMMAColMajor) :: sa
26      WMMASubMatrix(WMMAMatrixB, 16, 16, 16, Real, WMMAColMajor) :: sb
27      WMMASubMatrix(WMMAMatrixC, 16, 16, 16, Real, WMMAKind4) :: sc
28      integer :: lda, ldb, ldc
29      type(dim3) :: warpIdx
30      integer :: i, row_t, col_t
31
32      lda = m
33      ldb = k
34      ldc = m
35
36      warpIdx%x = (threadIdx%x-1) / warpsize+1
37      warpIdx%y = threadIdx%y
38
39      row_t = (blockIdx%x-1) * tile_r + (warpIdx%x-1) * wmma_m+1
40      col_t = (blockIdx%y-1) * tile_c + (warpIdx%y-1) * wmma_n+1
41
42      sc = 0.0_4
43
44      do i = 1, k, wmma_k
45         call wmmaLoadMatrix(sa, a(row_t, i), lda)
46         call wmmaLoadMatrix(sb, b(i, col_t), ldb)
47         call wmmaMatMul(sc, sa, sb, sc)
48      enddo
49
50      call wmmaStoreMatrix(c(row_t, col_t), sc, ldc)
51
52    end subroutine wmma_32x32
53
54  end module m

The parameters tile r=32 and tile c=32 denote the size of the tile of C that gets calculated per thread block. It is convenient to define a warpIdx variable of type(dim3) that is analogous to threadIdx, since all of the WMMA operations are done on a per-warp basis. The threadIdx variable appears in the code, but is used only used to calculate the warpIdx values. The row_t and col_t indices now depend on the warpIdx values as well as the blockIdx values, but aside from the the code is similar to the code in Example 3.

While this code addresses the low occupancy of previous examples, it is inecient in that it loads each 16x16 tile of A and B twice. This brings us to our next section on performance.

5. Performance of WMMA codes

In the section on Example 4 we noted the redundant loads of A and B submatrices. Before going further we should quantify the cost of performing loads from global memory to the the WMMA submatrices (which live in registers), as well as the cost of the store operation from WMMA submatrices to global memory. The following code is an artificial benchmark that uses the same 16×16 A and B matrices to calculate their product in a tile of a larger C matrix. The kernel code for this benchmark is:

20    attributes(global) subroutine wmma_peak(a, b, c, m, n, niter)
21      use cudadevice
22      use wmma
23      implicit none
24      real(2) :: a(wmma_m,wmma_k), b(wmma_k,wmma_n)
25      real(4) :: c(m,n)
26      integer , value :: m, n, niter
27
28      WMMASubMatrix(WMMAMatrixA, 16, 16, 16, Real, WMMAColMajor) :: sa
29      WMMASubMatrix(WMMAMatrixB, 16, 16, 16, Real, WMMAColMajor) :: sb
30      WMMASubMatrix(WMMAMatrixC, 16, 16, 16, Real, WMMAKind4) :: sc
31      type(dim3) :: warpIdx
32      integer :: i, row_t, col_t
33
34      warpIdx%x = (threadIdx%x-1) / warpsize+1
35      warpIdx%y = threadIdx%y
36      row_t = (blockIdx%x-1) * tile_r + (warpIdx%x-1) * wmma_m+1
37      col_t = (blockIdx%y-1) * tile_c + (warpIdx%y-1) * wmma_n+1
38
39      sc = 0.0_4
40      call wmmaLoadMatrix(sa, a, wmma_m)
41      call wmmaLoadMatrix(sb, b, wmma_k)
42      do i = 1, niter
43         call wmmaMatMul(sc, sa , sb , sc)
44      enddo
45      call wmmaStoreMatrix(c(row_t, col_t), sc, m)
46
47    end subroutine wmma_peak
48
49  end module m

where we launch this kernel multiple times with different values of niter, the number of times a matrix multiplication occurs using a single load/store of the input/output matrices. CUDA events are used to time the kernel and the resulting teraflops are reported. Compiling and running the code we have:

% pgf90 -Mcuda =cc70,cuda10.1 -O3 -o tensorCorePeak tensorCorePeak .CUF
% ./ tensorCorePeak
 Device : Tesla V100 -PCIE -16 GB
  
 niter, Tensor Core TFlops
            1    2.433460
            2    5.626374
            4    11.41901
            8    22.18852
           16    40.35468
           32    66.48002
           64    77.92628
          128    85.07854
          256    89.51171
          512    91.10131
         1024    92.56416
         2048    93.14177
         4096    93.44764
         8192    93.58527
        16384    93.65218
        32768    102.6219

For large values of niter the cost of loading and storing matrices is largely amortized away and we observe that the wmmaMatMul() is performing at above 100 TFlops. However, if there is no reuse of the A and B submatrices, so each load is used in only one wmmaMatMul() call, the kernel operates at under 3 TFlops. As a result, the first step in achieving good performance is to reuse the input matrices as much as possible, which brings us to the next example.

5.1 Example 5: Shared memory GEMM

From this point on in our discussion, since we are measuring performance, we are no longer using small example matrices as in the previous section. We choose values of m, n, and k to be 6,400. To illustrate our shared memory strategy we start with a kernel where each thread block calculates a 64×32 element tile of C, still using 16×16 WMMA submatrices. The parameters for this case are:

 4      integer, parameter :: wmma_m = 16
 5      integer, parameter :: wmma_n = 16
 6      integer, parameter :: wmma_k = 16
 7
 8      ! tile_r, tile_c are size of submatrix of C
 9      ! that gets calculated per block
10      integer, parameter :: tile_r = 64, tile_c = (tile_r/32) * wmma_n
11
12      ! Number of wmma_m block rows in A shared memory tile
13      integer, parameter :: blockRows_as = tile_r / wmma_m

where the expression for tile_c evaluates to 32.

tensor core tile

Figure 1: Depiction of the tiles used by a thread block to calculate a 64×32 tile of the C matrix. The 64×16 tile of A is stored in shared memory, and each warp of threads loads a 16×16 submatrix of B, and uses its B submatrix to calculate a 64×16 tile of C. As a result, the submatrices of A and B are loaded from global memory only once for the thread block.

Our strategy for optimizing reuse of data in WMMA submatrices is to load data used for A submatrices into shared memory, and for each warp of threads to load a single 16×16 submatrix of B and use that to calculate a 64×16 column of the C tile that is calculated in the thread block, as depicted in Figure 1. As such, for the 64×32 tile used here we launch this kernel with two warps of threads. In the kernel, our shared memory and WMMA submatrix declarations are:

31      real(2), shared :: a_s(tile_r, wmma_k)
32      WMMASubMatrix(WMMAMatrixA, 16, 16, 16, Real, WMMAColMajor) :: sa
33      WMMASubMatrix(WMMAMatrixB, 16, 16, 16, Real, WMMAColMajor) :: sb
34      WMMASubMatrix(WMMAMatrixC, 16, 16, 16, Real, WMMAKind4) :: &
35      sc(blockRows_as)

Since each warp of threads is calculating a 64×16 tile of matrix C, we need an array of 4 (blockRows_as) submatrix accumulators. The kernel code performs the following index calculations prior to entering the main loop over the k dimension:

46      ! row and column indices to the first element
47      !   in the tile_r x tile_c tile of C
48      row_t = ( blockIdx %x -1)* tile_r + 1
49      col_t = ( blockIdx %y -1)* tile_c + 1
50
51      ! C column index for each warp
52      col_w = col_t + ( warpIdx -1)* wmma_n

where col_w is the column index used to load the warp’s B matrix. After initializing the array of accumulator submatrices, the main loop over the k dimension and loop to store the resultant C submatrices are:

58      do i = 1, k, wmma_k
59         ! load the tile_r x 16 tile of A into shared memory ,
60         !   the block of tile_r threads loads one column per 
61         !   iteration in the j- loop
62         call syncthreads()
63         do j=1, wmma_k
64            a_s(threadIdx %x,j) = a(row_t-1 + threadIdx%x, i - 1 + j)
65         enddo
66
67         ! load B into wmma submatrices ,
68         !   each warp gets a different 16 x16 block
69         call wmmaLoadMatrix(sb, b(i,col_w), ldb)
70
71         ! for a_s
72         call syncthreads ()
73
74         ! loop loading A wmma submatrices from shared memory
75         !   and performing matmul
76         do j = 1, blockRows_as
77            call wmmaLoadMatrix(sa, a_s(1+(j-1)*wmma_m, 1), tile_r)
78            call wmmaMatMul(sc(j), sa, sb, sc(j))
79         enddo
80      end do
81
82      do j = 1, blockRows_as
83         call wmmaStoreMatrix(c(row_t+(j -1)*wmma_m, col_w), sc(j), ldc )
84      end do

where for each iteration over the k dimension the 64×16 A tile is loaded into shared memory (lines 63-65). Because we want to keep the indexing for loading A into shared memory memory as simple as possible, namely one column of values is loaded into the shared memory tile per block of threads during each iteration in the loop on lines 63-65, we are essentially fixing the thread block size to be tile_r (which is how the block size is determined in host code). Consequently, this arrangement determines the number of warps launched per thread block and thus determines the value of tile_c:

10      integer, parameter :: tile_r = 64, tile_c = (tile_r/32)*wmma_n

rather than being chosen independent of other parameters. Back in the main loop of the kernel, after issuing loads for the shared memory A tile each warp of threads loads a B WMMA submatrix (using the col_w index). A thread synchronization ensures all the elements in a_s are available to all threads, the code then loops over block rows, loading the WMMA submatrices from shared memory and performing the matrix multiplication. After the main loop over the k dimension completes, the results in the sc(j) tile are stored to global memory (lines 82-84).

When we compile and run this code we obtain the following:

% pgf90 -Mcuda=cc70,cuda10.1 -O3 -o tensorCoreShared \
  tensorCoreShared .CUF
% ./ tensorCoreShared
 Device: Tesla V100-PCIE-16GB
 M = 6400, N = 6400, K = 6400
 tile_r = 64, tile_c = 32

 Test passed, TFlops:      16.14157

Note that for this case the kernel is launched with a thread block of 64 threads. We can get better utilization by changing tile_r to 128, which effectively doubles the thread block size and quadruples the size of the C tile calculated per thread block. Running the code with these parameters we obtain:

% ./tensorCoreShared
 Device: Tesla V100-PCIE-16GB
 M = 6400, N = 6400, K = 6400
 tile_r = 128, tile_c = 64

 Test passed, TFlops:     19.06848

which is a bit better performance. But a more substantial increase can be obtained by addressing memory bank conflicts.

5.1.1 Shared memory padding

A common optimization technique when using shared memory in CUDA is to ensure that shared memory bank conflicts are avoided. Shared memory is divided into equally sized banks that can be accesses simultaneously, in the best case, but if mutltiple addresses of a memory request map to the same bank, then these accesses are serialized. The fix for such a case can be relatively simple — just pad the leading dimension of the shared memory array. For the case of shared memory arrays used to load WMMA submatrices, there is an alignment requirement that any padding must be a multiple of 16 bytes, or 8 real(2) words.

Changes to the device code are minimal, and include declaring a padding parameter:

15      ! Shared memory padding
16      integer, parameter :: pad_s = 16

and using the parameter in the declaration of the shared memory array:

34      real(2), shared :: a_s(tile_r+pad_s, wmma_k)

as well as adjusting the last argument (the leading dimension) to the wmmaLoadMatrix() call:

80         call wmmaLoadMatrix(sa, a_s(1+(j -1)*wmma_m, 1), &
81              tile_r+pad_s)

After implementing the above changes (in the tensorCoreSharedPad.CUF file), this code achieves the following

% ./tensorCoreSharedPad
 Device: Tesla V100-PCIE-16GB
 M = 6400, N = 6400, K = 6400
 tile_r = 128, tile_c = 64
 pad_s = 16

 Test passed, TFlops:     41.02449

which represents a substantial increase in performance over the previous case.

Since the mapping of device threads to elements in the WMMA submatrix is opaque, we cannot use access patterns to determine how much shared memory padding is optimal, and need to resort to an empirical approach. Padding by 16 half-precision elements is optimal for this kernel with these parameters, but when changing the kernel or parameters the choice of the amount of padding should be revisited.

The use of padding is one reason why we chose to place the A tile rather than the B tile in shared memory. A value of pad_s=16 would double the amount of shared memory used by a block of threads if the B tile were places in shared memory, whereas padding increases the shared memory usage for the A tile by only 16×16 or 256 half-precision words.

5.2 Example 6: Vectorized Memory Access

The load and store operations used in wmmaLoadMatrix() and wmmaStoreMatrix() are well optimized. The transfer of half-precision data from global to shared memory, however, can be improved using vectorized memory access.

For every copy operation, several instructions are issued which calculate the source and destination addresses as well as perform the load and stores. We can amortize the cost of these instructions by having the loads and stores operation on a larger word size. In CUDA C, there are vector types like float2 that facilitate vectorized memory access. In CUDA Fortran we can use several techniques to essentially "cast" or "equivalence" data to larger word sizes.

The following device subroutine copies an m × n tile of real(2) data from global to shared memory using real(4) words:

28    attributes ( device ) subroutine vectorCopyHalfAsync &
29         (a_s, lda_s, a, lda, m, n, cg)
30      use cooperative_groups
31      implicit none
32
33      !pgi$ ignore_tkr a, a_s
34      !  destination real(2) shared memory array "cast" as real (4)
35      real(4) :: a_s(lda_s /vecLen, *)
36
37      ! source real(2) global array "cast" as real(4)
38      real(4), intent (in) :: a(lda /vecLen,*)
39
40      ! leading dimension of destination array (units of real(2) words)
41      integer :: lda_s
42
43      ! leading dimension of source array (units of real(2) words)
44      integer :: lda
45
46      ! mxn block to be transferred (units of real(2) words)
47      integer :: m, n
48
49      ! currently thread block,
50      !   in the future this can be a more general cooperative group
51      type(thread_group ) :: cg
52
53      integer :: ir, ic
54
55      if (m/vecLen == cg%size) then
56         ir = cg%rank
57         do ic = 1, n
58            a_s(ir, ic) = a(ir, ic)
59         enddo
60      end if
61
62    end subroutine vectorCopyHalfAsync

This routine is called with real(2) arrays for a() and a_s(), which is permissible due to the !pgi$ ingore_tkr directive. The leading dimensions are adjusted in the declarations by the parameter vecLen=2. To keep the indexing as simple as possible (for performance reasons), a block of threads copies a column of real(4) words each iteration of the loop. This means that for the same thread block size (128) the A shared memory tile doubles in terms of half-precision words (256×64), as reflected in the parameter declarations:

 8      ! vector size used in loading shared memory. If you change this,
 9      !   change type/kind of a() and a_s() in vectorLoadHalfAsync
10      integer, parameter :: vecLen = 2
11
12      ! tile_r, tile_c are size of submatrix of C
13      !   that gets calculated per block
14      integer, parameter :: tile_r = 256, &
15           tile_c = (tile_r/vecLen/32)*wmma_n
16
17      ! Number of wmma_m block rows in A shared memory tile
18      integer, parameter :: blockRows_as = tile_r/wmma_m
19
20      ! Shared memory padding
21      integer, parameter :: pad_s = 16

The kernel code remains unchanged except for the call to vectorCopyHalfAsync() which replaces the loop which previously performed the data copy from global to shared memory:

107     do i = 1, k, wmma_k
108        ! load A into shared memory,
109        call syncthreads(tg)
110        call vectorCopyHalfAsync(a_s, tile_r+pad_s, a(row_t,i), m, &
111             tile_r, wmma_k, tg)
112
113        ! load B into wmma submatrices,
114        !   each warp gets a different 16x16 block
115        call wmmaLoadMatrix(sb, b(i,col_w), ldb)
116
117        ! for a_s
118        call syncthreads(tg)
119
120        ! loop loading A wmma submatrices from shared memory
121        !   and performing matmul
122        do j = 1, blockRows_as
123           call wmmaLoadMatrix(sa, a_s(1+(j-1)*wmma_m, 1), &
124                tile_r+pad_s)
125           call wmmaMatMul(sc(j), sa, sb, sc(j))
126        enddo
127     end do

Compiling and running this code we obtain:

% pgf90 -Mcuda=cc70,cuda10.1 -O3 -o tensorCoreVector \
  tensorCoreVector.CUF
% ./ tensorCoreVector
 Device: Tesla V100-PCIE-16GB
 M = 6400, N = 6400, K = 6400
 tile_r = 256, tile_c = 64
 pad_s = 16
 vecLen = 2
 Test passed, TFlops:      52.53718

which is a significant improvement over the nonvectorized memory access performance.

5.2.1 Vectorized memory access with Cray pointers

Another way to accomplish vectorized memory access in CUDA Fortran is through Cray pointers. By declaring real(4) pointers:

39      ! Cray pointers for vectorized access of a and a_s
40      real(4) :: a4(m/vecLen,*)
41      real(4) , shared :: a4_s((tile_r+pad_s)/vecLen, *)
42      pointer(xa4, a4)
43      pointer(xa4_s, a4_s)

and associating them with our real(2) arrays:

59      ! associate Cray pointers
60      xa4 = loc(a(1,1))
61      xa4_s = loc(a_s(1,1))

we can copy data from global to shared memory using real(4) words. After declaring an addition indexing parameter row_tv to accommodate the vector length:

63      ! row and column indices to the first element
64      ! in the tile_r x tile_c tile of C
65      row_t = (blockIdx%x-1)*tile_r + 1
66      col_t = (blockIdx%y-1)*tile_c + 1
67
68      ! vector version of row_t index
69      row_tv = (blockIdx%x-1)*tile_r/vecLen + 1

we can perform the vector copy to shared memory within the main loop of the kernel without needing to call a device function:

 78     do i = 1, k, wmma_k
 79        ! load A into shared memory,
 80        call syncthreads()
 81        do j = 1, wmma_k
 82           a4_s(threadIdx%x,j) = &
 83                a4(row_tv-1+threadIdx%x, i-1+j)
 84        enddo
 85
 86        ! load B into wmma submatrices,
 87        !   each warp gets a different 16x16 block
 88        call wmmaLoadMatrix(sb, b(i,col_w ), ldb )
 89
 90        ! for a_s
 91        call syncthreads()
 92
 93        ! loop loading A wmma submatrices from shared memory
 94        !   and performing matmul
 95        do j = 1, blockRows_as
 96           call wmmaLoadMatrix(sa, a_s(1+(j-1)*wmma_m, 1), &
 97                tile_r+pad_s)
 98           call wmmaMatMul(sc(j), sa, sb, sc(j))
 99        enddo
100     end do

Note that access to the A tile is though the real(4) pointers only when copying the data from global to shared memory, elsewhere the real(2) arrays are used.

Compiling and running this code we obtain the same performance, within run-to-run variations:

% pgf90 -Mcuda=cc70,cuda10.1 -O3 -o tensorCoreVectorCP \
  tensorCoreVectorCP .CUF
% ./ tensorCoreVectorCP
 Device: Tesla V100-PCIE-16GB
 M = 6400, N = 6400, K = 6400
 tile_r = 256, tile_c = 64
 pad_s = 16
 vecLen = 2

 Test passed, TFlops:      51.76673

Next steps

The sequence of example codes used in this paper ranged from those illustrating basic concepts of tensor core programming to those performing GEMM on half-precision data at above 50 TFlops. The performance emphasis in the latter part of the paper is not meant to result in a optimal code, rather to demonstrate what issues need to be addressed (e.g. data reuse, shared memory, vector access) as you to develop tensor core codes. WMMA tile sizes, vector lengths, thread block sizes, are just a few of the parameter spaces that can be further explored. There are a few things to be aware of when exploring these parameter space. When changing the WMMA tile sizes, one needs to change both the wmma_m and wmma_n parameters (wmma_k is always 16) and in addition the tile size parameters in the wmmaSubMatrix declarations. When changing tile_r and tile_c, be aware of register utilization and register spills. Compiling with -⁠Mcuda=ptxinfo will indicate register usage and register spills, although the presence of the latter will be very obvious from the resulting performance.

One issue we did not touch on is whether 16 bits of floating-point precision is sufficient for your needs, which is an assessment you will need to make. This can be assessed with real(2) support in host code, before jumping into tensor core programming.

Example Source Code
Click me
Cookie Consent

This site uses cookies to store information on your computer. See our cookie policy for further details on how to block cookies.

X