Technical News from The Portland Group

CUDA 5 Features in PGI CUDA Fortran 2013

The 2013 release of PGI CUDA Fortran introduces support for many of the new CUDA 5.0 features. This article briefly describes how you can take advantage of these features. Over the course of the paper, we'll show many examples of matrix multiplication using double precision data types.

Separate Compilation of Device Code in CUDA

An excellent set of software packages I've used over the years are the multiple precision libraries from David Bailey, et. al. In CUDA 5.0, it is now easy to incorporate community Fortran libraries like these into your CUDA Fortran programs.

Here is a simple matrix multiply example that we've shown many times in the past. The global subroutine in CUDA Fortran looks like this:

    attributes(global) subroutine dgemm16(a, b, c, m, n, k)
        integer, value :: m, n, k
        real(8) :: a(m,*), b(k,*), c(m,*)
        real(8), shared, dimension(17,16) :: bs
        real(8), device  :: cloc(16), ax

        inx = threadidx%x
        iny = threadidx%y
        ibx = (blockidx%x-1) * 16
        iby = (blockidx%y-1) * 16

        ia = ibx + (iny-1)*16 + inx
        ib = inx
        ic = ia

        jb = iby + iny
        jc = iby + 1

        do i = 1, 16
            cloc(i) = 0.0d0
        end do

        do ik = 1, k, 16
            bs(iny,inx) = b(ib,jb)
            call syncthreads()

            do j = 1, 16
                ax = a(ia,ik+j-1)
                do i = 1, 16
                    cloc(i) = cloc(i) + ax * bs(i,j)
                end do
            end do

            ib = ib + 16
            call syncthreads()
        end do

        do i = 1, 16
            c(ic,jc+i-1) = cloc(i)
        end do
        call syncthreads()
    end subroutine

This kernel uses a little (padded) square shared memory array to cache the values of B between the threads in the threadblock, and each thread reads a value of A from global memory and applies it 16 times in the inner loop. It is a simple implementation and performs reasonably well.

For this article, I've stacked the deck against this kernel and used poorly conditioned data, randomly ordered sets of positive and negative values with magnitudes between 2**(-127) and 2**128 (roughly 10E+/-38). With exact arithmetic, the numbers should cancel out and give a result of 0.0d0 for each element of C. Instead, for a set of 512x512 matrices, we see about half the values are nonzero:

% ./a.out
       137157  errors were encountered
 Max error was    7.5557859E+22
 Ave error was    6.2187533E+20
512x512 * 512x512:         1.135 ms      236.470 GFlops/s
 ### C(1,1)=   5.3362915329711724E-039

Compared to the largest number in the dataset, the results are good to 16 digits, but still, they are quite a ways from 0.0 which is what we want.

Now, let's get back to separate compilation. I put together a very quick port of David Bailey's ddfun90 library to CUDA Fortran. The ddfun90 library performs double-double arithmetic utilizing two real*8 values held in a derived type. The library uses generic interfaces, and once built, can be enabled very quickly in your F90 code.

There are certain constructs in the library (notably error handling using Fortran I/O) that I've left for another day. For now, I'm concentrating on double-double add, subtract, multiply and divide. For these routines, I made changes to the library file ddfun90 for the device entry points, adding attributes(device) to the functions and subroutines I want to call from device code:

attributes(device) subroutine ddadd (dda, ddb, ddc)

!   This subroutine computes ddc = dda + ddb.

implicit none
real*8 dda(2), ddb(2), ddc(2)
real*8 e, t1, t2

!   Compute dda + ddb using Knuth's trick.

t1 = dda(1) + ddb(1)
e = t1 - dda(1)
t2 = ((ddb(1) - e) + (dda(1) - (t1 - e))) + dda(2) + ddb(2)

!   The result is t1 + t2, after normalization.

ddc(1) = t1 + t2
ddc(2) = t2 - (ddc(1) - t1)
end subroutine

Similarly, in ddmod90, I added the device attribute again on the overloaded procedures I wanted to use:

interface operator (+)
  module procedure dd_addqq
  . . .
end interface

  attributes(device) function dd_addqq (qa, qb)
    implicit real*8 (d), &
      type (dd_real) (q), complex (kdb) (x), type (dd_complex) (z)
    type (dd_real):: dd_addqq
    intent (in):: qa, qb
    call ddadd (qa%ddr, qb%ddr, dd_addqq%ddr)
  end function

To take advantage of separate compilation with CUDA Fortran use the rdc flag. This flag, carried over from nvcc, stands for "relocatable device code".

% pgf90 -c -O2 -Mcuda=rdc ddfun90.cuf ddmod90.cuf

The generated .o files can be used like any other object file, or even put into a library:

% ar rc ddfunc.a ddfun90.o ddmod90.o

Now, I need to modify my original program to use the extended precision arithmetic. Thanks to the overloaded operators, this is pretty easy. First, use the ddmodule:

MODULE simple_dgemm

use ddmodule ! <- defines entry points, data types, and global data
             !    for use in contained subprograms

And then change the type of the shared and local variables:

    type(dd_real), shared, dimension(17,16) :: bs
    type(dd_real), device  :: cloc(16), ax

That's it. Arithmetic previously done in double precision, such as:

    cloc(i) = cloc(i) + ax * bs(i,j)

is now done in double-double precision due to the type changes. You can build the driver program and link in either the .o files or .a archive as usual. Just remember to add the rdc option to ‑Mcuda.

% pgf90 -O2 -Mcuda=rdc dgemmpaper.cuf ddfun90.o ddmod90.o 


% pgf90 -O2 -Mcuda=rdc dgemmpaper.cuf ddfunc.a           

As you might suspect, now when we run it the performance is a much slower.

% ./a.out
 Test passed!
512x512 * 512x512:       113.290 ms        2.369 GFlops/s
 ### C(1,1)=    0.000000000000000     

But look, instead of generating bad answers quickly we are generating reliably good answers and in the process we've learned just how easy it is to take advantage of third-party libraries with CUDA 5.0. Of course, the input data could be changed so that even the double-double arithmetic produces incorrect results. Bailey's page has a link to another package, mpfun90, that could probably address most of those issues too. The interfaces in that library are similar in nature to ddfun90.

The rdc flag by itself implies CUDA 5.0. In other words ‑Mcuda=rdc is equivalent to ‑Mcuda=cuda5.0,rdc. Using rdc is not supported in CUDA 4.2. Another CUDA restraint is that separate compilation only works for compute capability 2.0 (Fermi) and above.

If you're interested, you can download this entire code (minus ddfun90) from the PGI website.

Dynamic Parallelism of Device Code in CUDA.

PGI 2013 also adds support for dynamic parallelism in CUDA Fortran. This means a GPU kernel can launch one or more sub-kernels, and has access to a limited set of the CUDA API by which to control them.

The original 512x512 matrix multiply above can be broken up into eight sub-matrix multiplies:

        | C11 C12 |  = | A11 A12 |  *  | B11 B12 |
        | C21 C22 |    | A21 A22 |     | B21 B22 |

    C11 = A11*B11 + A12*B21
    C12 = A11*B12 + A12*B22
    C21 = A21*B11 + A22*B21
    C22 = A21*B12 + A22*B22

The code for this decomposition can now all be contained in a CUDA Fortran global subroutine, as long as you are using CUDA 5.0 on a compute capability 3.5 (Kepler K20) card:

  real(8), device, allocatable :: m1(:,:), m2(:,:), m3(:,:), m4(:,:)
  real(8), device, allocatable :: m5(:,:), m6(:,:), m7(:,:), m8(:,:)
  type(dim3), device :: devthreads, devblocks
  . . .

      newn = n / 2  ! For convenience, now assume square matrices
      devblocks = dim3(newn/256, newn/16, 1)
      devthreads = dim3(16, 16, 1)

      call dgemm16<<<devblocks,devthreads>>>(a(1,1), m, &
                                            b(1,1), k, &
                                            m1(1,1), newn, newn, newn, newn)
      call dgemm16<<<devblocks,devthreads>>>(a(1,1+k/2), m, &
                                            b(1+k/2,1), k, &
                                            m2(1,1), newn, newn, newn, newn)
      call dgemm16<<<devblocks,devthreads>>>(a(1,1), m, &
                                            b(1,1+n/2), k, &
                                            m3(1,1), newn, newn, newn, newn)
      call dgemm16<<<devblocks,devthreads>>>(a(1,1+k/2), m, &
                                            b(1+k/2,1+n/2), k, &
                                            m4(1,1), newn, newn, newn, newn)
      call dgemm16<<<devblocks,devthreads>>>(a(1+m/2,1), m, &
                                            b(1,1), k, &
                                            m5(1,1), newn, newn, newn, newn)
      call dgemm16<<<devblocks,devthreads>>>(a(1+m/2,1+k/2), m, &
                                            b(1+k/2,1), k, &
                                            m6(1,1), newn, newn, newn, newn)
      call dgemm16<<<devblocks,devthreads>>>(a(1+m/2,1), m, &
                                            b(1,1+n/2), k, &
                                            m7(1,1), newn, newn, newn, newn)
      call dgemm16<<<devblocks,devthreads>>>(a(1+m/2,1+k/2), m, &
                                            b(1+k/2,1+n/2), k, &
                                            m8(1,1), newn, newn, newn, newn)
      istat = cudaDeviceSynchronize()
      call add16<<<1,devthreads>>>(m1, newn, m2, newn, c(1,1), m, newn)
      call add16<<<1,devthreads>>>(m3, newn, m4, newn, c(1,1+n/2), m, newn)
      call add16<<<1,devthreads>>>(m5, newn, m6, newn, c(1+m/2,1), m, newn)
      call add16<<<1,devthreads>>>(m7, newn, m8, newn, c(1+m/2,1+n/2), m, newn)

      istat = cudaDeviceSynchronize()

Just as in CUDA Fortran host code, the F90 allocate statement in device code wraps the cudaMalloc() calls and F90 deallocate wraps cudaFree(). The entire set of CUDA device API calls is available in CUDA Fortran.

This code is compiled using these flags:

% pgf90 -Mcuda=cuda5.0,cc35,rdc dgemmdynamic.cuf 

This code doesn't perform very well because all the launches are serialized along the same stream. CUDA 5.0 adds a create stream call from device code. Use it like this:

      integer :: flags
      integer(kind=cuda_stream_kind) :: istreams(8)
      . . .

      flags = cudaStreamNonBlocking
      do i = 1, 8
         istat = cudaStreamCreateWithFlags(istreams(i), flags)
      end do
      . . .

      call dgemm16<<<devblocks,devthreads,0,istreams(1)>>>(a(1,1), m, &
                                            b(1,1), k, &
                                            m1(1,1), newn, newn, newn, newn)
      . . .

Now the performance you'll see from this version is comparable to the original. Note that you really shouldn't expect a speedup by launching routines from the device relative to launching code from the host. Dynamic parallelism is intended to ease programming; it's not a performance optimization in itself.

That said, there is still room for improvement here. As a further optimization, you can allocate the new matrices and create the streams just once. Typically, in CUDA Fortran you move the declarations out from the global subroutine to the module level to ensure their state is saved across multiple kernel invocations.

I've also put together one level of the Strassen matrix multiply algorithm using basically the same techniques as those shown above. The Strassen algorithm uses one fewer sub-matrix multiply at the cost of more additions.

All thee versions of the code are included in my example tar file.

Dynamic Parallelism through CUBLAS

One easy way to take advantage of dynamic parallelism is through NVIDIA's newly available device libraries. The first library available from NVIDIA is the cublas library, and PGI provides device code wrappers for it just as we have done in the past for previous host-called cublas libraries.

To create an equivalent dgemm routine to the previous examples, you could code a single-threaded global subroutine like this:

  attributes(global) subroutine dgemm16(a, b, c, m, n, k)
    use cublas_device
    integer, value :: m, n, k
    double precision, device :: a(m,*), b(k,*), c(m,*)
    double precision, device :: alpha, beta
    type(cublasHandle) :: ch1
    integer transa, transb
    i = threadIdx%x
    if (i.eq.1) then
        istat = cublasCreate_v2(ch1)
        alpha = 1.0d0
        beta  = 0.0d0
        transa = cublas_op_n
        transb = cublas_op_n
        istat = cublasDgemm_v2(ch1, transa, transb, m, n, k, alpha, &
                                  a, m, b, k, beta, c, m)
        istat = cublasDestroy_v2(ch1)
    end if
    end subroutine

Targeting New CUDA Runtime and Hardware with CUBLAS and PGI Compilers

Make sure you compile correctly for your new target architectures. For now, the default CUDA version with PGI 13.x is CUDA 4.2:

bash-4.1$ pgf90 -O2 dgemmhostcublas.cuf -lcublas
bash-4.1$ ./a.out
       258212  errors were encountered
 Max error was    7.5557859E+22
 Ave error was    1.1575536E+21
512x512 * 512x512:         0.602 ms      446.151 GFlops/s
 ### C(1,1)=   5.3362915329711724E-039

You can always use the latest supported CUDA version by specifying it on the command line explicitly:

bash-4.1$ pgf90 -O2 -Mcuda=cuda5.0,cc35 dgemmhostcublas.cuf -lcublas
bash-4.1$ ./a.out
       258212  errors were encountered
 Max error was    7.5557859E+22
 Ave error was    1.1575536E+21
512x512 * 512x512:         0.461 ms      582.123 GFlops/s
 ### C(1,1)=   5.3362915329711724E-039

I'm certain you can top 582 GFlops/sec by making the problem bigger, but I'll leave that as an exercise to the reader. I probably didn't compute that many Flops total in the first 10 years of my career, so I'm not that greedy.

Final Remarks

There are potentially many more cool opportunities to take advantage of with CUDA Fortan and CUDA 5.0, more than I can show here. PGI will be showing these demos and more at GTC this month. Come by and chat.