Technical News from The Portland Group

CUDA Fortran Device Kernels

In the PGInsider article, CUDA Fortran Data Management, I described the CUDA Fortran extensions for managing device data, and how by using the device attribute in host code declarations, you can enable simple data transfers between the host and device using Fortran array syntax in assignment statements.

In this article we'll go through some of the considerations a programmer should keep in mind when targeting an NVIDIA GPU using either CUDA Fortran or CUDA C. Starting with some simple kernels from the stream benchmark which I introduced in the last article, we'll move into some more complicated kernels which use shared memory and show how a more complex problem decomposition can improve performance.

The Basics

The STREAM benchmark tests execution time for four memory-bound kernels: copy, scale, add, and triad. These are very simple loops; for instance the call to stream_triad looks like this:

          call stream_triad (a, b, c, scalar, n)

and the code for the subroutine looks like this:

          subroutine stream_triad (a, b, c, scalar, n)
          real*8 a(*), b(*), c(*), scalar
          integer n
          do j = 1,n
              a(j) = b(j) + scalar*c(j)
          end do
          end 

In my previous article, I showed how with a single addition of the device attribute you can declare and initialize CUDA Fortran device arrays adev, bdev, and cdev.

          DOUBLE PRECISION, DEVICE, ALLOCATABLE :: ADEV(:), BDEV(:), CDEV(:)

          ALLOCATE(ADEV(ndim))
          ALLOCATE(BDEV(ndim))
          ALLOCATE(CDEV(ndim))

          ADEV = 2.0d0
          BDEV = 0.5D0
          CDEV = 0.0D0

So, what about the arguments scalar and n? For scalars, you have two choices. You can create device copies of these variables, and assign values to them using the same syntax as we used for adev, bdev, and cdev, or you can pass the scalar arguments by value. The latter is probably more familiar to C programmers than to Fortran programmers, but is supported equally well now in both languages. This bypasses the need to create a device copy of the variable; the value is placed on the argument list on the host, and read off the argument list on the device.

To properly pass these arguments from host to device, the interface for the device subroutine must be known, or explicit. Fortran has two methods to define an explicit interface; either add a Fortran interface block to the host code, or put the device code in a module that the host code then uses.

Our STREAM code uses the module approach. We create a Fortran module named stream_cudafor which contains the device subroutines:

      module stream_cudafor
      ...
      contains
        ...
        attributes(global) subroutine stream_triad(a, b, c, scalar, n)
          real*8, device :: a(*), b(*), c(*)
          real*8, value  :: scalar
          integer, value :: n
          ...

Then in the program unit that makes the call you can say:

      use stream_cudafor

and the compiler will know to set up the call to stream_triad correctly. In this case, we want to pass in scalar and n from the host, by value, because we do not need these variables on the device otherwise.

Now, let's discuss writing the actual kernel body. Current launch configurations (those things passed inside the chevrons) for CUDA kernels allow up to 1024 threads per block on Fermi architecture cards (512 on Tesla C1060), and up to 64K blocks per each of the x and y grid dimensions. In our STREAM example, the array sizes are less than 32M, so that gives us the flexibility to write fairly simple kernel programs. So, for starters, we'll just have each thread compute one result of the output array. Here is the entire kernel written in CUDA Fortran:

        attributes(global) subroutine stream_triad(a, b, c, scalar, n)
          real*8, device :: a(*), b(*), c(*)
          real*8, value  :: scalar
          integer, value :: n
          i = (blockidx%x-1)*blockdim%x + threadidx%x
          if (i .le. n) then
            c(i) = a(i) + b(i) * scalar
          end if
          return
          end

When this is called from this call site:

       call stream_triad <<<(N-1)/512+1,512>>> (ADEV, BDEV, CDEV, scale, N)

each instance of the thread kernel gets a unique (blockidx, threadidx) value. For instance, if N is one million, the blockidx values will range from 1 to 1954, and the threadidx values will range from 1 to 512. Note that the product of these two is greater than 1 million. The CUDA runtime system will actually spawn more threads than necessary; we must add the conditional check to our code and only do the computation if our computed index is less than or equal to n.

There are certainly alternatives to this decomposition approach, but for simple operations like those in STREAM, that have no real data reuse or non-sequential reordering, the simplest approach is usually the best. The global memory operates most efficiently when threads in a thread block access sequential, packed locations. In the CUDA documentation they refer to this as using coalesced accesses to global memory.

Contrast that loop to something like this, where you might be tempted to give each thread more work to do, perhaps applying something you might have learned from x86 SSE performance tuning:

          i = 4*((blockidx%x-1)*blockdim%x + threadidx%x)
          if (i .le. n) then
            c(i-3) = a(i-3) + b(i-3) * scalar
            c(i-2) = a(i-2) + b(i-2) * scalar
            c(i-1) = a(i-1) + b(i-1) * scalar
            c(i)   = a(i)   + b(i)   * scalar
          end if

Remember, there are potentially hundreds of threads per block, in potentially thousands of thread blocks, each running the same line of code "at the same time". Not really, of course, but it is useful to think of it that way. Also, think of the DRAM bus as a big, wide (think 256 or 512 bits) interface that works most efficiently on contiguous data. In this example, the first access, to a(i-3), is fetched by all threads simultaneously for all values of i. These are not all contiguous, and the same applies for the fetches of b(i-3), the other elements of a and b, and it also applies to the writes to c. It is easy to see how coding the operation like this will only get 25% of the memory throughput as compared to the first approach. In general, rule #1 when starting out writing CUDA kernels is to use contiguous memory across threads in a thread block for each read or write to global memory, whenever possible.

Example 2

Now lets look at an operation that is a little more involved, but still simple in theory: matrix transpose. Of course in Fortran 2003, there is an intrinsic for this operation: given two matrices a and b, you can say:

    a = transpose(b)

and your job is done. Its efficiency, of course, depends on the compiler's implementation. If you want to code it yourself, it looks something like:

    do j = 1, m
      do i = 1, n
        b(i,j) = a(j,i)
      end do
    end do

You can see that the programmer has a choice, either the source matrix can be read sequentially, or the destination matrix can be written sequentially. Not both, at least not in a simple implementation.

Here is a naïve implementation of a CUDA device kernel to perform matrix transpose. By using a 2-dimensional decomposition, we can collapse both the do j and do i loops into the runtime thread and block configuration. This implementation assumes that m and n are multiples of the x and y block dimensions:

  attributes(global) subroutine transpose1(a, b, n, m)
    real, device  :: a(n, m), b(m, n)
    integer, value :: n, m
    integer :: ix, iy
    ix = (blockIdx%x-1) * blockDim%x + threadIdx%x
    iy = (blockIdx%y-1) * blockDim%y + threadIdx%y
    b(ix,iy) = a(iy,ix)
  end subroutine transpose1

and to invoke this kernel, we'll pass it size 16x16 thread blocks:

   dimGrid  = dim3(M/16, N/16, 1)
   dimBlock = dim3(16, 16, 1)
   call transpose1 <<<dimgrid, dimblock>>> (ADEV, BDEV, N, M)

This kernel performs poorly, unfortunately. As we noted above, the non­sequential reads from the matrix a perform at just a small fraction of the potential peak. In practice, this runs on a GTX 280 at about rate of about 2.7 GB/sec. Remember from my last article that STREAM was running at around 100 GB/sec on this card. So, we have some room to improve.

The first thing to try is to use shared memory to cache a set of data from a thread block. This might allow us to use sequential accesses for both reading and writing to global memory. So, we set up a shared memory array equal to the thread block size, and use that. We have to set up three sets of indices to handle the block operation, those for a, b, and the shared memory tile:

  attributes(global) subroutine transpose2(a, b, n, m)
    real, device  :: a(n, m), b(m, n)
    integer, value :: n, m
    integer :: ix, iy, jx, jy, kx, ky
    real, shared :: tile(16, 16)

    ix = threadIdx%x
    iy = threadIdx%y

    jx = (blockIdx%x-1) * 16 + ix
    jy = (blockIdx%y-1) * 16 + iy

    kx = (blockIdx%y-1) * 16 + ix
    ky = (blockIdx%x-1) * 16 + iy

    tile(ix,iy) = a(jx,jy)
    call syncthreads()
    b(kx,ky) = tile(iy,ix)

  end subroutine transpose2

Note that we have to use a syncthreads call to guarantee that all threads have written to the shared array before any thread can read from it. This kernel is about 7x faster than the previous kernel, and gets a little over 18 GB/sec on our GTX 280, a big step forward, but we still have work to do to approach STREAM benchmark performance numbers.

In the final version of the transpose kernel below, I've made many changes to get closer to peak performance. The most important of these is to renumber the blocks in the grid. It turns out that besides restructuring reads and writes within a thread block for sequential accesses, it is also important to ensure that different thread blocks are accessing different partitions of memory, of which there are either 6 or 8, depending on your hardware. The mirrored nature of matrix transpose results in poor usage of these partitions, and the workaround is to use a diagonal numbering scheme. An excellent source of information on what NVIDIA refers to as "Partition Camping" can be found in the NVIDIA SDK under the transpose example.

Besides that, we pad the shared memory tile to avoid shared memory bank conflicts. This usually only shows up as a performance problem in highly tuned kernels; if this was applied to transpose2, you probably would see no net gain by doing this. Finally, we perform the complicated block remapping algorithm only on the first thread in the thread block, and store the results in shared variables ibx and iby. This gives a slight performance boost as well, as it removes redundant operations from many of the threads in the thread block. It also requires a syncthreads call before other threads may read the shared variables but is still almost always a small net gain. Other than that, the operations are the same:

  attributes(global) subroutine transpose3(a, b, n, m)
    real, device  :: a(n, m), b(m, n)
    integer, value :: n, m
    integer :: ix, iy, jx, jy, kx, ky, ibid
    real, shared :: tile(17, 16)
    integer, shared :: ibx, iby

    ix = threadIdx%x
    iy = threadIdx%y

    if ((ix == 1) .and. (iy == 1)) then
        ! thread 1 of the thread block computes new block numbers
        if (n == m) then
           ! square matrices are simpler
           iby = blockIdx%x-1
           ibx = blockIdx%x + blockIdx%y - 2
           if (ibx .ge. gridDim%x) ibx = ibx - gridDim%x
        else
           ibid = gridDim%x*(blockIdx%y-1) + blockIdx%x - 1
           iby = mod(ibid, gridDim%y)
           ibx = mod(ibid/gridDim%y+iby, gridDim%x)
        end if
        ibx = ibx * 16
        iby = iby * 16
    end if
    call syncthreads()

    jx = ibx + ix
    jy = iby + iy

    kx = iby + ix
    ky = ibx + iy

    tile(ix,iy) = a(jx,jy)
    call syncthreads()
    b(kx,ky) = tile(iy,ix)

  end subroutine transpose3

This version of the transpose runs over twice as fast as version 2. We've added complexity to the setup code so that the memory operations will perform better. Remember, each thread in this implementation is still doing just one global memory read and one global memory write. If we time just the last three statements in this routine, memory-to-shared, syncthreads, shared-to-memory, we find that this is running at about 100 GB/sec, which is on par with our STREAM numbers. If you care to improve performance even more, you can amortize the setup code by running over multiple data blocks with each thread block. Of course, you will need to make corresponding changes to the launch configuration.

Readers might notice that I've written two articles and not quoted any GFlops numbers, only GB/sec. My experience shows that, ultimately, scientific codes run only as fast as the memory system allows them to. By paying attention to the memory performance, your codes will run faster in the end.

Thoughts and feedback are always welcome. Please email me at brent.leback@pgroup.com.