This post contains material that is helpful in porting applications to CUDA Fortran. The techniques presented here are not optimization techniques per se, rather they are coding techniques and styles that can reduce the effort involved in porting applications to CUDA Fortran.

One goal we pursue when porting code to CUDA Fortran is to leave the CPU code intact, so that the same source code can be compiled to run on either on the CPU or GPU by the absence or presence of the -⁠Mcuda compiler option. This can be easily done with preprocessor macros, but we strive to minimize both the code that gets duplicated and the maintenance headaches that arise from such code duplication.

We will demonstrate the porting techniques using the following code that solves the Laplace equation in 2D:

module parameters
  integer, parameter :: nx = 4096, ny = 4096
  integer, parameter :: iterMax = 100
  integer, parameter :: reportInterval = 10
  integer, parameter :: fp_kind = kind(1.0)
  real(fp_kind), parameter :: tol = 1.0e-5_fp_kind
end module parameters

module arrays
  use parameters
  real(fp_kind) :: a(nx,ny), aNew(nx,ny), absResidual(2:nx-1,2:ny-1)
end module arrays

module laplaceRoutines
contains
  subroutine initialize()
    use parameters
    use arrays
    implicit none
    real(fp_kind), parameter :: &
         pi = 2.0_fp_kind*asin(1.0_fp_kind)
    real(fp_kind) :: y0(nx)
    integer :: i
    do i = 1, nx
       y0(i) = sin(pi*(i-1)/(nx-1))
    enddo
    a = 0.0_fp_kind
      a(:,1) = y0
    a(:,ny) = y0*exp(-pi)
    aNew = a
  end subroutine initialize


  subroutine laplaceSolution()
  use parameters
    use arrays
    implicit none
    real(fp_kind) :: maxResidual = 2*tol
    integer :: iter

    iter=0
    do while ( maxResidual > tol .and. iter <= iterMax )
       iter = iter + 1
       call jacobiIteration()
       maxResidual = maxval(absResidual)
       if(mod(iter,reportInterval) == 0) &
            write(*,'(i8,3x,f10.6)'), iter, maxResidual
       a = aNew
    end do
  end subroutine laplaceSolution


  subroutine jacobiIteration()
    use parameters
    use arrays
    implicit none
    integer :: i, j

    do j=2,ny-1
       do i=2,nx-1
          aNew(i,j) = 0.2_fp_kind * &
               (a(i,j-1)+a(i-1,j)+a(i+1,j)+a(i,j+1)) + &
               0.05_fp_kind * &
               (a(i-1,j-1)+a(i+1,j-1)+a(i-1,j+1)+a(i+1,j+1))
          absResidual(i,j) = abs(aNew(i,j)-a(i,j))
       end do
    end do
  end subroutine jacobiIteration

end module laplaceRoutines


program main
  use parameters
  use arrays
  use laplaceRoutines

  implicit none
  real :: startTime, stopTime
 
  write(*,'(/,a,i0,a,i0,a)') &
       'Relaxation calculation on ', nx, ' x ', ny, ' mesh'
  write(*,*) 'Iteration   Max Residual'

  call initialize()

  call cpu_time(startTime)
  call laplaceSolution()

  call cpu_time(stopTime)
  write(*,'(a,f10.3,a)')  '  Completed in ', &
       stopTime-startTime, ' seconds'
end program main

We’ll go through this code routine by routine, but links to the complete ports of this code are included at the end of the post.

The parameters module, which contains array dimensions, iteration limits, numerical tolerances, and type kind declarations requires no changes for use with CUDA Fortran.

Conditional inclusion of code with the !@cuf sentinel

The arrays module:

module arrays
  use parameters
  real(fp_kind) :: a(nx,ny), aNew(nx,ny), absResidual(2:nx-1,2:ny-1)
end module arrays

becomes:

module arrays
  use parameters
  real(fp_kind) :: a(nx,ny), aNew(nx,ny), absResidual(2:nx-1,2:ny-1)
  !@cuf real(fp_kind), device :: a_d(nx,ny), aNew_d(nx,ny)
  !@cuf attributes(device) :: absResidual
end module arrays

where here we have used the !@cuf sentinel to conditionally include the declarations of the device arrays a_d and aNew_d, as well as declaring the absResidual array as a device array. If CUDA Fortran is enabled in compilation, either by specifying -⁠Mcuda on the command line or renaming the file with the .cuf or .CUF extension, then for a line that begins with the !@cuf sentinel the rest of the line appears as a statement, otherwise the entire line is a comment. Note the different treatment when porting these arrays. When compiling for CUDA Fortran, both host and device arrays are needed for a() and aNew() and therefore the additional declaration of device arrays, whereas the array absResidual() is needed only on the device, hence the addition of the device attribute.

The laplaceRoutines module contains three subroutines: initialize(), laplaceSolution(), and jacobiIteration(). We still perform the initialization on the host, as is common with many CUDA Fortran codes. As a result, the initialize() subroutine only requires two additional lines that use the !@cuf sentinel to perform host to device transfers:

subroutine initialize()
  use parameters
  use arrays
  implicit none
  real(fp_kind), parameter :: &
       pi = 2.0_fp_kind*asin(1.0_fp_kind)
  real(fp_kind) :: y0(nx)
  integer :: i

  do i = 1, nx
     y0(i) = sin(pi*(i-1)/(nx-1))
  enddo
  a = 0.0_fp_kind
  a(:,1) = y0
  a(:,ny) = y0*exp(-pi)
  aNew = a
  !@cuf aNew_d = aNew
  !@cuf a_d = a
end subroutine initialize

Conditional inclusion of code with the _CUDA symbol

Changes to the laplaceSolution() subroutine occur only in the use statements, and the main body of the subroutine, the iteration do loop, is unchanged.

subroutine laplaceSolution()
    use parameters
#ifdef _CUDA
    use arrays, only: a => a_d, aNew => aNew_d, absResidual
#else
    use arrays
#endif
    !@cuf use cudafor
    implicit none
    real(fp_kind) :: maxResidual = 2*tol
    integer :: iter

    iter=0
    do while ( maxResidual > tol .and. iter <= iterMax )
       iter = iter + 1
       call jacobiIteration()
       maxResidual = maxval(absResidual)
       if(mod(iter,reportInterval) == 0) &
          write(*,'(i8,3x,f10.6)'), iter, maxResidual
       a = aNew
    end do
end subroutine laplaceSolution

Here we use two methods for conditional compilation of code, the !@cuf sentinel as well as the #ifdef _CUDA block. The _CUDA symbol is automatically defined whenever CUDA Fortran is enabled, similar to the !@cuf sentinel. The #ifdef _CUDA block is used to toggle between two versions of the use arrays statement. When CUDA Fortran is enabled, the only: localName => moduleName clause is used to rename the variables a and aNew so these names refer to the device arrays a_d and aNew_d. The reason we do this is so that we need to make no changes to the executable lines of code in the subroutine: at the end of the do-while loop the update:

    a = aNew

performs a device-to-device data transfer when CUDA Fortran is enabled at compilation. For this particular code changes to the executable statements would be minor, but this renaming approach can save major effort when porting larger projects. The use cudafor statement is required to call the overloaded maxval() routine that operates on device data.

Automatic code generation with !$cuf kernel directive

Porting the jacobiIteration() subroutine also utilizes the use ... only: construct to allow referencing the device variables using their host counterpart's names. In addition, the CUF kernel directive !$cuf kernel is used to automatically generate and launch kernel code for the contents of the tightly nested do loop:

subroutine jacobiIteration()
    use parameters
#ifdef _CUDA
    use arrays, only: a => a_d, aNew => aNew_d, absResidual
#else
    use arrays
#endif
    implicit none
    integer :: i, j

    !$cuf kernel do(2) <<<*,*>>>
    do j=2,ny-1
       do i=2,nx-1
          aNew(i,j) = 0.2_fp_kind * &
               (a(i,j-1)+a(i-1,j)+a(i+1,j)+a(i,j+1)) + &
               0.05_fp_kind * &
               (a(i-1,j-1)+a(i+1,j-1)+a(i-1,j+1)+a(i+1,j+1))
          absResidual(i,j) = abs(aNew(i,j)-a(i,j))
       end do
    end do
end subroutine jacobiIteration

As with the laplaceSolution() routine, the majority of the code remains unchanged due to the renaming on the use statement, and there is no duplication of code for host and device variants.

Finally we have the main routine which remains relatively unchanged, with the exception of the !@cuf sentinel used to conditionally print a statement indicating execution is being done of the GPU:

program main
  use parameters
  use arrays
  use laplaceRoutines
  implicit none

  real :: startTime, stopTime

  !@cuf write(*,"(/,a,/)") 'GPU version'
  write(*,'(/,a,i0,a,i0,a)') &
       'Relaxation calculation on ', nx, ' x ', ny, ' mesh'

  write(*,*) 'Iteration   Max Residual'

  call initialize()

  call cpu_time(startTime)
  call laplaceSolution()

  call cpu_time(stopTime)
  write(*,'(a,f10.3,a)')  '  Completed in ', &
       stopTime-startTime, ' seconds'
end program main

Compilation and execution

We can compile this code to run either on the CPU or GPU as follows:

% pgf90 -O3 laplace2DUse.F90
% ./a.out

Relaxation calculation on 4096 x 4096 mesh
 Iteration   Max Residual
      10     0.023564
      20     0.011931
      30     0.008061
      40     0.006065
      50     0.004811
      60     0.004040
      70     0.003442
      80     0.003029
      90     0.002685
     100     0.002420
  Completed in      4.554 seconds
% pgf90 -O3 laplace2DUse.F90 -Mcuda
% ./a.out

GPU version


Relaxation calculation on 4096 x 4096 mesh
 Iteration   Max Residual
      10     0.023564
      20     0.011931
      30     0.008061
      40     0.006065
      50     0.004811
      60     0.004040
      70     0.003442
      80     0.003029
      90     0.002685
     100     0.002420
  Completed in      0.435 seconds

Note that the strategy of renaming variables used in this port requires that the arrays reside in a module. There is another approach to renaming variables that is more general, using associate blocks.

Renaming variables using associate blocks

The associate construct permits one to associate a name with a variable for the duration of the associate block. Using associate blocks for renaming variables in our jacobiIteration() routine we have:

subroutine jacobiIteration()
  use parameters
  use arrays
  implicit none
  integer :: i, j

  !@cuf associate(a=>a_d, aNew=>aNew_d)
  !$cuf kernel do(2) <<<*,*>>>
  do j=2,ny-1
     do i=2,nx-1
        aNew(i,j) = 0.2_fp_kind * &
             (a(i,j-1)+a(i-1,j)+a(i+1,j)+a(i,j+1)) + &
             0.05_fp_kind * &
             (a(i-1,j-1)+a(i+1,j-1)+a(i-1,j+1)+a(i+1,j+1))
        absResidual(i,j) = abs(aNew(i,j)-a(i,j))
      end do 
  end do
  !@cuf end associate
end subroutine jacobiIteration

where the !@cuf sentinel is used in combination with the associate() and end associate statements to conditionally rename variables. For the contents of the associate block the names a and aNew are associated with variables a_d and aNew_d, and the host variables a and aNew are unavailable. A similar approach can be used for the laplaceSolution() subroutine:

subroutine laplaceSolution()
  use parameters
  !@cuf use cudafor
  use arrays
  implicit none
  real(fp_kind) :: maxResidual = 2*tol
  integer :: iter

  iter=0
  !@cuf associate(a=>a_d, aNew=>aNew_d)
  do while ( maxResidual > tol .and. iter <= iterMax )
     iter = iter + 1
     call jacobiIteration()
     maxResidual = maxval(absResidual)
     if(mod(iter,reportInterval) == 0) &
          write(*,'(i8,3x,f10.6)'), iter, maxResidual
     a = aNew
  end do
  !@cuf end associate
end subroutine laplaceSolution

We should spend some time discussion the relative merits of use statement renaming versus associate block renaming. First of all, use statement renaming is restricted to variables that are declared in modules, whereas associate block renaming can be used to rename any variable that is in scope. Associate block renaming also allows more control over the extent of the renaming since the end associate statement can be placed anywhere, whereas use statement renaming will remain in effect for the entire program unit in which the use statement appears. The one downside of associate block renaming is that there can be some overhead on the host with its use, similar to the overhead associated with pointers. From our Laplace example, however, we observe no such overhead:

% pgf90 -O3 laplace2DAssoc.f90 -Mcuda=cc35
% ./a.out

GPU associate version


Relaxation calculation on 4096 x 4096 mesh
 Iteration   Max Residual
      10     0.023564
      20     0.011931
      30     0.008061
      40     0.006065
      50     0.004811
      60     0.004040
      70     0.003442
      80     0.003029
      90     0.002685
     100     0.002420
  Completed in      0.431 seconds

Summary

This post presented several techniques one can use when porting code to CUDA Fortran so that the resultant code can be compiled for execution on the CPU or GPU in a manner where the "guts" of the code remain unmodified. While these techniques have been demonstrated here on a simple code, they can — and have been — used on large-scale projects. An example of this is the AFiD computational fluid dynamics code, available at: https://github.com/PhysicsofFluids/AFiD_GPU_opensource. I hope you find these techniques useful in porting your applications.

Example Source Code

Click me
Cookie Consent

This site uses cookies to store information on your computer. See our cookie policy for further details on how to block cookies.

X