Minimizing Changes When Porting Codes to CUDA Fortran
This post contains material that is helpful in porting applications to CUDA Fortran. The techniques presented here are not optimization techniques per se, rather they are coding techniques and styles that can reduce the effort involved in porting applications to CUDA Fortran.
One goal we pursue when porting code to CUDA Fortran is to leave the CPU code intact, so that the same source code can be compiled to run on either on the CPU or GPU by the absence or presence of the -Mcuda compiler option. This can be easily done with preprocessor macros, but we strive to minimize both the code that gets duplicated and the maintenance headaches that arise from such code duplication.
We will demonstrate the porting techniques using the following code that solves the Laplace equation in 2D:
module parameters integer, parameter :: nx = 4096, ny = 4096 integer, parameter :: iterMax = 100 integer, parameter :: reportInterval = 10 integer, parameter :: fp_kind = kind(1.0) real(fp_kind), parameter :: tol = 1.0e-5_fp_kind end module parameters module arrays use parameters real(fp_kind) :: a(nx,ny), aNew(nx,ny), absResidual(2:nx-1,2:ny-1) end module arrays module laplaceRoutines contains subroutine initialize() use parameters use arrays implicit none real(fp_kind), parameter :: & pi = 2.0_fp_kind*asin(1.0_fp_kind) real(fp_kind) :: y0(nx) integer :: i do i = 1, nx y0(i) = sin(pi*(i-1)/(nx-1)) enddo a = 0.0_fp_kind a(:,1) = y0 a(:,ny) = y0*exp(-pi) aNew = a end subroutine initialize subroutine laplaceSolution() use parameters use arrays implicit none real(fp_kind) :: maxResidual = 2*tol integer :: iter iter=0 do while ( maxResidual > tol .and. iter <= iterMax ) iter = iter + 1 call jacobiIteration() maxResidual = maxval(absResidual) if(mod(iter,reportInterval) == 0) & write(*,'(i8,3x,f10.6)'), iter, maxResidual a = aNew end do end subroutine laplaceSolution subroutine jacobiIteration() use parameters use arrays implicit none integer :: i, j do j=2,ny-1 do i=2,nx-1 aNew(i,j) = 0.2_fp_kind * & (a(i,j-1)+a(i-1,j)+a(i+1,j)+a(i,j+1)) + & 0.05_fp_kind * & (a(i-1,j-1)+a(i+1,j-1)+a(i-1,j+1)+a(i+1,j+1)) absResidual(i,j) = abs(aNew(i,j)-a(i,j)) end do end do end subroutine jacobiIteration end module laplaceRoutines program main use parameters use arrays use laplaceRoutines implicit none real :: startTime, stopTime write(*,'(/,a,i0,a,i0,a)') & 'Relaxation calculation on ', nx, ' x ', ny, ' mesh' write(*,*) 'Iteration Max Residual' call initialize() call cpu_time(startTime) call laplaceSolution() call cpu_time(stopTime) write(*,'(a,f10.3,a)') ' Completed in ', & stopTime-startTime, ' seconds' end program main
We’ll go through this code routine by routine, but links to the complete ports of this code are included at the end of the post.
The parameters module, which contains array dimensions, iteration limits, numerical tolerances, and type kind declarations requires no changes for use with CUDA Fortran.
Conditional inclusion of code with the !@cuf sentinel
The arrays module:
module arrays use parameters real(fp_kind) :: a(nx,ny), aNew(nx,ny), absResidual(2:nx-1,2:ny-1) end module arrays
becomes:
module arrays use parameters real(fp_kind) :: a(nx,ny), aNew(nx,ny), absResidual(2:nx-1,2:ny-1) !@cuf real(fp_kind), device :: a_d(nx,ny), aNew_d(nx,ny) !@cuf attributes(device) :: absResidual end module arrays
where here we have used the !@cuf sentinel to conditionally include the declarations of the device arrays a_d and aNew_d, as well as declaring the absResidual array as a device array. If CUDA Fortran is enabled in compilation, either by specifying -Mcuda on the command line or renaming the file with the .cuf or .CUF extension, then for a line that begins with the !@cuf sentinel the rest of the line appears as a statement, otherwise the entire line is a comment. Note the different treatment when porting these arrays. When compiling for CUDA Fortran, both host and device arrays are needed for a() and aNew() and therefore the additional declaration of device arrays, whereas the array absResidual() is needed only on the device, hence the addition of the device attribute.
The laplaceRoutines module contains three subroutines: initialize(), laplaceSolution(), and jacobiIteration(). We still perform the initialization on the host, as is common with many CUDA Fortran codes. As a result, the initialize() subroutine only requires two additional lines that use the !@cuf sentinel to perform host to device transfers:
subroutine initialize() use parameters use arrays implicit none real(fp_kind), parameter :: & pi = 2.0_fp_kind*asin(1.0_fp_kind) real(fp_kind) :: y0(nx) integer :: i do i = 1, nx y0(i) = sin(pi*(i-1)/(nx-1)) enddo a = 0.0_fp_kind a(:,1) = y0 a(:,ny) = y0*exp(-pi) aNew = a !@cuf aNew_d = aNew !@cuf a_d = a end subroutine initialize
Conditional inclusion of code with the _CUDA symbol
Changes to the laplaceSolution() subroutine occur only in the use statements, and the main body of the subroutine, the iteration do loop, is unchanged.
subroutine laplaceSolution() use parameters #ifdef _CUDA use arrays, only: a => a_d, aNew => aNew_d, absResidual #else use arrays #endif !@cuf use cudafor implicit none real(fp_kind) :: maxResidual = 2*tol integer :: iter iter=0 do while ( maxResidual > tol .and. iter <= iterMax ) iter = iter + 1 call jacobiIteration() maxResidual = maxval(absResidual) if(mod(iter,reportInterval) == 0) & write(*,'(i8,3x,f10.6)'), iter, maxResidual a = aNew end do end subroutine laplaceSolution
Here we use two methods for conditional compilation of code, the !@cuf sentinel as well as the #ifdef _CUDA block. The _CUDA symbol is automatically defined whenever CUDA Fortran is enabled, similar to the !@cuf sentinel. The #ifdef _CUDA block is used to toggle between two versions of the use arrays statement. When CUDA Fortran is enabled, the only: localName => moduleName clause is used to rename the variables a and aNew so these names refer to the device arrays a_d and aNew_d. The reason we do this is so that we need to make no changes to the executable lines of code in the subroutine: at the end of the do-while loop the update:
a = aNew
performs a device-to-device data transfer when CUDA Fortran is enabled at compilation. For this particular code changes to the executable statements would be minor, but this renaming approach can save major effort when porting larger projects. The use cudafor statement is required to call the overloaded maxval() routine that operates on device data.
Automatic code generation with !$cuf kernel directive
Porting the jacobiIteration() subroutine also utilizes the use ... only: construct to allow referencing the device variables using their host counterpart's names. In addition, the CUF kernel directive !$cuf kernel is used to automatically generate and launch kernel code for the contents of the tightly nested do loop:
subroutine jacobiIteration() use parameters #ifdef _CUDA use arrays, only: a => a_d, aNew => aNew_d, absResidual #else use arrays #endif implicit none integer :: i, j !$cuf kernel do(2) <<<*,*>>> do j=2,ny-1 do i=2,nx-1 aNew(i,j) = 0.2_fp_kind * & (a(i,j-1)+a(i-1,j)+a(i+1,j)+a(i,j+1)) + & 0.05_fp_kind * & (a(i-1,j-1)+a(i+1,j-1)+a(i-1,j+1)+a(i+1,j+1)) absResidual(i,j) = abs(aNew(i,j)-a(i,j)) end do end do end subroutine jacobiIteration
As with the laplaceSolution() routine, the majority of the code remains unchanged due to the renaming on the use statement, and there is no duplication of code for host and device variants.
Finally we have the main routine which remains relatively unchanged, with the exception of the !@cuf sentinel used to conditionally print a statement indicating execution is being done of the GPU:
program main use parameters use arrays use laplaceRoutines implicit none real :: startTime, stopTime !@cuf write(*,"(/,a,/)") 'GPU version' write(*,'(/,a,i0,a,i0,a)') & 'Relaxation calculation on ', nx, ' x ', ny, ' mesh' write(*,*) 'Iteration Max Residual' call initialize() call cpu_time(startTime) call laplaceSolution() call cpu_time(stopTime) write(*,'(a,f10.3,a)') ' Completed in ', & stopTime-startTime, ' seconds' end program main
Compilation and execution
We can compile this code to run either on the CPU or GPU as follows:
% pgf90 -O3 laplace2DUse.F90 % ./a.out Relaxation calculation on 4096 x 4096 mesh Iteration Max Residual 10 0.023564 20 0.011931 30 0.008061 40 0.006065 50 0.004811 60 0.004040 70 0.003442 80 0.003029 90 0.002685 100 0.002420 Completed in 4.554 seconds % pgf90 -O3 laplace2DUse.F90 -Mcuda % ./a.out GPU version Relaxation calculation on 4096 x 4096 mesh Iteration Max Residual 10 0.023564 20 0.011931 30 0.008061 40 0.006065 50 0.004811 60 0.004040 70 0.003442 80 0.003029 90 0.002685 100 0.002420 Completed in 0.435 seconds
Note that the strategy of renaming variables used in this port requires that the arrays reside in a module. There is another approach to renaming variables that is more general, using associate blocks.
Renaming variables using associate blocks
The associate construct permits one to associate a name with a variable for the duration of the associate block. Using associate blocks for renaming variables in our jacobiIteration() routine we have:
subroutine jacobiIteration() use parameters use arrays implicit none integer :: i, j !@cuf associate(a=>a_d, aNew=>aNew_d) !$cuf kernel do(2) <<<*,*>>> do j=2,ny-1 do i=2,nx-1 aNew(i,j) = 0.2_fp_kind * & (a(i,j-1)+a(i-1,j)+a(i+1,j)+a(i,j+1)) + & 0.05_fp_kind * & (a(i-1,j-1)+a(i+1,j-1)+a(i-1,j+1)+a(i+1,j+1)) absResidual(i,j) = abs(aNew(i,j)-a(i,j)) end do end do !@cuf end associate end subroutine jacobiIteration
where the !@cuf sentinel is used in combination with the associate() and end associate statements to conditionally rename variables. For the contents of the associate block the names a and aNew are associated with variables a_d and aNew_d, and the host variables a and aNew are unavailable. A similar approach can be used for the laplaceSolution() subroutine:
subroutine laplaceSolution() use parameters !@cuf use cudafor use arrays implicit none real(fp_kind) :: maxResidual = 2*tol integer :: iter iter=0 !@cuf associate(a=>a_d, aNew=>aNew_d) do while ( maxResidual > tol .and. iter <= iterMax ) iter = iter + 1 call jacobiIteration() maxResidual = maxval(absResidual) if(mod(iter,reportInterval) == 0) & write(*,'(i8,3x,f10.6)'), iter, maxResidual a = aNew end do !@cuf end associate end subroutine laplaceSolution
We should spend some time discussion the relative merits of use statement renaming versus associate block renaming. First of all, use statement renaming is restricted to variables that are declared in modules, whereas associate block renaming can be used to rename any variable that is in scope. Associate block renaming also allows more control over the extent of the renaming since the end associate statement can be placed anywhere, whereas use statement renaming will remain in effect for the entire program unit in which the use statement appears. The one downside of associate block renaming is that there can be some overhead on the host with its use, similar to the overhead associated with pointers. From our Laplace example, however, we observe no such overhead:
% pgf90 -O3 laplace2DAssoc.f90 -Mcuda=cc35 % ./a.out GPU associate version Relaxation calculation on 4096 x 4096 mesh Iteration Max Residual 10 0.023564 20 0.011931 30 0.008061 40 0.006065 50 0.004811 60 0.004040 70 0.003442 80 0.003029 90 0.002685 100 0.002420 Completed in 0.431 seconds
Summary
This post presented several techniques one can use when porting code to CUDA Fortran so that the resultant code can be compiled for execution on the CPU or GPU in a manner where the "guts" of the code remain unmodified. While these techniques have been demonstrated here on a simple code, they can — and have been — used on large-scale projects. An example of this is the AFiD computational fluid dynamics code, available at: https://github.com/PhysicsofFluids/AFiD_GPU_opensource. I hope you find these techniques useful in porting your applications.