Technical News from The Portland Group

Using GPU-enabled Math Libraries with PGI Fortran

Overview

In each of the last two issues of the PGInsider, separate articles covered using specialized GPU linear algebra math libraries to increase the performance of applications using these types of functions. The libraries covered were CULA, a GPU-accelerated numerical linear algebra library, and MAGMA— Matrix Algebra on GPU and Multicore Architectures. This article introduces a third library named CUBLAS (CUDA BLAS) developed by NVIDIA, and describes what is required from a user's perspective to make optimum use of each of these libraries.

CULA is a commercially supported GPU library that implements much of LAPACK in a way that leverages both very high performance multi-core x64 and GPU-accelerated versions of the routines. CULA is available in three versions: Basic (a no cost limited set of single precision routines), Premium (for-fee expanded set of routines in both single and double precision), and Commercial for redistribution. MAGMA is similar to LAPACK, but for heterogeneous/hybrid architectures starting with current multi-core+GPU systems. MAGMA is available at no cost. CUBLAS is a small library containing a subset of the BLAS Level 3 routines. CUBLAS is also available at no cost. The intent of this article is not to single out any one library as preferred. Our experience is that each one has its place.

For the purpose of this article a routine common amongst the three libraries was chosen as the basis for the examples. The Basic Linear Algebra Subprograms (BLAS) Level 3 routine named SGEMM performs single precision matrix multiplication.

Host Interface

The host interface in this context is defined as a math library calling convention that can be used when all data resides in CPU memory. It is the responsibility of the library routine to perform any data transfer to and from the GPU. In our example, the library routine is responsible for transferring matrices A and B to the GPU, and transferring the matrix C back to the CPU memory after computing the matrix multiply. For our SGEMM example, only the CULA library supports a host interface. However, for several other LAPACK routines such as SGETRF, both the CULA and MAGMA libraries contain both host interfaces and device interfaces (discussed in the next section).

Consider the following code snippet. This call to SGEMM performs the matrix multiplication of A x B and stores the result in C.

real, dimension(1000,1000) :: a, b, c
real :: alpha, beta
integer :: m,n,k,lda,ldb,ldc

m = n = k = lda = ldb = ldc = 1000
alpha = 1.0
beta = 0.0 

call sgemm('n','n',m,n,k,alpha,a,lda,b,ldb,beta,c,ldc)

In addition to improved performance from using the GPU, another important goal of using these libraries is to minimize the amount of code that needs to be modified or ported. Fortunately this can be accomplished by using Fortran 2003 features to handle the details specific to each library and/or GPU.

For the CULA library, a "CULA" module file might look something like this:

module cula

! Generic interfaces

interface sgemm
  module procedure sgemmHost
  module procedure sgemmDev
end interface

! External interfaces

interface
  integer function culasgemmdev(transa, transb, m, n, k, alpha, a, lda, b, ldb, 
beta, c, ldc) bind(c,name='CULA_DEVICE_SGEMM')
   use iso_c_binding
   integer(c_int) :: m, n, k, lda, ldb, ldc
   real(c_float), device, dimension(lda,k) :: a
   real(c_float), device, dimension(ldb,n) :: b
   real(c_float), device, dimension(ldc,n) :: c
   real(c_float) :: alpha, beta
   character(kind=c_char) :: transa, transb
  end function culasgemmdev

  integer function culasgemmhost(transa, transb, m, n, k, alpha, a, lda, b, ldb,
 beta, c, ldc) bind(c,name='CULA_SGEMM')
   use iso_c_binding
   integer(c_int) :: m, n, k, lda, ldb, ldc
   real(c_float), dimension(lda,k) :: a
   real(c_float), dimension(ldb,n) :: b
   real(c_float), dimension(ldc,n) :: c
   real(c_float) :: alpha, beta
   character(kind=c_char) :: transa, transb
  end function culasgemmhost
end interface

contains

subroutine sgemmHost(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ld
c)
!dir$ ignore_tkr (r) a, b, c
integer :: m, n, k, lda, ldb, ldc
real, dimension(lda,k) :: a
real, dimension(ldb,n) :: b
real, dimension(ldc,n) :: c
real :: alpha, beta
character :: transa, transb

integer :: istatus

istatus = culasgemmhost(transa,transb,m,n,k,alpha,a,lda,b,ldb,beta,c,ldc)
call cula_check_status(istatus)

end

subroutine sgemmDev(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc
)
!dir$ ignore_tkr (r) a, b, c
integer :: m, n, k, lda, ldb, ldc
real, device, dimension(lda,k) :: a
real, device, dimension(ldb,n) :: b
real, device, dimension(ldc,n) :: c
real :: alpha, beta
character :: transa, transb

integer :: istatus

istatus = culasgemmdev(transa,transb,m,n,k,alpha,a,lda,b,ldb,beta,c,ldc)
call cula_check_status(istatus)

end

subroutine cula_check_status(status)
integer :: status

integer :: info

if (status .ne. 0) then
  info = cula_geterrorinfo()
  if (status .eq. 7) then
! culaArgumentError
    write(*,*) 'Invalid value for parameter ', info
  else if (status .eq. 8) then
! culaDataError
    write(*,*) 'Data error (', info ,')'
  else if (status .eq. 9) then
! culaBlasError
    write(*,*) 'Blas error (', info ,')'
  else if (status .eq. 10) then
! culaRuntimeError
    write(*,*) 'Runtime error (', info ,')'
  else
! others
    call CULA_GETSTATUSSTRING(status)
  endif

  stop
endif

end

end module cula

To use this Fortran module file, one need make only the following changes to existing source code:

  1. Add a "use cula" statement
  2. Call the CULA library initialization routine

An example of using this module in a Fortran program might look as follows:

use cula

real, dimension(1000,1000) :: a, b, c
real :: alpha, beta
integer :: m,n,k,lda,ldb,ldc

m = n = k = lda = ldb = ldc = 1000
alpha = 1.0
beta = 0.0 

istatus = cula_initialize()
call cula_check_status(istatus)

call sgemm('n','n',m,n,k,alpha,a,lda,b,ldb,beta,c,ldc)

Sample compilation and linking commands using 64-bit Linux might look like:

      
pgfortran -Mcuda -c <file>.f90

pgfortran -Mcuda -o <file> <file>.o -L <CULA install path>/CULA/lib64 
    -R <CULA install path>/CULA/lib64 -lcula_pgfortran 

Device Interface

The device interface is the math library calling convention used to perform library routines on the GPU. The device interface requires the user to manage all data allocation and coherency between the CPU and GPU. In the following examples, CUDA Fortran is used to manage the data.

CUBLAS

Below is the same code snippet used previously, but modified to make use of the CUBLAS SGEMM routine. Fortran array syntax is used to transfer data to/from the GPU memory to CPU memory. Much like CULA, a call to the CUBLAS initialization routine cublasinit is required before using routines from this library.

use cudafor

interface
  subroutine sgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc)
 bind(c,name='cublasSgemm')
   use iso_c_binding
   integer(c_int), value :: m, n, k, lda, ldb, ldc
   real(c_float), device, dimension(m,n) :: a, b, c
   real(c_float), value :: alpha, beta
   character(kind=c_char), value :: transa, transb
  end subroutine sgemm

  subroutine cublasinit() bind(c,name='cublasInit')
  end subroutine cublasinit

end interface


real, dimension(1000,1000) :: a, b, c
real, device, dimension(1000,1000) :: dA, dB, dC
real :: alpha, beta
integer :: m,n,k,lda,ldb,ldc

m = n = k = lda = ldb = ldc = 1000
alpha = 1.0
beta = 0.0 

call cublasinit()

dA = a
dB = b
call sgemm('n','n',m,n,k,alpha,dA,lda,dB,ldb,beta,dC,ldc)
c = dC

The sample compilation and linking commands using 64-bit Linux are:

pgfortran -Mcuda -c <file>.f90

pgfortran -Mcuda -o <file> <file>.o -lcublas 

MAGMA

Next is the same code snippet again, this time modified to make use of the MAGMA SGEMM routine. As in the CUBLAS example, Fortran array syntax is used to transfer data to/from the GPU memory to CPU memory.

use cudafor

interface sgemm
  subroutine sgemmDev(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, l
dc) bind(c,name='magmablas_sgemm')
   use iso_c_binding
   integer(c_int), value :: m, n, k, lda, ldb, ldc
   real(c_float), device, dimension(m,n) :: a, b, c
   real(c_float), value :: alpha, beta
   character(kind=c_char), value :: transa, transb
  end subroutine sgemmDev
end interface

real, dimension(1000,1000) :: a, b, c
real, device, dimension(1000,1000) :: dA, dB, dC
real :: alpha, beta
integer :: m,n,k,lda,ldb,ldc

m = n = k = lda = ldb = ldc = 1000
alpha = 1.0
beta = 0.0 


dA = a
dB = b
call sgemm('n','n',m,n,k,alpha,dA,lda,dB,ldb,beta,dC,ldc)
dC = c

The sample compilation and linking commands using a 64-bit Linux are:

pgfortran -Mcuda -c <file>.f90

pgfortran -Mcuda -o <file> <file>.o -L <MAGMA install path>/lib 
    -R <MAGMA install path>/lib -lmagma -lmagmablas -lcublas 

CULA

Finally, the same code snippet modified to make use of the CULA SGEMM routine. Once again, Fortran array syntax is used to transfer data to/from the GPU memory to CPU memory. Also, note the call to cula_initialize is required when using routines from this library. The "cula" module is the one defined above in the host interface section.

use cula

real, dimension(1000,1000) :: a, b, c
real, device, dimension(1000,1000) :: dA, dB, dC
real :: alpha, beta
integer :: m,n,k,lda,ldb,ldc

m = n = k = lda = ldb = ldc = 1000
alpha = 1.0
beta = 0.0 

istatus = cula_initialize()
call cula_check_status(istatus)

dA = a
dB = b
call sgemm('n','n',m,n,k,alpha,dA,lda,dB,ldb,beta,dC,ldc)
dC = c

The sample compilation and linking commands, again using 64-bit Linux:

pgfortran -Mcuda=cuda3.2 -c <file>.f90

pgfortran -Mcuda=cuda3.2 -o <file> <file>.o -L <CULA install path>/CULA/lib64 
    -R <CULA install path>/CULA/lib64 -lcula_pgfortran

Conclusion

The rapid rise in the use of GPU accelerators in HPC has brought with it the need for GPU-optimized versions of many of the stalwart HPC libraries such as LAPACK. Fortunately, GPU programmers have a number of math library options. By making use of Fortran modules, programmers can easily use these libraries without the need to make significant changes to existing source code.

More information about each of these libraries can be found on their publishers' web sites:

CUBLAS: http://developer.nvidia.com/object/cuda_3_2_downloads.html
CULA: http://www.culatools.com
MAGMA: http://icl.cs.utk.edu/magma/