February 2011

## Using GPU-enabled Math Libraries with PGI Fortran

### Overview

In each of the last two issues of the *PGInsider*, separate articles covered using specialized GPU linear algebra math libraries to increase the performance of applications using these types of functions. The libraries covered were CULA, a GPU-accelerated numerical linear algebra library, and MAGMA— Matrix Algebra on GPU and Multicore Architectures. This article introduces a third library named CUBLAS (CUDA BLAS) developed by NVIDIA, and describes what is required from a user's perspective to make optimum use of each of these libraries.

CULA is a commercially supported GPU library that implements much of LAPACK in a way that leverages both very high performance multi-core x64 and GPU-accelerated versions of the routines. CULA is available in three versions: Basic (a no cost limited set of single precision routines), Premium (for-fee expanded set of routines in both single and double precision), and Commercial for redistribution. MAGMA is similar to LAPACK, but for heterogeneous/hybrid architectures starting with current multi-core+GPU systems. MAGMA is available at no cost. CUBLAS is a small library containing a subset of the BLAS Level 3 routines. CUBLAS is also available at no cost. The intent of this article is not to single out any one library as preferred. Our experience is that each one has its place.

For the purpose of this article a routine common amongst the three libraries was chosen as the basis for the examples. The Basic Linear Algebra Subprograms (BLAS) Level 3 routine named SGEMM performs single precision matrix multiplication.

### Host Interface

The host interface in this context is defined as a math library calling convention that can be used when all data resides in CPU memory. It is the responsibility of the library routine to perform any data transfer to and from the GPU. In our example, the library routine is responsible for transferring matrices A and B to the GPU, and transferring the matrix C back to the CPU memory after computing the matrix multiply. For our SGEMM example, only the CULA library supports a host interface. However, for several other LAPACK routines such as SGETRF, both the CULA and MAGMA libraries contain both host interfaces and device interfaces (discussed in the next section).

Consider the following code snippet. This call to SGEMM performs the matrix multiplication of A x B and stores the result in C.

real, dimension(1000,1000) :: a, b, c real :: alpha, beta integer :: m,n,k,lda,ldb,ldc m = n = k = lda = ldb = ldc = 1000 alpha = 1.0 beta = 0.0 call sgemm('n','n',m,n,k,alpha,a,lda,b,ldb,beta,c,ldc)

In addition to improved performance from using the GPU, another important goal of using these libraries is to minimize the amount of code that needs to be modified or ported. Fortunately this can be accomplished by using Fortran 2003 features to handle the details specific to each library and/or GPU.

For the CULA library, a "CULA" module file might look something like this:

module cula ! Generic interfaces interface sgemm module procedure sgemmHost module procedure sgemmDev end interface ! External interfaces interface integer function culasgemmdev(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) bind(c,name='CULA_DEVICE_SGEMM') use iso_c_binding integer(c_int) :: m, n, k, lda, ldb, ldc real(c_float), device, dimension(lda,k) :: a real(c_float), device, dimension(ldb,n) :: b real(c_float), device, dimension(ldc,n) :: c real(c_float) :: alpha, beta character(kind=c_char) :: transa, transb end function culasgemmdev integer function culasgemmhost(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) bind(c,name='CULA_SGEMM') use iso_c_binding integer(c_int) :: m, n, k, lda, ldb, ldc real(c_float), dimension(lda,k) :: a real(c_float), dimension(ldb,n) :: b real(c_float), dimension(ldc,n) :: c real(c_float) :: alpha, beta character(kind=c_char) :: transa, transb end function culasgemmhost end interface contains subroutine sgemmHost(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ld c) !dir$ ignore_tkr (r) a, b, c integer :: m, n, k, lda, ldb, ldc real, dimension(lda,k) :: a real, dimension(ldb,n) :: b real, dimension(ldc,n) :: c real :: alpha, beta character :: transa, transb integer :: istatus istatus = culasgemmhost(transa,transb,m,n,k,alpha,a,lda,b,ldb,beta,c,ldc) call cula_check_status(istatus) end subroutine sgemmDev(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc ) !dir$ ignore_tkr (r) a, b, c integer :: m, n, k, lda, ldb, ldc real, device, dimension(lda,k) :: a real, device, dimension(ldb,n) :: b real, device, dimension(ldc,n) :: c real :: alpha, beta character :: transa, transb integer :: istatus istatus = culasgemmdev(transa,transb,m,n,k,alpha,a,lda,b,ldb,beta,c,ldc) call cula_check_status(istatus) end subroutine cula_check_status(status) integer :: status integer :: info if (status .ne. 0) then info = cula_geterrorinfo() if (status .eq. 7) then ! culaArgumentError write(*,*) 'Invalid value for parameter ', info else if (status .eq. 8) then ! culaDataError write(*,*) 'Data error (', info ,')' else if (status .eq. 9) then ! culaBlasError write(*,*) 'Blas error (', info ,')' else if (status .eq. 10) then ! culaRuntimeError write(*,*) 'Runtime error (', info ,')' else ! others call CULA_GETSTATUSSTRING(status) endif stop endif end end module cula

To use this Fortran module file, one need make only the following changes to existing source code:

- Add a "use cula" statement
- Call the CULA library initialization routine

An example of using this module in a Fortran program might look as follows:

use cula real, dimension(1000,1000) :: a, b, c real :: alpha, beta integer :: m,n,k,lda,ldb,ldc m = n = k = lda = ldb = ldc = 1000 alpha = 1.0 beta = 0.0 istatus = cula_initialize() call cula_check_status(istatus) call sgemm('n','n',m,n,k,alpha,a,lda,b,ldb,beta,c,ldc)

Sample compilation and linking commands using 64-bit Linux might look like:

pgfortran -Mcuda -c <file>.f90 pgfortran -Mcuda -o <file> <file>.o -L <CULA install path>/CULA/lib64 -R <CULA install path>/CULA/lib64 -lcula_pgfortran

### Device Interface

The device interface is the math library calling convention used to perform library routines on the GPU. The device interface requires the user to manage all data allocation and coherency between the CPU and GPU. In the following examples, CUDA Fortran is used to manage the data.

#### CUBLAS

Below is the same code snippet used previously, but modified to make use of the CUBLAS SGEMM routine. Fortran array syntax is used to transfer data to/from the GPU memory to CPU memory. Much like CULA, a call to the CUBLAS initialization routine cublasinit is required before using routines from this library.

use cudafor interface subroutine sgemm(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, ldc) bind(c,name='cublasSgemm') use iso_c_binding integer(c_int), value :: m, n, k, lda, ldb, ldc real(c_float), device, dimension(m,n) :: a, b, c real(c_float), value :: alpha, beta character(kind=c_char), value :: transa, transb end subroutine sgemm subroutine cublasinit() bind(c,name='cublasInit') end subroutine cublasinit end interface real, dimension(1000,1000) :: a, b, c real, device, dimension(1000,1000) :: dA, dB, dC real :: alpha, beta integer :: m,n,k,lda,ldb,ldc m = n = k = lda = ldb = ldc = 1000 alpha = 1.0 beta = 0.0 call cublasinit() dA = a dB = b call sgemm('n','n',m,n,k,alpha,dA,lda,dB,ldb,beta,dC,ldc) c = dC

The sample compilation and linking commands using 64-bit Linux are:

pgfortran -Mcuda -c <file>.f90 pgfortran -Mcuda -o <file> <file>.o -lcublas

#### MAGMA

Next is the same code snippet again, this time modified to make use of the MAGMA SGEMM routine. As in the CUBLAS example, Fortran array syntax is used to transfer data to/from the GPU memory to CPU memory.

use cudafor interface sgemm subroutine sgemmDev(transa, transb, m, n, k, alpha, a, lda, b, ldb, beta, c, l dc) bind(c,name='magmablas_sgemm') use iso_c_binding integer(c_int), value :: m, n, k, lda, ldb, ldc real(c_float), device, dimension(m,n) :: a, b, c real(c_float), value :: alpha, beta character(kind=c_char), value :: transa, transb end subroutine sgemmDev end interface real, dimension(1000,1000) :: a, b, c real, device, dimension(1000,1000) :: dA, dB, dC real :: alpha, beta integer :: m,n,k,lda,ldb,ldc m = n = k = lda = ldb = ldc = 1000 alpha = 1.0 beta = 0.0 dA = a dB = b call sgemm('n','n',m,n,k,alpha,dA,lda,dB,ldb,beta,dC,ldc) dC = c

The sample compilation and linking commands using a 64-bit Linux are:

pgfortran -Mcuda -c <file>.f90 pgfortran -Mcuda -o <file> <file>.o -L <MAGMA install path>/lib -R <MAGMA install path>/lib -lmagma -lmagmablas -lcublas

#### CULA

Finally, the same code snippet modified to make use of the CULA SGEMM routine. Once again, Fortran array syntax is used to transfer data to/from the GPU memory to CPU memory. Also, note the call to cula_initialize is required when using routines from this library. The "cula" module is the one defined above in the host interface section.

use cula real, dimension(1000,1000) :: a, b, c real, device, dimension(1000,1000) :: dA, dB, dC real :: alpha, beta integer :: m,n,k,lda,ldb,ldc m = n = k = lda = ldb = ldc = 1000 alpha = 1.0 beta = 0.0 istatus = cula_initialize() call cula_check_status(istatus) dA = a dB = b call sgemm('n','n',m,n,k,alpha,dA,lda,dB,ldb,beta,dC,ldc) dC = c

The sample compilation and linking commands, again using 64-bit Linux:

pgfortran -Mcuda=cuda3.2 -c <file>.f90 pgfortran -Mcuda=cuda3.2 -o <file> <file>.o -L <CULA install path>/CULA/lib64 -R <CULA install path>/CULA/lib64 -lcula_pgfortran

### Conclusion

The rapid rise in the use of GPU accelerators in HPC has brought with it the need for GPU-optimized versions of many of the stalwart HPC libraries such as LAPACK. Fortunately, GPU programmers have a number of math library options. By making use of Fortran modules, programmers can easily use these libraries without the need to make significant changes to existing source code.

More information about each of these libraries can be found on their publishers' web sites:

CUBLAS: http://developer.nvidia.com/object/cuda_3_2_downloads.html

CULA: http://www.culatools.com

MAGMA: http://icl.cs.utk.edu/magma/