Using the cuSOLVER Library from CUDA Fortran
by Josh Romero and Greg Ruetsch, NVIDIA Corp.
The cuSOLVER library was first included with the CUDA 7.0 toolkit. It . As a collection of libraries, cuSOLVER provides useful LAPACK-like features. Its list of features is growing with each release. The current version offers common matrix factorization and triangular solve routines for dense matrices, a sparse least-squares solver and an eigenvalue solver. In addition, cuSOLVER provides a new refactorization library useful for solving sequences of matrices with a shared sparsity pattern.
Currently, CUDA Fortran does not have a ready-to-use interface module to the cuSOLVER library. This article shows how to write such an interface and presents two examples of its use: first, a simple Cholesky factorization, and second, the reduction of a generalized eigenproblem to a standard one.
Using cuSOLVER from CUDA Fortran
Writing a module containing definitions and interfaces to the cuSOLVER routines required for our code is straightforward. cuSOLVER routines come in two flavors, for dense and sparse matrices, and in our case we are concerned only with the dense version. In particular, in our cusolverDn module we define cusolver_status_* parameters, the cusolverDnHandle type, and interfaces to the utility routines cusolverDnCreate(), cusolverDnDestroy(), cusolverDnSetStream(), as well as the routines associated with performing Cholesky factorization on complex(8) data: cusolverDnZpotrf_bufferSize() and cusolverDnZpotrf(). cusolverDnHandle is a derived type with a member of the c_ptr type defined in the iso_c_binding module:
type cusolverDnHandle type(c_ptr) :: handle end type cusolverDnHandle
This handle is used as a first argument to all of the cusolverDn routines, including the routines to create and destroy the handle, whose interfaces are:
interface integer(c_int) function cusolverDnCreate(handle) & bind(C,name='cusolverDnCreate') import cusolverDnHandle type(cusolverDnHandle) :: handle end function cusolverDnCreate end interface interface integer(c_int) function cusolverDnDestroy(handle) & bind(C,name='cusolverDnDestroy') import cusolverDnHandle type(cusolverDnHandle), value :: handle end function cusolverDnDestroy end interface
Here the iso_c_binding type c_int is used as at the return type which corresponds to the cusolver_status_* enums declared in the module. The bind() keyword is used to bind the routine in the interface to the respective cuSOLVER library routine. Note that due to the C language’s pass-by-value convention, the value variable attribute is used in declaring handle to the cusolverDnDestroy interface, whereas handle is returned in the cusolverDnCreate interface and therefore value is not needed.
By default, all kernels launched by the dense cuSOLVER API functions will execute in the default (blocking) stream. To allow overlap with other device activity, the stream in which the dense cuSOLVER device code executes can be set by the cusolverDnSetStream() routine:
interface integer(c_int) function cusolverDnSetStream(handle, stream) & bind(C,name='cusolverDnSetStream') use cudafor import cusolverDnHandle type(cusolverDnHandle), value :: handle integer(cuda_stream_kind), value :: stream end function cusolverDnSetStream end interface
where here the cudafor module is required for the definition of cuda_stream_kind. Finally, we have the two routines associated with the Cholesky decomposition:
interface integer(c_int) function cusolverDnZpotrf_bufferSize( & handle, uplo, n, A, lda, Lwork) & bind(C, name='cusolverDnZpotrf_bufferSize') use iso_c_binding import cusolverDnHandle type(cusolverDnHandle), value :: handle integer(c_int), value :: uplo integer(c_int), value :: n complex(8), device :: A(*) integer(c_int), value :: lda integer(c_int) :: Lwork end function cusolverDnZpotrf_bufferSize end interface interface integer(c_int) function cusolverDnZpotrf( & handle, uplo, n, A, lda, Workspace, Lwork, devInfo) & bind(C,name='cusolverDnZpotrf') use iso_c_binding import cusolverDnHandle type(cusolverDnHandle), value :: handle integer(c_int), value :: uplo integer(c_int), value :: n complex(8), device :: A(*) integer(c_int), value :: lda !pgi$ ignore_tkr (r) Workspace complex(8), device :: Workspace(*) integer(c_int), value :: Lwork integer(c_int), device :: devInfo end function cusolverDnZpotrf end interface
The first of these two routines returns in Lwork (in units of words, in this case complex(8) words) the amount of space required in a scratch array Workspace used in the routine cusolverDnZpotrf() that performs the Cholesky factorization. This factorization requires the matrix to be Hermitian, and as a result either the upper or lower part of the matrix is used as specified by the uplo argument (one of CUBLAS_FILL_MODE_UPPER or CUBLAS_FILL_MODE_LOWER defined in the CUBLAS module). In these routines n is the size of the Hermitian positive-definite nxn matrix whose values are stored in the argument A with a leading dimension of lda. The integer devInfo will return a value of zero upon successful completion.
An example of how this module is used to perform a Cholesky factorization on a small matrix is listed below:
program main use cublas use cusolverDn use cudafor implicit none integer, parameter :: n=3 complex(8) :: a(n,n) complex(8), device :: a_d(n,n) complex(8), device, allocatable :: workspace_d(:) integer, device :: devInfo_d integer :: istat, Lwork type(cusolverDnHandle) :: h a(1,1) = 25.0; a(1,2) = 15.0; a(1,3) = -5.0 a(2,1) = a(1,2); a(2,2) = 18.0; a(2,3) = 0.0 a(3,1) = a(1,3); a(3,2) = a(2,3); a(3,3) = 11.0 a_d = a istat = cusolverDnCreate(h) if (istat /= CUSOLVER_STATUS_SUCCESS) & write(*,*) 'handle creation failed' istat = cusolverDnZpotrf_bufferSize(h, & CUBLAS_FILL_MODE_LOWER, n, a_d, n, Lwork) if (istat /= CUSOLVER_STATUS_SUCCESS) & write(*,*) 'cusolverDnZpotrf_buffersize failed' allocate(workspace_d(Lwork)) istat = cusolverDnZpotrf(h, CUBLAS_FILL_MODE_LOWER, & n, a_d, n, workspace_d, Lwork, devInfo_d) if (istat /= CUSOLVER_STATUS_SUCCESS) & write(*,*) 'cusolverDnZpotrf failed' istat = devInfo_d if (istat /= 0) write(*,*) 'Cholesky factorization failed' istat = cusolverDnDestroy(h) if (istat /= CUSOLVER_STATUS_SUCCESS) & write(*,*) 'handle destruction failed' a = a_d write(*,"(3(f0.0,SP,f0.0,'i',2x))") a(1,:) write(*,"(3(f0.0,SP,f0.0,'i',2x))") a(2,:) write(*,"(3(f0.0,SP,f0.0,'i',2x))") a(3,:) end program main
Here the return values are checked for each API call, as well as the devInfo_d argument to the cusolverDnZpotrf call.
The code can be compiled and executed with:
$ pgfortran -c -Mcuda=8.0 cusolverDn_m.cuf $ pgfortran -o testDn -Mcuda=8.0 testDn.cuf cusolverDn_m.o \ -Mcudalib=cublas,cusolver testDn.cuf: $ ./testDn 5.+0.i +15.+0.i -5.+0.i 3.+0.i +3.+0.i +0.+0.i -1.+0.i +1.+0.i +3.+0.i
where because CUBLAS_FILL_MODE_LOWER was specified only the lower triangular part of the input matrix is modified by the factorization.
Reducing Generalized Eigenproblems using CUDA Fortran
Now that we have a working interface to cuSOLVER, we are going to address a more complex problem. A common operation that can be found in the context of many physics computations is taking a generalized Hermitian-definite eigenproblem of the form,
Ax = λBx
where A is a Hermitian nxn matrix, B is a Hermitian positive-definite nxn matrix, λ and and x are the eigenvalue and associated eigenvector respectively, and reducing it to a Hermitian eigenproblem in standard form,
Cx = λx
where C is a Hermitian nxn matrix. This reduction allows for the computation of the eigenvalues and eigenvectors of the original system via the computation of the eigenvalues and eigenvectors of the Hermitian matrix C, a task for which many algorithms exist.
The procedure to reduce the original eigenproblem to standard form is as follows:
Compute the Cholesky factorization of the matrix B. For this example, assume that we have factored B as B = LLH. Substitution into the original eigenproblem results in the form,
Ax = λLLHx
To reduce to a standard eigenproblem, the equation is manipulated so that
L-1AL-Hx = λx
This corresponds to the standard Hermitian eigenproblem with C = L-1AL-H
With the Cholesky factorization routine from cuSOLVER now available to us, the first step of our reduction procedure can be computed easily. To factor B as B = LLH, we can call the previously covered cusolverDnZpotrf routine with the uplo argument equal to CUBLAS_FILL_MODE_LOWER.
We first note that our desired computation corresponds exactly to an existing LAPACK routine, zhegst, with many available CPU implementations. A corresponding GPU routine, magmaf_zhegst_gpu, can be found within the open-source MAGMA library. One option available to us then is to simply use this routine. With that being said, there exists good reason to continue our investigation. One interesting property of many routines within the MAGMA library (including our routine of interest) is a dependence on existing CPU implementations. It is common for routines to be implemented using a hybrid matrix blocking strategy, with small sub-block computations carried out on the CPU, and bulk matrix updates carried out on the GPU. This requires active communication between the GPU and CPU, typically implemented using asynchronous memcopy routines to enable concurrent operation. For best performance, these types of hybrid implementations require a good matching of available CPU and GPU resources.
However, it is not atypical to see systems with several GPUs associated with a single CPU socket. In these cases, the available CPU resources per GPU are limited, leading to potential bottlenecks in performance. Furthermore, it may be desirable to have an implementation that requires no host to device communication, particularly in applications that require host to device transfers of other data during the kernel runtime. To sidestep these issues, we can develop our own implementation of zhegst that uses only GPU resources.
As a first step, we can develop a basic GPU implementation. This basic implementation uses two triangular solves (performed via calls to cublasZtrsm) on the full matrices to compute matrix C. The first call to cublasZtrsm solves the problem:
X1LH = A → X1 = AL-H, A := X1
The second call to cublasZtrsm solves the problem:
LX2 = A → X2 = L-1A, A := X2
After these operations, the original A matrix now contains the product L-1AL-H, which is exactly our desired matrix C.
A subroutine zhegst_gpu_v1 that implements this procedure is listed below:
! zhegst completed with 2 global triangular solves on GPU subroutine zhegst_gpu_v1(itype, uplo, N, A, lda, B, ldb) implicit none integer, intent(in) :: itype, N, lda, ldb character, intent(in) :: uplo complex(8), device, dimension(1:ldb,1:N), intent(in) :: B complex(8), device, dimension(1:lda,1:N) :: A complex(8), parameter :: cone = cmplx(1.,0,8) integer :: istat type(cublasHandle) :: H1 if (itype .ne. 1 .or. uplo == 'U') then write(*,*), "Provided problem configuration not supported!" return endif istat = cublasCreate(H1) ! Populate A matrix with complete Hermitian entries !$cuf kernel do(2) <<<*,*>>> do j = 1, N do i = 1, N if (j > i) then A(i,j) = conjg(A(j,i)) endif end do end do istat = cublasztrsm_v2(H1, CUBLAS_SIDE_RIGHT, CUBLAS_FILL_MODE_LOWER, & CUBLAS_OP_C, CUBLAS_OP_N, N, N, & cone, B, ldb, A, lda) istat = cublasztrsm_v2(H1, CUBLAS_SIDE_LEFT, CUBLAS_FILL_MODE_LOWER, & CUBLAS_OP_N, CUBLAS_OP_N, N, N, & cone, B, ldb, A, lda) istat = cublasDestroy(H1) end subroutine zhegst_gpu_V1
With a basic GPU implementation defined, we can now develop an improved version by pairing our basic implementation with the blocking strategy used by MAGMA. The main idea is to replace the sub-block computation that MAGMA offloads to the CPU with our basic GPU implementation, which keeps all computation resident on the GPU. To preserve the benefits of concurrent computation of the sub-block and bulk updates, we introduce two CUDA streams, one to perform the sub-block computations, and another to perform the available concurrent portion of the bulk update. In a sense, one CUDA stream emulates the behavior of the CPU in the existing MAGMA routine, while the other performs all the tasks already associated with the GPU. A subroutine zhegst_gpu_v2 that implements this is listed below:
! zhegst completed in blocks, using 2 ztrsms to solve subblock problem subroutine zhegst_gpu_v2(itype, uplo, N, A, lda, B, ldb, nb) implicit none integer, intent(in) :: itype, N, lda, ldb, nb character, intent(in) :: uplo complex(8), device, dimension(1:ldb, 1:N), intent(in) :: B complex(8), device, dimension(1:lda, 1:N) :: A complex(8), parameter :: cone = cmplx(1.0,0.0,8) complex(8), parameter :: chalf = cmplx(0.5,0.0,8) real(8), parameter :: one = 1.0_8 integer :: i, j, k, kb, istat integer(kind = cuda_stream_kind) :: stream1, stream2 type(cublasHandle) :: H1 if (itype .ne. 1 .or. uplo == 'U') then write(*,*), "Provided problem configuration not supported!" return endif ! Setup cublas handle and streams istat = cublasCreate(H1) istat = cudaStreamCreate(stream1) ! "CPU" stream istat = cudaStreamCreate(stream2) ! "GPU" stream do k = 1, N, nb kb = min(N-k+1, nb) istat = cublasSetStream(H1, stream1) ! Populate subblock with complete Hermitian entries !$cuf kernel do(2) <<<*,*, 0, stream1>>> do j = k,k+kb-1 do i = k,k+kb-1 if (j > i) then A(i,j) = conjg(A(j,i)) endif end do end do ! Solve subblock problem (results in fully populated A subblock) istat = cublasztrsm_v2(H1, CUBLAS_SIDE_RIGHT, CUBLAS_FILL_MODE_LOWER, & CUBLAS_OP_C, CUBLAS_OP_N, kb, kb, & cone, B(k,k), ldb, A(k,k), lda) istat = cublasztrsm_v2(H1, CUBLAS_SIDE_LEFT, CUBLAS_FILL_MODE_LOWER, & CUBLAS_OP_N, CUBLAS_OP_N, kb, kb, & cone, B(k,k), ldb, A(k,k), lda) if (k + kb .le. N) then ! Perform bulk matrix update istat = cublasSetStream(H1, stream2) istat = cublasztrsm_v2(H1, CUBLAS_SIDE_RIGHT, CUBLAS_FILL_MODE_LOWER, & CUBLAS_OP_C, CUBLAS_OP_N, N-k-kb+1, kb, cone, & B(k, k), ldb, A(k+kb, k), lda) istat = cudaStreamSynchronize(stream1) ! Since the A subblock is fully populated, use zgemm instead of ! zhemm here istat = cublaszgemm_v2(H1, CUBLAS_OP_N, CUBLAS_OP_N, N-k-kb+1, kb, & kb, -chalf, B(k+kb, k), ldb, A(k, k), lda, & cone, A(k+kb, k), lda) istat = cublaszher2k_v2(H1, CUBLAS_FILL_MODE_LOWER, CUBLAS_OP_N, & N-k-kb+1, kb, -cone, A(k+kb, k), lda, & B(k+kb, k), ldb, one, A(k+kb, k+kb), lda) istat = cudaStreamSynchronize(stream2) ! Again, we use zgemm instead of zhemm here istat = cublaszgemm_v2(H1, CUBLAS_OP_N, CUBLAS_OP_N, N-k-kb+1, kb, & kb, -chalf, B(k+kb, k), ldb, A(k, k), lda, & cone, A(k+kb, k), lda) istat = cublasztrsm_v2(H1, CUBLAS_SIDE_LEFT, CUBLAS_FILL_MODE_LOWER, & CUBLAS_OP_N, CUBLAS_OP_N, N-k-kb+1, kb, cone, & B(k+kb, k+kb), ldb, A(k+kb, k), lda) end if end do ! Cleanup istat = cublasDestroy(H1) istat = cudaStreamDestroy(stream1) istat = cudaStreamDestroy(stream2) end subroutine zhegst_gpu_v2
For comparison, see the MAGMA implementation of this routine on GitHub at: https://github.com/maxhutch/magma/blob/master/src/zhegst_gpu.cpp
A couple aspects of our developed GPU routines are worth noting. First, our basic GPU implementation differs from the standard LAPACK zhegst routine in that it requires a fully populated Hermitian matrix A to operate properly. As such, before performing the triangular solves, the transposed entries of A are populated naively using a simple CUF kernel. We found this contributed negligibly to the performance of both the global and blocked implementations. Second, our basic GPU implementation results in a fully populated matrix C, while the standard routine populates only the lower or upper triangular part of C, corresponding with the provided uplo argument. In our blocked implementation, this gives us the option of replacing calls to cublasZhemm with cublasZgemm, used during the bulk matrix update. This provided a small performance improvement for the matrix sizes we tested.
We tested our developed GPU routines, zhegst_gpu_v1 and zhegst_gpu_v2, on matrix sizes ranging from 512x512 up to 8192x8192 and compared performance against the zhegst routine available in Intel MKL Math Kernel Library, and the magmaf_zhegst_gpu routine available in MAGMA, linked with the Intel MKL. The MKL library is configured to use OpenMP threads for parallelization. We performed our testing on a system with an Intel Xeon E5-2698 v3 CPU (2.3 GHz, 16 cores) and a single NVIDIA P100-PCIE. For all cases shown, the sub-block size (nb) used by our blocked GPU implementation is set to 448. For completeness, we are also including performance results for the Cholesky factorization routines used, zpotrf from MKL, magmaf_zpotrf_gpu from MAGMA, and cusolverDnZpotrf.
Figure 1. Performance Results for Cholesky Factorization Routines
Figure 2. Performance Results for zhegst Routines
Figure 1 plots the runtime in milliseconds versus the matrix size N and speedup relative to MKL for the various Cholesky implementations. Similarly, Figure 2 plots runtime in milliseconds versus the matrix size N and speedup relative to MKL for the various zhegst implementations.
The performance results for the various zhegst routines indicate a few interesting trends. First, for smaller N, both the basic and blocked GPU implementations outperform MAGMA, with the blocked GPU implementation maintaining an edge for up to a larger N than the basic GPU implementation. The performance advantage here can be attributed to MAGMA suffering from an imbalance of CPU and GPU work at smaller matrix sizes. When N is small, the sub-block calculations occurring on the CPU comprise a large percentage of the total computational work. This leads to difficulties completely hiding the CPU sub-block computation and associated data transfer behind a concurrent GPU computation of the bulk matrix update. In addition to this, profiling reveals that the first sub-block computation on the CPU occurs sequentially, with concurrent operation starting with computation of the second sub-block and beyond. For small N, fewer total sub-block computations occur and the performance impact of the sequential computation of the first sub-block is greater.
Figure 3. NVVP Timelines for N = 1024
These observations can be verified in the NVVP timeline results shown in Figure 3, which correspond to a case where N = 1024, within the regime that MAGMA suffers from poor balance of CPU and GPU work. For additional clarity, operations associated with the sub-block computations are marked using NVTX regions. This is particularly useful in the MAGMA case where time spent in CPU functions that is not normally represented within NVVP can be clearly identified. In the MAGMA timeline, we can observe several large vacant gaps where the GPU is completely idle. These gaps are largely associated with the CPU sub-block computation time (magenta markers) exceeding the amount of time required by the GPU for the bulk matrix update from the previous block. We also see the large impact associated with the first sub-block computation occurring fully sequentially. To make matters worse, at these small matrix sizes, the allocation and deallocation of pinned workspace (blue and cyan markers respectively) for the sub-block computations on the CPU are significant contributors to the total routine runtime. If we consider the timeline in the same case for the blocked GPU implementation, we see significantly improved concurrency of the sub-block computation and bulk update, split between the two depicted CUDA streams.
With increasing N, these issues are mitigated and the performance of MAGMA improves. For larger N, the performance of MAGMA and our blocked GPU implementation are basically equal, while the performance of our basic GPU implementation lags behind. This indicates a clear advantage of using a blocking strategy. Additionally, the parity in performance of both MAGMA’s implementation and our blocked GPU implementation for larger matrices is not too surprising. At these sizes, MAGMA can effectively hide all CPU computation and host to device data transfers behind concurrent GPU work and the impact of the first sub-block is marginal. At this point, the sub-block computations for both MAGMA and our blocked GPU implementation are fully hidden behind concurrent computation of the bulk matrix updates. Because the procedure to perform the bulk update, save for some very small modifications, is essentially equivalent between MAGMA and our blocked implementation, it follows that the performance is equivalent as well.
Figure 4. NVVP Timelines for N = 4096
In Figure 4 which contains the NVVP timelines associated with N = 4096, we can see that MAGMA achieves improved concurrency and more effectively hides the impact of the offloaded CPU sub-block computations, leading to reduced idle time on the GPU and better performance.
Implementation in complete generalized eigensolver
The Cholesky factorization and blocked zhegst implementation described here comprise the first major operations of a larger generalized eigensolver on the GPU that has been developed for Quantum ESPRESSO, an open-source quantum chemistry code. Similar strategies to reduce dependencies on CPU resources were applied to the remaining operations of the solver to achieve a well performaing GPU solution with limited dependence on the CPU. Figure 5 shows the performance comparisons of the complete solver.
Figure 5. Performance Results for complete eigensolver routine.
More in-depth discussion of the solver and these performance results can be found in our GTC presentation. The source code for the eigensolver is available GitHub at https://github.com/NVIDIA/Eigensolver_gpu.
The availability of the cuSOLVER library within CUDA Fortran is a welcome development for the scientific computing community and broadens the reach of GPUs in tackling new problems. Additionally, our investigation of the zhegst routine on Pascal hardware revealed some performance benefits from switching to a pure GPU implementation over an existing hybrid implementation available in MAGMA. Moving forward, with the increased availability and use of higher performing Volta GPUs on the horizon, it is a worthwhile exercise to investigate existing hybrid computational routines that may see a performance benefit from new pure GPU implementations.
Source files for this article can be found on GitHub at: https://github.com/romerojosh/PGInsider