March 2013
Using CULA with PGI Fortran
CULA from EM Photonics is a family of GPU-accelerated linear algebra libraries for NVIDIA CUDA-enabled GPUs. CULA Dense is complementary to the BLAS1, BLAS2 and BLAS3 functions that are included in NVIDIA's CUBLAS library, and features over 350 additional higher-level LAPACK functions including system solutions, least squares solutions, eigenproblem analysis, singular value decomposition and many others. A separate library for sparse matrix solutions, CULA Sparse, is available as well.
CULA makes it easier to harness the power of NVIDIA GPUs for the fundamental linear algebra operations that are common to many science and engineering applications. The most popular functions are available for free to all in the CULA Dense Free version (registration required.) Even better, academics can receive a single free license of the complete versions of CULA Dense and CULA Sparse for their personal research!
Using CULA Dense is a simple transition from using the LAPACK dense linear algebra routines found in AMD's ACML math library, in the compiled Fortran version of LAPACK (both bundled with PGI products for Linux and Windows) or in Intel's MKL math library that is supported for use with PGI compilers.
This article will primarily cover the use of CULA libraries with the PGI Fortran compiler.
Using CULA
To start accelerating LAPACK routines using NVIDIA GPUs, CULA Dense offers developers three different data manipulation interfaces: the link-compatible interface, host interface and the device interface. Simply put, the host interface works with data located in host memory, and the device interface performs work on data already located on the GPU accelerator.
Link Interface
The Link interface is a special link-time drop-in replacement for your existing LAPACK library. The Link interface handles all the details involved in using a GPU accelerator: memory management, memory transfer, querying for the presence of a GPU, and so on. Re-linking any existing Fortran program that uses LAPACK is a simple matter of changing your link line entry, from something like ‑lacml_mp to ‑lcula_link. There is zero programming required to try this out!
The Link interface will intercept each LAPACK library call and then examine it to determine whether it is a good candidate for execution by CULA. Good candidates are those with medium to large data sizes that fit within the GPU's memory, in a system where a CUDA GPU is present (and functional), and if CULA has implemented that routine. If all these conditions are met, then the call is forwarded on to CULA Dense for execution. If not, or if any errors are encountered, then your existing host LAPACK library is invoked instead! This decision-making process can be steered by the user by environment variables, as specified by the documentation.
The performance increase you'll experience from the Link interface will be modest, as every LAPACK routine that is run by CUDA will require data allocation and transfer, but the goal of this interface is simplicity rather than to get the best possible speed-up.
A complete working program that demonstrates the Link interface can be found in the CULA Dense toolkit, in the folders examples/linkInterface (C language) and examples/linkInterfaceFortran (Fortran language). There is a longer treatment of this topic on the CULA site.
Host Interface
CULA's Host interface is a programmer-oriented interface that uses the GPU to operate on matrix data that is located in your computer's main memory. The Host interface is guaranteed to attempt to run every function on the GPU (unlike the Link interface), and the routines have slightly different names to avoid conflicting with traditional CPU LAPACK libraries. A CULA equivalent to the routine DGESVD would be cula_dgesvd. For a PGI Fortran compilation, the CULA Host interface can be integrated by a three step process:
- Ensure that the necessary modules are compiled, which are cula_status.f90 and cula_lapack_fortran.f90 located in the include/ folder.
- Add the statement "use cula_lapack" to your code, and call CULA routines in your code
- Add CULA to the link line with ‑lcula_core ‑lcula_lapack
The host interface is simple to use and it requires no GPU programming experience. All memory management and kernel invocations are managed by the CULA library routines; no user interaction with the GPU is required. CULA efficiently transfers data to the GPU, processes it and transfers the results back to the host. For users concerned with transfer times, consider that most LAPACK functions grow with O(n3) complexity, which often renders the transfer time being very small compared to the compute time, given a sufficiently large matrix.
The following example shows the traditional CPU method of calling the LAPACK function sgesv. This popular routine is commonly used to solve a matrix using triangular factorization. The example shows solving the system A*X=B, where A is an N x N coefficient matrix. X and B are NRHS x N sized matricesÑessentially multiple independent column vectors packed into a single matrix and solved simultaneously.
! call lapack solver call sgesv(n,nrhs,a,n,ipiv,b,n,info)
In this example, a, b, and ipiv are data arrays in CPU memory. To implement a GPU accelerated version, change the function name to the CULA routine cula_sgesv.
! call cula solver (host memory) use cula_lapack status = cula_sgesv(n,nrhs,a,n,ipiv,b,n)
That's all there is to using GPU accelerated LAPACK calls in your PGI Fortran programs. For a complete program and build scripts, please see the examples/fortranInterface folder in the CULA toolkit.
Device Interface
In addition to the Host interface, CULA also offers a device interface that uses the PGI CUDA Fortran extensions. In this interface, the user is required to allocate and manage GPU memory, but in exchange gains flexibility. This interface is useful for example in the situation where you have a closed loop GPU program where data is used on the GPU before or after a CULA routine. The following example illustrates how to use the CULA Fortran Device interface with the PGI CUDA Fortran feature:
! allocate device memory real, device, dimension(:,:), allocatable :: a_dev, b_dev integer, device, dimension(:), allocatable :: ipiv_dev allocate( a_dev(n,n), b_dev(n,nrhs), ipiv_dev(n) ) ! copy input to device memory a_dev = a ! call cula solver (device memory) status = cula_device_sgesv(n,nrhs,a_dev,n,ipiv_dev,b_dev,n) ! do more work here if desired ! copy only the desired output arrays back to host memory b = b_dev
This example demonstrates the added flexibility of the CULA device interface. Where the Host interface must necessarily assume that the user might be interested in the A and ipiv values after solving (as these contain the LU decomposition and pivoting information from the original matrix), the Device interface allows the user to control all data transfers between host and device memory. In this case, only the final solution to the system was desired and the other outputs from the solver were not copied back to host main memory.
A complete timer program is available in the CULA Dense installation (R16a and newer) examples/fortranPgiCufInterface folder. The user links the CULA library as in all other examples (‑lcula_core ‑lcula_lapack), and must also compile the special PGI CUDA Fortran module located at include/cula_lapack_device_pgfortran.cuf.
Device Interface (without CUDA Fortran)
For the sake of completeness, there is a Fortran Device interface option that does not leverage the PGI CUDA Fortran extensions. This method of GPU coding most closely matches the traditional C-language CUDA programming method, involving the use of raw pointers and low-level memory allocations. While the use of this method in Fortran is possible, it is not recommended. We mention it in this article for those who have existing programs that use this workflow. An example of calling CULA in this context can be found in the CULA Dense toolkit, in the folder examples/fortranDeviceInterface.
Conclusion
If you're interested in seeing how CULA can help your PGI Fortran project, please check out the CULA webpage for more information, software downloads, documentation, support forums and the CULA developers' blog. The CULA Dense Free version is a great way to get started. Contact us directly by sending e-mail to info@culatools.com.