Technical News from The Portland Group

Using the CULA GPU-enabled LAPACK Library with CUDA Fortran

Overview

CULA from EM Photonics is a GPU-accelerated linear algebra library for NVIDIA CUDA-enabled GPUs. CULA is complementary to the BLAS1, BLAS2 and BLAS3 functions that are included in NVIDIA’s CUBLAS library, and it features over 120 additional higher-level LAPACK functions including system solutions, least squares solutions, eigenproblem analysis, singular value decomposition and many others. CULA makes it easier to harness the power of NVIDIA GPUs for the fundamental linear algebra operations that are common to many science and engineering applications. The most popular functions are available for free in the CULA Basic version. (Registration is required.) CULA supports most generations of NVIDIA CUDA-enabled accelerators including the newest Tesla C2050/C2070 (Fremi architecture) devices. Working in cooperation with PGI, EM Photonics has released a CULA library and SDK compatible with CUDA Fortran.

Using CULA is a simple transition from using the LAPACK dense linear algebra routines found in AMD's optimized ACML math library, in the compiled Fortran version of LAPACK (both bundled with PGI products for Linux and Windows) or in Intel's optimized MKL math library which is supported for use with PGI compilers.

Using CULA

To start accelerating LAPACK routines using NVIDIA GPUs, CULA offers developers two different data manipulation interfaces: the host interface and the device interface. Simply put, the host interface works with data located in host memory, and the device interface performs work on data already located on the GPU accelerator.

Host Interface

The host interface is simple to use and it requires no GPU programming experience. All initialization, memory management, and kernel invocation is managed by the CULA library routines; no user interaction is required. CULA efficiently transfers data to the GPU, processes it and transfers the results back to the host. For users concerned with transfer times, consider that most LAPACK functions grow with O(n3) complexity, making the transfer times nearly negligible at even moderate problem sizes.

The following example shows the traditional CPU method of calling the LAPACK function sgesv. This popular routine is commonly used to solve a matrix using triangular factorization. The example shows solving the system A*X=B, where A is an N x N coefficient matrix. X and B are NRHS x N sized matrices—essentially multiple independent column vectors packed into a single matrix and solved simultaneously.

! call lapack solver
call sgesv(n,nrhs,a,n,ipiv,b,n,info)

In this example, a, b, and ipiv are data arrays in CPU memory. To implement a GPU accelerated version, change the function name to the CULA routine cula_sgesv.

! call culapack solver (host memory)
status = cula_sgesv(n,nrhs,a,n,ipiv,b,n)

That's all there is to using GPU accelerated LAPACK calls in your PGI Fortran programs. We've made available source code to a complete timer program that shows how to use the CULA host interface, how to initialize CULA, how to process potential error returns and how to benchmark against MKL, ACML, and LAPACK.

The following command lines show how to compile and execute the timer code using PGI CUDA Fortran on 64-bit systems. Future versions of CULA will include this timer program in the examples/ directory of the SDK.

MKL

pgfortran -mp -L$MKL_LIB_PATH_64 -L/usr/local/cula/lib64 cula.cuf -lmkl_core \
-lmkl_intel_lp64 -lmkl_gnu_thread -lgomp -lcula_pgfortran

ACML

pgfortran -mp -L/usr/local/cula/lib64 cula.cuf -lacml_mp -lcula_pgfortran

LAPACK (single-threaded non-optimized)

pgfortran -mp -L/usr/local/cula/lib64 cula.cuf  -llapack  -lcula_pgfortran

The only difference between a traditional LAPACK call and the CULA host interface LAPACK call is that the CULA function returns a status variable that describes GPU errors such as "no GPU" or "out of GPU memory" in addition to standard LAPACK errors like "bad parameter" or "singular matrix" returned in the traditional LAPACK info parameter. The original LAPACK error handling system has been modified in CULA to allow for the broader range of error returns possible within hybrid computing architectures.

Device Interface

In addition to the host interface, CULA also offers a device interface that uses CUDA FORTRAN extensions. In this interface, the user is required to allocate and manage memory, but in exchange gains flexibility. This interface is useful for example in the situation where you have a closed loop GPU program where data is used on the GPU before or after a CULA routine. The following example illustrates how to use the CULA FORTRAN device interface:

! allocate device memory
real, device, dimension(:,:), allocatable :: a_dev, b_dev
integer, device, dimension(:), allocatable :: ipiv_dev

allocate( a_dev(n,n), b_dev(n,nrhs), ipiv_dev(n) )

! copy input to device memory
a_dev = a

! call culapack solver (device memory)
status = cula_sgesv(n,nrhs,a_dev,n,ipiv_dev,b_dev,n)

! do more work here if desired

! copy output to host memory
b = b_dev

This example demonstrates the added flexibility of the CULA device interface. Where the host interface must necessarily assume that the user might be interested in the A and ipiv values after solving (as these contain the LU decomposition and pivoting information from the original matrix), the device interface allows the user to control all data transfers between host and device memory. In this case, only the final solution to the system was desired and the other outputs from the solver were not copied back to host main memory. The example timer program referenced above shows in more detail how to use CULA device interfaces with CUDA Fortran.

Compiling for CULA functionality with CUDA Fortran is easy as well. Simply link the special CULA PGI Fortran library as follows:

pgfortran -lcula_pgfortran file.f

Performance Comparisons

The performance of CULA is orders of magnitude faster than compiled Fortran LAPACK running on a host CPU. It provides as much as a factor of 10 speed-up over the versions of LAPACK included in the highly-tuned ACML and MKL running on all the cores of the latest multi-core AMD Opteron and Intel Core i7 Nehalem-class CPUs. Performance graphs generated by EM Photonics compare the various LAPACK routines on NVIDIA Tesla GPUs using CULA versus an Intel Core i7 930 using the MKL.

Other CULA Features

For C++ users, CULA provides another special interface called the bridge interface. This interface simplifies porting of pre-existing LAPACK codes to CULA, and it does this by perfectly mimicking the LAPACK interface. Recall earlier how CULA removed the info parameter to create better error messages; the bridge interface leaves the LAPACK interface perfectly intact, and features other convenient options like customizable crossovers (i.e., if the CPU is better suited to a given problem size, then allow the CPU to solve it) and CPU fallback when a GPU is not present or is not compatible with CULA. In an upcoming version of CULA, the bridge interface will be extended for use from PGI Fortran programs. More information on the bridge interface is available on the EM Photonics website.

Learn More About CULA

If you're interested in seeing how CULA can help your PGI Fortran project, please check out the CULA webpage for more information, software downloads, documentation, support forums and the CULA developers blog. The free CULA Basic version is an great way to get started. Contact us directly by sending e-mail to info@culatools.com or by calling (302) 456-9003.