PGI CUDA Fortran Compiler
Graphics processing units or GPUs have evolved into programmable, highly parallel computational units with very high memory bandwidth. GPU designs are optimized for the computations found in graphics rendering, but are general enough to be useful in many data-parallel, compute-intensive programs common in high-performance computing (HPC).
CUDA™ is a parallel computing platform and programming model for graphics processing units (GPUs). The original CUDA programming environment was comprised of an extended C compiler and tool chain, known as CUDA C. CUDA C allowed direct programming of the GPU from a high level language.
In mid 2009, PGI and NVIDIA cooperated to develop CUDA Fortran. CUDA Fortran includes a Fortran 2003 compiler and tool chain for programming NVIDIA GPUs using Fortran. Available in PGI 2010 and later releases, CUDA Fortran is supported on Linux, macOS and Windows.
A free trial of CUDA Fortran is available as part of the standard PGI Fortran download packages. These packages include installation and configuration information, along with the CUDA Fortran Programming Guide and Reference.
CUDA supports four key abstractions: cooperating threads organized into thread groups, shared memory and barrier synchronization within thread groups, and coordinated independent thread groups organized into a grid. A CUDA programmer is required to partition the program into coarse grain blocks that can be executed in parallel. Each block is partitioned into fine grain threads, which can cooperate using shared memory and barrier synchronization. A properly designed CUDA program will run on any CUDA-enabled GPU, regardless of the number of available processor cores
When called from the host Fortran program, CUDA Fortran defined subroutines execute in parallel on the GPU. Calls to such subroutines—also known as kernels—specify how many parallel instances of the kernel to execute. Each instance is executed by a CUDA thread. CUDA threads are organized into thread blocks. Each thread has a global thread block index and a local thread index within its thread block.
- CUDA Fortran Programming Guide and Reference (1.3MB PDF)
- CUDA Fortran Quick Reference Card (updated Nov 2016, 144KB PDF)
- CUDA Fortran Porting Guide (updated Nov 2016, 173KB PDF)
- CUDA Fortran for Scientists and Engineers by Ruetsch & Fatica 2013
- Introduction to CUDA Fortran article
- CUDA Fortran Data Management article
- CUDA Fortran Device Kernels article
- CUDA Fortran Asynchronous Data Transfers article
- Tuning a Monte Carlo Algorithm on GPUs tutorial article
- Porting the SPEC Benchmark BWAVES to GPUs with CUDA Fortran tutorial article
- Using the CULA GPU-enabled LAPACK Library with CUDA Fortran article
- Using GPU-enabled Math Libraries with PGI Fortran article
- Calling Thrust from CUDA Fortran article
- User study: UQAC Leverages CUDA Fortran to Optimize Innovative Aluminum Welding Techniques (Nov. 2015, 630KMB)
- Greg Ruetsch's CUDA Fortran posts on the NVIDIA Parallel Forall blog
- Fortran on GPUs Dr. Lars Koesterke, Texas Advance Computing Center Feb. 2011.
- NVIDIA CUDA Zone website
Q What is the difference between OpenACC and CUDA Fortran? Why do we need both models?
A The OpenACC specification is for a high-level implicit programming model for host+accelerator systems, similar to OpenMP for multi-core systems. OpenACC:
- Enables offloading of compute-intensive loops and code regions from a host CPU to a GPU accelerator using simple compiler directives
- Implements directives as Fortran comments and C pragmas, so programs can remain 100% standard-compliant and portable
- Makes GPGPU programming and optimization incremental and accessible to application domain experts
- Is supported in the PGI Fortran, C and C++ compilers
CUDA Fortran is an analog to NVIDIA's CUDA C compiler. CUDA C and CUDA Fortran are lower-level explicit programming models with substantial runtime library components that give expert programmers direct control of all aspects of GPGPU programming. For example:
- Initialization of a CUDA-enabled NVIDIA GPU
- Splitting up of source code into host CPU code and GPU compute kernel code in appropriately defined functions and subprograms
- Allocation of page-locked host memory, GPU device main memory, GPU constant memory and GPU shared memory
- All data movement between host main memory and the various types of GPU device memory
- Definition of (multi-dimensional) thread/block grids and launching of compute kernels for execution on CUDA GPUs
- Synchronization of threads within a CUDA thread group
- Asynchronous launch of GPU compute kernels and synchronization with host CPU execution
OpenACC together with CUDA Fortran enables acceleration using a high-level implicit model, or to drop into the lower-level explicit model of CUDA Fortran where needed.
OpenACC is included with all PGI Fortran, C and C++ compilers. CUDA Fortran is included with all PGI Fortran compilers.