Graphic processing units or GPUs have evolved into programmable, highly parallel computational units with very high memory bandwidth. GPU designs are optimized for the computations found in graphics rendering, but are general enough to be useful in many data-parallel, compute-intensive programs common in high-performance computing (HPC).
CUDA is the architecture of the NVIDIA line of GPUs. Currently, the CUDA programming environment is comprised of an extended C compiler and tool chain, known as CUDA C. CUDA C allows direct programming of the GPU from a high level language.
PGI and NVIDIA have worked in cooperation to develop CUDA Fortran. CUDA Fortran includes a Fortran 2003 compiler and tool chain for programming NVIDIA GPUs using Fortran. Available in PGI 2010 and later release, CUDA Fortran is supported on Linux, Mac OS X and Windows.
A free 15 day trial of CUDA Fortran is available as part of the standard PGI Fortran download packages. Installation and configuration information, along with the CUDA Fortran Programming Guide and Reference, are included in the download packages. Either a current PGI license or a PGI trial license is required to enable the software.
CUDA supports four key abstractions: cooperating threads organized into thread groups, shared memory and barrier synchronization within thread groups, and coordinated independent thread groups organized into a grid. A CUDA programmer is required to partition the program into coarse grain blocks that can be executed in parallel. Each block is partitioned into fine grain threads, which can cooperate using shared memory and barrier synchronization. A properly designed CUDA program will run on any CUDA-enabled GPU, regardless of the number of available processor cores
CUDA Fortran allows the definition of Fortran subroutines that execute in parallel on the GPU when called from the Fortran program which has been invoked and is running on the host. Such a subroutine is called a device kernel or kernel. A call to a kernel specifies how many parallel instances of the kernel must be executed; each instance will be executed by a different CUDA thread. The CUDA threads are organized into thread blocks, and each thread has a global thread block index, and a local thread index within its thread block.
Please also see the PGI Accelerator Programming user forum for additional questions and answers.
Q What is the difference between the PGI Accelerator programming model and CUDA Fortran? Why do we need both models?
A The PGI Accelerator programming model is a high-level implicit programming model for x64+GPU systems, similar to OpenMP for multi-core x64 systems. The PGI Accelerator model:
CUDA Fortran is an analog to NVIDIA's CUDA C compiler. CUDA C and CUDA Fortran are lower-level explicit programming models with substantial runtime library components that give expert programmers direct control of all aspects of GPGPU programming. For example:
The PGI Accelerator model together with CUDA Fortran will enable PGI users to port applications using a high-level implicit model, or to drop into the lower-level explicit model of CUDA Fortran where needed.
The PGI Accelerator programming model is included with the PGI Accelerator Fortran and C99 compilers. CUDA Fortran is included with the PGI Accelerator Fortran compilers.