One of the defining features of recent NVIDIA GPUs including the Telsa V100 is the introduction of Tensor Cores, which are programmable matrix multiply and accumulate units that operate on half-precision (16-bit) multiplicands. Access to programming Tensor Cores in CUDA C became available in the CUDA 9.0 release through the WMMA (Warp Matrix Multiply and Accumulate) API. This paper describes a CUDA Fortran interface to this same functionality. Note that the WMMA interface is a preview feature in CUDA C and subject to change, as is the CUDA Fortran interface described in what follows. Continue reading ⇒
The CUDA Fortran compiler from PGI now supports programming Tensor Cores with NVIDIA’s Volta V100 and Turing GPUs. This enables scientific programmers using Fortran to take advantage of FP16 matrix operations accelerated by Tensor Cores. Let’s take a look at how Fortran supports Tensor Cores. Continue reading ⇒
PGI Community Edition compilers and tools for Linux/x86-64 are available as an Amazon Machine Image (AMI) on the AWS Marketplace, providing a low-cost option for those interested in doing GPU-accelerated computing using Amazon's extensive cloud computing resources. For as little as $3 per hour, you can create your own personal virtualized NVIDIA Volta V100 GPU-enabled system on Amazon's cloud. Just upload your application's source code, build it using the PGI compilers, and run it. This article guides you through the steps necessary to build and run an application using PGI compilers, and demonstrates how GPU-accelerated computing can be cost-effective on Amazon's cloud infrastructure. Continue reading ⇒
This article describes PCAST, a new feature available in the PGI Fortran and C compilers that has been updated in the 19.1 release, with more improvements to come in the future. PCAST has two use cases. The first is designed to be used when testing changes to parts of a program, or testing new compile-time flags, or testing a port to a new compiler or to a new processor. You might want to test whether a new library gives the same result, or test whether adding OpenMP parallelism is safe, or enabling autovectorization (-Mvect=simd), or porting from an X86 system to OpenPOWER. This use case works by adding pgi_compare calls to your application at places where you want intermediate results to be compared. Those results are saved in a file in the initial golden run, where you know the results are correct. During the test runs, the same calls will compare the computed intermediate results to the saved results and report differences. This is described in more detail in the first section below. Continue reading ⇒
This article will describe a new feature in the PGI OpenACC compilers for GPU computing to support deep copy. The compilers already support manual deep copy, and the PGI Fortran compilers already support automatic full deep copy. Now, with PGI 18.7, we are delivering a first implementation of true deep copy, which allows the flexibility of manual deep copy with less work by the programmer. Continue reading ⇒
Prior to the CUDA 9.0 Toolkit, synchronization between device threads was limited to using the synchthreads() subroutine to perform a barrier synchronization amongst all threads in a thread block. To synchronize larger groups of threads, one would have to break a kernel into multiple smaller kernels where the completion of these smaller kernels were in effect synchronization barriers for all the threads in a grid. With the cooperative groups feature introduced in the CUDA 9.0 toolkit, one can now synchronize groups of threads both larger and smaller than thread blocks on supported hardware, making coding easier and the resultant code more efficient for cases that naturally map to such synchronization. This post is the first in a series that describes the features of cooperative groups as implemented in CUDA Fortran in the PGI compilers. Continue reading ⇒
One of the most important issues when programming a system with a GPU or any attached accelerator is managing data movement between host memory and device memory. This is a special case of managing a memory hierarchy, a problem that has been with us since the dawn of computing. The very first computer I programmed was an IBM 360/75 at the University of Illinois which had a whole megabyte of magnetic core memory, composed of 400KB of fast core and 600KB of slow core. Systems in those days didn't support virtual memory, so programmers were responsible for staging data from disk (or tape) storage to memory and back, and for telling the operating system the maximum amount of memory the program would use (or on this system, the maximum amount of fast and of slow core that the program would use). Continue reading ⇒
Any software developer knows that developing and maintaining a large software project can be a difficult task, particularly when documentation is sparse and the original developers are unavailable. Sometimes a small change causes unexpected problems elsewhere in a program, and they can be very difficult to track down in a large codebase. Other times, one might want to explain how a program works to a new developer, and it would be very useful to have a visual representation of the program’s internal structure. When adapting existing HPC applications to OpenMP or OpenACC, sometimes it is difficult to prioritize what exactly should be parallelized first. These tasks can be difficult, labor intensive, or even impossible to do by hand. Continue reading ⇒
General GPU Programming
The PGI 18.7 release has a new feature intended to improve your experience for common cases when building programs for GPU computing. This affects both OpenACC (-acc and -ta=tesla) and CUDA Fortran. With PGI 18.7, the compiler will detect the CUDA driver and GPU compute capabilities on your system, and use the CUDA toolkit corresponding to that driver and generate code for the GPU (or GPUs) installed on your system. This matches the behavior that the PGI compilers have had for years with respect to the CPU version, where the compiler detects the type of CPU on which you are building your program (Haswell, Broadwell, Skylake, Zen, …) and optimizes and generates code for that CPU.
Continue reading ⇒
General purpose parallel programming on GPUs is a relatively recent phenomenon. GPUs were originally hardware blocks optimized for a small set of graphics operations. As demand arose for more flexibility, GPUs became increasingly more programmable. Early approaches to computing on GPUs cast computations into a graphics framework, allocating buffers (arrays) and writing shaders (kernel functions). Several research projects looked at designing languages to simplify this task; in late 2006, NVIDIA introduced its CUDA architecture and tools to make data parallel computing on a GPU more straightforward. Not surprisingly, the data parallel features of CUDA map pretty well to the data parallelism available on NVIDIA GPUs. Here, we'll describe the data parallelism model supported in CUDA and the latest NVIDIA Kepler GPUs.
Why should PGI users want to understand the CUDA threading model? Clearly, PGI CUDA Fortran users should want to learn enough to tune their kernels. Programmers using the directive-based PGI Accelerator Compilers with OpenACC will also find it instructive in order to understand and use the compiler feedback (-Minfo messages) indicating which loops were scheduled to run in parallel or vector mode on the GPU; it's also important to know how to tune performance using the loop mapping clauses. So, let's start with an overview of the hardware in today's NVIDIA GPUs. Continue reading ⇒
Today's high performance systems are trending towards using highly parallel accelerators to meet performance goals and power and price limits. The most popular compute accelerators today are NVIDIA GPUs. Intel Xeon Phi coprocessors and AMD Radeon GPUs are competing for that same market, meaning we will soon be programming and tuning for a wider variety of host + accelerator systems.
We want to avoid writing a different program for each type of accelerator. There are at least three current options for writing a single program that targets multiple accelerator types. One is to use a library, which works really well if the library contains all the primitives your application needs. Solutions built on class libraries with managed data structures are really another method to implement libraries, and again work well if the primitives suit your application. The potential downside is that you depend on the library implementer to support each of your targets now and in the future. Continue reading ⇒
The OpenACC API has two compute constructs, the kernels construct and the parallel construct. This article describes the differences between the two and use cases for each. To simplify the discussion, I will focus this article only on OpenACC gang and vector parallelism. OpenACC also supports worker parallelism, but that will be the subject of yet another article. I will also ignore any data movement by assuming that all required data is
One of the key features of high level language programming is modularity, including support for procedures and separate compilation. It's hard to imagine modern programming without functions and libraries. The term compiler was originally used to define the software that compiled separately created external objects into a single binary, what we now call a linker. Yet, until recently, OpenACC programs could only support procedures through inlining, more or less preventing any use of libraries or procedure calls across multiple files.
With the latest releases, PGI now supports procedure calls, separate compilation and linking for OpenACC programs targeting NVIDIA GPU accelerators. This article introduces this very important feature and how to use the acc routine directive to enable it. I will also present hints on how to use the clauses on the routine directive, including reasons for why the clauses are necessary, and some caveats and current limitations. Continue reading ⇒
In Part 1 I introduced the OpenACC routine directive and its use to enable true procedure calls and separate compilation in OpenACC programs. This article will discuss a few more advanced issues: support for global variables and the acc declare directive, interfacing to CUDA C and CUDA Fortran device functions using acc routine declarations, and using acc routine in C++ class member functions. Continue reading ⇒
OpenACC is designed as a parallel programming model that can deliver high performance on a wide range of systems, including accelerated systems with GPUs,multicore CPUs, and manycore processors. Until recently, PGI has focused its OpenACC development efforts on the NVIDIA Tesla and AMD Radeon GPU targets. Performance on these two different GPUs is comparable using OpenACC, but until now performane portability of OpenACC on conventional multicore microprocessors has not been demonstrated. That changes with the latest release of the PGI Accelerator compilers. Continue reading ⇒
This post contains material that is helpful in porting applications to CUDA Fortran. The techniques presented here are not optimization techniques per se, rather they are coding techniques and styles that can reduce the effort involved in porting applications to CUDA Fortran.
One goal we pursue when porting code to CUDA Fortran is to leave the CPU code intact, so that the same source code can be compiled to run on either on the CPU or GPU by the absence or presence of the -Mcuda compiler option. This can be easily done with preprocessor macros, but we strive to minimize both the code that gets duplicated and the maintenance headaches that arise from such code duplication.
Continue reading ⇒
The cuSOLVER library was first included with the CUDA 7.0 toolkit. It . As a collection of libraries, cuSOLVER provides useful LAPACK-like features. Its list of features is growing with each release. The current version offers common matrix factorization and triangular solve routines for dense matrices, a sparse least-squares solver and an eigenvalue solver. In addition, cuSOLVER provides a new refactorization library useful for solving sequences of matrices with a shared sparsity pattern.
Currently, CUDA Fortran does not have a ready-to-use interface module to the cuSOLVER library. This article shows how to write such an interface and presents two examples of its use: first, a simple Cholesky factorization, and second, the reduction of a generalized eigenproblem to a standard one. Continue reading ⇒
NVIDIA CUDA is a general purpose parallel programming architecture with compilers and libraries to support programming of NVIDIA GPUs. The CUDA SDK includes an extended C compiler, here called CUDA C, allowing GPU programming from a high level language. The CUDA programming model supports four key abstractions: cooperating threads organized into thread groups, shared memory and barrier synchronization within thread groups, and coordinated independent thread groups organized into a grid.
PGI and NVIDIA defined CUDA Fortran, which is supported in the upcoming PGI 2010 release, to enable CUDA programming directly in Fortran. CUDA Fortran is a small set of extensions to Fortran that supports and is built upon CUDA. The extensions allow the following actions in a Fortran program:
- Declaration of variables that reside in GPU device memory
- Dynamic allocation of data in GPU device memory
- Copying of data from host memory to GPU memory, and back
- Invocation of GPU subroutines from the host
A CUDA programmer partitions a program into coarse grain blocks that can be executed in parallel. Each block is partitioned into fine grain threads, which can cooperate using shared memory and barrier synchronization. A properly designed CUDA program will run on any CUDA-enabled GPU, regardless of the number of available processor cores. This article will teach you the basics of CUDA Fortran programming and enable you to quickly begin writing your own CUDA Fortran programs. Continue reading ⇒
Using managed memory simplifies many coding tasks, makes source code cleaner, and enables a unified view of complicated data structures across host and device memories.
A good explanation of Unified Memory can be found on the NVIDIA Parallel Forall blog, and Appendix J of the CUDA C Programming Guide gives a detail explanation. Unified Memory provides a way to obtain a single pointer to user allocated data that can be used in both host and device code. Unlike zero-copy memory, managed memory is not pinned and static, but migrates between the host and device on access. The system (driver and OS) controls the physical page movement. Specifically, in CUDA C/C++, managed memory usage consists of replacing calls to cudaMalloc() with cudaMallocManaged(), and removing explicit uses of cudaMemcpy() to transfer data between host and device.
In CUDA Fortran, we've added the managed keyword to the language, which can be used in host code similarly to the device keyword. Here's an example of how we've used it to simplify the host code in our sgemm test, a CUDA Fortran example we've used frequently here at PGI: Continue reading ⇒
In CUDA Fortran, data transfers in either direction between the host and device using Fortran assignment statements or the function cudaMemcpy() are blocking transfers; that is, control is returned to the host thread only after the data transfer is complete. The cudaMemcpyAsync() function is a non-blocking variant in which control is returned immediately to the host thread. In contrast with assignment statements or cudaMemcpy(), the asynchronous transfer version requires pinned host memory, and it contains an additional argument, a stream ID. A stream is simply a sequence of operations that are performed in order on the device. Operations in different streams can be interleaved and in some cases overlapped—a property that can be used to hide data transfers between the host and the device. Continue reading ⇒
This short post summarizes changes in the behavior of Fortran Allocatable arrays and array assignments in the PGI 18.7 release. You may be particularly interested in this if you use OpenACC and Fortran Allocatable array assignments in your device code.
Continue reading ⇒
Polymorphism is a term used in software development to describe a variety of techniques employed by programmers to create flexible and reusable software components. The term is Greek and it loosely translates to "many forms".
In programming languages, a polymorphic object is an entity, such as a variable or a procedure, that can hold or operate on values of differing types during the program's execution. Because a polymorphic object can operate on a variety of values and types, it can also be used in a variety of programs, sometimes with little or no change by the programmer. The idea of write once, run many, also known as code reusability, is an important characteristic to the programming paradigm known as Object-Oriented Programming (OOP).
OOP describes an approach to programming where a program is viewed as a collection of interacting, but mostly independent software components. These software components are known as objects in OOP and they are typically implemented in a programming language as an entity that encapsulates both data and procedures. Continue reading ⇒
This is the second part to a series of articles that explore Object-Oriented Programming (OOP) in Fortran 2003 (F2003). The first installment introduced the OOP paradigm and three important features to OOP: inheritance, polymorphism, and information hiding. F2003 supports inheritance through type extension, polymorphism through its CLASS keyword, and information hiding through its PUBLIC/PRIVATE keywords/binding-attributes.
There are two basic types of polymorphism: procedure polymorphism and data polymorphism. Part one of the series covered procedure polymorphism, which deals with procedures that can operate on a variety of data types and values. Data polymorphism, a topic for this article, deals with program variables that can store and operate on a variety of data types and values.
In addition to data polymorphism, we will also examine F2003's typed allocation, sourced allocation, unlimited polymorphic objects, generic type-bound procedures, abstract types, and deferred bindings.
Continue reading ⇒
Part three introduces an F2003 feature known as parameterized derived types. Parameterized derived types allow the programmer to create derived types that take one or more values, known as parameters, to specify characteristics of the data encapsulated by the derived type. These parameters are supplied by the user of the derived type to specify the kind and/or amount of data needed by the derived type. We will first look at features and syntax of parameterized derived types. Then we will conclude our discussion with a case study that uses parameterized derived types to create general purpose matrices. Continue reading ⇒
In this installment, we discuss User-Defined Derived Type Input/Output (UDTIO). UDTIO allows the programmer to specify how a derived type is read (or written) from (or to) a file. One of the main purposes for providing UDTIO is so the user of an object can perform Input/Output (I/O) operations without any knowledge of the object's layout. For example, the derived type may have private components which are not directly accessible with traditional I/O operations. Continue reading ⇒
Tutorials and Hands-On
Since the introduction of PGI CUDA Fortran late last year, we've seen a dramatic rise in the number of customers using this new extension to the Fortran language. As the moderator of the PGI User Forum, I have been very busy answering questions about the language, and noting those questions that seem to be asked often or may be of interest to the wider community. For this installment of the PGInsider, I have implemented the Monte Carlo Integration algorithm to highlight some of the tips, tricks, and traps of programming for the GPU. Continue reading ⇒