PGI Compilers with OpenACC Directives
Overview
Using PGI compilers, programmers can accelerate applications on CPU+accelerator platforms by adding OpenACC compiler directives to existing high-level standard-compliant Fortran, C and C++ programs and then recompiling with appropriate compiler options.
Sample Fortran matrix multiplication loop, tagged to be compiled for an accelerator.
!$acc kernels do k = 1,n1 do i = 1,n3 c(i,k) = 00 do j = 1,n2 c(i,k) = c(i,k) + a(i,j) * b(j,k) enddo enddo enddo !$acc end kernels
How They Work
Until now, developers targeting HPC accelerators have had to rely on language extensions to their programs. CPU+accelerators programmers have been required to program at a detailed level including a need to understand and specify data usage information and manually construct sequences of calls to manage all movement of data between the CPU host and the accelerator.

PGI compilers automatically analyze whole program structure and data, split portions of the application between the host CPU and the accelerator device as specified by a standard set of user directives, and define and generate an optimized mapping of loops to automatically use the parallel cores, hardware threading capabilities and SIMD vector capabilities of modern accelerators. In addition to directives and pragmas that specify regions of code or functions to be accelerated, other directives give the programmer fine-grained control over the mapping of loops, allocation of memory, and optimization for the accelerator memory hierarchy. The PGI compilers generate unified object files and executables that manage all movement of data to and from the accelerator while leveraging all existing host-side utilities—linker, librarians, makefiles—and require no changes to the existing standard HPC Linux programming environment.
Resources
Specifications
- OpenACC Specification (ver. 2.7 November 2018)
Training and Education
- Michael Wolfe's Introduction to Parallel Programming with OpenACC video series
- Your First OpenACC Program (time: 7.5 minutes)
- OpenACC YouTube channel
- OpenACC talks at recent GPU Technology Conferences
- Oak Ridge National Laboratory Leadership Computing Facility GPU Hackathons
Tutorial Presentations and Articles
- Understanding the CUDA Data Parallel Threading Model—A Primer
- Tesla vs. Xeon Phi vs. Radeon
- OpenACC Kernels and Parallel Constructs
- OpenACC Interoperability Tricks
- OpenACC and CUDA Unified Memory
- Using the OpenACC Routine Directive
- Using the OpenACC Routine Directive Part 2
- OpenACC on Multicore CPUs
- Profiling OpenACC Programs with the PGI Profiler
Applications and Programming Information
- OpenACC Programming and Best Practices Guide
- PGI Accelerator Compilers with OpenACC Getting Started Guide
- OpenACC 2.7 API Reference Card
Case Studies
- AWE Demonstrates OpenACC Performance Portability
- Massively Scaling Computational Electromagnetics Code Using OpenACC
- Quantum Chemist Leverages OpenACC to Accelerate Research in One Week’s Effort
- Numeca Taps OpenACC to Accelerate Commercial CFD Application without Rewriting Code
- OpenACC Enables Astrophysics Researchers to Gain Insight into Dark Energy
- Researchers at North Carolina State Use OpenACC to Run a Fully Implicit 3D CFD Solver on a GPU
- Research at University of Illinois Leads to an Advanced Model for MRI Reconstruction Using OpenACC
Other Resources
- OpenACC website
- NVIDIA OpenACC website
FAQ
Please also see the Accelerator Programming user forum for additional questions and answers.
- Which programming languages do the PGI compilers support?
- On which platforms and operating systems do PGI compilers run?
- Which accelerators can be targeted by PGI compilers?
- Do I need to install any 3rd party software?
- Does the compiler support IEEE standard floating-point arithmetic?
- Can I call a CUDA kernel function from my PGI compiled code?
- Does the compiler support two or more accelerators in the same program?
- Will PGI be dropping supporting for the original PGI Accelerator directive syntax?
- Can I run my program on a machine that doesn't have an accelerator on it?
- Do I have to rebuild my application for each different accelerator model?
- In what timeframe will PGI be including OpenMP 4.0 or 4.5 support?
- Can I use function or procedure calls in my GPU code?
- When will you support <my favorite feature> in your compiler?
- Which OpenACC features are supported in which release?
- How much does it cost?
- How can I try it?
- Where do I start?
Which programming languages do the PGI OpenACC compilers support?
PGI supports accelerators from within the PGFORTRAN™ Fortran 2003, PGCC® ANSI C11 and PGC++® gnu-compatible C++17 compilers.
On which platforms and operating systems do PGI OpenACC compilers run?
PGI OpenACC compilers run on 64-bit Linux on x86 and OpenPOWER, and 64-bit, Windows. PGI OpenACC compilers can also target multicore CPUs running 64-bit macOS.
Which accelerators can be targeted by PGI OpenACC compilers?
PGI compilers target all NVIDIA Tesla GPU accelerators with compute capability 2.0 or higher running on Linux or Windows..
In addition to the accelerators listed above, beginning with PGI version 15.10, 64-bit x86 multicore CPUs can also be targeted using Linux, Windows and macOS. See the OpenACC on Multicore CPUs PGInsider article for more information. Support for multicore OpenPOWER CPUs as an OpenACC target was added with PGI version 16.10.
Do I need to install any 3rd party software?
To use NVIDIA CUDA-enable GPUs, you must first install the CUDA driver for your system. All other necessary 3rd party software is included in the PGI installation packages.
Does the compiler support IEEE standard-floating point arithmetic?
The accelerators available today support most of the IEEE floating-point standard. However, they do not support all the rounding modes, and some operations, notably square root, exponential, logarithm, and other transcendental functions, may not deliver full precision results. This is a hardware limitation that compilers cannot overcome.
Do PGI OpenACC compilers support double-precision?
Yes.
Can I call a CUDA kernel function from my PGI compiled code?
You can call CUDA device functions from PGI-compiled OpenACC compute regions in C, C++ or Fortran. The OpenACC code would need an appropriate acc routine(...) directive to tell the compiler that the given function is available for the device, and the compile line would need to include –ta=tesla (to override the default –ta=tesla,host), because there is only a host version of that function. See the OpenACC Routine Directive Part 2 PGInsider article for more details. To invoke a CUDA kernel from Fortran, you could use the CUDA Fortran extensions. Otherwise, you would need a wrapper routine compiled by nvcc to actually launch the kernel, then call that wrapper from the PGI code. There is no syntax to directly launch a CUDA kernel from the PGI-compiled C or C++ code.
Does the compiler support two or more accelerators in the same program?
As with CUDA, you can use two or more GPUs by using multiple threads, where each thread attaches to a different GPU and runs its kernels on that GPU. The current release does not include support to automatically control two or more GPUs from the same accelerator region.
When do I need to convert from the legacy PGI Accelerator directives syntax to the standard OpenACC syntax?
PGI deprecated support for PGI Accelerator features in PGI 2018, and remove support entirely in PGI 2019.
Can I run my program on a machine that doesn't have an accelerator on it?
Yes. PGI OpenACC compilers can generate PGI Unified Binary™ technology executables that work in the presence or absence of an accelerator.
Do I have to rebuild my application for each different model accelerator?
The accelerator code generated uses the same technology that is used for graphics applications and games; that is, the program uses a portable intermediate format which is then dynamically translated and re-optimized at run time by the drivers supplied by the vendor for the particular model of GPU in your machine. This preserves your investment by allowing your programs to continue to work even when you upgrade your accelerator, or use your program on a machine with a different model.
Can I use function or procedure calls in my GPU code?
PGI 2014 and newer include support for procedure calls (the OpenACC routine directive) on NVIDIA GPUs.
In what timeframe will PGI be including OpenMP 4.5 support?
OpenMP 4.5 include many new features, including tasking extensions and task dependences, task groups, task cancellation, task priorities, task loops, thread binding, SIMD constructs, SIMD function compilation, user-defined reductions, additional atomic constructs, doacross-style synchronization between workshared loop iterations, plus a whole host of target/device features. PGI added support for the tasking, binding, SIMD, synchronization, reduction, atomic and other CPU features for Linux/OpenPOWER CPUs in 17.7. These features are supported in the 18.1 release of the PGI LLVM compilers for Linux x86-64.
When will you support <my favorite feature> in your compiler?
Some features cannot be supported due to limitations of the hardware. Other features are not being supported because they would not deliver satisfactory performance. Still other features are planned for future implementation. Your feedback can affect our priorities.
Which OpenACC features are supported in which release?
PGI introduced support for OpenACC directives with Release 2012 version 12.6 and support for C++ was added with Release 2013. Support for multicore x86 CPUs as an accelerator target was added in the PGI Release 2015 version 15.9, and support for multicore OpenPOWER CPUs as an accelerator target in 16.10. PGI dropped support for targeting GPUs from macOS in 17.1.
Following is a list of OpenACC 1.0 features and the PGI version they were added.
Feature | Version | Feature | Version |
---|---|---|---|
!$acc kernels | 12.3 | !$acc declare | 12.3 |
clauses: | clauses: | ||
if() | 12.3 | copy()/copyin() | 12.3 |
async() | 12.3 | copyin()/copyout() | 12.3 |
copy() | 12.3 | create() | 12.3 |
copyin() | 12.3 | present() | 12.3 |
copyout() | 12.3 | present_or_copy() | 12.3 |
create() | 12.3 | present_or_copyin() | 12.3 |
present() | 12.3 | present_or_copyout() | 12.3 |
present_or_copy() | 12.3 | present_or_create() | 12.3 |
present_or_copyin() | 12.3 | device_resident() | 12.6 |
present_or_copyout() | 12.3 | deviceptr() | 12.6 |
present_or_create() | 12.3 | ||
deviceptr() | 12.3 | !$acc update | 12.3 |
clauses: | |||
!$acc parallel | 12.5 | if() | 12.3 |
clauses: | async() | 12.3 | |
if() | 12.5 | ||
async() | 12.5 | !$acc cache | 12.6 |
num_gangs() | 12.5 | ||
num_workers() | 12.6 | !$acc host_data | 14.1 |
vector_length() | 12.5 | ||
reduction() | 12.6 | !$acc wait | 12.3 |
copyin() | 12.5 | ||
copyout() | 12.5 | Runtime routines: | |
create() | 12.5 | openacc module | 12.3 |
present() | 12.6 | openacc.h C hdr file | 12.3 |
present_or_copy() | 12.6 | openacc_lib.h Ftn hdr file | 12.3 |
present_or_copyin() | 12.6 | ||
present_or_copyout() | 12.6 | acc_get_num_devices() | 12.3 |
present_or_create() | 12.6 | acc_set_device_type() | 12.3 |
deviceptr() | 12.6 | acc_get_device_type() | 12.3 |
private() | 12.6 | acc_set_device_num() | 12.3 |
firstprivate() | 14.4 | acc_get_device_num() | 12.3 |
acc_async_test() | 12.3 | ||
!$acc data | 12.3 | acc_async_test_all() | 12.3 |
clauses: | acc_async_wait() | 12.3 | |
if() | 12.3 | acc_async_wait_all() | 12.3 |
async() | 12.3 | acc_init() | 12.3 |
copy() | 12.3 | acc_shutdown() | 12.3 |
copyin() | 12.3 | acc_on_device() | 12.3 |
create() | 12.3 | acc_malloc() for C | 12.3 |
present() | 12.3 | acc_free() for C | 12.3 |
present_or_copy() | 12.3 | ||
present_or_copyin() | 12.3 | Preprocessing: | |
present_or_copyout() | 12.3 | _OPENACC | 12.3 |
present_or_create() | 12.3 | ||
deviceptr() in C | 12.3 | Environment variables: | |
deviceptr() in Ftn | 14.1 | ACC_DEVICE_TYPE | 12.3 |
ACC_DEVICE_NUM | 12.3 | ||
!$acc loop | 12.3 | ||
clauses: | PGI Extensions: | ||
collapse() | 12.6 | acc_copyin | 12.6 |
within kernels region | acc_copyout | 12.6 | |
gang() | 12.5 | acc_create | 12.6 |
worker() | 12.5 | acc_delete | 12.6 |
vector() | 12.5 | acc_update_host | 12.6 |
seq() | 12.3 | acc_update_device | 12.6 |
private() | 12.3 | acc_updatein | 12.6 |
reduction() | 12.6 | acc_updateout | 12.6 |
within parallel region | acc_ispresent | 12.6 | |
gang | 12.6 | acc_deviceptr | 12.6 |
worker | 12.6 | ||
vector | 12.6 |
Following is a list of OpenACC 2.0 features and the PGI version they were added.
Feature | Version | Feature | Version |
---|---|---|---|
Kernels clauses | !$acc routine | 14.1 | |
wait() | 14.7 | gang | 14.1 |
default(none) | 15.1 | worker | 14.1 |
device_type() | 15.1 | vector | 14.1 |
seq | 14.1 | ||
Parallel clauses | bind name() | 14.7 | |
wait() | 14.7 | bind string() | 14.7 |
default(none) | 15.1 | device_type() | 15.1 |
device_type() | 15.1 | nohost | 14.7 |
Loops clauses | #pragma atomic | 14.4 | |
tile() | 15.1 | !$acc atomic | 14.4 |
auto() | 15.1 | ||
device_type() | 15.1 | Runtime routines | |
acc_wait() | 14.1 | ||
Update clauses | acc_wait_all() | 14.1 | |
wait() | 14.7 | acc_async_wait_all | 14.1 |
async() | 14.7 | acc_wait_async() | 14.4 |
acc_copyin() | 14.1 | ||
Declare clauses | acc_present_or_copyin() | 14.1 | |
link() | -- | acc_create() | 14.1 |
acc_present_or _create() | 14.1 | ||
!$acc enter_data | 14.1 | acc_copyout() | 14.1 |
if() | 14.1 | acc_delete() | 14.1 |
async() | 14.7 | acc_map_data() | 14.1 |
wait() | 14.7 | acc_unmap_data() | 14.1 |
copyin() | 14.1 | acc_deviceptr() | 14.1 |
create() | 14.1 | acc_hostptr() | 14.1 |
pcopy() | 14.1 | acc_is_present() | 14.1 |
pcreate() | 14.1 | acc_memcpy_to_device() | 14.1 |
acc_memcpy_from_device() | 14.1 | ||
!$acc exit_data | 14.1 | acc_update_device() | 14.1 |
if() | 14.1 | acc_update_self() | 14.1 |
async() | 14.7 | ||
wait() | 14.7 | ||
copyout() | 14.1 | ||
delete() | 14.1 |
Following is a list of OpenACC 2.5 features and the PGI version they were added.
Feature | Version |
---|---|
Change in the behavior of the copy, copyin, copyout and create data clauses. | 15.1 |
Change in the behavior of the acc_copyin, acc_create, acc_copyout and acc_delete API routines. | 15.1 |
New default(present) clause for compute constructs. | 15.7 |
Asynchronous versions of the data API routines. | 15.9 |
New acc_memcpy_device API routine. | 15.7 |
New OpenACC interface for profile and trace tools. | 16.1 |
Change in the behavior of the declare create directive with a Fortran allocatable. | 15.1 |
Reference counting added to device data. | 16.1 |
Change in exit data directive behavior. New optional finalize clause. | 16.7 |
New update directive clause, if_present. | 17.1 |
New init, shutdown, set directives. | 17.1 |
Change in the routine bind clause definition. | 17.1 |
New API routines to get and set the default async queue value. | 17.1 |
Num_gangs, num_workers and vector_length clauses allowed on the kernels construct. | 16.7 |
Following is a list of OpenACC 2.6 features and the PGI version they were added.
Feature | Version |
---|---|
serial and serial loop constructs | 18.1 |
Fortran optional arguments support in data clauses | 18.1 |
attach/detach clauses | 18.1 |
implicit attach/detach behavior for other data clauses | 18.1 |
no_create clause in compute and data constructs | 18.1 |
if_present clause in host_data construct | 18.1 |
if clause in host_data construct | 18.3 |
New Device API Runtime Routines
acc_get_property, acc_attach[_async], acc_detach[_finalize][_async] |
18.1 |
How much does it cost?
PGI OpenACC features are included in the no-cost PGI Community Edition. Those interested can purchase a permanent PGI Professional Edition license which includes ongoing support, access to the latest updates and other benefits. Check out the PGI Product Feature Comparison for a feature differences summary.
How can I try it?
To try out the PGI compilers with OpenACC, download the free PGI Community Edition.
Where do I start?
We recommend you download the OpenACC Getting Started Guide from this website.