Recompiler and Run

PGI for OpenPOWER+Tesla

PGI compilers for OpenPOWER systems include all of the same language, multicore and accelerator programming features available on PGI for Linux/x86-64: Fortran, C, C++, CUDA Fortran, OpenACC and OpenMP. In most cases, existing PGI-compiled Linux/x86-64 codes will re-compile and run with minimal effort on Linux/OpenPOWER systems. Use PGI to enable easy migration of GPU-accelerated science and engineering applications across systems in heterogeneous computing environments and from generation to generation of HPC systems.

Scale Uniformly Across Architectures

Performance Portability

Porting the OpenACC Version of GTC from Xeon+Tesla to OpenPOWER+Tesla. Gyrokinetic Toroidal Code (GTC)—Particle Turbulence Simulations for Sustainable Fusion Reactions in ITER. The goal of GTC is to deliver new plasma/fusion energy science at scale. The performance of GTC on different system architectures with different GPUs scales as expected. In all cases, the same source code base was compiled without modification using PGI.

Dramatically Lower Development Effort

Developer View of CUDA Unified Memory Diagram

OpenACC for CUDA Unified Memory

PGI compilers leverage Pascal and Volta GPU hardware features, NVLink and CUDA Unified Memory to simplify OpenACC programming on GPU-accelerated OpenPOWER processor-based servers. When OpenACC allocatable data is placed in CUDA Unified Memory, no explicit data movement or data directives are needed. This simplifies GPU acceleration of applications that make extensive use of allocatable data, and allows you to focus on parallelization and scalability of your algorithms. See the OpenACC and CUDA Unified Memory PGInsider post for details.

OpenMP 4.5 for Multicore CPUs

Support for OpenMP 4.5 syntax and features in the PGI Fortran, C and C++ compilers on Linux/OpenPOWER allows you to compile most OpenMP 4.5 programs for parallel execution across all the cores of a multicore CPU or server. TARGET regions are implemented with default support for the multicore host as the target, and PARALLEL and DISTRIBUTE loops are parallelized across all OpenMP threads.

PGI Unified Binary Performance

PGI Unified Binary Performance Chart

PGI Unified Binary for Tesla and Multicore

Use OpenACC to build applications for both GPU acceleration and parallel execution across all the cores of a multicore server. When you run the application on a GPU-enabled system, the OpenACC regions will offload and execute on the GPU. When the same application executable is run on a system without GPUs installed, the OpenACC regions will be executed in parallel across all CPU cores in the system. If you develop commercial or production applications, now you can accelerate your code with OpenACC and deploy a single binary usable on any system, with or without GPUs.
OpenACC Construct View in the PGI Profiler

New Profiling Features for OpenACC and CUDA Unified Memory

The PGI Profiler has a number of OpenACC profiling features including support on multicore CPUs with or without attached GPUs, and a summary view that shows time spent in each OpenACC construct. CUDA Unified Memory features include correlating CPU page faults with the source code lines where the associated data was allocated, support for CUDA Unified Memory page thrashing, throttling and remote map events, NVLINK support and more. Read about the CUDA Unified Memory profiling features in CUDA 9 Features Revealed.
PGI Documentation Cover Graphic

Online Documentation

In addition to the traditional PDF format, all PGI Compilers & Tools documentation is also available at in online HTML format, enabling easier searching and referencing from any Internet-connected system or device.
Click me