Tesla V100 GPU Support

PGI OpenACC and CUDA Fortran now support Tesla V100. Based on the new NVIDIA Volta GV100 GPU, Tesla V100 offers more memory bandwidth, more streaming multiprocessors, next generation NVLink and new microarchitectural features that add up to better performance and programmability. For OpenACC and CUDA Fortran programmers, Tesla V100 offers improved hardware support and performance for CUDA Unified Memory features on both x86-64 and OpenPOWER processor-based systems.

Dramatically Lower Development Effort

Developer View of CUDA Unified Memory Diagram

OpenACC for CUDA Unified Memory

PGI compilers now leverage Pascal and Volta GPU hardware features, NVLink and CUDA Unified Memory to simplify OpenACC programming on GPU-accelerated x86-64 and OpenPOWER processor-based servers. When OpenACC allocatable data is placed in CUDA Unified Memory, no explicit data movement or data directives are needed. This simplifies GPU acceleration of applications that make extensive use of allocatable data, and allows you to focus on parallelization and scalability of your algorithms. See the OpenACC and CUDA Unified Memory PGInsider post for details.
Deep Copy Example

Automatic Deep Copy of Fortran Derived Types

Automatic deep copy of Fortran derived types allows you to port applications with modern deeply nested data structures to Tesla GPUs using OpenACC. The PGI 17.7 compilers allow you to list aggregate Fortran data objects in OpenACC COPY, COPYIN, COPYOUT and UPDATE directives to move them between host and device memory including traversal and management of pointer-based objects within the aggregate data object.

LCALS Benchmark Performance

LCALS Performance Comparision

C++ Enhancements

The updated PGI C++ compiler includes incremental C++17 features, and is supported as a CUDA 9.0 NVCC host compiler on both Linux/x86-64 and Linux/OpenPOWER platforms. It delivers an average 20% performance improvement on the LCALS loops benchmarks with no abstraction penalty, now supports lambdas with capture in OpenACC GPU-accelerated compute regions and is now interoperable with GNU 6.3.
template <typename Execution_Policy, typename BODY>
double bench_forall ( int s, int e, BODY body ) { 
  StartTimer ();
  if ( is_same<Execution_Policy, Serial> :: value ) { 
    for ( int i = s; i < e; ++i )
      body ( i );
  } elseif ( is_same<Executon_Policy, OpenACC> :: value ) { 
  #pragma acc parallel loop
  for ( int i = s; i < e; ++i )
    body ( i );
  } return EndTimer ( );

using T = double;
void do_bench_saxpy ( int N, T*a, T*b, Tx) { 
  auto saxpy = [=]( int i ) /* Capture-by-Value */ 
    { b[i] += a[i] * x; };

double stime = bench_forall<Serial>(0, N, saxpy);
double time = bench_forall<OpenACC>(0, N, saxpy);
printf ( "OpenACC Speedup %f \n", stime / time );

Use C++14 Lambdas with Capture in OpenACC Regions

C++ lambda expressions provide a convenient way to define anonymous function objects at the location where they are invoked or passed as arguments. The auto type specifier can be applied to lambda parameters to create a polymorphic lambda-expression. Starting with the PGI 17.7 release, you can now use lambdas in OpenACC compute regions in your C++ programs.  Using lambdas with OpenACC is useful for a variety of reasons.  One example is to drive code generation customized to different programming models or platforms.  C++14 has opened up doors for more and more lambda use cases, especially for polymorphic lambdas, and all of those capabilities are now usable in your OpenACC programs. 

cusolverDN Speedup Over CPU

cuSOLVER Performance Comparison

PGI Compilers Are Now Interoperable With the cuSOLVER Library

You can now call optimized cuSolverDN routines from CUDA Fortran and OpenACC Fortran using the PGI-supplied interface module and the PGI-compiled version of the cuSOLVER library bundled with PGI 17.7. This same cuSolver library is callable from PGI OpenACC C/C++ as they are built using PGI compilers and are compatible with and use the PGI OpenMP runtime. Read more about Using the cuSOLVER Library from CUDA Fortran.

PGI Unified Binary Performance

PGI Unified Binary Performance Chart

PGI Unified Binary for Tesla and Multicore

Use OpenACC to build applications for both GPU acceleration and parallel execution across all the cores of a multicore server. When you run the application on a GPU-enabled system, the OpenACC regions will offload and execute on the GPU. When the same application executable is run on a system without GPUs installed, the OpenACC regions will be executed in parallel across all CPU cores in the system. If you develop commercial or production applications, now you can accelerate your code with OpenACC and deploy a single binary usable on any system, with or without GPUs.
OpenACC Construct View in the PGI Profiler

New Profiling Features for OpenACC and CUDA Unified Memory

The PGI Profiler adds new OpenACC profiling features including support on multicore CPUs with or without attached GPUs, and a new summary view that shows time spent in each OpenACC construct. New CUDA Unified Memory features include correlating CPU page faults with the source code lines where the associated data was allocated, support for new CUDA Unified Memory page thrashing, throttling and remote map events, NVLINK support and more. Read about the new CUDA Unified Memory profiling features in CUDA 9 Features Revealed.
PGI Documentation Cover Graphic

Online Documentation

In addition to the traditional PDF format, all PGI Compilers & Tools documentation is now also available at pgicompilers.com in online HTML format, enabling easier searching and referencing from any Internet-connected system or device.
Click me