SPEC ACCEL 1.2 Performance

SPEC ACCEL Performance Comparision

Accelerate Your HPC Applications
with Tesla V100 GPUs

PGI OpenACC and CUDA Fortran now support CUDA 9.1 running on Tesla Volta GPUs. Tesla V100 offers more memory bandwidth, more streaming multiprocessors, next generation NVLink and new microarchitectural features that add up to better performance and programmability. For OpenACC and CUDA Fortran programmers, Tesla V100 offers improved hardware support and performance for CUDA Unified Memory features on both x86-64 and OpenPOWER processor-based systems. With PGI 2018, you get the best of both worlds — world-class CPU performance plus comprehensive GPU support.

SPEC CPU 2017 FP Speed

SPEC CPU 2017 fp Speed Results on Skylake, EPYC and Broadwell chart

Support for the Latest CPUs

Multicore CPU performance remains one of the key strengths of the PGI compilers, which now support the latest generation of HPC CPUs including Intel Skylake, IBM POWER9 and AMD Zen. PGI Fortran 2003, C11 and C++14 compilers deliver state-of-the-art SIMD vectorization and benefit from newly optimized single and double precision numerical intrinsic functions on Linux x86, Linux OpenPOWER, and macOS. See the benchmarks section for PGI 2018 performance results on a variety of HPC industry standard benchmarks.

Full OpenACC 2.6

All PGI compilers now support the latest OpenACC features on both Tesla GPUs and multicore CPUs. New OpenACC 2.6 features include manual deep copy directives, the serial compute construct, if_present clause in the host_data construct, no_create data clause, attach/detach clauses, acc_get_property API routines and improved support for Fortran optional arguments. Other OpenACC features added or enhanced include cache directive refinements and support for named constant arrays in Fortran modules.

Dramatically Lower Development Effort

Developer View of CUDA Unified Memory Diagram

OpenACC for CUDA Unified Memory

PGI compilers leverage Pascal and Volta GPU hardware features, NVLink and CUDA Unified Memory to simplify OpenACC programming on GPU-accelerated x86-64 and OpenPOWER processor-based servers. When OpenACC allocatable data is placed in CUDA Unified Memory, no explicit data movement or data directives are needed. This simplifies GPU acceleration of applications that make extensive use of allocatable data, and allows you to focus on parallelization and scalability of your algorithms. See the OpenACC and CUDA Unified Memory PGInsider post for details.
.LB1_444:
        vmovupd (%r11,%r9), %zmm17
        vmovupd 64(%r9,%r11), %zmm18
        subl    $16, %r10d
        vfmadd231pd     (%rbx,%r9), %zmm16, %zmm17
        vmovupd %zmm17, (%rbx,%r9)
        vfmadd231pd     64(%r9,%rbx), %zmm16, %zmm18
        vmovupd %zmm18, 64(%r9,%rbx)
        addq    $128, %r9
        testl   %r10d, %r10d
        jg      .LB1_444

AVX-512 Support

Intel AVX-512 CPU instructions available on the latest generation Skylake CPUs enable twice the number of floating point operations compared to the previous generation AVX2 SIMD instructions. At 512 bits wide, AVX-512 doubles both the register width and the total number of registers, and can help improve the performance of HPC applications.
C++17

New C++17 Features

Release 2018 of the PGI C++ compiler introduces partial support for the C++17 standard when compiling with ‑‑c++17 or ‑std=c++17. Supported C++17 core language features are available on all supported macOS versions and on Linux systems with GCC 5 or newer. New C++ language features include compile-time conditional statements (constexpr if), structured bindings, selection statements with initializers, fold expressions, inline variables, constexpr lambdas, and lambda capture of *this by value.
OpenMP

OpenMP 4.5 for Multicore CPUs

Previously available with PGI compilers for Linux/OpenPOWER, PGI 2018 introduces support for OpenMP 4.5 syntax and features in the PGI Fortran, C and C++ compilers on Linux/x86-64. You can now use PGI to compile OpenMP 4.5 programs for parallel execution across all the cores of a multicore CPU or server. TARGET regions are implemented with default support for the multicore host as the target, and PARALLEL and DISTRIBUTE loops are parallelized across all OpenMP threads.

PGI Unified Binary Performance

PGI Unified Binary Performance Chart

PGI Unified Binary for Tesla and Multicore

Use OpenACC to build applications for both GPU acceleration and parallel execution across all the cores of a multicore server. When you run the application on a GPU-enabled system, the OpenACC regions will offload and execute on the GPU. When the same application executable is run on a system without GPUs installed, the OpenACC regions will be executed in parallel across all CPU cores in the system. If you develop commercial or production applications, now you can accelerate your code with OpenACC and deploy a single binary usable on any system, with or without GPUs.
template <typename Execution_Policy, typename BODY>
double bench_forall ( int s, int e, BODY body ) { 
  StartTimer ();
  if ( is_same<Execution_Policy, Serial> :: value ) { 
    for ( int i = s; i < e; ++i )
      body ( i );
  } elseif ( is_same<Execution_Policy, OpenACC> :: value ) { 
  #pragma acc parallel loop
  for ( int i = s; i < e; ++i )
    body ( i );
  } return EndTimer ( );
}

using T = double;
void do_bench_saxpy ( int N, T*a, T*b, Tx) { 
  auto saxpy = [=]( int i ) /* Capture-by-Value */ 
    { b[i] += a[i] * x; };

double stime = bench_forall<Serial>(0, N, saxpy);
double time = bench_forall<OpenACC>(0, N, saxpy);
printf ( "OpenACC Speedup %f \n", stime / time );
}

Use C++14 Lambdas with Capture in OpenACC Regions

C++ lambda expressions provide a convenient way to define anonymous function objects at the location where they are invoked or passed as arguments. The auto type specifier can be applied to lambda parameters to create a polymorphic lambda-expression. With PGI compilers you can use lambdas in OpenACC compute regions in your C++ programs.  Using lambdas with OpenACC is useful for a variety of reasons.  One example is to drive code generation customized to different programming models or platforms.  C++14 has opened up doors for more and more lambda use cases, especially for polymorphic lambdas, and all of those capabilities are now usable in your OpenACC programs. 

Using the Default PGI Code Generator:

 % pgfortran -fast -Minfo -c daxpy.f90
 daxpy:
     5, Generated an alternate version of the loop
        Generated vector simd code for the loop
        Generated 2 prefetch instructions for the loop
        Generated vector simd code for the loop
        Generated 2 prefetch instructions for the loop
        FMA (fused multiply-add) instruction(s) generated
 

Using the LLVM Code Generator:

 % pgfortran -fast -Minfo -c daxpy.f90 -Mllvm
 daxpy:
     5, Generated vector simd code for the loop
        FMA (fused multiply-add) instruction(s) generated

LLVM/x86-64 Code Generator

Release 2018 includes an LLVM code generator for x86-64 fully integrated with the PGI Fortran, C and C++ compilers, including support for OpenACC and CUDA Fortran. This initial release introduces support for OpenMP 4.5 features targeting multicore x86-64 CPUs and delivers performance improvements on many C++ applications. Included as part of the PGI Linux installation package, the LLVM components co-install with the default PGI compilers and are invoked with a simple command-line option.
PGI Profiler

Enhanced Profiling Features

New CPU Detail View shows a breakdown of the time spent on the CPU for each thread. Three call tree options allow you to profile based on caller, callee or by file and line number. View time for all threads together or individually, quickly sort events by min or max time, and more. Other new features include an option to adjust program counter sampling frequency, and an enhanced display showing the NVLink version of the NVLink topology.

See the What's New section in the release notes for complete details.

Click me