SPEC ACCEL Performance Comparision

Accelerate Your HPC Applications
with Tesla V100 GPUs

PGI OpenACC and CUDA Fortran now support CUDA 10.2 on Tesla Volta GPUs. Tesla V100 memory bandwidth, streaming multiprocessors, next generation NVLink and new microarchitectural features deliver both performance and programmability. For OpenACC and CUDA Fortran programmers, Tesla V100 offers improved hardware support and performance for CUDA Unified Memory on both x86-64 and OpenPOWER processor-based systems. PGI compilers give you the best of both worlds — world-class CPU performance plus comprehensive GPU support.
V100 Tensor Core Support

V100 Tensor Core Support

NVIDIA Tesla V100 Tensor Cores enable fast FP16 matrix multiply and accumulation into FP16 or FP32 results with performance 8x to 16x faster than pure FP32 or FP64 in the same power envelope. Tensor Cores enable scientists and engineers to dramatically accelerate suitable math library routines and applications using mixed-precision. With the PGI Fortran compiler you can now leverage Tensor Cores in your CUDA Fortran scientific applications and through the cuTENSOR library.

cuTENSOR Library Support

cuTENSOR is a high performance CUDA library for tensor primitives with extensive mixed-precision support for tensor contractions, tensor reductions and element-wise tensor operations. With PGI 20.1 you can use cuTENSOR from your Fortran programs, and the PGI Fortran compiler can automatically map selected element-wise array operations and Fortran transformational intrinsics to cuTENSOR including matmul, transpose, spread, and reshape. This enables speed-ups of up to 4x over inline code, and the ability to use V100 Tensor Cores from CUDA Fortran and OpenACC programs through array operations on real(2) data objects.
AMD EPYC Processor Support

AMD Rome CPUs Support

PGI 20.1 compilers are now fully supported on AMD Rome CPUs, including support for all PGI AVX-256 SIMD vectorization and optimizations applied to other x86-64 processors. The PGI compilers auto-detect when the compilation host is an AMD Rome processor and automatically generate CPU executables optimized for Rome. All supported programming models including OpenACC and OpenMP for multicore CPUs, and OpenACC and CUDA Fortran for GPUs, are supported on AMD Rome CPUs in combination with NVIDIA V100 GPUs.
F08 Features

Fortran 2008 Features

The PGI 20.1 compilers include support for many Fortran 2008 features, including recently added support for submodules, the block construct, the do concurrent construct executed serially, the g0 edit descriptor, the norm2() intrinsic procedure, polymorphic assignment, procedure pointer initialization and execute_command_line(). These features are all supported in the PGI Fortran compilers for the Linux/x86-64 and Linux/OpenPOWER, and all except execute_command_line() are supported on Windows/x86-64 platforms.
F08 Features

Full C++17 Language Support

The PGI C++ compiler includes full support for C++17 language features when compiling with ‑‑c++17 or ‑std=c++17. Supported C++17 core language features are available on all supported Linux versions. New C++ language features include compile-time conditional statements (constexpr if), structured bindings, selection statements with initializers, fold expressions, inline variables, constexpr lambdas, and lambda capture of *this by value.
#pragma acc kernels copy(x[0:256])
    for (int i = 0; i < 256; i++) {
      if (x[i] < small) {
        printf("x at loc %03d is small: %d\n",i,x[i]);
        x[i] = small;
      } else if (x[i] > large) {
        printf("x at loc %03d is large: %d\n",i,x[i]);
        x[i] = large;

OpenACC printf() Support

The PGI C and C++ compilers include support for formatted output using printf() statements in OpenACC compute regions. Most common format specifiers for flag characters, width, precision, size and type are supported. Using printf() in OpenACC regions is useful for basic debugging and programmer-driven tracing during development and tuning of OpenACC applications on both multicore CPUs and GPUs.

PGI Community Edition Docker Stack

PGI in the Cloud

PGI Community Edition compilers for Linux/x86-64 are now available as a container image on the NVIDIA GPU Cloud (NGC) and as an Amazon Machine Image (AMI) on the AWS Marketplace. These images include OpenACC-enabled Fortran, C and C++ compilers for the latest multicore CPUs and NVIDIA GPUs including Volta V100. NGC users can pull the PGI container to develop HPC applications on Alibaba Cloud, AWS, Google Cloud Platform, the Oracle Cloud Infrastructure or on local workstations and HPC systems. AWS users can run the PGI AMI on a variety of AWS-supported platforms. PGI in the Cloud is ideal for users who want to build, test, benchmark and run their own applications in the cloud using the latest NVIDIA GPUs, and for development and deployment of cloud-based parallel programming education and training.
$ pgcc -ta=tesla:autocompare -o a.o example.c

$ PGI_COMPARE=summary,abs=2,rel=2 ./a.o

Floats at: 22562 FAILED relative test: 0.000672976 and 0.000650764, rel diff=0.0341326, tolerance REL=0.01
Floats at: 23550 FAILED relative test: 0.000672976 and 0.000650764, rel diff=0.0341326, tolerance REL=0.01
Floats at: 11599 FAILED relative test: -0.000798482 and -0.000812978, rel diff=0.0178308, tolerance REL=0.01
Floats at: 16794 FAILED relative test: -0.000403227 and -0.00038901, rel diff=0.0365471, tolerance REL=0.01
Floats at: 29318 FAILED relative test: 0.000403227 and 0.00038901, rel diff=0.0365471, tolerance REL=0.01
compared 3 blocks, 68608 elements, 274432 bytes
5 errors found in 2 blocks
63772 errors tolerated in 2 blocks
 relative tolerance = 0.010000, rel=2
 absolute tolerance = 0.010000, abs=2

PCAST Directives

PGI Compiler Assisted Software Testing (PCAST) is useful for detecting where and why results diverge between CPU and GPU-accelerated versions of code, between successive versions of a program you are optimizing incrementally, or between the same program executing on two different processor architectures. The PCAST API enables you to capture and save selected data during execution of a program, then read it in for comparison to data generated in a separate execution of the program. PGI compilers include a directive-based interface for the PCAST API, enabling you to instrument programs without sacrificing portability to other compilers and platforms. Learn more about PCAST on the PGI Compiler Assisted Software Testing overview page.
PCAST Autocompare Example

PGI Auto-compare for OpenACC

Results can diverge between programs running on a CPU versus a GPU due to programming errors, precision of numerical intrinsics, or variations in compiler optimizations. OpenACC auto-compare runs compute regions redundantly on both the CPU and GPU. When data is copied from the GPU back to the CPU, GPU results are compared with those computed on the CPU. Auto-compare works on both structured and unstructured data regions, with difference reports controlled by environment variables so you can quickly pinpoint where results start to diverge and adapt your program or compiler options as needed.
typedef struct points {
  float* x;  float* y;  float* z;
  int n;
  float coef, direction;   
  #pragma acc policy<dpmove> copy(x[0:n]) copyin(y[0:n])
} points;

void sub (int n, float* y) {
  points p;

    p.n = n;
    p.x = (float*) malloc (sizeof (float )*n );
    p.y = (float*) malloc (sizeof (float )*n );
    p.z = (float*) malloc (sizeof (float )*n );
    #pragma acc data copy(p<dpmove>)
      #pragma acc parallel loop
      for ( i =0; i<p.n; ++i ) p.x[i] += p.y[i];
      . . .

OpenACC Deep Copy Directives

Modern HPC applications make extensive use of deeply nested aggregate data structures — Fortran derived types, C++ classes and C structs. PGI compilers includes an implementation of the draft OpenACC 3.0 true deep copy directives in Fortran, C and C++, which allow you to specify a subset of members to move between host and device memory within the declaration of an aggregate. Named policies allow distinct sets of members to be copied at different points in a program. Once the deep copy pattern is defined, a single data clause (copy(a)) can be used to copy the selected members of the aggregate, including dynamically allocated members, some of which can themselves be aggregate structures with dynamically allocated members.
OpenACC 2.6

Full OpenACC 2.6

PGI compilers support OpenACC 2.6 features on both Tesla GPUs and multicore CPUs, including manual deep copy directives, the serial compute construct, if_present clause in the host_data construct, no_create data clause, attach/detach clauses, acc_get_property API routines and improved support for Fortran optional arguments. Other OpenACC features recently added or enhanced include cache directive refinements and support for named constant arrays in Fortran modules.

Dramatically Lower Development Effort

Developer View of CUDA Unified Memory Diagram

OpenACC for CUDA Unified Memory

PGI compilers leverage Pascal and Volta GPU hardware features, NVLink, and CUDA Unified Memory to simplify OpenACC programming on GPU-accelerated x86-64 and OpenPOWER processor-based servers. When OpenACC allocatable data is placed in CUDA Unified Memory, no explicit data movement or data directives are needed. This simplifies GPU acceleration of applications that make extensive use of allocatable data, and allows you to focus on parallelization and scalability of your algorithms. See the OpenACC and CUDA Unified Memory PGInsider post for details.

OpenMP 4.5 for Multicore CPUs

PGI Fortran, C and C++ compilers on Linux/x86-64 and Linux/OpenPOWER include support for OpenMP 4.5 syntax and features. You can use PGI to compile OpenMP 4.5 programs for parallel execution across all the cores of a multicore CPU or server. TARGET regions are implemented with default support for the multicore host as the target, and PARALLEL and DISTRIBUTE loops are parallelized across all OpenMP threads.
        vmovupd (%r11,%r9), %zmm17
        vmovupd 64(%r9,%r11), %zmm18
        subl    $16, %r10d
        vfmadd231pd     (%rbx,%r9), %zmm16, %zmm17
        vmovupd %zmm17, (%rbx,%r9)
        vfmadd231pd     64(%r9,%rbx), %zmm16, %zmm18
        vmovupd %zmm18, 64(%r9,%rbx)
        addq    $128, %r9
        testl   %r10d, %r10d
        jg      .LB1_444

AVX-512 Support

Intel AVX-512 CPU instructions available on the latest generation Skylake CPUs enable twice the number of floating point operations compared to the previous generation AVX2 SIMD instructions. AVX-512 doubles both the register width and the total number of registers, and can help improve the performance of HPC applications.
template <typename Execution_Policy, typename BODY>
double bench_forall ( int s, int e, BODY body ) { 
  StartTimer ();
  if ( is_same<Execution_Policy, Serial> :: value ) { 
    for ( int i = s; i < e; ++i )
      body ( i );
  } elseif ( is_same<Execution_Policy, OpenACC> :: value ) { 
  #pragma acc parallel loop
  for ( int i = s; i < e; ++i )
    body ( i );
  } return EndTimer ( );

using T = double;
void do_bench_saxpy ( int N, T*a, T*b, Tx) { 
  auto saxpy = [=]( int i ) /* Capture-by-Value */ 
    { b[i] += a[i] * x; };

double stime = bench_forall<Serial>(0, N, saxpy);
double time = bench_forall<OpenACC>(0, N, saxpy);
printf ( "OpenACC Speedup %f \n", stime / time );

Use C++14 Lambdas with Capture in OpenACC Regions

C++ lambda expressions provide a convenient way to define anonymous function objects at the location where they are invoked or passed as arguments. The auto type specifier can be applied to lambda parameters to create a polymorphic lambda-expression. With PGI compilers you can use lambdas in OpenACC compute regions.  Using lambdas with OpenACC is useful for a variety of reasons.  One example is to drive code generation customized to different programming models or platforms.  C++14 opens up more lambda use cases, especially for polymorphic lambdas, and all of those capabilities are now usable in your OpenACC programs. 
OpenACC Construct View in the PGI Profiler

CPU and GPU Profiling Features

The PGI profiler supports performance analysis of OpenACC programs on CPUs and GPUs. CPU Detail View shows a breakdown of the time spent on the CPU for each thread. Three call tree options allow you to profile based on caller, callee or by file and line number. View time for all threads together or individually, quickly sort events by min or max time, and more. Other new features include an option to adjust program counter sampling frequency, and an enhanced display showing the version of the NVLink topology.

See the What's New section in the PGI Release Notes for complete details.

Click me
Cookie Consent

This site uses cookies to store information on your computer. See our cookie policy for further details on how to block cookies.