Selected HPC-related content from the 2018 GPU Technology Coference held in San Jose, California March 26–29, 2018.

Talks

  • S8709 - Accelerating Molecular Modeling Tasks on Desktop and Pre-Exascale Supercomputers

This talk will showcase recent successes in the use of GPUs to accelerate challenging molecular simulation analysis tasks on the latest Volta-based Tesla V100 GPUs on both Intel and IBM/OpenPOWER hardware platforms, and with large scale runs on petascale computers such as ORNL Summit. This presentation will highlight the performance benefits obtained from die-stacked memory on Tesla V100, the NVLink interconnect on the IBM OpenPOWER platforms, and the use of advanced features of CUDA, Volta's new Tensor units, and just-in-time (JIT) compilation to increase the performance of key analysis algorithms. We will present results obtained with OpenACC parallel programming directives, current challenges, and future opportunities. Finally, we will describe GPU-accelerated machine learning algorithms for tasks such as clustering of structures resulting from molecular dynamics simulations.

View the recording.
No slides available.

  • S8291 - Acceleration of a Computational Fluid Dynamics Code with GPU UsingOpenACC

This talk will showcase recent successes in the use of GPUs to accelerate challenging molecular simulation analysis tasks on the latest Volta-based Tesla V100 GPUs on both Intel and IBM/OpenPOWER hardware platforms, and with large scale runs on petascale computers such as ORNL Summit. This presentation will highlight the performance benefits obtained from die-stacked memory on Tesla V100, the NVLink interconnect on the IBM OpenPOWER platforms, and the use of advanced features of CUDA, Volta's new Tensor units, and just-in-time (JIT) compilation to increase the performance of key analysis algorithms. We will present results obtained with OpenACC parallel programming directives, current challenges, and future opportunities. Finally, we will describe GPU-accelerated machine learning algorithms for tasks such as clustering of structures resulting from molecular dynamics simulations.

View the recording.
View the slides.

  • S8848 - Adapting Minisweep, a Proxy Application, on Heterogeneous Systems Using OpenACC Directives

Learn about how the high-level directive-based, widely popular, programming model, OpenACC can help port radiation transport scientific codes to large scale heterogeneous systems consisting of state-of-the-art accelerators such as GPUs. Architectures are rapidly evolving and the exascale machines are expected to offer billion-way concurrency. We need to rethink algorithms, languages, programming models among other components in order to increase parallelism from a programming standpoint to be able to migrate large scale applications to these massively powerful platforms. This talk will discuss programming challenges and its corresponding solutions for porting a wavefront based miniapplication for Denovo, which is a production code for nuclear reactor modeling, using OpenACC. Our OpenACCimplementation running on NVIDIA's next-generation Volta GPU boasts a 85.06x speedup over serial code, which is larger than CUDA's 83.72x speedup over the same serial implementation.

View the recording.
View the slides.

  • S8811 - An Agile Approach to Building a GPU-enabled and Performance-portable Global Cloud-resolving Atmospheric Model

We will give a high-level overview of the results of these efforts, and how we built a cross-organizational partnership to achieve them. Ours is a directive-based approach using OMP and OpenACC to achieve portability. We have focused on achieving good performance on three main architectural branches available to us, namely: traditional multi-core processors (e.g. Intel Xeons), many core processors like the Intel Xeon Phi, and of course NVIDIA GPUs. Our focus has been on creating tools for accelerating the optimization process, techniques for effective cross-platform optimization, and methodologies for characterizing and understanding performance. The results are encouraging, suggesting a path forward based on standard directives for responding to the pressures of future architectures.

View the recording.
View the slides.

  • S8637 - Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P

We'll present our experience with using OpenACC to port GTC-P, a real-world plasma turbulence simulation, on NVIDIA P100 GPU and SW26010, the Chinese home-grown many-core processor. Meanwhile, we developed the GTC-P code with the native approach on Sunway TaihuLight supercomputer so that we can analyze the performance gap between OpenACC and the native approach on P100 GPU and SW26010. The experiment results show that the performance gap between OpenACC and CUDA on P100 GPU is less than 10% by PGI compiler. However, the gap on SW26010 is more than 50% since the register level communication only supported by native approach can avoid low-efficiency main memory access. Our case study demonstrates that OpenACC can deliver impressively portable performance on P100 GPU, but the lack of software cache via RLC supported by the OpenACC compiler on SW26010 results in large performance gap between OpenACC and the native approach.

View the recording.
View the slides.

  • S8800 - A Novel Mapped Grid Approach for GPU Acceleration of High-Order Structured Grid CFD Solvers

We'll present use of state-of-the-art computational fluid dynamics algorithms and their performance on NVIDIA GPUs, including the new DGX-1 Station using multiple Tesla V100 GPU accelerators. A novel mapped grid approach to implementing high-order stencil based finite-difference and finite-volume methods is the highlight, but we'll also feature the use of flux-reconstruction on GPU using OpenACC.

View the recording.
View the slides.

  • S8188 - Application of OpenACC to Computer Aided Drug Discovery software suite "Sanjeevini"

In this session we demonstrate the features and capabilities of OpenACC for porting and optimizing ParDOCK docking module of Sanjeevini suite for Computer Aided Drug Discovery developed at SCFBio, IIT Delhi, India. We have used OpenACC to efficiently port the existing C++ programming model of ParDOCK software with minimal code modifications to run on latest NVIDIA P100 GPU card. With these code modifications and tuning, average speedup of 6x improvements in turnaround time was made possible. By implementing openACC the code is now able to sample 10 times more ligand conformations leading to increase in accuracy. The ACC ported ParDOCK code is now able to predict correct pose of a protein-ligand interaction from 96.8% percent times, compared to 94.3% earlier (for poses under 1 Å) and 89.9% times compared to 86.7% earlier (for poses under 0.5 A).

No recording available.
View the slides.

  • S8805 - Managing Memory of Complex Aggregate Data Structures in OpenACC

t is extremely challenging to move data between host and device memories when deep nested complex aggregate data structures are commonly used in an application. This talk will bring users diving into VASP, ICON, and other real-world applications and see how the deep copy issue is solved in these real-world applications with PGI compiler and OpenACC APIs. The OpenACC 2.6 specification includes directives and rules that enable programmer-controlled manual deep copy, albeit in a form that can be intrusive in terms of the number of directives required. The OpenACC committee is designing new directives to extend explicit data management to aggregate data structures in a form that is more elegant and concise. The talk will also cover comparison of unified memory, manual deepcopy, full deepcopy, and true deepcopy.

View the recording.
View the slides.

  • S8506 - Mapping MPI+X Applications to Multi-GPU Architectures: A Performance-Portable Approach

Learn how to map parallel scientific applications on multi-GPU architectures using a performance-portable approach. This approach is built on three fundamental aspects: Firstly, the memory hierarchy is the primary design consideration; secondly, there is a global awareness of hybrid programming abstractions, such as MPI+CUDA+OpenMP; and thirdly, a framework that enables the integration and support of heterogeneous devices. We'll provide example mappings on a CORAL early access system consisting of IBM Power8+ processors with NVIDIA Pascal GPUs. We'll also discuss the performance of micro-benchmarks and an earthquake ground motion simulation code relative to other mapping approaches.

View the recording.
View the slides.

  • S8580 - Modernizing OpenMP for an Accelerated World

OpenMP has come a long way in its first 20 years, but the last few have brought by far the most change. With accelerated computing on the rise, OpenMP integrated features to address distributed memory devices and offloading to accelerators. Now, as we prepare for the next generation of supercomputers and GPUs, OpenMP is growing to meet the challenges of productively programming scientific applications in a world of accelerators, unified memory, and explicitly hierarchical memories. This talk will discuss the present and future of OpenMP as we ramp up to version 5.0, presenting some of the new features incorporated so far and how they are shaped by and in turn how they shape large scale scientific applications.

View the recording.
View the slides.

  • S8351 - Multi GPU Made Easy by OmpSs + CUDA/OpenACC

While OpenACC focuses on coding productivity and portability, CUDA enables extracting the maximum performance from NVIDIA GPUs. OmpSs, on the other hand, is a GPU-aware task-based programming model which may be combined with CUDA, and recently with OpenACC as well. Using OpenACC we will start benefiting from GPU computing, obtaining great coding productivity and nice performance improvements. We can next fine-tune the critical application parts developing CUDA kernels to hand-optimize the problem. OmpSs combined with either OpenACC or CUDA will enable seamless task parallelism leveraging all system devices.

View the recording.
View the slides.

  • S8314 - Multi GPU Programming with MPI

Learn how to program multi-GPU systems or GPU clusters using the message passing interface (MPI) and OpenACC or NVIDIA CUDA. We'll start with a quick introduction to MPI and how it can be combined with OpenACC or CUDA. Then we'll cover advanced topics like CUDA-aware MPI and how to overlap communication with computation to hide communication times. We'll also cover the latest improvements with CUDA-aware MPI, interaction with unified memory, the multi-process service (MPS, aka Hyper-Q for MPI), and MPI support in NVIDIA performance analysis tools.

View the recording.
View the slides.

  • S8373 - MVAPICH2-GDR: Pushing the Frontier of Designing MPI Libraries Enabling GPUDirect Technologies

Learn about the latest developments in the high-performance mass passing interference (MPI) over InfiniBand, iWARP, and RoCE (MVAPICH2) library that simplify the task of porting MPI applications to supercomputing clusters with NVIDIA GPUs. MVAPICH2 supports MPI communication directly from GPU device memory and optimizes it using various features offered by the CUDA toolkit, providing optimized performance on different GPU node configurations. These optimizations are integrated transparently under standard MPI API, for better programmability. Recent advances in MVAPICH2 include designs for MPI-3 RMA using GPUDirect RDMA framework for MPI datatype processing using CUDA kernels, support for GPUDirect Async, support for heterogeneous clusters with GPU and non-GPU nodes, and more. We use the popular Ohio State University micro-benchmark suite and example applications to demonstrate how developers can effectively take advantage of MVAPICH2 in applications using MPI and CUDA/OpenACC. We provide guidance on issues like processor affinity to GPU and network that can significantly affect the performance of MPI applications that use MVAPICH2.

View the recording.
View the slides.

  • S8926 - ORNL Summit: Accelerated Simulations of Stellar Explosions with FLASH: Towards Exascale Capability

Multiphysics and multiscale simulations are found in a variety of computational science subfields, but their disparate computational characteristics can make GPU implementations complex and often difficult. Simulations of supernovae are ideal examples of this complexity. We use the scalable FLASH code to model these astrophysical cataclysms, incorporating hydrodynamics, thermonuclear kinetics, and self-¬‐gravity across considerable spans in space and time. Using OpenACC and GPU-¬‐enabled libraries coupled to new NVIDIA GPU hardware capabilities, we have improved the physical fidelity of these simulations by increasing the number of evolved nuclear species by more than an order-¬‐of-¬‐ magnitude. I will discuss these and other performance improvements to the FLASH code on the Summit supercomputer at Oak Ridge National Laboratory.

View the recording.
No slides available.

  • S8908 - ORNL Summit: Enabling Large Scale Science on Summit Through the Center for Accelerated Application Readiness

The Center for Accelerated Application Readiness within the Oak Ridge Leadership Computing Facility is a program to prepare scientific applications for next generation supercomputer architectures. Currently the program consists of thirteen domain science application development projects focusing on preparing codes for efficient use on Summit. Over the last three years, these teams have developed and executed a development plan based on detailed information about Summit's architecture and system software stack. This presentation will highlight the progress made by the teams that have used Titan, the 27 PF Cray XK7 with NVIDIA K20X GPUs, SummitDev, an early IBM Power8+ access system with NVIDIA P100 GPUs, and since very recently Summit, OLCF's new IBM Power9 system with NVIDIA V100 GPUs. The program covers a wide range of domain sciences, with applications including ACME, DIRAC, FLASH, GTC, HACC, LSDALTON, NAMD, NUCCOR, NWCHEM, QMCPACK, RAPTOR, SPECFEM, and XGC.

View the recording.
View the slides.

  • S8909 - ORNL Summit: Exposing Particle Parallelism in the XGC PIC code by exploiting GPU memory hierarchy

XGC is a kinetic whole-¬‐volume modeling code with unique capabilities to study tokamak edge plasmas in real geometry and answer important questions about the design of ITER and other future fusion reactors. The main technique is the Particle-¬‐in-¬‐Cell method, which models the plasma as billions of quasiparticles representing ions and electrons. Ostensibly, the process of advancing each particle in time is embarrassingly parallel. However, the electric and magnetic fields must be known in order to push the particle, which requires an implicit gather operation from XGC's sophisticated unstructured mesh. In this session, we'll show how careful mapping of field and particle data structures to GPU memory allowed us to decouple the performance of the critical electron push routine from size of the simulation mesh and allowed the true particle parallelism to dominate. This improvement enables performant, high resolution, ITER scale simulations on Summit.

View the recording.
View the slides.

  • S8747 - ORNL Summit: Petascale Molecular Dynamics Simulations on the Summit POWER9/Volta Supercomputer

Learn the opportunities and pitfalls of running billion-atom science at scale on a next-generation pre-exascale GPU-accelerated supercomputer. The highly parallel molecular dynamics code NAMD has been long used on the GPU-accelerated Cray XK7 Blue Waters and ORNL Titan machines to perform petascale biomolecular simulations, including a 64-million-atom model of the HIV virus capsid. In 2007 NAMD was was one of the first codes to run on a GPU cluster, and it is now one of the first on the new ORNL Summit supercomputer, which features IBM POWER9 CPUs, NVIDIA Volta GPUs, and the NVLink CPU-GPU interconnect. This talk will cover the latest NAMD performance improvements and scaling results on Summit and other leading supercomputers.

View the recording.
No slides available.

  • S8190 - Performance Optimization for Scientific Applications

We'll take you on a journey through enabling applications for GPUs; interoperability of different languages (including Fortran, OpenACC, C, and CUDA); CUDA library interfacing; data management, movement, and layout tuning; kernel optimization; tool usage; multi-GPU data transfer; and performance modeling. We'll show how careful optimizations can have a dramatic effect and push application performance towards the maximum possible on the hardware. We'll describe tuning of multi-GPU communications, including efficient exploitation of high-bandwidth NVLink hardware. The applications used in this study are from the domain of numerical weather prediction, and also feature in the ESCAPE European collaborative project, but we'll present widely relevant techniques in a generic and easily transferable way.

View the recording.
View the slides.

  • S8446 - Porting Quantum ESPRESSO's PWscf Solver to GPUs with CUDA Fortran

Learn how to effectively leverage CUDA Fortran to port scientific applications written in Fortran to GPUs. We'll present in detail the porting effort of Quantum ESPRESSO's Plane-Wave Self-Consistent Field (PWscf) solver, from profiling and identifying time-consuming procedures to performance analysis of the GPU-accelerated solver on several benchmark problems on systems ranging in size from small workstations to large distributed GPU clusters. We'll highlight several tools available in CUDA Fortran to accomplish this, from high-level CUF kernel directives to lower level kernel programming, and provide guidance and best practices in several use cases with detailed examples.

View the recording.
View the slides.

  • S8799 - On Porting Scalable Parallel CFD Application HiFUN on NVIDIA GPU

The present study deals with porting scalable parallel CFD application HiFUN on NVIDIA Graphics Processing Unit (GPU) using an off-load strategy. The present strategy focuses on improving single node performance of the HiFUN solver with the help of GPUs. This work clearly brings out the efficacy of the off-load strategy using OpenACC directives on GPUs and may be considered as one of the attractive models for porting legacy CFD codes on GPU based supercomputing platform.

View the recording.
View the slides.

  • S8344 - OpenMP on GPUs, First Experiences and Best Practices

OpenMP has a long history on shared memory, CPU-based machines, but has recently begun to support offloading to GPUs and other parallel accelerators. This talk will discuss the current state of compilers for OpenMP on NVIDIA GPUs, showing results and best practices from real applications. Developers interested in writing OpenMP codes for GPUs will learn how best to achieve good performance and portability.

View the recording.
View the slides.

  • S8750 - Porting VASP to GPUs with OpenACC

VASP is a software package for atomic-scale materials modeling. It's one of the most widely used codes for electronic-structure calculations and first-principles molecular dynamics. We'll give an overview and status of porting VASP to GPUs with OpenACC. Parts of VASP were previously ported to CUDA C with good speed-ups on GPUs, but also with an increase in the maintenance workload as VASP is otherwise written wholly in Fortran. We'll discuss OpenACC performance relative to CUDA, the impact of OpenACC on VASP code maintenance, and challenges encountered in the port related to management of aggregate data structures. Finally, we'll discuss possible future solutions for data management that would simplify both new development and maintenance of VASP and similar large production applications on GPUs.

View the recording.
View the slides.

  • S8273 - Programming GPU Supercomputers Ten Years From Now

We'll briefly review how programming for GPU computing has progressed over the past ten years, and where it is going over the next ten years, specifically for data management and parallel compute management. CUDA languages expose all aspects of data and compute management, allowing and sometimes requiring programmers to take control of both. Libraries typically internalize all compute management, and some internalize all data management as well. Directives virtualize both data and compute management, but don't completely hide either. Future hardware and software capabilities will allow programs to enjoy automatic data movement between DDR memory and GPU device memory, and enhanced caching hardware reduces the need for explicit scratchpad memory programming. As parallel constructs are added to standard programming languages, writing parallel programs for GPU computing will become no more or less difficult than multicore programming.

View the recording.
View the slides.

  • S8489 - Scaling Molecular Dynamics Across 25,000 GPUs on Sierra & Summit

As a part of the Department of Energy/National Cancer Institute pilot programs and the Sierra Institutional Center of Excellences, Lawrence Livermore National Laboratory has developed strong scaling molecular dynamics codes for atomic-level simulation in physics, materials science, and biology. Our implementation is portable from tablets and laptops to supercomputers, and can efficiently scale up to tens of thousands of GPUs. In particular, we target the Department of Energy leadership computing facilities, Sierra and Summit, at the Livermore and Oak Ridge National Laboratories. These are over 100-petaflops supercomputers powered by IBM and NVIDIA hardware. We'll discuss the performance and scaling of our code, and its application to cancer biology research, material science, and high-energy physics.

View the recording.
View the slides.

  • S8847 - Solar Storm Modeling using OpenACC: From HPC Cluster to "In-House"

We explore using OpenACC to migrate applications required for modeling solar storms from CPU HPC clusters to an "in-house" multi-GPU system. We describe the software pipeline and the utilization of OpenACC in the computationally heavy codes. A major step forward is the initial implementation of OpenACC in our Magnetohydrodynamics code MAS. Strategies for overcoming some of the difficulties encountered are discussed, including handling Fortran derived types, array reductions, and performance tuning. Production-level "time-to-solution" results will be shown for multi-CPU and multi-GPU systems of various sizes. The timings show that it is possible to achieve acceptable "time-to-solution"s on a single multi-GPU server/workstation for problems that previously required using multiple HPC CPU-nodes.

View the recording.
View the slides.

  • S8241 - Sunny Skies Ahead! Versioning GPU accelerated WRF to 3.7.1

This talk details the inherent challenges in porting a GPU-accelerated community code (WRF) to a newer major version, integrating the community non-GPU changes with OpenACC directives from the earlier version. This is a non-trivial exercise — this particular version upgrade contained 143,000 modified lines of code which required reintegration into our accelerator directives. This work is important in providing support for newer features whilst still providing GPU support for the users. We also look at efforts to improve the maintainability of GPU accelerated community codes.

View the recording.
View the slides.

  • S8557 - Tricks, Tips, and Timings: The Data Movement Strategies You Need to Know

Learn the latest strategies to efficiently move complicated data structures between GPUs and CPUs. We'll go beyond basic data movement, showing techniques that have been used in practice to port and optimize large-scale production applications. These include a look at the unique benefits of zero copy, how to set up a deep copy to avoid having to flatten data structures, and how this can be done in OpenMP 4. We'll cover both CUDA and directive approaches using examples written in modern Fortran and applicable in any language.

View the recording.
View the slides.

  • S8470 - Using RAJA for Accelerating LLNL Production Applications on the SIERRA Supercomputer

Top supercomputers in the TOP500 list have transitioned from homogeneous node architectures toward heterogeneous many-core nodes with accelerators and CPUs. These new architectures present significant challenges to developers of large-scale multi physics applications, especially at Department of Energy laboratories that have invested heavily in scalable message passing interference codes over decades. Preserving developer productivity requires single source high-performance code bases, while porting to new architectures. We'll introduce RAJA, a C++-based programming model abstraction developed at Lawrence Livermore National Laboratory (LLNL) and used to abstract fine-grained on-node parallelization in multiple production applications. Then, we'll describe how RAJA is used in ARES, a large, multiphysics application at LLNL.

View the recording.
View the slides.

  • S8382 - Zero to GPU Hero with OpenACC

GPUs are often the fastest way to obtain your scientific results, but many students and domain scientists don't know how to get started. In this tutorial we will take an application from simple, serial loops to a fully GPU-enabled application. Students will learn a profile-guided approach to accelerating applications, including how to find hotspots, how to use OpenACC to accelerated important regions of code, and how to get the best performance they can on GPUs. No prior experience in GPU-programming or OpenACC is required, but experience with C, C++, or Fortran is a must. Several books will be given away to attendees who complete this tutorial.

View the recording.
No slides available.

Posters

  • P8168 - Efficient GPU Parallelization of MPAS Physics Schemes Using OpenACC Directives

Meetings

The OpenACC User Group meets a few times a year during key HPC events to discuss training, provide feedback on the specification, collaborate on OpenACC-related research and activities, share experiences and best practices and have a good time with great company! Invited speakers Dr. John Stone from UIU, who will talk about VMD progress with OpenACC, and Dr. Randy Allen from Mentor Graphics to update us on GCC implementation. Join us on March 27th at GTC18 - food and drinks are on us.

View John Stone's slides.
View Randy Allen's slides.

Click me

This site uses cookies to store information on your computer. See our cookie policy for further details on how to block cookies.

X