GPU Tutorial Summary

Michael Wolfe from PGI will conduct a half-day tutorial on "Performance Optimization on GPUs" on the afternoon of Tuesday, 31 May 2011, in Tucson,Arizona, in conjunction with the 25th International Conference on Supercomputing (ICS 2011). Registration and cost information is available from ICS 2011 web site. You may also be interested in the morning tutorial on "Introduction to Programming GPUs".

This tutorial focuses on tuning for and exploiting the critical performance features of a GPU, and avoiding performance hazards. We use examples from NVIDIA CUDA C, PGI CUDA Fortran, and the PGI Accelerator programming model. We look at performance measurement tools, to focus our attention on the performance critical issues. The performance issues covered include data traffic between the host and GPU, data traffic between the GPU memory and the GPU cores, kernel code optimization, and thread-block and grid shape. We motivate and demonstrate each task with live examples.

The target audience is C and Fortran programmers who have taken a GPU programming tutorial, or who have some experience with computing on GPUs, and who want to tune their application performance on the GPU.

Tentative Schedule

Date: Tuesday, 31 May 2011

13:00-13:15 - GPU Architecture Basics
13:15-13:45 - Performance Measurement for CUDA Programs
13:45-14:45 - Performance Tuning
14:45-15:00 - Compiler Feedback for Accelerator Programs
15:00-15:30 - Break
15:30-16:00 - Performance Measurement for Accelerator Programs
16:00-17:00 - Performance Tuning

Schedule is subject to minor changes.

Tutorial Syllabus

Part I. Introduction (0:15)

  1. GPU Architecture Basics
    • GPU Architecture features
    • How is the GPU connected to the GPU
    • Grids, Blocks, Threads, Warps
    • Device memory, hardware cache memory, software cache memory

Part II. CUDA Performance Tuning (1:30)

  1. Performance Measurement
    • Using cudaprof
    • Using pgprof
    • Memory coalescing, divergence
  2. Performance Tuning
    • Optimize data movement
    • Device memory accesses
    • Kernel launch configurations
    • Optimize kernel code
    • Unrolling the loops

Part III. PGI Accelerator Model Performance Tuning (2:00)

  1. Compiler Feedback
    • Data movement messages
    • Kernel messages
    • Performance messages
  2. Performance Measurement
    • Runtime profiler
    • Using pgprof
  3. Performance Tuning
    • Appropriate algorithm
    • Optimize data movement
    • Optimize kernel performance
    • Tuning the kernel schedule

About the Presenter

Michael Wolfe has been a compiler engineer at The Portland Group since joining in 1996, where his responsibilities and interests include deep compiler analysis and optimizations ranging from improving power consumption for embedded microcores to improving the efficiency of Fortran on parallel clusters. He was an associate professor at the Oregon Graduate Institute from 1988 until 1996, and was a cofounder and lead compiler engineer at Kuck and Associates, Inc., prior to that. He was granted a PhD in Computer Science from the University of Illinois, and has published the textbook, High Performance Compilers for Parallel Computing, and a monograph, Optimizing Supercompilers for Supercomputers, and many technical papers. Dr. Wolfe is also a Fellow with STMicroelectronics.

Click me