The Portland Group, a leader in GPGPU development tools and technologies for HPC, offers one and two day courses on NVIDIA GPU Programming with CUDA C, CUDA Fortran and the PGI Accelerator programming model.
Intended for expert domain scientists and engineers, the courses are delivered in an all day interactive lecture format and cover all aspects of GPU accelerator programming. This includes background information, a summary of programming alternatives including ways you can determine which approach is right for your project, detailed walk-throughs of real-world examples and in-depth performance analysis and tuning discussions. In addition to the classroom lectures, PGI offers an optional additional one or two day of hands-on collaboration where you and PGI GPU experts can work side-by-side on your code.
The courses are offered on-site at your location and are conducted by members of PGI's staff of experience GPU compiler and applications engineers.
Price and Availability
A one day lecture only training course includes a CD-ROM containing PGI manuals and documentation, training materials, example programs, and tutorial materials. No class size restrictions.
A two day training course includes lecture, lab and Q&A sessionsand includes a CD-ROM containing PGI manuals and documentation, training materials, example programs, and tutorial materials. No class size maximum for the first day lecture; maximum class size is 10 students for day two.
For those needing extended training, an additional one day of hands-on collaboration can be added to either course option. The additional one day can be delivered contemporaneously at your site, or it can be delivered at a later date via telephone or the Internet. Maximum class size is 10 students.
For pricing information contact, PGI sales. Travel, meals and accommodations are extra.
Classes are scheduled subject to availabliltiy. PGI also conducts training in concert with many major HPC industry conferences throughout the year. Contact
PGI sales
for schedules and related information.
Lecture Syllabus
Part I. Introduction
-
CPU Architecture vs. GPU Architecture
-
CPU Architecture Basics
-
Multicore and multiprocessor basics
-
GPU Architecture Basics
-
How is the GPU connected to the host?
-
Why is parallel programming for GPUs different than for multicore?
-
What is a GPU thread and how does it execute?
-
How can I identify my GPU?
Part II. CUDA, C and Fortran
-
Low-level GPU Programming CUDA
-
How does data get to the GPU?
-
How does a program run on the GPU?
-
What kinds of parallelism is appropriate for a GPU?
-
The CUDA programming model
-
Host code to control GPU, allocate memory, launch kernels
-
Kernel code to execute on GPU
-
Scalar routine executed on one thread
-
Launched in parallel on a grid of thread blocks
-
The Host Program
-
Declaring and allocating device memory data
-
Moving data to and from the device
-
Launching kernels
-
Writing Kernels
-
What is allowed in a kernel vs. what is not allowed
-
Grids, Blocks, Threads, Warps
-
Building and Running CUDA Programs
-
Compiler options
-
Running your program
-
The CUDA Runtime API
-
CUDA Fortran vs. CUDA C
-
Performance Tuning, Tips and Tricks
-
Measuring performance, using cudaprof
-
Occupancy, memory coalescing
-
Optimizing your kernels
-
Optimize communication between host and GPU
-
Optimize device memory accesses, shared memory usage
-
Optimize the kernel code
-
Debugging using emulation
Part III. PGI Accelerator Model
-
High-level GPU Programming using the PGI Accelerator Model
-
What role does a high-level model play?
-
Basic concepts and directive syntax
-
Accelerator compute and data regions
-
Appropriate algorithms for a GPU
-
Building and Running Accelerator Programs
-
Command line options
-
Enabling compiler feedback
-
Accelerator Directive Details
-
Compute regions
-
Clauses on the compute region directive
-
What can appear in a compute region
-
Obstacles to successful acceleration
-
Loop directive
-
Clauses on the loop directive
-
Loop schedules
-
Data regions
-
Clauses on the data region directive
-
Interpreting compiler feedback
-
Using pgprof source browser
-
Hindrances to parallelism
-
Data movement feedback
-
Reading kernel schedules
-
Performance Tuning, Tips and Tricks
-
Choosing accelerator device
-
PGI Unified Binary for multiple host or multiple accelerators
-
Performance profiling information
-
Optimizing initialization time
-
Performance Tuning (50 minutes)
-
Appropriate algorithm
-
Optimizing data movement between host and GPU
-
Optimizing kernel performance
-
Tuning the kernel schedule
-
Choosing accelerator device
-
PGI Unified Binary for multiple host or multiple accelerators
-
Performance profiling information
-
Optimizing initialization time
Part IV. Wrap-up, Questions
-
Accelerators in HPC
-
Past, present, future role of accelerators in HPC
-
Past, present, future of programming models for accelerators
-
How to reach an exaflop