MPI Debugging and Profiling

Porting MPI Applications with PGI Compilers and Tools

Today, clusters of x64 and x64+GPU workstations and servers running Linux or Windows are capable of tackling the largest scientific computing applications. If you are among the many users moving to cluster computing to break free of the serial performance limits of today's servers, you need a complete set of cluster-capable development tools to effectively port, debug and tune your Fortran, C and C++ applications. The PGI CDK® Cluster Development Kit® is a suite of compilers and tools for porting existing applications to, or developing new applications on clusters.

Fundamental components for effective cluster computing include a job scheduler to manage cluster throughput, a means for launching and monitoring compute jobs on a cluster and an MPI message-passing interface library for internode communication. Linux users have a number of choices in each category both open source and commercially supported. Windows users will find most of these capabilities are built into the Microsoft HPC Server product line. Regardless of which OS path you choose, to do productive and effective cluster application development you'll quickly come to appreciate the value of having a cluster capable debugger and performance profiler as well.

The PGDBG OpenMP/MPI Debugger

PGDBG GUI Debugging a cluster MPI application can be extremely challenging. The PGDBG debugger provides a comprehensive set of graphical user interface (GUI) elements to assist you in this process. PGDBG provides the ability to separately debug and control OpenMP threads and MPI processes on both Linux and Windows HPC Server cluster. Perform Step, Break, Run and Halt actions on threads or processes individually or collectively as a group. PGDBG can even display the state of MPI message queues, enabling you to quickly isolate and resolve message-passing deadlock bugs.

Using a single integrated multi-process debugging window, PGDBG provides precise control and feedback on the state of every MPI process and OpenMP thread simultaneously, with fully integrated capabilities for debugging hybrid parallel programs that use MPI message-passing between nodes and OpenMP shared-memory parallelism within a multicore or SMP cluster node.

Tabs in the Main window Source Panel allow you to display source code only, disassembly code showing how the currently executing high-level source code has been compiled into assembly language, or a mix where the assembly code is interleaved with the source code. Assembly language stepping and breakpoint indicators are enabled as well.

On Windows, PGDBG is interoperable with the Microsoft Visual C++ compiler, and together with PGI Visual Fortran gives you the power to port and debug your OpenMP and MS-MPI applications on Windows HPC Server clusters using an easy and intuitive graphical user interface.

The PGPROF OpenMP/MPI Performance Profiler

PGPROF® is a powerful and simple-to-use interactive postmortem statistical analyzer for MPI process-parallel and OpenMP thread-parallel programs as well as programs incorporating PGI Accelerator directives and CUDA Fortran. Use PGPROF to visualize and diagnose the performance of the components of your program. PGPROF associates execution time with the source code and instructions of your program, allowing you to see where and how execution time is spent. Through resource utilization data and compiler feedback information, PGPROF also provides features for helping you to understand why certain parts of your program have high execution times.

PGPROF provides the information necessary to determine which functions and lines in an application are consuming the most execution time. Combined with the feedback features of the PGI compilers, PGPROF will enable you to maximize vectorization and performance on a single x64 processor core. PGPROF exposes performance bottlenecks in a cluster application by presenting the number of calls, aggregate message size and execution time of individual MPI function calls on a line by line basis. On GPUs, PGPROF reports performance critical information including initialization, data transfer and kernel execution times.

Using PGPROF, you can merge profiles from multiple runs on different numbers of nodes to perform scalability analysis on your MPI or OpenMP application at the application, function or line level. Scalability analysis allows you to quickly see which parts of your application are barriers to scalable performance, and where your parallel tuning efforts should be focused. PGPROF, displays information in easy-to-use formats such as bar-charts, percentages, counts or seconds and displays profiles using graphical histograms.

Putting it All Together

While performance of individual x64 processor cores is still improving, the premium on power efficiency has led processor vendors to push aggressively on multi-core technology rather than increased clock speeds. Significant application performance gains in the next few years will depend directly on your ability to exploit multi-core and cluster platforms. The PGI compilers and tools give you the ability to migrate incrementally from serial to auto-parallel or OpenMP parallel algorithms for multi-core processors. When you are ready to take the next step to cluster-enabled applications using MPI, the PGDBG debugger and PGPROF profiler provide simple and intuitive interfaces to make porting and tuning of applications to MPI more tractable.

Click me