MPI Debugging and Profiling

Porting, Debugging and Profiling MPI Applications on Windows CCS Clusters

Windows Compute Cluster Server 2003 enables clusters of AMD Opteron and Intel Core 2 processor-based workstations and servers to tackle scientific computing applications in a mainstream Windows environment. Windows CCS provides several fundamental components for effective cluster computing—the MSMPI message-passing interface library, a job scheduler to manage cluster throughput, and a means for launching and monitoring compute jobs on a cluster that is nearly as simple as printing a document from a Windows application.

If you are among the many Windows users moving to cluster computing to break free of the serial performance limits of today's servers, you need a complete set of cluster-capable development tools to effectively port, debug and tune your Fortran, C and C++ applications on Windows CCS. PGI® 7.1 compilers and tools from The Portland Group, which will be generally available in the Fall 2007, add three key components that dovetail with Windows CCS and Microsoft Visual C++ to enable effective cluster-based computing:

PGI Visual Fortran

PVF fully integrates the PGI suite of high-performance 64-bit and 32-bit parallel Fortran compilers and tools from The Portland Group into Microsoft Visual Studio 2005. Interoperable with Microsoft Visual C++, PVF is an ideal solution for porting computationally intensive science and engineering applications to Windows CCS clusters, enabling you to easily build and execute MPI applications from within Visual Studio 2005.

PGI Visual FortranPGI Visual Fortran offers world-class performance and features including auto-parallelization for multi-core, OpenMP 2.5, and support for the PGI Unified Binary. The PGI Unified Binary streamlines cross-platform support by combining into a single executable file code optimized for x64 processor families from both Intel and AMD. This gives you the assurance that your applications will run correctly and with optimal performance regardless of the type of x64 processor on which they are deployed.

PVF's state-of-the-art Fortran compiler technologies include SSE vectorization, auto-parallelization, interprocedural analysis and optimization, memory heirarchy optimizations, function inlining (including library functions), profile-feedback optimization, CPU-specific optimizations and more. PVF is the ideal solution of migrating existing compute-intensive Windows applications from SMP servers and workstations to Windows CCS clusters.

The PGDBG OpenMP/MPI Debugger for Windows CCS

Debugging a cluster MPI application can be extremely challenging. Starting with PGI release 7.1, the PGDBG debugger provides a comprehensive set of graphical user interface (GUI) elements to assist you in this process. PGDBG 7.1 provides the ability to separately debug and control OpenMP threads and MPI processes on your Windows CCS cluster. Step, Next, Break, Halt, Wait or Continue OpenMP threads or MPI processes individually, as a group, or in user-defined process/thread subsets. PGDBG 7.1 can even display the state of MPI message queues, enabling you to quickly isolate and resolve message-passing deadlock bugs.

PGDBGUsing a single integrated multi-process debugging window, PGDBG 7.1 provides precise control and feedback on the state of every MPI process and OpenMP thread simultaneously, with fully integrated capabilities for debugging hybrid parallel programs that use MPI message-passing between nodes and OpenMP shared-memory parallelism within a multi-core or SMP cluster node.

The main PGDBG 7.1 window displays Fortran, C or C++ program source code, optionally interleaved with the corresponding x64 assembly code. Sub-windows enable watch points, register state dumps, and execution of a sequence of user-defined commands at every break point. The main window includes one-touch buttons for the most common debugging commands. A simple and intuitive process/thread grid makes it easy to change the context of the source window and all sub-windows from one process to another with a single mouse click, greatly simplifying control over individual or collective OpenMP threads and MPI processes.

Application input and output can be displayed in the PGDBG 7.1 main window or in a separate I/O Window to span larger numbers of lines of I/O.

PGDBG 7.1 is interoperable with the Microsoft Visual C++ compiler, and together with PGI Visual Fortran gives you the power to port and debug your OpenMP and MPI applications on Windows CCS clusters using an easy and intuitive graphical user interface.

The PGPROF OpenMP/MPI Profiler for Windows CCS

PGPROF® 7.1 is an interactive, powerful and easy-to-use postmortem statistical analyzer for MPI parallel and OpenMP thread-parallel programs running on Windows CCS clusters. You can use PGPROF to analyze programs on multi-core SMP Servers, distributed-memory clusters and hybrid clusters where each node contains multi-core x64 processors. PGPROF allows profiling at the function, source code line, and assembly instruction level for PGI-compiled Fortran, C and C++ programs.

PGPROF PGPROF provides the information you need to determine which functions and lines in an application are consuming the most execution time. Combined with the feedback features of the PGI compilers, PGPROF will enable you to maximize vectorization and performance on a single x64 processor core. PGPROF exposes performance bottlenecks in a cluster application by presenting the number of calls, aggregate message size and execution time of individual MPI function calls on a line by line basis.

Using PGPROF, you can merge trace files from multiple runs on different numbers of nodes to perform scalability analysis on your MPI or OpenMP application at the application, function or line level. Scalability analysis allows you to quickly see which parts of your application are barriers to scalable performance, and where your parallel tuning efforts should be focused. PGPROF displays information in easy-to-use formats such as bar-charts, percentages, counts or seconds and displays profiles using graphical histograms.

Putting it all Together

While performance of individual x64 processor cores is still improving, the premium on power efficiency has led processor vendors to push aggressively on multi-core technology rather than increased clock speeds. Significant application performance gains in the next few years will depend directly on your ability to exploit multi-core and cluster platforms. The PGI 7.1 compilers and tools give you the ability to migrate incrementally from serial to auto-parallel or OpenMP parallel algorithms for multi-core processors. When you are ready to take the next step to cluster-enabled applications using MPI, the PGDBG debugger and PGPROF profiler provide simple and intuitive interfaces to make porting and tuning of applications to MPI more tractable.