Technical News from The Portland Group
Accelerating Visual C++ Applications on GPUs
PGI is exploring how to enable Microsoft Visual C++ developers to easily target NVIDIA GPU accelerators. NVIDIA's recently announced Parallel Nsight enables developers to write CUDA C, OpenCL or DirectCompute code for NVIDIA GPUs from within Visual Studio. All of these programming models are explicit models that require re-factoring to split out explicit GPU kernel functions, manage data movement between the host and GPU, configure parallel kernel launches, and manage the GPU memory hierarchy. Is there a way we can provide Visual C++ programmers a higher-level, more incremental and fully portable programming model for accessing the power of GPU accelerators?
The PGI Accelerator Programming Model
The PGI C compiler for Windows supports programming for GPUs using the PGI Accelerator programming model. This model is a directive-based method for guiding the compiler to automatically translate loop nests into code (called kernels) for GPUs. The directives are enabled via a compiler option; without the option, or when compiling the code with another compiler, the directives are treated as comments and ignored. This feature of a directives-based model allows source code to remain compiler-neutral and portable. In C, PGI Accelerator directives are implemented using the pragma mechanism and should look familiar to anyone who has used OpenMP pragmas for multi-core programming.
Creating a PGI Accelerator C Compiler Plug-in for VC++
The PGI C compiler for Windows is not currently available for use within Visual Studio. However, PGI Visual Fortran incorporates the PGI Fortran compiler for Windows, a PGI-custom project system, debug engine and Fortran editing facilities into Visual Studio to create a state-of-the-art Windows Fortran programming environment. PVF projects can be mixed with VC++ projects to create mixed-language solutions, and the PGI debug engine can be used to debug both Fortran and VC++ code.
These capabilities and features give us most of the components we needed to create a Beta-level PGI Accelerator C compiler plug-in for VC++. To create the plug-in, we re-used many elements of PGI Visual Fortran including the PGI debug engine. We incorporated the PGI C compiler, and introduced an option to compile specific files with it in place of Microsoft C++, relying on the VC++ language facility for editing of PGI-compiled C code.
Using the PGI Accelerator C Plug-in within Visual Studio
The first step to programming GPUs using the PGI Accelerator C plug-in for Visual Studio is to find a section of code in an existing VC++ project that is a candidate for offloading onto the GPU. As an example, we’ll use an implementation of the Mersenne Twister pseudorandom number generator originally written by NVIDIA and ported to run on NVIDIA GPUs using PGI Accelerator pragmas by Federico Dal Castello of STMicroelectronics.
The screen shot below shows the serial and OpenMP multi-core versions of the main loop, both of which make calls to the sgenrand_mt() and genrand_mt() functions to do most of the work in the random number generation:
Running the serial and OpenMP versions of the code we see the following:
The performance is measured as "throughput" in megabytes per second, so a larger number is better. Performance scales nearly linearly to two cores of our dual-core Windows 7 system, but does not seem to benefit from hyperthreading.
The system we are working on is an Intel Core i7 quad-core laptop with an NVIDIA Quadro GPU. Using a PGI Workstation command window and the pgaccelinfo utility, we can see it has the following characteristics:
PGI$ pgaccelinfo … Device Name: Quadro FX 880M Device Revision Number: 1.2 Global Memory Size: 1034485760 Number of Multiprocessors: 6 Number of Cores: 48 …
The screen shot below of the MersenneTwister solution and the source file mersenneTwisterCPU.c shows a version of the random number generator main loop that has been modified to run on an NVIDIA GPU using PGI Accelerator pragmas. The sgenrand_mt() and genrand_mt() functions have been inlined, and considerable modifications have been made as described in the Dal Castello article. Note the insertion of two PGI Accelerator directives, region and for, with various clauses:
The #pragma acc region directive marks a structured block that the PGI C compiler will analyze and try to offload to an NVIDIA GPU. The #pragma acc for directive specifies that the iterations of the following for loop are independent, or fully parallelizable regardless of any potential dependencies the compiler might detect. Recall that the VC++ compiler ignores the #pragma acc. To activate the #pragma acc region and for directives, the file containing them must be compiled with the PGI compiler. To accomplish this, select the PGI Accelerator C Compiler as the Tool associated with the mersenneTwisterCPU.c file:
Build the VC++ project as you normally would. When the selected file is compiled, the PGI compiler (pgcc.exe) will be invoked instead of Microsoft’s C/C++ compiler (cl.exe).
Note the messages indicating that data is being moved between the host and the Accelerator (the NVIDIA GPU), and that an Accelerator kernel is being generated for the loop at line 200. These are PGI Common Compiler Feedback Format (CCFF) messages, which can be enabled to highlight optimizations performed by the PGI compilers. CCFF messages are also issued when optimizations are inhibited, with an explanation as to why. For example, if the compiler is unable to generate GPU code for a loop subject to a PGI Accelerator pragma, it will issue messages explaining why. By monitoring and adapting to this feedback, the programmer can incrementally optimize source code to enable GPU code generation and optimization.
The compiler options used in the PGI Accelerator C compiler command are controlled by the Property Pages associated with the selected file. For example, the Target Accelerators property page for mersenneTwisterCPU.c shows that PGI Accelerator directives are enabled for NVIDIA GPUs, but most of the other options are left to their defaults:
Additional Details About MersenneTwister
In this example all files except mersenneTwisterCPU.c, which contains the main compute kernels, were compiled with the VC++ compiler. PGI runtime libraries required at link time were added to the project’s Linker | Input | Additional Dependencies property. It is important to note that the PGI compiler can handle only C99 language files, not C++. Files in your VC++ project or solution that include C++ and are candidates for GPU acceleration must be refactored so the loops to be offloaded to the GPU are in an extern C program unit in a file that contains only C language functions. However, the resulting code will still be 100% portable to other compilers, most notably to VC++ itself.
Running on an NVIDIA GPU
Once your application is built with at least one enabled PGI Accelerator directive, running on the GPU is straightforward. Start Without Debugging (Ctrl+F5) launches the executable as usual:
The kernels designated for the GPU device are offloaded automatically during execution. You can see in the case above that the throughput for the GPU-enabled version of the Mersenne Twister random number generator is about two times faster than the performance on four host cores using OpenMP.
When you are inserting PGI Accelerator directives into source in a VC++ project and changing the targeted compiler to activate the PGI Accelerator pragmas, you will very likely encounter porting issues related to the change in language (C++/C), compiler (Microsoft/PGI), target (CPU/GPU), or all three.
Language and Compiler Compatibility
PGCC is a native OpenMP and auto-parallelizing ANSI C99 compiler. It is not a C++ compiler and consequently will not process C++ source. Code that is heavily reliant on C++ will likely be more difficult to port to PGCC than code that is already C99 or mostly so. In addition to the differences between the C and C++ languages, Microsoft’s C/C++ compiler supports a number of Microsoft-specific language extensions. Many of these extensions are not currently supported by the PGI C compiler.
The PGI compilers are tested and shipped with a subset of the Windows SDK headers and libraries. This subset may need to be expanded when working with code being ported from VC++. Unexpected paths through include files could result as new Microsoft header files are included or preprocessor definitions made.
Algorithmic Compatibility and C Programming Limitations
Although the PGI Accelerator directives have the potential to simplify the introduction to GPU programming, the ease with which code is transitioned from CPU to GPU depends heavily on the structure of the code itself. Some algorithms are easier than others to port, and some won’t (or shouldn’t) be ported at all. As you examine your code to find suitable regions for GPU offloading, keep the following guidelines and limitations in mind:
1) Loop nests to be offloaded to the GPU accelerator must be rectangular. In particular triangular loops or loops where the loop bound(s) of one loop are computed within another loop in the nest are not supported. The following is an example of a triangular loop:
for (j=0; j<n; j++) for (i=0; i<j; i++) <some code>
2) Pointers used to access arrays in loops to be offloaded to the GPU accelerator must be declared with the C99 restrict attribute. Alternatively, safeptr pragmas can be used or the whole file containing the loop can be compiled with the Pointer Safety property set to “All pointers are safe,” but these actions can have unintended side effects.
3) At least some of the loops to be offloaded must be fully data parallel with no synchronization or dependences across iterations; these loops enable distribution of work across the multi-processors in an NVIDIA GPU. One or more loops in the nest can be vectorizable loops that require some synchronization—for example reductions are okay in many cases; these loops can be vectorized across multiple thread processors within a multiprocessor in an NVIDIA GPU. One or more loops in the nest can be sequential, but these loops will be executed serially within a thread processor—e.g. as the inner loop(s).
4) Computed array indices should be avoided. Such expressions within the loop nest result in the compiler detecting dependences that prevent parallelization and vectorization of loops. The independent clause can be used in cases where the programmer asserts that all elements of the computed index are independent.
5) Function calls are not currently allowed within loops to be offloaded to a GPU accelerator. In some cases, the compiler may be able to inline functions if directed to do so. It is recommended, however, that you avoid calls within accelerator regions by manually inlining wherever possible. Improved automatic inlining will be supported in future versions of the PGI Accelerator compilers.
6) Loops that compute on structs can be offloaded, but those that operate on nested structs cannot. This is a limitation of the current release of the compiler, not a limitation of the programming model.
7) Pointer arithmetic is not allowed within loops to be offloaded to the GPU accelerator.
The PGI Accelerator C compiler plug-in currently provides debugging support for code running on the host (CPU) but not on the device (GPU). When starting debugging, you can choose to use either the PGI debug engine or the VC++ debugger to debug the host-side code.
With respect to host-side debugging, the PGI debug engine can debug objects compiled with either Microsoft or PGI compilers. The Visual Studio debugger, however, cannot debug code in PGI-compiled objects. If you want to debug host-side code compiled with the PGI Accelerator C compiler, you will need to invoke PGI Debugging.
The PGI Accelerator programming model provides a path for the incremental and efficient high-level exploration of GPU programming. See Michael Wolfe’s series of articles on PGI Accelerator programming for a complete tutorial introduction. As is usually the case when migrating applications to new hardware platforms, optimizing code for maximum performance on GPU targets usually requires structural or algorithmic changes beyond the addition of PGI Accelerator directives. These changes may be benign with respect to the performance impact on non-GPU targets (specifically a multi-core x64 host), or may improve performance generally, or in some cases will decrease performance when compiled and run on the host. You need to be aware of these possibilities, and structure your code accordingly to maximize portability and performance portability. While they don’t necessarily make GPU programming easy per se, PGI Accelerator directives are often the fastest and easiest way to test whether your code might benefit from GPU acceleration.
Until now, a directive-based GPU programming model has not been accessible to VC++ developers. The PGI Accelerator C compiler which can be used to target specific files for GPU compilation and the PGI debug engine for host-side debugging are now packaged together in the PGI Accelerator C plug-in beta for Visual Studio. A private technical preview program will soon be under way. If you think you are a good candidate to test drive the technical preview version, contact firstname.lastname@example.org. We will continue to develop and refine the plug-in and its capabilities, and information you provide about its use will help shape the form this work takes.