Technical News from The Portland Group

PGI C++ with OpenACC

In an article last year, PGI introduced the first version of its GNU-compatible C++ compilers. The new compilers were link-compatible with code generated by GCC versions 4.1 through 4.5. This year, PGI has extended PGC++ to include support for GCC version 4.7, and we're working as well to complete full C++11 support. But of more interest to many programmers is the added support for OpenACC. Together with GNU ABI compatibility, this allows C++ programmers to experiment with and to port the computationally intensive parts of our C++ applications to GPUs using the PGI OpenACC C++ compilers. This article discusses how OpenACC is integrated into the C++ compilers, how to use it, what to expect, and pitfalls to avoid.

Using OpenACC in C++

The OpenACC API is a specification for directives and runtime routines that enable a programmer to designate loops that should run on an accelerator device, such as a GPU, and the data that needs to be copied to the device memory. The directives in C++ appear as pragmas before a loop or structured block. A simple example is shown below. (The examples used in this article are available for download from the PGI website.)

    void saxpy( int n, float a, float x[], float y[] ){
        #pragma acc parallel loop pcopyin(x[0:n],a,n) pcopy(y[0:n])
        for( int i = 0; i < n; ++i )
            y[i] += a*x[i];
    }

The SAXPY operation is a classical single-precision A times X plus Y. We'll start with this example, though we'll see that it's likely that the data movement will dominate the computation for a computation this trivial. The target accelerator for this article is the NVIDIA Kepler GPU. GPUs are designed with their primary job in mind, graphics; NVIDIA and AMD have added features to their GPUs that support general purpose computing, but the architectures are tuned to the types of computations that appear in graphics kernels. Graphics kernels are highly parallel and organized around regular data access patterns. In compute terms, this translates to parallel loops with regular, rectangular iteration spaces, and array accesses with mostly stride-1 references in the primary index, that is, the inner loop. Many HPC applications are already designed around nested parallel loops with stride-1 inner loops. These are the applications that are likely to run well on a GPU.

The abstract target machine is shown in the following figure. The key features of the accelerator are its separate physical memory and highly parallel design. The accelerator processing elements (PEs) are organized into SIMD units which execute synchronously, the same instruction at the same time. Different SIMD units run asynchronously, like commodity multicore processors. The device memory can range up to 6GB today. As with GPUs, access time to the memory is quite slow. CPUs have a cache hierarchy to reduce the impact of the memory latency, but GPUs are designed for stream processing, where caches are not effective. Instead, GPUs use multithreading, meaning when a GPU thread issues a memory access, that GPU thread essentially goes to sleep until the memory responds. In the meantime, the GPU processor switches over very quickly, in hardware, to another GPU thread. In this way, the GPU exploits program parallelism to keep busy while the slow device memory responds. While the memory has long latency, it has very high bandwidth to keep up with the demands of data-intensive applications. Instead of suffering from stalls due to cache misses, the GPU keeps busy, as long as there is enough parallelism to overlap with memory operations.

OpenACC Abstract Machine Block Diagram

Let's study this simple loop above. It uses a pragma to tell the compiler to generate code for the accelerator; the pragma looks much like the classical OpenMP pragmas. The parallel loop clause tells the compiler to run the iterations of the immediately following loop in parallel across the PEs of the accelerator. The pcopyin and pcopy clauses control data movement. The clause pcopyin(x[0:n]) tells the compiler to copy n elements of the x array starting at x[0] to the GPU device memory before starting the loop, but that the data need not be copied back to the host at the end. The clause pcopy(y[0:n]) tells the compiler to copy the y array to the GPU before the loop and back to the host at the end. The clause pcopyin is actually short for the more verbose name present_or_copyin; we'll see what this means shortly.

As mentioned, the parallel loop pragma explicitly tells the compiler to partition the iterations of the loop across the PEs. The compiler doesn't do any checking that it's legal to execute the loop iterations in parallel, just as with OpenMP. You are taking responsibility for determining the parallelism in your program.

We can compile this file with the ‑acc flag to enable the OpenACC directives using:

    pgc++ -acc saxpy.cpp -c

The default is to generate code for NVIDIA GPUs as well as the host CPU for all OpenACC regions. Because we use accelerators to accelerate our programs, that is, to make them faster, we recommend that you build the program with the ‑Minfo=accel command -ine flag to see messages from the compiler. In this case, the compiler informational messages are:

_Z5saxpyifPfS_:
      1, Generating present_or_copy(y[0:n])
         Generating present_or_copyin(n)
         Generating present_or_copyin(a)
         Generating present_or_copyin(x[0:n])
         Accelerator kernel generated
          3, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
      1, Generating NVIDIA code
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
		

The first four lines give information about the data copied according to the data clauses on the pragma. The two lines below that indicate the compiler generated an accelerator kernel for the region and give the parallelism schedule used; this translates roughtly to a CUDA thread block size of 256 and a grid of n/256. Finally, the last four lines state that the compiler generated code for NVIDIA Tesla (compute capability 1.0), Fermi (2.0) and Kepler (3.0) GPUs, so the program will run on any NVIDIA CUDA-enabled GPU.

You can tune this by adding more command-line options, specifically the target accelerator flag -ta. The default behavior is -ta=nvidia,host, that is, to generate code for all NVIDIA GPUs as well as the host. If you are only going to run this on a system with an NVIDIA Kepler, you can specify -ta=nvidia:cc30 or -ta=nvidia:kepler.

Now you can link this routine with the rest of your application. We have created an example together with a main program compiled using gcc and this saxpy function compiled using pgc++. The final executable will need to be linked with pgc++ -acc to link in all the OpenACC runtime routines. Running the program is as simple as typing the executable name at the command line, as you would expect. If you've compiled the saxpy routine with the default options, or with -ta=nvidia:kepler,host, the program will test at runtime whether an appropriate CUDA-enabled GPU is available and run on that GPU. If none is found, it will run the OpenACC region on the host.

You can also select the device type to use at runtime by setting the environment variable ACC_DEVICE_TYPE to nvidia or host. If you have two or more NVIDIA GPUs, you can select which device to use by setting the environment variable ACC_DEVICE_NUM to the CUDA device number, starting at zero for the first device. OpenACC has other routines to select which device to use dynamically, but that's beyond the scope of this article.

Performance

The first thing you will likely notice when you run this program on your GPU is that it takes longer on the GPU than it does if just run on the host CPU. I ran this test on a system with a 3.1GHz Intel Sandy Bridge CPU with an NVIDIA GeForce GTX 690 Kepler GPU and collected the elapsed real time around the call to the saxpy function. The times in microseconds for different vector lengths when running on the GPU and the host are given in the table:

length GPU CPU
10 475,523 1
100 482,403 1
1,000 478,848 1
10,000 472,319 10
100,000 473,508 92
1,000,000 480,601 990
10,000,000 690,198 10,071

Several things become apparent. First, the times on the GPU don't seem to vary much until you reach a vector length of about 10 million; times on the host don't seem to change much either until the vector length reaches about 100,000. For the GPU, this is at least in part because I was using a Linux system with no display attached. The NVIDIA driver puts the GPU in power-save mode when it's not in use as a display or for compute, so this program was paying the overhead to power the device up each time it ran. There is also some initialization time for the CUDA drivers and the OpenACC runtime. We can factor that out by calling the saxpy routine once outside the timer, to initialize the system. Timing a second call, we get more reasonable times:

length GPU CPU
10 872 1
100 880 1
1,000 1,071 1
10,000 1,117 10
100,000 1,592 92
1,000,000 6,806 990
10,000,000 60,517 10,071

But, it's still not great. The host takes about 10 times longer for 10 times as much work, but the GPU doesn't show this behavior until it reaches very long vector lengths. As suggested earlier, data movement will dominate the computation for a simple kernel like this one. The GPU derives its performance advantage from exploiting a high degree of parallelism, but the GPU has a separate physical memory from the host. To performa computation on the GPU means moving the data from the host memory to the device memory. The data is typically moved over an IO bus, which is a PCI express bus in current systems. PCI express is a very fast IO bus, but is many times slower than the host memory interface. In this example, the program copies the vectors x and y to the device, runs the kernel to do the computation, then brings the result vector y back to the host. Computational intensity is a measure of how much computation is done relative to data movement. In this case, we compute computational intensity as the ratio of computation on the GPU relative to the data moved between the host and GPU. There are two vectors of length n moved to the device memory and one moved back to the host for 3n words moved total. The computation has just a multiply and add in the inner loop, for 2n floating point operations. The computational intensity ratio is 2n/3n or 2/3. Since data movement is much slower than computation, a simple kernel like this will be dominated by data traffic. Because data movement involves starting a DMA engine to move the data, there's some startup cost as well, leading to the nonlinear times.

If we set the environment variable PGI_ACC_TIME to 1, the OpenACC runtime will collect statistics about the time spent moving data and executing the kernel. For a vector length of 1,000,000, my system spends about 2 milliseconds moving data and only 100 microseconds executing the kernel.

A more well-structured program would put the data into device memory and leave it there, avoiding moves between host and device. We can do that with a data construct around the call to saxpy, as follows:

    #pragma acc data copyin(x[0:n]) copy(y[0:n])
    {
        gettimeofday( &t0, NULL );
        saxpy( n, a, x, y );
        gettimeofday( &t1, NULL );
    }

The x and y arrays are moved to the device in the caller. Inside saxpy, the present_or_copy (or pcopy) clause will notice that the data is already present in device memory and will not perform any data allocates or copies. Now the times for the saxpy kernel on the GPU are much better:

length GPU CPU
10 358 1
100 366 1
1,000 551 1
10,000 550 10
100,000 576 92
1,000,000 672 990
10,000,000 1,370 10,071

Now we can guess that the overhead of launching a kernel for this system must be about 300 microseconds. Note that the execution time for the GPU kernel doesn't increase by a factor of ten for ten times more work, even at the lengths we're using. But we start to see the advantages of a GPU for very large datasets, even for this simple kernel. If we can amortize the cost of initialization and data movement.

Pitfalls

However, the data structures in many C++ applications aren't amenable to GPU acceleration. GPUs are designed to operate efficiently on large, regular data; in programming language terms, that translates into vectors, matrices and other arrays. The GPU memory is designed around accesses to contiguous memory locations, what were once called superwords and are now very like a whole cache line. Essentially, when any word is accessed, the memory system returns the whole cache line. If the program uses that whole cache line, the program runs at full speed. In the saxpy kernel, when the vector element x[i] is fetched, so is x[i+2] and x[i+1]. The whole cache line can be fetched and processed in parallel. If we were only operating on every eighth vector element, the effective memory bandwidth would be reduced by a factor of eight; the memory returns a whole cache line, but the program only uses every eighth word. Similarly, if x were a structure or class with eight members, and we were only operating on one of those members, the memory performance will again be reduced by a factor of eight.

Other common data structures include linked lists and trees and other structures that use links or pointers to navigate the structure. Such structures are difficult to port efficiently to a GPU, even when using a low level language such as OpenCL or CUDA C. GPUs aren't designed to handle such sequentially-accessed structures efficiently; a GPU gets all its power from parallelism.

Finally, many C++ language features are not supported at all in compute regions in the current OpenACC, such as IO and dynamic allocation (operator new, operator new[], and malloc). These operations have to run on the host, outside of an OpenACC compute region.

Inlining

The biggest problem that C++ programmers will have porting to a GPU is handling all the explicit and implicit function calls in their code. C++ operators, constructors, destructors, and so on are implemented and appear to the compiler as functions. The current OpenACC implementations only handle function calls using inlining. This means the function source must be available, in the source file or in a header file.

For the PGI C++ compilers, function inlining up to ten levels deep is enabled with the ‑O2 or ‑fast command line flags. Even with inlining, the build may fail if one of those inlined functions includes a system call or operation that the GPU can't handle.

Summary

The new PGI C++ compilers allow programmers to start porting programs to a GPU using OpenACC directives, without having to recode those parts of their program in C. With Linux GNU compatibility in the PGI C++ compilers, only those parts of the application that are amenable to GPU acceleration need to be updated and recompiled.

However, GPUs are designed to work effectively in parallel on specific data structures and data access patterns. Many C++ applications are not designed around regular, rectangular arrays of data and nested parallel loops. The data structures and algorithms must be organized appropriately regardless of the programming language used, whether OpenACC, CUDA, or OpenCL. If the data and program structures are appropriate, OpenACC can be used successfully to port the application to the GPU.

Function calls cause problems when porting to a GPU. This is not specific to C++, but C++ programs use function calls much more than other languages. GPUs are only starting to include support for real procedure calls and separate compilation. OpenACC is evolving to include support for these as well, and as that evolution progresses C++ will become even more viable as a GPU computing language.