Technical News from The Portland Group

The PGI Accelerator Compilers with OpenACC

Note: This article was revised in August 2012 to bring it up-to-date with the production software release.

The OpenACC Application Program Interface (OpenACC API) is a specification developed by a group of vendors to allow programmers to write applications that offload code from a host CPU to an attached accelerator device. The API defines a set of directives and API routines similar to the well-established OpenMP API. The OpenACC API is also very similar to the PGI Accelerator programming model, from its goals and scope to its design and features. In this article, I will describe the significant differences between the current PGI Accelerator Programming Model v1.3 specification and the OpenACC API, specifically for those programmers who want to port programs from the PGI Accelerator model to OpenACC. I will also describe how procedures written using the PGI Accelerator directives can interface with procedures written using OpenACC, and the future directions of the PGI Accelerator model and compilers.

Differences Between PGI Accelerator Directives and OpenACC

OpenACC has four important features that were not available in the PGI Accelerator model: the parallel construct, the present data clause, support for asynchronous data movement as well as asynchronous computation, and three levels of parallelism (gang, worker and vector). Let's discuss each of these features in more detail, then go on to other important differences.

The Parallel Construct

The PGI Accelerator model has support for data regions (using the data region construct) and compute regions (using the region construct). When designing the OpenACC API, we elected to avoid using the word region, because we wanted to use terminology that would more easily align with OpenMP, looking forward to a time when the two APIs might merge. For that reason, the OpenACC API uses different construct names. The OpenACC data construct serves exactly the same purpose as the PGI Accelerator data region construct. The OpenACC kernels construct serves the same purpose as the PGI Accelerator region construct.

However, the OpenACC API also defines another type of compute construct, the parallel construct. The parallel construct is more explicit, user-specified parallelism, much like the OpenMP parallel construct. The OpenACC parallel construct immediately launches a specified number of gangs, where each gang contains a specified number of workers. One worker in each gang starts executing the code in the parallel construct redundantly, just like threads in an OpenMP parallel construct, until they reach a work-sharing loop construct. At that point, the iterations of the work-sharing loop are spread across the gangs, as specified by the construct. It can be confusing to have two constructs to solve the same problem, so it's important to understand the differences, and when you might want to use one or the other. The PGI compilers support the parallel construct as of release 12.6. We will have a follow-up article describing the differences and use cases for the parallel and kernels constructs.

The Present Data Clause

Both the PGI Accelerator model and the OpenACC API have data clauses that can be placed on the data region or data constructs, or on compute constructs. Both have support for copyin, copyout and copy, which allocate the appropriate memory on the accelerator device and copy data from the host to the accelerator, or back, or both. Each has a clause to allocate memory without data copies, called local in the PGI Accelerator model, and create in the OpenACC API.

Data movement between the host and the accelerator is costly, and is often the bottleneck to performance. Tuning a program often requires minimizing that data movement. This creates the need to leave data on the accelerator as long as possible, and in particular to use it across procedure boundaries. The PGI Accelerator model added the reflected attribute for procedure arguments, and the mirror attribute for global data. The OpenACC API uses the present clause, which can be used for both procedure arguments and global data.

The use case is where a main routine allocates all the data. It then calls a routine to initialize the data, then other routines to process the data. Using the OpenACC directives, the main routine includes a data construct to allocate the data on the device. The OpenACC runtime keeps track of the association between the host data and the accelerator copy. All the other procedures use a present clause on that data, which tells the compiler that the data is already present on the accelerator. The OpenACC runtime uses the host data address to access a lookup table, and retrieves the address of the accelerator data copy, which is then used in the compute regions on the accelerator.

    #pragma acc data copy( a[0:n] )
    {
        init( a, n );
        process( a, n );
    }
    ...
    void init( float* a, int n ){
        #pragma acc kernels loop present(a[0:n])
        for( int i = 0; i < n; ++i ) a[i] = sinf((float)i);
    }

The advantage of the present clause is that the interface between caller and callee is not changed. It is a runtime error if the data is not present on the accelerator, or if the data is only partially present (only a subarray, say).

To support libraries that may be called whether the data is or is not already present on the accelerator, the OpenACC API also defines four combined data clauses, combining present with copyin, copyout, copy or create. These may be spelled in long form (present_or_copyin, ...) or short form (pcopyin, ...). The behavior when the data is present on the accelerator is that of the present clause; the data on the accelerator is used without any copying, regardless of the combined data clause used. If the data is not already present, then the data is allocated and copied as specified. As with the present clause, it is a runtime error if the data is partially present.

Asynchronous Execution

OpenACC allows for asynchronous data movement and accelerator computation. The update construct and compute constructs (parallel and kernels) each has an optional async clause, which itself can have an integer expression argument. The data movement and computation become asynchronous activities that overlap with continuing execution by the host CPU. Two asynchronous activities that were created with an async clause with no argument, or with arguments having the same value, will occur in the order that those activities were created by the host. Two asynchronous activities that were created with different async argument values may occur in either order, or may overlap.

PGI has extended the OpenACC asynchronous behavior by allowing the async clause on the data construct (and data region directive) as well. By its nature, data allocation is synchronous, but any data movement will be done asynchronously.

Three Levels of Parallelism

The PGI Accelerator model has a two-level parallelism model on the accelerator, parallel and vector. The model target architecture is a collection of processing elements (PEs) each of which can execute vector operations. This maps pretty well onto most current accelerator hardware. For an NVIDIA GPU, say, the PEs (the parallel dimension) maps roughly onto the streaming multiprocessors, or the thread blocks, and the vector dimension maps roughly onto the threads within a thread block.

The OpenACC execution model has three levels: gang, worker and vector. The model target architecture is a collection of processing elements or PEs, where each PE is multithreaded, and each thread on the PE can execute vector instructions. For an NVIDIA GPU, the PEs might map to the streaming multiprocessors, multithreading might map to warps, and the vector dimension might map to the threads within a warp. The gang dimension would map across the PEs, the worker across the multithreading dimension within a PE, and the vector dimension to the vector instructions. There is no support for any synchronization between gangs, since current accelerators typically do not support synchronization across PEs. A program should try to map parallelism that shares data to workers within the same gang, since those workers will be executed by the same PE, and will share resources (such as data caches) that would make access to the shared data more efficient.

C Subarrays

There are other differences as well. The PGI Accelerator model for Fortran used standard Fortran subarray notation in the data clauses, using lower bound and upper bound, such as x(1:n). The C language has no standard notation for subarrays, so the PGI directives used notation that matched the Fortran subarrays, with lower and upper bounds. OpenACC adopted Intel's Array Notation for C, which uses a starting index and length. For instance, the PGI notation for an n element vector starting at element zero was x[0:n-1], the OpenACC notation is x[0:n]. The PGI notation for 80 elements starting at element 21 is x[21:100], whereas the OpenACC notation is x[21:80].

Contiguous Subarrays

The PGI Accelerator model allows programs to specify arbitrary rectangular subarrays, such as the interior of a matrix. For instance, with a Fortran array declared as a(100,100), a data clause might move only the 98x98 interior using copyin(a(2:99,2:99)). In the OpenACC specification, only contiguous subarrays may be allocated on the device. For Fortran, this means any leading dimensions would have to specify the full dimension; for C, the same applies to the trailing dimensions. The example above would have to be specified as copyin(a(1:100,2:99)).

The PGI Accelerator compilers extend this to allow your program to specify the interior of a matrix, for example. The behavior has changed since the earlier releases, however. With the old PGI Accelerator model, if you specified copyin(a(2:99,2:99)), only the 98x98 interior would be allocated and copied. The new PGI Acclerator compilers will allocate the bounding 100x98 region, but will only move the 98x98 interior of the matrix.

Reductions

The PGI Accelerator compilers automatically recognized a selection of reductions in loops. The OpenACC API uses a reduction clause, much like OpenMP. The OpenACC directives also allow more flexible reductions, such as partial reductions across some dimensions of computation. The PGI Accelerator compilers will continue to auto-detect reductions, but also implement the reduction clause.

Reflected and Mirror

The PGI Accelerator model has two important features not available in the OpenACC specification. The PGI reflected data attribute for Fortran allows a program to specify in the interface that the caller must provide a visible device copy of a specified dummy argument. While the functionality is much the same as with OpenACC present clause, the advantage is a compile-time check, instead of waiting until a runtime error. The PGI mirror data attribute allows dynamically allocated data to be allocated both on the device and the host. The OpenACC API allows for device resident data, but has no analog to the mirror attribute.

Porting from PGI Accelerator to OpenACC

If you have existing PGI Accelerator Model programs that you want to port to the OpenACC specification, most of the work can be done with very little work. The PGI compilers will continue to accept the older PGI Accelerator directives, but you may want to modify your program to conform to the new specification. You can do this by replacing some of your your PGI Accelerator directives by the corresponding OpenACC directives, as in the following table table:

PGI Accelerator OpenACC
constructs
region kernels
region for kernels loop
region do kernels loop
data region data
clauses
local() create()
parallel() gang()

Some changes will be required, because of incompatibilities between the old PGI Accelerator model and OpenACC; the current PGI release implements the OpenACC specification, so the old behavior is no longer supported.

  • Change C subarrays in data clauses from lowerbound:upperbound to start:length.

There are other changes you might want to make to conform to the OpenACC specification, though these are not required when using the PGI compilers:

  • Change subarrays in data clauses to specify contiguous regions.
  • Add reduction clauses for reductions in your loops or regions.
  • Add the declare keyword for any declarative data directives.
  • Remove any cache clauses from the loop directives, and add cache directives at the top of the loop.
  • Replace any reflected clauses with present clauses.
  • Remove any mirror clauses; there is no equivalent functionality in the OpenACC API. You may be able to use a global data construct.

Interfacing Between PGI Accelerator and OpenACC

The PGI Accelerator compilers accept both PGI Accelerator model directives as well as OpenACC directives. It's important to describe how your existing PGI Accelerator source files will interface with any new OpenACC source files.

The PGI compiler -ta flag enables the both the OpenACC and PGI Accelerator model directives, and with suboptions can specify details about the target accelerator. The PGI compiler -acc flag also enables OpenACC directives and PGI Accelerator model directives. Either or both flags may be used at compile time.

Old object files already generated using the PGI Accelerator model directives will continue to be supported. Moreover, they can be linked with any object files built with the current compiler release. However, data moved to the GPU by old object files with the older PGI Accelerator data clauses will not be available to the OpenACC present clause, except as noted below.

Source files using PGI Accelerator model directives can be recompiled with the 12.6 (or later) release. When recompiled, the data clauses will be interpreted as OpenACC data clauses. This means that any subarrays allocated on the accelerator will correspond to contiguous subarrays on the host. For instance, with a Fortran array declared as a(100,100), a data clause such as copyin(a(2:99,2:99)) will actually allocate the contiguous 100x98 subarray on the accelerator, but will only move the 98x98 interior of the array. Also, C subarrays will be interpreted using the OpenACC start:length convention. Finally, when recompiled using the -acc flag, the arrays in data clauses will be available to the OpenACC present clauses.

The Future of the PGI Accelerator Model

We are continuing to develop the PGI Accelerator compilers. The next version of the PGI Accelerator model will be a superset of OpenACC, and will be backward compatible with the v1.3 PGI Accelerator model, except for incompatibilities with the OpenACC interpretation noted above.

In addition, the PGI Accelerator compilers have additional functionality that enhances your productivity in several ways. The PGI Unified Binary™ feature allows you to generate a single binary that will execute your program using OpenACC or PGI Accelerator directives on an accelerator, or on a system with no accelerator, or to let the program or the end user select, using API calls or environment variables. The PGI Accelerator compilers also support multiple accelerators, such as the NVIDIA compute capability 1.x (Tesla) and 2.x (Fermi) devices, in a single binary.

The PGI Accelerator Fortran compilers interface seamlessly with PGI CUDA Fortran. CUDA Fortran data with the device attribute may be freely used in PGI Accelerator or OpenACC compute regions. CUDA Fortran attributes(device) subprograms may be called within PGI Accelerator or OpenACC compute regions in the same Fortran module.

Summary

The OpenACC API is a specification for directives and API routines that support productive use of accelerator devices, such as GPUs. The OpenACC design is very similar to the PGI Accelerator model design, and PGI's implementation allows the use of both. PGI will continue to push the envelope of the PGI Accelerator compilers, to make more functionality and performance available to our customers. The current PGI release implements a subset of the OpenACC specification. As more functionality becomes available over the next few months, we will give more details and advice about using the OpenACC directives in subsequent articles.