One of the key features of high level language programming is modularity, including support for procedures and separate compilation. It's hard to imagine modern programming without functions and libraries. The term compiler was originally used to define the software that compiled separately created external objects into a single binary, what we now call a linker. Yet, until recently, OpenACC programs could only support procedures through inlining, more or less preventing any use of libraries or procedure calls across multiple files.

With the latest releases, PGI now supports procedure calls, separate compilation and linking for OpenACC programs targeting NVIDIA GPU accelerators. This article introduces this very important feature and how to use the acc routine directive to enable it. I will also present hints on how to use the clauses on the routine directive, including reasons for why the clauses are necessary, and some caveats and current limitations.

A Simple Example

Let's start with an example, here using C; I'll start with one file containing a function:

    #include <math.h>
    #pragma acc routine seq
    float sqab(float a){
        return sqrtf(fabsf(a));
    }

The acc routine tells the compiler to generate accelerator code for the following routine. The seq clause tells the compiler that the code in the routine, and in any other routines called from within this routine, will run sequentially in one device thread. We'll have more details on this clause later in the next section.

We compile this code using the command line pgcc -acc -Minfo -c a1.c; this compiles by default for the NVIDIA Tesla as well as for host execution. The compiler generates the informational message:

    sqab:
          4, Generating acc routine seq
             Generating Tesla code

This confirms that the routine was compiled for a Tesla accelerator. Now routine sqab can be called from host code or from an OpenACC compute region. For instance, we can call this function from another routine in a separate file:

    #pragma acc routine seq
    extern float sqab(float);

    void test(float* x, int n){
        #pragma acc parallel loop pcopy(x[0:n])
        for( int i = 0; i < n; ++i ) x[i] = sqab(x[i]);
    }

This has a simple OpenACC parallel loop with a call to the function sqab; the prototype for sqab appears in the file, and the acc routine directive is copied before the prototype. Alternatively, the acc routine could have been specified anywhere after the prototype by including the function name, as:

    #pragma acc routine(sqab) seq

This is useful if the prototype appears in a header file, for instance. Compiling this with the same command line flags will give us the expected informational messages:

    test:
          6, Generating present_or_copy(x[:n])
             Accelerator kernel generated
              7, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
          6, Generating Tesla code

Because the program has both host code and device code that must be linked, the PGI compiler will invoke the device linker followed by the host linker, and embed the device binary into the host binary. Then you can run the program as you normally would.

You can encounter link-time errors for the device code, just as you would for host code, if you have a call to a missing routine. In this case, if we compile the routine test as above, but forget to put in the acc routine directive for the function sqab, we will get a link-time error:

    nvlink error   : Undefined reference to 'sqab' in 'a2.o'

In this scenario, because we had compiled the sqab routine for the host, without the acc routine directive, it would not be compiled for the device, so we get the undefined reference error from the linker.

We can write the same example in Fortran. Here we put the function in a module:

    module b1
    contains
     real function sqab(a)
        !$acc routine seq
        real :: a
        sqab = sqrt(abs(a))
     end function
    end module

The acc routine directive goes inside the function or subroutine, usually in the declaration part. The caller can then simply use the module:

    subroutine test( x, n )
       use b1
       real, dimension(*) :: x
       integer :: n
       integer :: i
       !$acc parallel loop pcopy(x(1:n))
       do i = 1, n
          x(i) = sqab(x(i))
       enddo
    end subroutine

The Routine Directive

As shown above, the acc routine directive has two uses. First, the directive tells the compiler that the current routine needs to be compiled for device execution, as well as for the host; as we describe here, it also gives some specific information about how the routine may be called. This is done in C or C++ by placing an acc routine directive just before the function; the directive may specify the function name in parentheses after the routine keyword, or not. In a Fortran subroutine or function, this is done by placing an acc routine directive (with or without the procedure name) in the declaration part of the procedure.

The second use for the directive is to convey to the compiler that a called routine is in another file where it will be compiled for the device, and how it will be compiled. In a C or C++ program, if the function appears in the same file above the call site, the acc routine directive just above that routine will suffice; no additional directive is needed.

    #pragma acc routine seq
    float sqab( float a ){
        return sqrtf( fabsf( a ) );
    }
    . . .
    void test( float* x, int n ){
        #pragma acc parallel loop pcopy(x[0:n])
        for( int i = 0; i < n; ++i ) x[i] = sqab( x[i] );
    }

If the function appears later in the file, or is in another file, then place an acc routine directive just before a prototype declaration for the function, or place an acc routine(name) directive anywhere in the file above the call site.

      // routine pragma immediately before the prototype:
    #pragma acc routine seq
    extern float sqab(float);
      // or, after the prototype but before the call site:
    #pragma acc routine(sqab) seq
    . . .
    void test( float* x, int n ){
        #pragma acc parallel loop pcopy(x[0:n])
        for( int i = 0; i < n; ++i ) x[i] = sqab( x[i] );
    }
    . . .
      // repeat routine pragma before the function itself:
    #pragma acc routine seq
    float sqab( float a ){
        return sqrtf( fabsf( a ) );
    }

In a Fortran program, if the called routine is in a module, any subprogram that uses the module will have the necessary information, as in our example above (another reason to use Fortran modules). Otherwise, the caller can include an interface block for the routine, with an acc routine directive that matches the one in the actual routine:

    subroutine test( x, n )
       real, dimension(*) :: x
       integer :: n
       interface
        real function sqab(a)
         !$acc routine seq
         real :: a
        end function
       end interface
       . . .
    end subroutine

Alternatively, the caller can include an external declaration for the routine and an acc routine(name) directive:

    subroutine test( x, n )
       real, dimension(*) :: x
       integer :: n
       real, external :: sqab
       !$acc routine(sqab) seq
       . . .
    end subroutine

For C++, many functions appear as methods in class declarations. Many of these methods are in header files, often in files that can't be modified. Moreover, these classes have templates and can be instantiated many times. To address these complexities, the PGI C++ compiler has two additional features. First, an acc routine directive for a templated class member function that appears in the class definition will apply to any instantiation of that function. Second, any function in the same file (or an included file) that is called from within an OpenACC compute region is compiled with an implicit acc routine seq directive. This includes class member functions, as well as functions called from within other functions that have an explicit or implicit acc routine directive. This functionality is specifically to address the prevalent definition of classes in header files, so is limited to C++ at this time.

The Routine Parallelism Clauses

The acc routine directive should include a clause indicating what kind of parallelism is used in the routine. This will be one of gang, worker, vector or seq. An acc routine gang directive tells the compiler that the routine contains a loop with gang parallelism, or calls another routine that has a gang parallel loop. Such a routine may not itself be called from within a gang parallel loop, because gang parallel loops must not be nested. Because gang parallel loops are executed by sharing the iterations of the loop across the gangs, all the gangs must make the call to that routine. For OpenMP programmers, this is essentially the same as the OpenMP restriction that a procedure with an orphan for or do directive on a loop must be called by all OpenMP threads.

An acc routine worker directive tells the compiler that the routine contains worker parallelism, but no gang parallelism. Such a routine may be called from within a gang parallel loop, but not within a worker or vector parallel loop. Similarly, an acc routine vector directive tells the compiler that the routine contains vector parallelism, but no gang or worker parallelism. It may be called from with a gang parallel or worker parallel loop, but not from within a vector parallel loop.

Finally, an acc routine seq directive tells the compiler that the routine contains no parallelism; it is a sequential routine that will execute only on the vector lane of the one worker of the one gang that called the routine.

A savvy programmer will ask why OpenACC needs these clauses when OpenMP does not. It's true that OpenMP 3.1 does not require any directives when compiling routines. This is partly because OpenMP 3.1 programs only compiled for a single instruction set, so the parallel code runs on the same processors as the sequential code, (i.e., on the host). In addition, OpenMP 3.1 only has a single level of parallelism, corresponding to OpenACC gang parallelism. If there is only a single level of parallelism, then a routine may contain a parallel loop or not. If not, then it's a sequential routine. If it does, then it's the programmer's responsibility to make sure all OpenMP threads actually make the call so the parallel loop is properly work-shared across all the threads.

However, the new OpenMP 4 simd capability adds the same complexity, and requires programmers to declare when a procedure must be compiled for SIMD execution. The OpenMP 4 simd functionality corresponds roughly to OpenACC vector mode, and so the two specifications now have similar requirements when it comes to addressing function calls and multiple levels of parallelism.

Routine with Vector Parallelism

Here we show an example of a Fortran module with a routine declared with vector parallelism.

    module j1
    contains
       subroutine saxpy(n,a,x,y)
	  !$acc routine vector
	  integer,value :: n
	  real :: a, x(*),y(*)
	  integer :: i
	  !$acc loop
	  do i = 1, n
	     y(i) = a*x(i) + y(i)
	  enddo
       end subroutine
       subroutine test( x, y, a, n, gpu )
	  real :: x(:,:), y(:,:), a(:)
	  integer :: n
	  integer :: i
	  logical :: gpu
	  !$acc parallel loop pcopy(y) pcopyin(x,a) if(gpu)
	  do i = 1, n
	     call saxpy( n, a(i), x(1,i), y(1,i) )
	  enddo
       end subroutine
    end module
		
		

The saxpy routine is declared with acc routine vector. This means that vector parallelism will be exploited within the routine, not by the caller. When the compiler sees the orphaned acc loop, it knows to use only vector parallelism for that loop, because of the routine directive. The loop directive is called orphaned because it is not contained in an OpenACC parallel or kernels construct; it will be in a compute region generated in the caller.

The parallel loop directive in the caller does not have an explicit vector_length clause. However, because the compiler knows that this construct contains a call to the saxpy subroutine which was compiled with the routine vector attribute, the compiler knows that vector parallelism should be generated. For NVIDIA devices, the compiler will add an implicit vector_length(32) clause. In this example, I used the conditional if(gpu) to tell whether to run the compute region on the device (the GPU) or on the host. We use this internally for testing, to compare GPU results with results generated by the same code on the host.

Routine with Gang Parallelism

Our final example is a modification of the last one, changing the subroutine test to run on the device.

    module jj1
    contains
       subroutine saxpy(n,a,x,y)
	  !$acc routine vector
	  integer,value :: n
	  real :: a, x(*),y(*)
	  integer :: i
	  !$acc loop vector
	  do i = 1, n
	     y(i) = a*x(i) + y(i)
	  enddo
       end subroutine
       subroutine test( x, y, a, n )
	  !$acc routine gang
	  real :: x(:,:), y(:,:), a(:)
	  integer, value :: n
	  integer :: i
	  !$acc loop gang
	  do i = 1, n
	     call saxpy( n, a(i), x(1,i), y(1,i) )
	  enddo
       end subroutine
    end module

The vector saxpy routine hasn't changed at all. The subroutine test now has a routine directive. The acc parallel loop in the previous section becomes another orphaned acc loop directive in this example. In this case, the caller must specify the number of gangs, since the compiler won't know the trip count of the gang loop. The caller must give an explicit vector length as well, since the compiler doesn't see the call to the vector routine:

    !$acc parallel pcopy(y) pcopyin(x,a) num_gangs(n) vector_length(32)
    call test( x, y, a, n )
    !$acc end parallel

Notice that I changed the scalar argument n to have the value attribute, so it will be passed efficiently by value. In C and C++, all arguments are passed by value; arrays are supported by passing the value of the pointer to the array. In Fortran, arguments are usually passed by reference, that is, by passing the address of the argument. For scalars, this is inefficient; Fortran now supports the value attribute for scalar dummy arguments. To use this, the caller must have an explicit interface for the procedure, usually by putting the caller and the procedure in a module, or putting the procedure in a module and using that module in the caller.

Disabling Relocatable Code

By default, the PGI compilers will generate relocatable device code for Tesla targets. The driver will call the device linker before the host linker, as described earlier, to generate a device binary at link time, allowing for separate compilation. You can disable separate compilation if all your procedure calls appear in the same source file by using the nordc suboption (no relocatable device code) as

    -ta=tesla:nordc

In nordc mode, the compiler generates a device binary for each source file. Calls to procedures that don't appear in that source file will result in undefined references at compile time. If you specify -ta=tesla:nordc for the link step, then the device linker will not be invoked. In that case, all the OpenACC files must have been compiled with nordc. In some cases, there is a performance advantage to disabling relocatable code. The device compiler back end will do more function inlining and optimization without relocatable code.

There is one other reason to disable relocatable code for Tesla targets. With relocation disabled, the CUDA binary includes code for all the specified compute capabilities, and also includes the PTX (portable assembler). If you build with -ta=tesla:cc20,nordc, the cuda binary will include Fermi code as well as the PTX. When you run your program on a Kepler or newer GPU, the PTX will be dynamically recompiled for the newer machines. With the current PGI compilers, the portable PTX code is lost when generating relocatable code and linking. This is being addressed for a future release.

The acc routine directive and procedure calls can be used with the Radeon target as well, but Radeon does not support separate compilation. There is currently no capability to link Radeon objects, though we expect that feature to be available in the future. For Radeon, the caller and callee routines must currently be in the same source file. If you specify both Tesla and Radeon targets, -ta=radeon,tesla, the compiler will assume the nordc suboption for the Tesla target as well.

Compute Capabilities

The current default compute capability suboption for Tesla targets is -ta=tesla:cc2+, which means it will generate code for Fermi (cc20) and Kepler (cc30) targets. Separate compilation is not supported for the older compute cc1x targets (the original Tesla architecture). If you specify -ta=tesla:cc1x, the compiler will assume nordc as well. You can still use the routine directive and procedure calls, but only within a single file. Note that support for compute capability 1.x devices will likely be removed in some future major release, when NVIDIA removes support in the CUDA toolkit.

If you specify the compute capability for the compile step, you should use the same compute capabilities at the link step. In particular, if you compile with -ta=tesla:cc30, you won't be able to link without specifying the same option. By default, the compiler will try to link with cc20 and cc30, and it will find no cc20 code in the object file. If you get a link-time error message:

    nvlink fatal   : could not find compatible device code in c.o

then likely you need to specify the proper compute capability at the link step.

Current Limitations

There are some limitations in support for procedures. We've already discussed the lack of separate compilation for Radeon targets and Tesla compute capability cc1x. We expect support for separate compilation on Radeon targets to come in a future release.

A procedure compiled with acc routine vector will usually only work if called from a parallel construct with vector_length(32) for NVIDIA devices. The OpenACC vector loops must synchronize across all the vector lanes at the end of the loop. OpenACC workers and vector lanes map to dimensions of a single thread block, and the CUDA execution model doesn't allow synchronizing an arbitrary subset of the thread block, such as those CUDA threads comprising the vector lanes of a single worker. This limitation currently restricts routine vector usage to a vector width that maps to a single warp.

Automatic local arrays in Fortran and VLAs (variable length arrays) in C are not supported in OpenACC routines with the PGI 14.x releases. Supporting these requires dynamically allocating the arrays in the procedure, and that hasn't been implemented at this time. Fixed size arrays will work, however. Also, passing Fortran assumed shape dummy arguments is not supported in OpenACC routines with the PGI 14.x release.

In addition, loops with reductions in a routine will not work. This will be fixed in an upcoming PGI release. This affects gang, worker or vector loops.

Closing Comments

With support for separate compilation, OpenACC can now support true high level programming. Object files containing both host and device code can be placed in libraries and the appropriate code will be linked as expected.

Part Two of this article will discuss more advanced issues, including support for global variables and the acc declare directive, and using the acc routine directive to interface with CUDA C and CUDA Fortran functions, and will include examples using acc routine in C++ classes.

Click me
Cookie Consent

This site uses cookies to store information on your computer. See our cookie policy for further details on how to block cookies.

X