OpenACC Routine Directive Part 2
In Part 1 I introduced the OpenACC routine directive and its use to enable true procedure calls and separate compilation in OpenACC programs. This article will discuss a few more advanced issues: support for global variables and the acc declare directive, interfacing to CUDA C and CUDA Fortran device functions using acc routine declarations, and using acc routine in C++ class member functions.
Global Variables in C and C++
As with external routines, external variables can be referenced in OpenACC accelerator regions and device code. As with routines, a directive is required to inform the compiler that a global variable needs to be allocated in device memory as well as host memory. Let's assume we have a global variable coef containing a global coefficient that we want to use in device code.
float coef = 3.14159265f; #pragma acc declare copyin(coef) #pragma acc routine seq float adjust(float a){ return a*a/coef; }
The acc declare directs the compiler to generate a global copy of coef in the device code. The copyin clause will cause the value of coef on the host to be copied to the device when the program attaches to the device. This typically occurs when the program executes its first OpenACC data or compute construct, or when acc_init is called.
The declare directive must appear in any file that refers to the variable in device code; it must also appear in the file that actually declares the variable without the extern keyword. Suppose we have this modified example:
extern float coef; #pragma acc declare copyin(coef) #pragma acc routine seq float adjust(float a){ return a*a/coef; }
Just as the extern coef must be declared in another file for the host code, that file must have a declare directive to put coef on the device. This example used a copyin clause; the acc declare directive can also use the create clause, which will allocate the variable but not initialize it from the host.
Note that the copyin clause will cause the value to be copied from the host to the device when the device is attached. The host program can set or modify the value before that time, and the modified value will be copied to the device. The global variable can be used directly in OpenACC compute regions:
extern float coef; #pragma acc declare copyin(coef) ... #pragma acc parallel { float y = sin(coef); #pragma acc loop for( i = 0; i < n; ++i ) x[i] *= y; }
Global variables also appear in the OpenACC runtime present table. This means that you can use the update directive to copy updated values of the variable between the host and device memories. Also, global arrays can be passed to procedures that use the array in a present data clause:
void vec_double( float* x, int n ){ #pragma acc parallel loop present(x[0:n]) for( int i = 0; i < n; ++i ) x[i] += x[i]; } ... float gx[1024]; #pragma acc declare create(gx) ... vec_double( &gx[1], 1022 ); #pragma acc update host( gx )
Currently, C file static variables, function static variables, and C++ class static members can appear in acc declare create directives, and will be allocated on the device, but they will not appear in the present table and cannot appear in update directives.
There is an additional benefit when the global variable is itself a pointer. Typically, we don't want just the pointer on the device, we want the data that the pointer points to as well. With the PGI implementation, if you have a global pointer in an acc declare directive, then put an array based on that pointer in a data clause, the compiler allocates and copies the array to the device, as directed in the data clause, and fills in the global pointer on the device with the device array address.
In the example that follows, the pointer p is a global variable in both host and device memories. At the enter data directive, the array p[0:n] is copied to device memory, and the global pointer p on the device is updated to point to this array. In the routine addem, the global pointer p will have the correct value.
float* p; #pragma acc declare create(p) ... void addem( float* x, int n ){ #pragma acc parallel loop present(x[0:n]) for( i = 0; i < n; ++i ) x[i] += p[i]; } ... float *xx; p = (float*)malloc( n*sizeof(float) ); #pragma acc enter data copyin( p[0:n] ) #pragma acc data copy( xx[0:n] ) { ... addem( xx, n ); ... }
Module Variables in Fortran
In Fortran, global variables are in modules or in common blocks. Unfortunately, common blocks are not currently supported in accelerator device code because of device linker limitations. However, module variables are supported. A directive in the module declaration informs the compiler that a variable needs to be allocated in device memory as well as host memory.
module globals real :: coef = 3.14159265 !$acc declare copyin(coef) contains real function adjust( a ) real :: a adjust = a*a / coef end function end module
The acc declare directs the compiler to generate a copy of coef in device memory. The copyin clause will cause the value of coef on the host to be copied to the device when the program attaches to the device. Because this appears in a module, coef can be used in host or device code in any subprogram in the module, or in any subprogram that uses the module.
use globals ... !$acc parallel loop do i = 1, n x(i) = x(i) * coef enddo
As in C, these variables appear in the present table, and can be used in update directives. There is one other use for acc declare in a module. An allocatable array that appears in an acc declare create clause will be allocated on the device as well as the host when it appears in an allocate statement. If the allocatable array is declared in a module, the global device copy of the allocatable array pointer gets updated with the device address at the allocate statement. Such an array will appear in the present table, and values can be moved between host and device memories with update directives.
In the following example, a module contains an allocatable array that appears in a declare create directive. The allocate statement in the allocx routine allocates x in both host and device memory. The call to muly in the parallel construct will occur on the device, using the device copy of x. Note the scalar arguments to muly are declared with the value attribute; a more efficient parameter passing mechanism than the Fortran default of passing by reference.
module globals real, dimension(:), allocatable :: x !$acc declare create(x) contains subroutine muly( y, a, n ) !$acc routine vector real, dimension(*) :: y real, value :: a integer, value :: n integer :: i !$acc loop vector do i = 1, n x(i) = x(i) + y(i) enddo end subroutine end module subroutine allocx( n ) use globals integer :: n allocate( x(n) ) ! allocates host and device copies end subroutine subroutine doem( y, a, n ) use globals real, dimension(:) :: y real :: a integer :: n integer :: i !$acc data copyin(y) !$acc parallel num_gangs(1) vector_length(128) call muly( y, a, n ) !$acc end parallel !$acc end data end subroutine
Calling CUDA Device Routines from C
With separate compilation on Tesla accelerators, you might want to write a CUDA device routine that can be called from your OpenACC compute construct or from within a procedure compiled with acc routine. The PGI OpenACC compiler maps parallel loop iterations onto the CUDA blockIdx and threadIdx indices.
The simplest case is a scalar CUDA C device routine, one that doesn't refer to threadIdx or blockidx indices, such as:
__device__ __host__ float radians( float f ){ return f*3.14159265; }
When compiling this device routine with nvcc, you must specify –rdc=true and the compute capability (or capabilities) that you are compiling for:
% nvcc -c -rdc=true -gencode arch=compute_35,code=sm_35
The next step is to add acc routine seq after the prototype for this device routine in your OpenACC source code. The seq clause tells the compiler that each device thread will call this routine independently of any other thread.
extern float radians( float ); #pragma acc routine(radians) seq
The radians routine may then be called within an OpenACC compute region. Because the original routine had both the __device__ and __host__ attributes, it can be called on either the CUDA device or on the host. If the routine didn't have the __host__ attribute, the OpenACC program would have to be compiled with –ta=tesla explicitly, leaving off the –ta=host option. If such a routine were called within another procedure with acc routine, the routine directive should include the nohost clause, because there is no host version of radians to call.
If your CUDA device routine expects all the device threads to call it, in particular if your CUDA device routine includes calls to __syncthreads, the acc routine directive for the prototype should use the worker clause, instead of seq. Such a routine will typically use blockIdx and threadIdx to determine on which indices to work. The worker clause directs the compiler to ensure all threads in the corresponding thread block actually call the procedure simultaneously. The PGI OpenACC compilers typically use threadIdx.x to compute the vector loop index and threadIdx.y to compute the worker loop index, so the CUDA device routine should use that as well.
Calling CUDA Device Routines from Fortran
In CUDA Fortran, a device routine is defined with attributes(device) on the subroutine or function statement. If a scalar device routine appears in a module, and the OpenACC routine uses that module, the device routine may be called directly in an OpenACC compute construct. Device procedures can be called within OpenACC compute constructs, just as device data can be used or assigned; as mentioned earlier, these constructs must be compiled with –ta=tesla explicitly, and in particular without the –ta=host option, because there is no host version of the CUDA device data or device procedure.
The simplest way to use CUDA Fortran device routines is to place them in a module, then call them from the same module or use that module in the caller. In the example below, note again the use of the value attribute on the scalar argument f.
module m1 contains attributes(device) real function radians( f ) real, value :: f radians = f*3.14159265 end function end module subroutine sub( x ) use m1 real, dimension(:) :: x integer :: i !$acc parallel loop present(x) do i = 1, ubound(x,1) x(i) = radians(x(i)) enddo end subroutine
Calling CUDA C routines from Fortran OpenACC compute constructs requires an interface block:
subroutine sub( x ) real, dimension(:) :: x interface real function radians( f ) bind(c) !$acc routine seq real, value :: f end function end interface integer :: i !$acc parallel loop present(x) do i = 1, ubound(x,1) x(i) = radians(x(i)) enddo end subroutine
The routine seq directive indicates that this routine will have been compiled for the device. The bind(c) effects Fortran code generation using C bindings, meaning the compiler doesn't decorate the name in the generated code as it would for a Fortran function name. Rather, it uses symbol names and calling conventions expected for a C function.
C++ Class Member Functions
In a C++ program, many functions, especially class member functions, appear as source code in header files included in the program. The PGI C++ compiler will take note of functions called in compute regions and implicitly add the pragma acc routine seq if there is no explicit routine directive. With optimization, many of these functions will get inlined anyway, but this allows the program to compile without having to modify header files, many of which are read-only system header files.
If you have your own class definition and want to add an explicit acc routine directive, do so just above the function definition in the class. This will allow the compiler to generate a device version of the member function, so it can be called on an accelerator device. If the class is templated, the compiler will generate a version for each instantiation of that class. Note that virtual functions are not supported on the device, nor is exception handling. The example below creates a routine incrby that can be called on the device using vector parallelism.
templateclass myv{ T* x; int n; public: T& operator[](int i){ return *(this->x+i); } int size(){ return n; } ... #pragma acc routine vector void incrby( myv& b ){ int nn = n; #pragma acc loop for( int i = 0; i < nn; ++i ) x[i] += b[i]; } ... } void test( int n ){ myv<float> a(n); myv<float> b(n); ... #pragma acc parallel num_gangs(1) vector_length(100) { a.incrby( b ); } }
The instantiation of myv for the variables a and b will create a version of the incrby function for the device. Of equal interest is how to create the copy of the myv variables a and b on the device, along with their data. The OpenACC committee is working on several ways to handle this, but for now you can write class member routines similar to a constructor and destructor, that create or delete the device data, and update the device data or the host data as in the following example.
templateclass myv{ T* x; int n; public: void createdev(){ #pragma acc enter data create( this[0:1], x[0:n] ) #pragma acc update device(n) } void deletedev(){ #pragma acc exit data delete( x[0:n], this[0:1] ) } void updatedev(){ #pragma acc update device(x[0:n]) } void updatehost(){ #pragma acc update host(x[0:n]) } }
The create routine createdev creates the class itself (through the this pointer) and the data, using the OpenACC 2.0 dynamic data enter data directive. The OpenACC runtime will allocate memory for the class, then memory for the data vector x, then fill in or attach the pointer from the class to the data vector. It also explicitly fills in the length field. The delete routine deletes the data vector and the class, in reverse order. The routines to update the data simply update the data vector in one direction or the other.
Closing Comments
The routine directive supports a nohost clause that directs the compiler to generate only a device version of a function or procedure. In particular, no host version will be generated. This obviously only works if there are no calls to the routine from host code, so this clause should be used with care. Also supported is a bind clause, which allows you write a completely different implementation of a routine for use on an accelerator device versus the host. Because this clause is being re-defined for the next revision of the OpenACC specification, I'll leave discussion of it until the definition is settled.
The OpenACC routine directive makes it reasonably natural to write parallel programs that use modern modular programming structures, including separate compilation and libraries. I hope this two-part introduction gets you started to more productive use of OpenACC for developing performance portable heterogeneous HPC applications.