11 Tips for Maximizing Performance with OpenACC Directives in Fortran

List of Tips

  1. Privatize Arrays
  2. Make While Loops Parallelizable
  3. Rectangles Are Better Than Triangles
  4. Restructure Linearized Arrays with Computed Indices
  5. Privatize Live-out Scalars
  6. Inline Function Calls in Directive Regions
  7. Watch for Runtime Device Errors
  8. Be Aware of Data Movement
  9. Use Directive Clause to Optimize Performance
  10. Use Data Regions to Avoid Inefficiencies
  11. Leave Data on GPU Across Procedure Boundaries

Tip #1: Privatize Arrays

Some loops will fail to offload because parallelization is inhibited by arrays that must be privatized for correct parallel execution. In an iterative loop, data which is used only during a particular iteration can be declared private. And in general code regions, data which is used within the region but is not initialized prior to the region and is re-initialized prior to any use after the region can be declared private.

For example, if the following code is compiled:

!$acc kernels loop
   do i = 1, M
      do j = 1, N
         do jj = 1, 10
            tmp(jj) = jj
         end do
         A(i,j) = sum(tmp)
      enddo
   enddo

Informational messages similar to the following will be produced:

% pgfortran -acc -Minfo=accel private.f
privatearr:
      4, Generating copyout(a(1:m,1:n))
         Generating copyout(tmp(1:10))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
      5, Parallelization would require privatization of array 'tmp(1:10)'
      6, Parallelization would require privatization of array 'tmp(1:10)'
         Accelerator kernel generated
         5, !$acc loop seq
         6, !$acc loop seq
            Non-stride-1 accesses for array 'a'

A kernel is generated, but it will be very inefficient because it is sequential. If you further specify using a loop directive private clause that it is safe to privatize array tmp in the scope of the do j loop:

!$acc kernels loop
   do i = 1, M
   !$acc loop private(tmp)
      do j = 1, N
         do jj = 1, 10
            tmp(jj) = jj
         end do
         A(i,j) = sum(tmp)
      enddo
   enddo

It will provide the PGI compiler with the information necessary to successfully compile the nested loop into a fully parallel kernel:

% pgfortran -add -Minfo=accel private2.f
privatearr:
      4, Generating copyout(a(1:m,1:n))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
      5, Loop is parallelizable
      7, Loop is parallelizable
         Accelerator kernel generated
         5, !$acc loop gang, vector(16) ! blockidx%x threadidx%x
         7, !$acc loop gang, vector(16) ! blockidx%y threadidx%y
            CC 1.0 : 9 registers; 56 shared, 8 constant,
                     0 local memory bytes; 100% occupancy
            CC 2.0 : 19 registers; 8 shared, 64 constant,
                     0 local memory bytes; 100% occupancy
         8, Loop is parallelizable
        11, Loop is parallelizable

Note that the compiler will by default generate versions of the kernel that can be executed on CUDA devices with compute capability 1.x or 2.x. You can restrict code generation to a specific compute capability, say 2.0 for Fermi-class GPUs, using the compiler option ‑ta=nvidia:cc20.

Tip #2: Make While Loops Parallelizable

The PGI Accelerator compiler can't automatically convert while loops into a form suitable to run on the GPU. But it is often possible to manually convert a while loop into a countable rectangular do loop. For example, if the following code is compiled:

!$acc kernels
   i = 0
   do, while (.not.found)
      i = i + 1
      if (A(i) .eq. 102) then
         found = i
      endif
   enddo
!$acc end region

Informational messages similar to the following will be produced:

% pgfortran -acc -Minfo=accel while.f -c
PGF90-W-0155-Accelerator region ignored; see -Minfo messages (while.f: 6)
while:
      6, Accelerator region ignored
      8, Accelerator restriction: invalid loop
   0 inform, 1 warnings, 0 severes, 0 fatal for while

But if the loop is restructured into the following form as a do loop:

!$acc kernels loop
   do i = 1, N
      if (A(i) .eq. 102) then
         found(i) = i
      else
         found(i) = 0
      endif
   enddo
print *, 'Found at ', maxval(found)

It will provide the PGI compiler with the information necessary to successfully compile the nested loop for execution on an NVIDIA GPU:

% pgfortran -acc -Minfo=accel while2.f -c
while:
      5, Generating copyin(a(1:n))
         Generating copyout(found(1:n))
         Generating compute capability 2.0 binary
      6, Loop is parallelizable
         Accelerator kernel generated
         6, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
            Using register for 'found'
            CC 2.0 : 8 registers; 4 shared, 60 constant,
                     0 local memory bytes; 100% occupancy
   

Tip #3: Rectangles Are Better Than Triangles

All loops must be rectangular. For triangular loops, the compiler will serialize the inner loop. For example, if the following triangular loop is compiled:

!$acc kernels loop
   do i = 1, M
      do j = i, N     ! Here's the triangular loop
         A(i,j) = i+j
      enddo
    enddo

Informational messages similar to the following will be produced:

% pgfortran -acc -Minfo=accel -c triangle.f
triangle:
      4, Generating copyout(a(1:m,:))
         Generating compute capability 2.0 binary
      5, Loop is parallelizable
         Accelerator kernel generated
         5, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
            CC 2.0 : 21 registers; 12 shared, 60 constant,
                     0 local memory bytes; 83% occupancy
       6, Loop is parallelizable

While the loops seemed to have been parallelized, the resulting code will likely fail. Why? Because the compiler copies out the entire A array from device to host and in the process copies garbage values into the lower triangle of the host copy of A. However, if a copy clause is specified on the accelerator region boundary correct code will be generated. For example, after compiling the following loop:

!$acc kernels loop copy(A)
   do i = 1, M
      do j = i, N
         A(i,j) = i+j
      enddo
   enddo

Informational messages similar to the following will be produced:

pgfortran -acc -Minfo=accel -c triangle2.f
triangle:
      4, Generating copy(a(:,:))
         Generating compute capability 2.0 binary
      5, Loop is parallelizable
         Accelerator kernel generated
         5, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
            CC 2.0 : 21 registers; 12 shared, 60 constant,
                     0 local memory bytes; 83% occupancy
      6, Loop is parallelizable

Tip #4: Restructure Linearized Arrays with Computed Indices

It is not uncommon for legacy codes to use computed indices for computations on multi-dimensional arrays that have been linearized. For example, if the following loop with a computed index into the linearized array A is compiled:

!$acc kernels loop
   do i = 1, M
      do j = 1, N
         idx = ((i-1)*M)+j
         A(idx) = B(i,j)
      enddo
   enddo

Informational messages similar to the following will be produced:

% pgfortran -acc -Minfo=accel linearization.f
linear:
      4, Generating copyout(a(:))
         Generating copyin(b(1:m,1:n))
         Generating compute capability 2.0 binary
      5, Parallelization would require privatization of array 'a(:)'
      6, Parallelization would require privatization of array 'a(:)'
         Accelerator kernel generated
         5, !$acc loop seq
         6, !$acc loop seq
            Non-stride-1 accesses for array 'b'
            CC 2.0 : 16 registers; 0 shared, 72 constant,
                     0 local memory bytes; 16% occupancy
   

The code will run on the GPU but it will execute sequentially and run very slowly. You have two options. First, the loop can be restructured to remove linearization:

!$acc kernels loop
   do i = 1, M
      do j = 1, N
         A(i,j) = B(i,j)
      enddo
   enddo

Allowing the compiler to successfully generate a parallel GPU code:

% pgfortran -acc -Minfo=accel linearization2.f
linear:
      4, Generating copyout(a(1:m,1:n))
         Generating copyin(b(1:m,1:n))
         Generating compute capability 2.0 binary
      5, Loop is parallelizable
      6, Loop is parallelizable
         Accelerator kernel generated
         5, !$acc loop gang, vector(16) ! blockidx%x threadidx%x
         6, !$acc loop gang, vector(16) ! blockidx%y threadidx%y
            CC 2.0 : 11 registers; 16 shared, 64 constant,
                     0 local memory bytes; 100% occupancy
   

Or second, independent clauses can be specified on the do loops to provide the compiler with the information necessary to safely parallelize the loops:

!$acc kernels
!$acc loop independent
   do i = 1, M
!$acc loop independent
      do j = 1, N
         idx = ((i-1)*M)+j
         A(idx) = B(i,j)
      enddo
   enddo
!$acc end kernels
   

Tip #5: Privatize Live-out Scalars

It is common for loops to initialize scalar work variables, and for those variables to be referenced or re-used after the loop. Such a variable is called a "live out" scalar, because correct execution may depend on its having the last value it was assigned in a serial execution of the loop(s). For example, if the following loop with a live out variable idx is compiled:

!$acc kernels loop
   do i = 1, M
      do j = 1, N
         idx = i+j
         A(i,j) = idx
      enddo
   enddo
   print *, idx, A(1,1), A(M,N)

Informational messages similar to the following will be produced:

% pgfortran liveout.f -acc -Minfo=accel -c
liveout:
      4, Generating copyout(a(1:m,1:n))
         Generating compute capability 1.3 binary
      5, Loop is parallelizable
      6, Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
         5, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
         6, !$acc loop seq
            CC 1.3 : 9 registers; 48 shared, 16 constant,
                     0 local memory bytes; 100% occupancy
      7, Accelerator restriction: induction variable live-out from loop: idx
      8, Accelerator restriction: induction variable live-out from loop: idx
   

While some code will run on the GPU, the inner loop is executed sequentially. Looking at the code, the use of idx in the print statement is only for debugging purposes. In this case, you know the computations will still be valid even if idx is privatized so the code can be modified as follows:

!$acc kernels loop
   do i = 1, M
!$acc loop private(idx)
      do j = 1, N
         idx = i+j
         A(i,j) = idx
      enddo
   enddo
   print *, idx, A(1,1), A(M,N)

A much more efficient fully parallel kernel will be generated:

% pgfortran liveout2.f -acc -Minfo=accel -c
liveout:
      4, Generating copyout(a(1:m,1:n))
         Generating compute capability 1.3 binary
      5, Loop is parallelizable
      7, Loop is parallelizable
         Accelerator kernel generated
         5, !$acc loop gang, vector(16) ! blockidx%x threadidx%x
         7, !$acc loop gang, vector(16) ! blockidx%y threadidx%y
            CC 1.3 : 8 registers; 56 shared, 12 constant,
                     0 local memory bytes; 100% occupancy

Note that the value printed out for idx in the print statement will be different than in a sequential execution of the program.

Tip #6: Inline Function Calls in Directives Regions

One of the most common barriers to maximum GPU performance is the presence of function calls in the region. To run efficiently on the GPU, the compiler must be able to inline function calls. There are two ways to invoke automatic function inlining with the PGI Accelerator compilers:

1. If the function(s) to be inlined are in the same file as the section of code containing the accelerator region, you can use the ‑Minline compiler command-line option to enable automatic procedure inlining. This will enable automatic inlining of functions throughout the file, not only within the accelerator region. If you would like to restrict inlining to specific functions, say func1 and func2, use the option ‑Minline=func1,func2. To learn more about controlling inlining with ‑Minline, see the pgfortran man page, or just type pgfortran ‑help ‑Minline.

2. If the function(s) to be inlined are in a separate file from the code containing the accelerator region, you need to use the inter-procedural optimizer with automatic inlining enabled by specifying ‑Mipa=inline on the compiler command-line. ‑Mipa is both a compile-time and link-time option, so you need to specify it on the command-line when linking your program as well for inlining to occur. As with ‑Minline, you can learn more about controlling inter-procedural optimizations and inlining from the pgfortran man pages, or using pgfortran ‑help ‑Mipa.

In some cases when working with Fortran, procedures can only be inlined automatically by enabling array reshaping with ‑Minline,reshape or ‑Mipa=inline,reshape. For example when a 2D array is passed as an actual argument to a corresponding 1D array dummy argument.

There are several restrictions on automatic inlining. A Fortran subprogram will not be inlined if any of the following applies:

  • It is referenced in a statement function.
  • A common block mismatch exists; in other words, the caller must contain all common blocks specified in the callee, and elements of the common blocks must agree in name, order, and type (except that the caller's common block can have additional members appended to the end of the common block).
  • An argument mismatch exists; in other words, the number and type (size) of actual and formal parameters must be equal.
  • A name clash exists, such as a call to subroutine xyz in the extracted subprogram and a variable named xyz in the caller.

If you encounter these or any other restrictions that prevent automatic inlining of functions called in accelerator regions, the only alternative is to inline them manually.

Tip #7: Watch for Runtime Device Errors

Once you have successfully offloaded code in an accelerator region for execution on the GPU, you can still encounter errors at runtime due to common porting or coding errors that are not exposed by execution on the host CPU.

You may encounter an error message like this when executing a program:

Call to cuMemcpyDtoH returned error 700: Launch failed

This typically occurs when the device kernel returns an execution error due to an out-of-bounds or other memory access violation. For example the following code will generate such an error:

!$acc kernels loop
   do i = 1, M
      do j = 1, N
         A(i,j) = B(i,j+1) << out-of-bounds
      enddo
   enddo

The only way to isolate such errors currently is through inspection of the code in the accelerator region, or by compiling and executing on the host using the ‑Mbounds command-line option. This option will instrument the executable to print an error message for out-of-bounds array accesses.

If you encounter the following error message when executing a program:

Call to cuMemcpy2D returned error 1: Invalid value

This typically occurs if there is an error copying data to or from the device. For example, the following code will generate such an error:

      parameter(N=1024,M=512)
      real :: A(M,N), B(M,N)
      ...
!$acc kernels loop copyout(A), copyin(B(0:N,1:M+1)) << Bad bounds
      do i = 1, M  for copyin
         do j = 1, N
            A(i,j) = B(i,j+1)
         enddo
      enddo

The only way to isolate such errors currently is through inspection of the code in the accelerator region or inspection of the ‑Minfo informational messages at compile time.

Tip #8: Be Aware of Data Movement

Having successfully offloaded a CUDA kernel using PGI Accelerator directives, you should understand and try to optimize data movement between host memory and GPU device memory.

You can see exactly what data movement is occurring for each generated CUDA kernel by looking at the informational messages emitted by the PGI Accelerator compiler:

% pgfortran -acc -Minfo=accel -c jacobi.f90
jacobi:
      18, Generating copyin(a(1:m,1:n)) << Array a being copied from host memory to GPU device memory before CUDA kernel launch
          Generating copyout(a(2:m-1,2:n-1)) << Elements of arrays a and newa copied back 
          Generating copyout(newa(2:m-1,2:n-1)) << to host memory after CUDA kernel execution
      ...

You can see how much execution time is spent moving data between host memory and device memory by setting the environment variable PGI_ACC_TIME when you run the program:

% pgfortran -acc jacobi.f90
% setenv PGI_ACC_TIME 1
% a.out

<output from program>

Accelerator Kernel Timing data
jacobi
      18: region entered 798 times
          time(us): total=5575112 init=4565273 region=1009839  
                    kernels=79825 data=385751
       79 milliseconds spent ^^^   and   ^^^385 milliseconds spent moving data between
       executing kernels                    host memory and GPU device memory
          w/o init: total=1009839 max=12347 min=1191 avg=1265
      20: kernel launched 798 times
          grid: [16x16]  block: [16x16]
          time(us): total=47315 max=70 min=58 avg=59
      24: kernel launched 798 times
          grid: [1]  block: [256]
          time(us): total=9067 max=13 min=11 avg=11
      27: kernel launched 798 times
          grid: [16x16]  block: [16x16]
          time(us): total=23443 max=35 min=28 avg=29
   

Once you have examined and timed the data movement required at accelerator region boundaries, there are several techniques you can use to minimize and optimize data movement.

Tip #9: Use Directive Clauses to Optimize Performance

By default, the PGI Accelerator compilers will move the minimum amount of data required to perform the necessary computations on the GPU. For example, if the following code is compiled:

      change = tolerance + 1 ! get into the while loop
      iters = 0
      do while ( change > tolerance )
         iters = iters + 1
         change = 0
!$acc kernels
!$acc loop reduction(max:change)
         do j = 2, n-1
            do i = 2, m-1
               newa(i,j) = w0 * a(i,j) + &
               w1 * (a(i-1,j) + a(i,j-1) + a(i+1,j) + a(i,j+1) ) + &
               w2 * (a(i-1,j-1) + a(i-1,j+1) + a(i+1,j-1) + a(i+1,j+1) )
              change = max( change, abs( newa(i,j) - a(i,j) ) )
            enddo
         enddo
         a(2:m-1,2:n-1) = newa(2:m-1,2:n-1)
!$acc end kernels
      enddo
   

Feedback messages similar to the following will be produced:

% pgfortran -acc -Minfo=accel -c jacobi.f90
jacobi:
      18, Generating copyin(a(1:m,1:n))
          Generating copyout(a(2:m-1,2:n-1))
          Generating copyout(newa(2:m-1,2:n-1))
          Generating compute capability 2.0 binary
      19, Loop is parallelizable
      20, Loop is parallelizable
          Accelerator kernel generated
          19, !$acc loop gang, vector(16) ! blockidx%y threadidx%y
          20, !$acc loop gang, vector(16) ! blockidx%x threadidx%x
              Cached references to size [18x18] block of 'a'
              CC 2.0 : 18 registers; 1328 shared, 104 constant,
                       0 local memory bytes; 100% occupancy
      24, Max reduction generated for change
      27, Loop is parallelizable
          Accelerator kernel generated
          27, !$acc loop gang, vector(16) ! blockidx%x threadidx%x
              !$acc loop gang, vector(16) ! blockidx%y threadidx%y
              CC 2.0 : 10 registers; 16 shared, 80 constant,
                       0 local memory bytes; 100% occupancy

Some things to note:

  • Only the interior elements of the arrays a and newa are modified, so only those elements are copied out of the GPU memory to host memory
  • Performance degrades dramatically because data being copied is not contiguous and is using small transfers
  • Array newa is just a temporary array which does not need to be initialized before kernel execution and is not used after kernel execution.

If we modify the code as follows by adding clauses to the acc kernels directive to specify that the entire array a should be copied in and out, and that the array newa can be treated as GPU-local (i.e. as a scratch array that does not need to be copied):

      change = tolerance + 1 ! get into the while loop
      iters = 0
      do while ( change > tolerance )
         iters = iters + 1
         change = 0
!$acc kernels copy(a) create(newa)
!$acc loop reduction(max:change)
         do j = 2, n-1
            do i = 2, m-1
               newa(i,j) = w0 * a(i,j) + &
               w1 * (a(i-1,j) + a(i,j-1) + a(i+1,j) + a(i,j+1) ) + &
               w2 * (a(i-1,j-1) + a(i-1,j+1) + a(i+1,j-1) + a(i+1,j+1) )
               change = max( change, abs( newa(i,j) - a(i,j) ) )
            enddo
         enddo
         a(2:m-1,2:n-1) = newa(2:m-1,2:n-1)
!$acc end kernels
      enddo
   

When re-compiled the PGI compiler emits the following feedback messages:

% pgfortran -acc -Minfo=accel -c jacobi2.f90 
jacobi: 
     18, Generating copy(a(:,:)) 
         Generating create(newa(:,:)) 
         Generating compute capability 2.0 binary 
     ... 

The copy of array a will be much more efficient, and data movement for array newa has been completely eliminated.

Tip #10: Use Data Regions to Avoid Inefficiencies

OpenACC has two kinds of constructs: compute constructs (kernels and parallel) and a data construct. In a kernels construct, loops to be executed on the GPU are delineated using the !$acc kernels and !$acc end kernels or use a !$acc kernels loop directive before a loop. For example, we might enclose a weighted five-point stencil operation in Fortran as:

!$acc kernels loop
   do i = 2, n-1
      do j = 2, m-1
         b(i,j) = 0.25*w(i)*(a(i-1,j)+a(i,j-1)+ &
                             a(i+1,j)+a(i,j+1)) &
                  +(1.0-w(i))*a(i,j)
      enddo
   enddo

When compiled, the PGI Accelerator compiler emits the following messages:

s1:
     7, Generating copyin(w(2:n-1))            << Compiler is sending a portion of array w 
        Generating copyin(a(1:n,1:m))          << and array a to the GPU, and copying modified 
        Generating copyout(b(2:n-1,2:m-1))     << portion of array b back to host memory 
     8, Loop is parallelizable
     9, Loop is parallelizable
        Accelerator kernel generated 
        8, !$acc loop gang, vector(16)                    << Shows parallelism
           Cached references to size [16] block of 'w'    << schedule: Loops are 
        9, !$acc loop gang, vector(16)                    << executed in 16x16
           Cached references to size [18x18] block of 'a' << blocks.

However, there is a serious problem with this particular program. The loop does roughly 8*n*m operations, but transfers roughly 8*n*m bytes to do it. The data transfer between the host and the GPU will dominate any performance advantage gained from the parallelism on the GPU. We certainly don't want to send data between the host and GPU for each iteration. Enter the data region.

A data region is generated by a data construct. A data construct looks similar to a compute construct, but defines only data movement between host memory and GPU device memory. In an iterative solver, a data construct can be placed outside the iteration loop, and an enclosed compute construct around the computational kernel. A more complete example might look as follows:

!$acc data copy(a(1:n,1:m)) create(b(2:n-1,2:m-1)) copyin(w(2:n-1))
      do while(...)
!$acc kernels
         do i = 2, n-1
            do j = 2, m-1
               b(i,j) = 0.25*w(i)*(a(i-1,j)+a(i,j-1)+ &              
                                   a(i+1,j)+a(i,j+1)) &
                        +(1.0-w(i))*a(i,j)
            enddo
         enddo
         do i = 2, n-1
            do j = 2, m-1
               a(i,j) = b(i,j)
            enddo
         enddo
!$acc end kernels
      enddo 
!$acc end data 

Now, any input data is copied to GPU device memory at entry to the data construct, and the results copied back to host memory at exit of the data construct. Inside the while loop, there is essentially no data movement between the host and the GPU. This will run several times faster than the original program.

The code executed within a data construct, including any procedures called, comprise the data region. A data region will typically contain one or more compute constructs; data used in a compute construct that was moved to the accelerator in an enclosing data construct are not moved at the boundaries of the compute construct. The next section discusses how to extend the reach of a data region across procedure boundaries.

The update device and update host clauses allow fine-tuning of data movement at construct boundaries. You can add an update device clause to a compute construct directive for data that was allocated on the GPU in an enclosing data construct, but which has been updated on the host between the beginning of the data construct and the beginning of the compute construct. This tells the compiler the arrays or parts of arrays that need to be copied from host memory to GPU device memory at entry to the compute construct. Similarly, you would add an update host clause to a compute construct when you have data allocated on the GPU in an enclosing data construct, but you want some or all of the host copy of that array updated at the exit of the compute construct.

Using data regions, it is often possible to substantially reduce the amount of data movement in program units that include multiple accelerator compute regions.

Tip #11: Leave Data on GPU Across Procedure Boundaries

Data regions enable the programmer to leave data in GPU device memory and re-use it across multiple procedures. However, the compiler must know that the data already exists on the device. For this reason, OpenACC has the present clause.

The present clause on a compute or data construct or declare directive tells the compiler that the specified dummy argument or global variable appears in an enclosing data region, such as a data construct in a calling routine. Consider the example below:

      module glob
      real, dimension(:), allocatable :: x
      contains
         subroutine sub( y )
         real, dimension(:) :: y
!$acc declare present(y,x)
!$acc kernels
         do i = 1, ubound(y,1)
            y(i) = y(i) + x(i)
         enddo
!$acc end kernels
         end subroutine
      end module

      subroutine roo( z )
      use glob
      real :: z(:)
!$acc data copy(z) copyin(x)
      call sub( z )
!$acc end data region
      end subroutine

The subroutine sub is contained within a module, and the dummy array y and global array x are declared to be present on the device. The call site is within a data construct, so the code within the subroutine sub is within the data region generated by the construct. Within the kernels construct, the compiler knows that the arrays x and y are already present, so no data motion takes place.

Using data regions along with the present clause, it is possible to allocate and use data in GPU device memory across large portions of an application while minimizing the number of data transfers that must occur to keep the host and device copies coherent.

We can modify this example again so that the subroutine will work whether the data is present on the GPU or not:

      module glob
      real, dimension(:), allocatable :: x
      contains
         subroutine sub( y )
         real, dimension(:) :: y
!$acc kernels present_or_copy(y) present_or_copyin(x)
         do i = 1, ubound(y,1)
            y(i) = y(i) + x(i)
         enddo      
!$acc end kernels
         end subroutine
      end module

      subroutine roo( z, zz )
      use glob
      real :: z(:), zz
!$acc data copy(z) copyin(x)
      call sub( z )
      call sub( zz )
!$acc end data region
      end subroutine

In this example, we modified the clauses for x and y to be present_or_copy[in]. At runtime, the program will test whether each array is already present on the GPU; if so, it will use that copy of the array without any data movement. If the data is not already present, it will act according to the copy or copyin clause. At the first call to sub, both x and y will be present due to the data construct in roo. At the second call, x is still present, but y, which is now bound to zz, is not, so y will be copied in to and out from the device.

You can also specify a subarray on a present clause, meaning that at least the specified subarray must be present. It is a runtime error if an array in a present clause is not present, or an array or subarray is only partially present. In a present_or_... clause, if the array is not present, the second part of the clause is honored. If the array is partially present, it is again a runtime error.

Click me