Technical News from The Portland Group

Optimizing Data Movement in the PGI Accelerator Programming Model

Background

In 2009, PGI introduced the PGI Accelerator programming model, a directive-based model targeting NVIDIA GPUs. About a year and a half ago, we discussed Performance Tuning for the GPU in the PGI Accelerator model. In that article we show that after finding an appropriately parallel algorithm, the most important key to generating high performance is managing data traffic between the host and the GPU. This is still the case today whether you're programming in an implicit model like PGI Accelerator directives or with explicit model like CUDA C or CUDA Fortran. With explicit CUDA, you can see each data movement as cudaMemcpy calls or CUDA Fortran array assignments. With PGI Accelerator directives, data movement is not always so obvious because the compiler manages most of it.

As explained in the earlier performance tuning article, it's important to enable the -Minfo flag to the compiler so you can see the copyin (into the GPU) and copyout (out of the GPU) messages. These tell you which arrays are being transferred in each direction. By default, the compiler will move the smallest part of each array needed to execute the loop. One optimization strategy is to tell the compiler to move the whole array using copy, copyin or copyout clauses; whole arrays or contiguous subarrays can be moved with a single potentially faster DMA operation. Also by default, the compiler will bring any arrays that were modified on the GPU back to the host. If those arrays are intermediate results, they may not be needed anymore. So another tuning trick is to use the local clause to tell the compiler that those arrays values need not return to the host.

Last year, we extended the PGI Accelerator model with the addition of the data region directive. This allows your program to move arrays between the host and GPU at the data region boundaries. If a single data region encloses many compute regions, the compute regions will reuse the data already allocated on the GPU. This can dramatically reduce the frequency of data movement and is important for optimized performance. While this was an improvement, it still wasn't enough; the data region construct had to lexically contain the compute regions. If those compute regions were in different procedures, they had to be inline to get the benefit. In this article, we will present three new powerful methods to manage data traffic without that restriction. Currently, these new tools are available in the 2011 release of the PGI Accelerator Fortran compiler. We expect to release a PGI Accelerator C compiler implementation later this Spring as well.

Mirrored Allocatable Data

The new PGI Accelerator Fortran directive mirror applies to Fortran allocatable arrays. This directive tells the compiler that allocate and deallocate statements for this array should allocate copies both on the host and on the GPU. When these arrays appear in host code, the host copy is used; when they appear in PGI Accelerator compute regions, the GPU copy is used. Using the mirror directive with module allocatable arrays gives them global visibility.

    module glob
      real, dimension(:), allocatable :: x
      !$acc mirror( x )
    end glob
    subroutine sub( y )
      use glob
      real, dimension(:) :: y
      !$acc region
        do i = 1, ubound(y,1)
          y(i) = y(i) + x(i)
        enddo
      !$acc end region
    end subroutine

In this example, when x is allocated, a copy on the GPU will be allocated as well. When the region in subroutine sub is executed, the region will use the values in the GPU copy of x with no data movement.

The update directive can be used to synchronize data between the GPU and host copies of the array. The following example updates a subarray of the host copy from GPU data.

    subroutine sync
      use glob
      !$acc update host( x(2:499) )
    end subroutine

There is also a new mirror clause for use in data regions; the mirror clause is like the mirror directive, except the GPU copy only has the lifetime of the data region. When the data region is entered, if an array in the mirror clause is allocated on the host (or is not allocatable), it will be allocated on the GPU with the same size. If an allocatable array in the mirror clause is not allocated, a GPU copy will not be allocated. If an allocatable array in the mirror clause is allocated or deallocated in the data region, the GPU copy will likewise be allocated or deallocated. It's important to note that the allocation or deallocation does not imply any data movement. If you want the host data copied to the GPU, you should use a copyin clause instead of mirror, or add update device directives.

Because C doesn't have statements for dynamic allocation, and because it doesn't preserve array bound information, there is no analog to the mirror directive or clause for C.

Reflected Arguments

The second new directive in PGI Accelerator Fortran is the reflected directive, which applies to dummy arguments. This directive tells the compiler that the specified array dummy arguments appear in a data region clause in the caller, or are mirrored on the GPU. Let's expand the previous example as follows:

    module glob
      real, dimension(:), allocatable :: x
      !$acc mirror( x )
    contains
     subroutine sub( y )
      real, dimension(:) :: y
      !$acc reflected(y)
      !$acc region
        do i = 1, ubound(y,1)
          y(i) = y(i) + x(i)
        enddo
      !$acc end region
     end subroutine
    end module
    subroutine roo( z )
      use glob
      real :: z(:)
      !$acc data region copy(z)
       call sub( z )
      !$acc end data region
    end subroutine

Now the subroutine sub is contained within the module, and the dummy array y has the reflected attribute. The caller, roo, uses the module, making the interface to sub explicit; alternatively, we could have left sub as an external subroutine and used an interface block in roo. At the call site, the compiler knows that the array z has been copied to the GPU, and that the subroutine sub needs both the host address and the GPU address for z. Within the subroutine sub, the compiler knows, because of the reflected clause, that the dummy argument y must have been copied to the GPU by the caller, and so the compute region in the subroutine incurs no data movement for either y (because it's reflected) or x (because it's mirrored).

The reflected directive only applies to dummy argument arrays, and can only be used when the interface is explicit, in Fortran terms. This means the subprogram must appear in a module, and the caller in the same module, or in a scope where the module has been USEd, or the caller must have a matching interface block to the subprogram with the reflected directive. Another restriction in the current implementation is the whole array must be copied to the GPU to use the reflected directive.

As with the mirror directive, the reflected directive is currently implemented only in PGI Accelerator Fortran. Fortran has a well-defined concept of explicit interface, using modules and interface blocks, so the compiler can ensure that the caller and callee agree on argument characteristics. The analog in C is the function prototype, which is less strict and doesn't include a scoping mechanism. Nevertheless, we will be including this feature in PGI Accelerator C during this year, as described in the PGI Accelerator Model document, version 1.3.

CUDA Memory Allocation

PGI Accelerator Fortran and CUDA Fortran have always shared a interesting relationship. CUDA is a lower-level, explicit model that gives expert programmers direct control. The PGI Accelerator model is a high-level, implicit model intending to serve for GPU programming the same role that OpenMP serves for multi-core programming. There is a place for each model; our goal for the PGI Accelerator compiler is to approach and reach the same performance that an expert programmer could get by writing equivalent CUDA directly. However, we realize that there are often algorithmic kernels that you can write in CUDA that you simply can't express in a directive-based model, so it would be best if the two models can work together. In PGI 2011, the two models are more cleanly integrated so they can coexist in the same program—even in the same subprograms. In particular, you can now use CUDA Fortran device data in PGI Accelerator regions.

One problem with the mirror or reflected directives is that they require data to be allocated both on the host and the GPU. In many cases, we have data that we want allocated only on the accelerator itself; in simple terms, we don't want to "pay for the memory twice." CUDA Fortran has the device attribute for variables and arrays, so why not use that attribute on data that we want to allocate only on the GPU? Now you can; we can go back to our running example, and use the CUDA Fortran device attribute for the x array:

    module globdata
      real, dimension(:), allocatable, device :: x
    end module
    module globsub
    contains
     subroutine sub( y )
      use globdata
      real, dimension(:) :: y
      !$acc reflected(y)
      !$acc region
        do i = 1, ubound(y,1)
          y(i) = y(i) + x(i)
        enddo
      !$acc end region
     end subroutine
    end module
    subroutine roo( z )
      use globsub
      real :: z(:)
      !$acc data region copy(z)
       call sub( z )
      !$acc end data region
    end subroutine

We've modified this example a little more to highlight the another new feature of PGI 2011, access to CUDA module data across modules. This works in explicit CUDA Fortran kernel routines as well as in PGI Accelerator regions.

In this example, the array x has the device attribute; it will be allocated only on the GPU. When used in the PGI Accelerator region, that GPU data will be used directly. In fact, if we wanted to and it were appropriate, we could add the device attribute to the dummy arguments z and y, removing the need for the reflected directive and the data region.

CUDA Fortran has the device attribute for arrays and variables, so it's easy to extend PGI Accelerator Fortran to recognize this attribute and retrieve the address of the data on the GPU. C doesn't have such an attribute; even in CUDA C, pointers to device data have the same datatype and attributes as pointers to host data. In an upcoming PGI build—most likely in April—we will add the deviceptr clause for data and compute regions in C. This will allow you to specify that a pointer variable points to data on the device, such as data allocated with cudaMalloc, and use that data in the compute region.

There are some considerations to keep in mind when using CUDA extensions. First, you have to enable them, either by using the .cuf or .CUF CUDA Fortran file extensions, or compiling with the ‑Mcuda flag. Second, some PGI Accelerator model features change behavior. The PGI Unified Binary™ compiled with -ta=nvidia,host will not work because there is no host mode for the CUDA extensions. Also, as with any CUDA program, it must be run on a machine with an NVIDIA GPU, even if the part of the program using the CUDA extensions is not executed. When a CUDA program starts, it connects to the operating system CUDA driver and looks for a GPU, even before any user code is executed. There are some other, subtler behavior changes, but that's a topic for another article.

Summary

Tuning and optimizing data traffic between the host and GPU has been, and continues to be important to achieving maximum benefit from the massive performance of the GPU. This is true regardless of which model you use to program the GPU. We have been adding features to the PGI Accelerator programming model and the compilers themselves to allow you to manage and tune the data traffic more carefully. For PGI Accelerator Fortran users, the mirror and reflected directives let you effectively extend the range of a data region across procedure boundaries. The reflected directive will be available in PGI Accelerator C later this year. You can also combine CUDA Fortran extensions with PGI Accelerator Fortran to manually control data allocation on the GPU and data movement between the GPU and host while preserving the productivity benefits of PGI Accelerator directives for programming. You will soon be able to do the same for PGI Accelerator C as well. As always, you should measure the performance of your program to determine the bottlenecks; we recommend using pgcollect with pgprof, or NVIDIA's cudaprof tool, or compiling with -ta=nvidia,time to enable our simple post-mortem GPU profiler.