Technical News from The Portland Group
CUDA Fortran Data Management
Fortran 2003 provides a rich set of syntactic features for allocating, deallocating, and operating on arrays of data resident in the X64 host memory space. In CUDA Fortran, we've extended those features to allow programmers to manage the separate GPU device memory space while still maintaining the clean, readable syntax familiar to Fortran programmers.
In a previous PGInsider article An introduction to CUDA Fortran, we showed an example of compiling and executing a simple CUDA Fortran program.
In this article I describe in depth the choices we made and the trade-offs we faced in designing the CUDA Fortran language. I will give a snapshot of the current features we've implemented, and point out areas that we may extend in the future. In my next article, I will discuss programming CUDA Fortran kernels.
With CUDA Fortran, PGI has provided syntax that is familiar to Fortran programmers, along with type checking that is consistent with what Fortran programmers expect. Given that a body of CUDA code and programmers already exist, we also strove to allow interoperability with CUDA C and an interface comfortable to programmers that already have some familiarity with CUDA programming.
In the simplest form, you can declare, allocate, and transfer data to and from the GPU with just a few extensions to Fortran; to the standard list of Fortran attributes that can be applied to data declarations, such as parameter, public, private, dimension, intent, save, etc., we've added four more in CUDA Fortran: device, constant, shared, and pinned. Initially, the device attribute is the one you will care most about.
real, device, allocatable :: adev(:), bdev(:) ! device declaration real, dimension(100) :: a ! normal host declaration allocate(adev(100), bdev(10)) ! allocate device arrays adev = a ! host to device bdev = 0.0 ! Initialize device array a(1:10) = bdev ! device to host
The device attribute states that the object in the declaration resides in the device global memory, not in the host memory. This also implies that the data in the object is not directly accessible from the host program. A common programming mistake is to mix up the use of device and host arrays. For this reason, we recommend explicit interfaces for calls passing device arrays, so the compiler can catch mismatches in the arguments at compile time. The general utility of Fortran compile-time checking is the reason we chose to extend the language with a device attribute rather than allow declared data to reside either on the host, the device, or both.
The constant attribute specifies that the data should reside in the device's constant memory. Likewise, the shared attribute is used in device subprograms to declare data for the device multiprocessor shared memory. The pinned attribute is for host data, and specifies that allocations should come from host page-locked memory.
Typically, your CUDA Fortran program will allocate some device arrays, copy data over to the device using the array syntax shown above or one of the other methods which I will describe later, invoke a device kernel using CUDA chevron syntax, and move results data back to the host.
Some Finer Details
Here are some alternatives to the method of allocating device data shown above:
Data can be allocated "automatically", as in Fortran automatic arrays. For example:
subroutine sub1(a,n) real, device :: adev(n) real, device :: bdev(10) real, device :: cscalarThese arrays and scalars will be created automatically when sub1 is entered, and the space is freed when sub1 returns.
Fixed-size device data can be declared in a Fortran module, and that data can be accessed by CUDA Fortran kernels contained in that module as well as by host programs that use the module. In a future release of CUDA Fortran, we will allow the module device arrays to be allocatable.
Data can be allocated using CUDA API interfaces from the host. For those familiar with the CUDA C runtime calls, CUDA Fortran provides interfaces to those routines. In normal Fortran fashion, counts are in units of the data type, not bytes. So for instance
could be replaced with
istat = cudaMalloc(adev, 100) istat = cudaMalloc(bdev, 10)
Note that there are some subtle differences between these two methods. Fortran has some semantics associated with allocatable arrays that are not implemented when using cudaMalloc. For instance, the ALLOCATED() intrinsic will work with the former but not with arrays allocated via cudaMalloc. Also, there are rules of deallocation and reallocation for allocatable arrays that are not followed when cudaMalloc is used. Programmers should not mix ALLOCATE with cudaFree, or cudaMalloc with DEALLOCATE.
Multi-dimensional arrays can also be allocated using the CUDA API interfaces to cudaMallocPitch and cudaMalloc3D. These interfaces pad the array leading dimension to maximize aligned memory references on the device. PGI's implementation of ALLOCATE will never pad the arrays; we've generally associated ALLOCATABLE arrays with contiguous memory. Plus, Fortran programmers have been adding their own padding to arrays for years; if that is your intent, you know how to do it.
Similarly, there are CUDA API methods for transferring data between the host and device for one, two, and three dimensional data. So, for instance;
adev = a bdev = 0.0
could be replaced with
istat = cudaMemcpy(adev, a, 100) istat = cudaMemset(bdev, 0.0, 10)
The assignment adev = a will run even faster if a is a pinned array bcause one copy is skipped during the transfer. Pinned memory can be specified by adding the pinned attribute to the declaration for a. Note that pinned memory must be allocated using the allocate statement, i.e. you cannot create static, pinned, data areas.
Though many variations of memcpy and memset are provided, the single most important point for good performance is that you should make every attempt, by staging data on either end, to perform your transfers on large, contiguous data regions. There is an example of that below.
Finally, if you are interfacing with existing CUDA C code, or have other requirements outside of typical Fortran usage, we have implemented interfaces which are more C-like. This is built using the iso_c_binding module TYPE(C_PTR) of F2003. Most CUDA Fortran ABI interfaces can take a TYPE(C_PTR) in addition to the standard supported intrinsic data types. In these cases the counts are in terms of bytes.
We've extended the TYPE(C_PTR) concept to implement TYPE(C_DEVPTR), which corresponds to a C pointer to a device memory location. For example, if you want to interface to existing CUDA C which uses global device arrays, you can access them via the cudaGetSymbolAddress call. This sets a TYPE(C_DEVPTR) argument to the address, which can then be used in CUDA Fortran, either in place of device arrays in API calls, or "cast" into a Fortran device array through an extension to the c_f_pointer subroutine which is also a part of the iso_c_binding module subroutine. For example, this code sequence initializes a CUDA C kernel global device array named vx of length 100:
type(cudaSymbol) :: csvx type(c_devptr) :: cdvx real, allocatable, device :: vx(:) csvx = 'vx' istat = cudaGetSymbolAddress(cdvx, csvx) call c_f_pointer(cdvx, vx, 100) vx = 0.0
Note that this extension to c_f_pointer() may change in the future when we allow Fortran pointers to have the device attribute, and also support Fortran pointers in device kernels. F2003 does not allow allocatable arrays as the second argument to c_f_pointer().
When porting a code to CUDA Fortran, the first step is to have some idea of the portion of the code you expect to run on the GPU. If you are familiar with the code, you probably have an idea of where time is spent, and where exploiting the parallelism and features of a GPU could lead to better performance. The next thing you will need to consider is which data will reside on the host, and which data needs to reside on the GPU. In Fortran, this will likely mean either which arrays, or which portions of the arrays, need to be on the GPU, and for what duration.
When starting the port, it might be easier to duplicate entire arrays on the GPU. While we were gaining experience, we typically added the letters "dev" to array names. As one of our engineers says, "adev equals a" is pretty easy to remember. Once the code is running, we can go back during an optimization phase and determine if entire copies of both arrays are needed.
As an example, we ported the STREAM benchmark, specifically stream_tuned.f, to CUDA Fortran. The original code declares and initializes the arrays like this:
C .. Arrays in Common .. DOUBLE PRECISION a(ndim),b(ndim),c(ndim) ... !$OMP PARALLEL DO DO 10 j = 1,n a(j) = 2.0d0 b(j) = 0.5D0 c(j) = 0.0D0 10 CONTINUE
This can be ported to CUDA Fortran array syntax:
DOUBLE PRECISION, DEVICE, ALLOCATABLE :: ADEV(:), BDEV(:), CDEV(:) DOUBLE PRECISION HOSTTMP(ndim) ... ALLOCATE(ADEV(ndim)) ALLOCATE(BDEV(ndim)) ALLOCATE(CDEV(ndim)) ADEV = 2.0d0 BDEV = 0.5D0 CDEV = 0.0D0
All the benchmark processing is done on the device, so there is no need to duplicate the data on the host, other than for single array operations. The calls to do the actual operations were converted to calling the CUDA device kernels via the chevron syntax, for instance:
call stream_copy (c, a, n)
call stream_copy <<<(N-1)/512+1,512>>> (CDEV, ADEV, n)
As mentioned before, we'll discuss writing kernels in the next article. This CUDA Fortran STREAM benchmark source code is available for download from the PGI website.
As you can see from this output, the NVIDIA GPU has great internal memory bandwidth once you get the data resident on the device: the STREAM rate is many times what you will see on a single socket of an X64.
% pgfortran -Mcuda -Mfixed -O2 stream_cudafor.f % ./a.out ---------------------------------------------- Double precision appears to have 16 digits of accuracy Assuming 8 bytes per DOUBLE PRECISION word ---------------------------------------------- Array size = 20000000 Offset = 0 The total memory requirement is 457 MB You are running each test 10 times -- The *best* time for each test is used *EXCLUDING* the first and last iterations ---------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds ---------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 121340.0756 0.0027 0.0026 0.0027 Scale: 122529.5038 0.0026 0.0026 0.0026 Add: 108298.5612 0.0045 0.0044 0.0045 Triad: 107965.7078 0.0045 0.0044 0.0045 ---------------------------------------------------- Solution Validates! ----------------------------------------------------
With all these data transfer options, it's important to recognize how each actually works and performs. For instance, in the STREAM benchmark the final verification step is to sum up the array elements and check that the results are correct. With the arrays resident on the GPU device, you could choose to code the final step like this:
SUBROUTINE checksums(a,b,c,n,ntimes) C .. Arguments .. DOUBLE PRECISION, DEVICE :: a(n),b(n),c(n) INTEGER n,ntimes C .. C .. Local Scalars .. DOUBLE PRECISION aa,bb,cc,scalar,suma,sumb,sumc,epsilon INTEGER k ... C Now sum up the arrays, excluding the first and last C elements, which are modified using the timing results C to confuse aggressive optimizers. suma = 0.0D0 sumb = 0.0D0 sumc = 0.0D0 do j = 2, n-1 suma = suma + a(j) sumb = sumb + b(j) sumc = sumc + c(j) end do
However, given the size of the arrays used in STREAM, you'll find that this loop executes very slowly, as it is implemented as (n-2)*3 separate transfers from the GPU to the host memory. While later releases of CUDA Fortran may optimize this, currently, it is better to create your own temporary array and code the operation like this:
DOUBLE PRECISION hosttmp(n) ! create an automatic array on host C Now sum up the arrays, excluding the first and last C elements, which are modified using the timing results C to confuse aggressive optimizers. hosttmp = a suma = sum(hosttmp(2:n-1)) hosttmp = b sumb = sum(hosttmp(2:n-1)) hosttmp = c sumc = sum(hosttmp(2:n-1))
As language designers with strong applications background, we've also discussed supporting, at some point in the future, transformational and reduction operations using the PGI Accelerator model on device arrays:
! This support is under consideration for future implementation: !$acc region suma = sum(a(2:n-1)) sumb = sum(b(2:n-1)) sumc = sum(c(2:n-1)) !$acc end region
Reductions and other operations that require global memory synchronization on the device can be difficult for scientists and engineers (even long-time programmers!) to write on a GPU. We will work to provide efficient implementations of the F2003 operations wherever possible, to ease the burden on programmers looking for quick paths to peak performance of their code.
Thoughts and feedback on these issues are welcome. Please email me at firstname.lastname@example.org.