Technical News from The Portland Group
The PGI Accelerator Programming Model on NVIDIA GPUs
Part 4: New and Upcoming Features
This article discusses the new features in the PGI Accelerator programming model introduced in the PGI 2010 release, some additions that will be coming with the March (10.3) and April (10.4) updates, and looks at future directions of the model and directives. See the Additional Resources section at the PGI Accelerator Compilers web page for introductory information about the model, including links to the previous three PGInsider articles.
The most important new feature in PGI 2010 is that we now have two kinds of regions in the model: a compute region and a data region. A compute region is what we were just calling an accelerator region. We enclose the loops that we want executed on the accelerator or GPU in a compute region. In C, this is written as a #pragma acc region followed by a structured block in braces; in Fortran, we surround the region by !$acc region and !$acc end region directives. For example, we might enclose a weighted five-point stencil operation in Fortran as:
!$acc region do i = 2, n-1 do j = 2, m-1 b(i,j) = 0.25*w(i)*(a(i-1,j)+a(i,j-1)+ & a(i+1,j)+a(i,j+1)) & +(1.0-w(i))*a(i,j) enddo enddo !$acc end region
The compiler analyzes the loops and array accesses in the region, determines what data to send to the device and bring back to the host, and generates the appropriate code for both the accelerator and the host to implement this. When we compile this with pgfortran -ta=nvidia -Minfo=accel, we see the following informational messages, where line 7 is the region directive:
s1: 7, Generating copyin(w(2:n-1)) Generating copyin(a(1:n,1:m)) Generating copyout(b(2:n-1,2:m-1)) 8, Loop is parallelizable 9, Loop is parallelizable Accelerator kernel generated 8, !$acc do parallel, vector(16) Cached references to size  block of 'w' 9, !$acc do parallel, vector(16) Cached references to size [18x18] block of 'a'
This tells me, the programmer, that the compiler is sending a subvector of w and the array a to the device, and bringing the modified portion of b back to host memory. It also tells me the loop parallelism schedule: that the loops are executed in a 16x16 thread block (do vector(16)), and both loops are parallelized across the grid of thread blocks (do parallel)
However, there is a serious problem with this particular program. The loop does roughly 8*n*m operations, but transfers roughly 8*n*m bytes to do it. The data transfer between the host and the GPU will dominate any performance advantage gained from the parallelism on the GPU. Moreover, such a loop is typically written as part of an iterative solver. We certainly don't want to send data between the host and GPU for each iteration of the solver.
Enter the data region. A data region looks similar to a compute region, but it only defines where to send data between the host and the GPU. In an iterative solver, we would put the data region outside the iteration loop, and the compute region around the computational kernel. The more complete example is:
!$acc data region copy(a(1:n,1:m)) local(b(2:n-1,2:m-1)) copyin(w(2:n-1)) do while(...) !$acc region do i = 2, n-1 do j = 2, m-1 b(i,j) = 0.25*w(i)*(a(i-1,j)+a(i,j-1)+ & a(i+1,j)+a(i,j+1)) & +(1.0-w(i))*a(i,j) enddo enddo do i = 2, n-1 do j = 2, m-1 a(i,j) = b(i,j) enddo enddo !$acc end region enddo !$acc end data region
Now the data are copied to the GPU at entry to the data region, and the results copied back to the host at exit of the data region. Inside the while loop, there is essentially no data movement between the host and the GPU. This will run several times faster than the original program, because the data movement is amortized over many compute region executions.
A data region will typically contain one or more compute regions; data used in a compute region that was moved to the accelerator in an enclosing data region directive are not moved at the boundaries of the compute region. Two new data clauses are implemented in the model: updatein and updateout. You would add an updatein clause to a compute region directive for data that was allocated on the GPU in an enclosing data region, but which has been updated on the host. This tells the compiler the array, or what parts of the arrays need to be copied from the host to the GPU at entry to the compute region. Similarly, you would add an updateout clause to a compute region when you have data allocated on the GPU in an enclosing data region, but you want some or all of the host copy of that array updated at the exit of the compute region.
A data region directive can include any of the data clauses that can appear on a region directive: copy, copyin, copyout, and local. Because a data region can also be contained in other data regions, the directive can also include updatein and updateout clauses.
In addition, there are updatein and updateout directives; these work just like the clauses on a data or compute region directive, but they can occur almost anywhere inside a data region (except it can't be the statement after an if or while or similar construct in C, or after a logical if in Fortran).
Going back to the stencil example above, the outer while loop typically iterates until some convergence criterion is reached. A common criterion is to compute a residual as the sum of the squares of the differences between the old and new values, and to stop when the residual reaches a small value, indicating that the computation has converged. This could be written as follows:
!$acc data region copy(a(1:n,1:m)) local(b(2:n-1,2:m-1)) copyin(w(2:n-1)) do while(resid .gt. tol) resid = 0.0 !$acc region do i = 2, n-1 do j = 2, m-1 b(i,j) = 0.25*w(i)*(a(i-1,j)+a(i,j-1)+ & a(i+1,j)+a(i,j+1)) & +(1.0-w(i))*a(i,j) enddo enddo do i = 2, n-1 do j = 2, m-1 resid = resid + (b(i,j)-a(i,j))**2 a(i,j) = b(i,j) enddo enddo !$acc end region enddo !$acc end data region
In fact, the computation of resid inside the loop involves a loop-carried dependence cycle; it uses the value from the previous iteration of the loop to compute the value for the current iteration. Normally, loop-carried dependence cycles prevent parallelization. However, this is a summation, and mathematically, the order of accumulation doesn't matter. Of course, we all know that computationally the order does matter. Arithmetical addition is commutative and associative, but computer floating point arithmetic is not; see my HPCwire article on this subject if that interests you. However, unless we took great care with the order of the accumulation (which it appears we did not), there's really no reason to believe that the answer computed with the order shown is any better or more accurate than the answer computed with some other order. So, parallelizing and vectorizing compilers have long rearranged the operations for summations and other reduction operations.
Now, the PGI 2010 compilers also recognize and generate optimized code for reductions appearing in PGI Accelerator compute regions. When compiling this program with informational messages enabled, you'll see a line for the reduction:
22, Loop is parallelizable 23, Loop is parallelizable Accelerator kernel generated 22, !$acc do parallel, vector(16) 23, !$acc do parallel, vector(16) Using register for 'b' 24, Sum reduction generated for resid
In fact, an additional kernel is generated for the final accumulation; you'll see this if you run with ‑ta=nvidia,time or with profiling enabled.
Profiling with the PGI Accelerator programming model has been integrated more closely with our other profiling tools. With our 9.0 release, you had to use ‑ta=nvidia,time to get any information about the performance of the GPU parts of the program. Now, you can simply run your program under control of the pgcollect tool:
% pgcollect a.out arg1 arg2...
The pgcollect tool runs your program and enables sample-based profiling for the host code and profile data collection for the GPU code. It writes the profile data to a file named pgprof.out. You can save or rename this file; it will get overwritten each time you run a program with pgcollect. You then display the results with
% pgprof pgprof.out -exe a.out
The PGPROF GUI has a new Accelerator tab that shows the time spent in the region, in transferring data, and in GPU computation.
Currently, PGPROF will not show details about execution on the device. As mentioned in a previous article, you can run your program under cudaprof to get GPU execution details, such as the number of divergent branches, global memory operations, and more. Also see the companion article in this issue Using PGPROF and the CUDA Visual Profiler to Profile GPU Applications.
New API Routines
PGI 2010 includes several new routines, most of which are useful for performance analysis.
Calling acc_shutdown(acc_device_nvidia) will free all memory on the device, destroy the active context, and free up all resources associated with the GPU. This might be used if you have several phases in your program, and the accelerator is only used in one or a few phases. It would free up GPU resources for use by another user. Note that the GPU resources, such as memory, are not virtualized; when you allocate memory, it's real memory on the device, and you can run out.
Memory query routines
The call mem = acc_get_memory() returns the total amount of memory on the GPU, whereas the call mem = acc_get_free_memory() returns the free (available) memory.
Performance analysis routines
We have included several routines to help you track how many operations of different types are executed.
- acc_regions() returns the number of compute regions executed.
- acc_kernels() returns the number of kernels executed; recall that any reductions are implemented with an extra kernel.
- acc_allocs() returns the number of arrays allocated on the GPU.
- acc_frees() returns the number of arrays deallocated on the GPU; this should equal the number of arrays allocated.
- acc_copyins() returns the number of arrays copied into or updated to the GPU.
- acc_copyouts() returns the number of arrays copied out from or updated from the GPU.
- acc_bytesalloc() returns the number of bytes allocated on the GPU.
- acc_bytesin() returns the number of bytes copied into or updated into the GPU.
- acc_bytesout() returns the number of bytes copied out from or updated from the GPU.
Other Minor Enhancements
While we've made several other enhancements and updates, two others are worth mentioning. We now support the Fortran complex datatype, which was not allowed in PGI 9.0. And we now support a new loop directive clause; in Fortran, this is:
!$acc do independent
and it appears in C as:
#pragma acc for independent
The independent clause tells the compiler to ignore its dependence analysis, and to trust you, the programmer, that the loop iterations are in fact completely independent. If you know the loop is fully parallel, but the compiler analysis isn't strong enough, adding this clause should allow the loop to be scheduled in parallel. Caveat user: this is very dangerous.
CUDA C and CUDA Fortran Interoperability
We've engineered the PGI Accelerator compilers to generate code that can coexist with CUDA C and CUDA Fortran routines as well. CUDA supports two programming API libraries; CUDA Fortran and most CUDA C programs use the Runtime API; this supports the CUDA chevron syntax (<<<gridsize,blocksize>>>) and automatic loading of your kernel code to the GPU. CUDA C also supports a lower-level Driver API. Routines that use the Runtime API can't exist in the same program as routines that use the Driver API.
The PGI Accelerator compiler generates code that can be used with either API. The default is to use the lower level Driver API, for control and efficiency. However, if you link the program with the ‑Mcuda flag, as you would when linking CUDA Fortran objects, the program will use the Runtime API. If you are linking your PGI Accelerator programming model routines with other CUDA C routines, you should use the ‑Mcuda flag if the CUDA C routines use the Runtime API.
Make sure to keep your subscription up-to-date to get the latest releases, which come out every month or two. As I write this, we are working on our 10.3 release, scheduled for early March, and planning for what comes after that.
We are continuously working to make minor and occasionally major improvements to the compilers. One of the components of the PGI Accelerator compiler is the Planner, which maps the loop parallelism to the device parallelism. This is the part of the compiler that chooses loops to run in parallel and vector mode, the shape of the thread block, arrays to cache, and so on. We are presenting an article on the Planner at GPGPU-3 in Pittsburgh this March. The Planner is under continual improvement. For instance, one of the goals of the Planner is to optimize memory reference patterns; it tries to map the loop with the most stride-1 memory references to the threadIdx.x dimension, to improve memory bandwidth. It turns out the memory reference patterns also matter to the grid dimensions. Look at the document Optimizing Matrix Transpose in CUDA, available as a PDF file in the CUDA SDK from NVIDIA. The authors explain that the device memory is divided into partitions; the best device memory performance results when different warps and different thread blocks that are simultaneously active access different partitions, spreading the work across the whole device memory. We have recently tuned the Planner to optimize for this, and see some very nice performance improvements in certain cases.
One of the new features for the 10.3 release will affect many PGI Accelerator programming model users. Our compilers can use PGI Unified Binary technology to optimize for different host processors; for instance, we can optimize a version for Intel Nehalem, with SSE3 and SSE4.1 instructions, along with a version for AMD Barcelona, which has its own SSE4A instruction set. At run time, the program will determine the actual CPU type on which it is running and choose the appropriate version. We will soon use similar technology for the NVIDIA GPU code; NVIDIA's nvcc supports something similar, called a fat binary.
Right now, when invoked with ‑ta=nvidia enabled, the compiler will by default generate code for an NVIDIA compute capability 1.3 device, such as an NVIDIA Tesla. Unfortunately, this code will not run on lower capability GPUs, such as those found in laptops or lower-cost cards. You can override the default by adding a compute capability option to the command line, such as ‑ta=nvidia,cc10 or change the default by adding a line set COMPUTECAP=10; to the siterc file in your installation. With the 10.3 compilers, these will no longer be necessary. The compiler will, by default, generate code for compute capability 1.3, and will also generate code for compute capability 1.0, if the code is supported by 1.0. Double-precision requires compute capability 1.3, so any code with double-precision will only generate the 1.3 version. As before, you can override this behavior by setting an explicit compute capability option on the command line, or by setting a single default compute capability in the siterc file.
As an aside, the same feature will be used with CUDA Fortran. By default, every CUDA Fortran kernel will be generated as a compute capability 1.0 and a 1.3 kernel; if there are operations not supported at capability 1.0, the compiler will try compute capability 1.1 or 1.2, or only generate the capability 1.3 kernel. For instance, atomic operations were introduced in compute capability 1.1, and warp vote routines are supported only compute capability 1.2 and above.
Which brings us to NVIDIA Fermi support. As are all of you, we are anxiously awaiting the release of the NVIDIA Fermi GPUs and accelerators. Fermi will require use of the CUDA 3.0 drivers and toolkit. As before, we will provide the required CUDA toolkit components, and we will support both backward compatibility for users still using the CUDA 2.3 toolkit, and immediate support for those who are ready with CUDA 3.0. When Fermi is released, it will be supported by generating an additional version of the GPU code, tuned for the Fermi. The default will be to package all these versions together, so your program will run on Fermi or Tesla, or even lower capability cards as long as you aren't using double-precision.