Technical News from The Portland Group
OpenACC Features in PGI Accelerator Fortran Compilers—Part 1This is an update of a previous article about the PGI Accelerator programming model for GPUs
GPUs have a very high compute capacity, and the latest designs have made them even more programmable and useful for tasks other than just graphics. Research on using GPUs for general purpose computing has gone on for several years, but it was only when NVIDIA introduced the CUDA Toolkit in 2007, including a compiler with extensions to C, that GPU computing became useful without heroic effort. Yet, while CUDA is a big step towards GPU programming, it requires significant rewriting and restructuring of your program, and if you want to retain the option of running on an X64 host, you must maintain both the GPU and CPU program versions separately.
In November 2011, PGI, along with Cray, NVIDIA and CAPS Entreprise, introduced the OpenACC API, a directive-based method for host+accelerator programming, allowing us to treat the GPU as an accelerator. The OpenACC specification shares many features of the PGI Accelerator programming model, which we introduced in 2009. This article describes OpenACC features for those of you who have little or no experience with GPU programming, and no experience with the PGI Accelerator programming model. If you have used the PGI Accelerator directives, and specifically want information about the differences between the PGI Accelerator directives and OpenACC directives, please read the companion article instead.
The OpenACC API uses directives and compiler analysis to compile natural Fortran for the GPU; this often allows you to maintain a single source version, since ignoring the directives will compile the same program for the X64 CPU.
PGI is introducing full support for the OpenACC 1.0 specification with 12.6 release of its compilers. Some OpenACC features are available in earlier releases beginning with version 12.3.
Let's start by looking at accelerators, GPUs, and the current NVIDIA Fermi GPU in particular, because it is the first target for our compiler. An accelerator is typically implemented as a coprocessor to the host; it has its own instruction set and usually (but not always) its own memory. To the hardware, the accelerator looks like another IO unit; it communicates with the CPU using IO commands and DMA memory transfers. To the software, the accelerator is another computer to which your program sends data and routines to execute. Many accelerators have been produced over the years; with current technology, an accelerator fits on a single chip, like a CPU. Besides GPUs, other accelerators being considered today are the forthcoming Intel MIC, the AMD APU, and FPGAs. Here we focus on the NVIDIA family of GPUs; a picture of the relevant architectural features is shown below.
NVIDIA Fermi GPU Accelerator Block Diagram
The key features are the processors, the memory, and the interconnect. The NVIDIA GPUs have (currently) up to 16 multiprocessors; each multiprocessor has two SIMD units, each unit with 16 parallel thread processors. The thread processors run synchronously, meaning all thread processors in a SIMD unit execute the same instruction at the same time. Different multiprocessors run asynchronously, much like commodity multicore processors.
The GPU has its own memory, usually called device memory; this can range up to 6GB today. As with CPUs, access time to the memory is quite slow. CPUs use caches to try to reduce the effect of the long memory latency, by caching recently accessed data in the hopes that the future accesses will hit in the cache. Caches can be quite effective, but they don't solve the basic memory bandwidth problem. GPU programs typically require streaming access to large data sets that would overflow the size of a reasonable cache. To solve this problem, GPUs use multithreading. When the GPU processor issues an access to the device memory, that GPU thread goes to sleep until the memory returns the value. In the meantime, the GPU processor switches over very quickly, in hardware, to another GPU thread, and continues executing that thread. In this way, the GPU exploits program parallelism to keep busy while the slow device memory is responding.
While the device memory has long latency, the interconnect between the memory and the GPU processors supports very high bandwidth. In contrast to a CPU, the memory can keep up with the demands of data-intensive programs; instead of suffering from cache stalls, the GPU can keep busy, as long as there is enough parallelism to keep the processors busy.
Current approaches to programming GPUs include NVIDIA's CUDA and the open standard language OpenCL. Over the past several years, there have been many success stories using CUDA to port programs to NVIDIA GPUs. The goal of OpenCL is to provide a portable mechanism to program different GPUs and other parallel systems. Is there need or room for another programming strategy?
The cost of programming using CUDA or OpenCL is the initial programming effort to convert your program into the host part and the accelerator part. Each routine to run on the accelerator must be extracted to a separate kernel function, and the host code must manage device memory allocation, data movement, and kernel invocation. The kernel itself may have to be carefully optimized for the GPU or accelerator, including unrolling loops and orchestrating device memory fetches and stores. While CUDA and OpenCL are much, much easier programming environments than what was available before 2007, and both allow very detailed low-level optimization for the GPU, they are a long way from making it easy to program and experiment.
To address this, the OpenACC consortium has come up with the OpenACC API, implemented as a set of directives and API runtime routines. Using these, you can more easily start and experiment with porting programs to GPUs, letting the compiler do much of the bookkeeping. The resulting program is more portable, and in fact can run unmodified on the CPU itself. In this first tutorial installment, we will show some initial programs to get you started, and explore some of the features of the model. As we will see in the next installment, this model doesn't make parallel programming easy, but it does reduce the cost of entry; we will also explore using the directives to tune performance.
You need the right hardware and software to start using the PGI Accelerator compilers. First, you need a 64-bit x86 system with a Linux distribution supported both by PGI and NVIDIA; these include recent RHEL, SLES, OpenSUSE, Fedora, and Ubuntu distributions. See the PGI release support page and the Download CUDA page on the NVIDIA web site for currently supported distributions. Your system needs a CUDA-enabled NVIDIA graphics or Tesla card; see the CUDA-Enabled Products page on the NVIDIA web site for a list of appropriate cards. You need to install the PGI compilers, which come with the necessary CUDA toolkit components. From NVIDIA, you'll want a recent CUDA driver, available from the Download CUDA page on the NVIDIA site.
Let's assume you've installed the PGI compilers under the default location /opt/pgi. The PGI installation has two additional directory levels. The first corresponds to the target (linux86 for 32-bit linux86, linux86-64 for 64-bit). The second level has a directory for the PGI version (12.3 or 12.6 for example) and another PGI release directory containing common components (2012). If the compilers are installed, you're ready to test your accelerator connection. Try running the PGI-supplied tool pgaccelinfo. If you have everything set up properly, you should see output similar to this:
CUDA Driver Version: 4010 NVRM version: NVIDIA UNIX x86_64 Kernel Module 285.05.33 Thu Jan 19 14:07:02 PST 2012 Device Number: 0 Device Name: Quadro 6000 Device Revision Number: 2.0 Global Memory Size: 6441992192 Number of Multiprocessors: 14 Number of Cores: 448 Concurrent Copy and Execution: Yes Total Constant Memory: 65536 Total Shared Memory per Block: 49152 Registers per Block: 32768 Warp Size: 32 Maximum Threads per Block: 1024 Maximum Block Dimensions: 1024, 1024, 64 Maximum Grid Dimensions: 65535 x 65535 x 65535 Maximum Memory Pitch: 2147483647B Texture Alignment: 512B Clock Rate: 1147 MHz Execution Timeout: No Integrated Device: No Can Map Host Memory: Yes Compute Mode: default Concurrent Kernels: Yes ECC Enabled: No Memory Clock Rate: 1494 MHz Memory Bus Width: 384 bits L2 Cache Size: 786432 bytes Max Threads Per SMP: 1536 Async Engines: 2 Unified Addressing: Yes Initialization time: 1134669 microseconds Current free memory: 6367731712 Upload time (4MB): 2605 microseconds (1308 ms pinned) Download time: 2929 microseconds (1382 ms pinned) Upload bandwidth: 1610 MB/sec (3206 MB/sec pinned) Download bandwidth: 1431 MB/sec (3034 MB/sec pinned
This first tells you the CUDA driver version information. It tells you that there is a single device, number zero; it's an NVIDIA Quadro 6000, it has compute capability 2.0, 6GB memory and 14 multiprocessors. You might have more than one GPU installed; perhaps you have a small GPU on the motherboard and a larger GPU or Tesla card in a PCI slot. The pgaccelinfo will give you information about each one it can find. On the other hand, if you see the message:
No accelerators found. Try pgaccelinfo -v for more information
then you probably don't have the right hardware or drivers installed.
Now you're ready to start your first program.
We're going to show several simple example programs; we encourage you to try each one yourself. These examples are all available for download from the PGI web site.
We'll start with a very simple program; it will send a vector of floats to the GPU, double it, and bring the results back. In Fortran, the whole program is:
program main integer :: n ! size of the vector real,dimension(:),allocatable :: a, r, e integer :: i character(10) :: arg1 if( command_argument_count() .gt. 0 )then call get_command_argument( 1, arg1 ) read(arg1,'(i10)') n else n = 100000 endif if( n .le. 0 ) n = 100000 allocate(a(n), r(n), e(n)) do i = 1,n a(i) = i*2.0 enddo !$acc kernels loop do i = 1,n r(i) = a(i) * 2.0 enddo do i = 1,n e(i) = a(i) * 2.0 enddo ! check the results do i = 1,n if( r(i) .ne. e(i) )then print *, i, r(i), e(i) stop 'error found' endif enddo print *, n, 'iterations completed' end program
We prefixed the loop we want sent to the GPU by a kernels loop directive. This tells the compiler to find the parallelism in the loop, move the data over to the GPU, launch the operation on the GPU, then bring the results back. For this program, it's as simple as that.
Build this with the command:
pgfortran -o acc_f1.exe acc_f1.f90 -acc -Minfo
Note the ‑acc and ‑Minfo flags. The ‑acc enables the OpenACC directives in the compiler; by default, the PGI compilers will target the accelerator regions for the NVIDIA GPU. We'll show other options in later examples. The ‑Minfo flag enables informational messages from the compiler; we'll enable this on all our builds, and explain what the messages mean. You're going to want to understand these messages when you start to tune for performance.
If everything is installed and licensed correctly you should see the following informational messages from pgfortran:
main: 17, Generating copyin(a(1:n)) Generating copyout(r(1:n)) Generating compute capability 1.0 binary Generating compute capability 2.0 binary 18, Loop is parallelizable Accelerator kernel generated 18, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
Let's explain a few of these messages. The first:
tells you that the compiler determined that the array a is used only as input to the loop, so those n elements of array a need to copied over from the CPU memory to the GPU device memory; this is a copyin to the device memory. Because they aren't modified, they don't need to be brought back. The second messageGenerating copyout(r(1:n))
tells you that the array r is assigned, but never read inside the loop; the values from the CPU memory don't need to be sent to the GPU, but the modified values need to be copied back. This is a copyout from the device memory. Below that is the message:Loop is parallelizable
This tells you that the compiler analyzed the references in the loop and determined that all iterations could be executed in parallel. The next message is the most key:Accelerator kernel generated
This tells you that the compiler successfully converted the body of that loop to a kernel for the GPU. The kernel is the GPU function itself created by the compiler, that will be called by the program and executed in parallel on the GPU. We'll discuss the next message, and others that you'll see, in more detail in the next installment.
So now you're ready to run the program. Assuming you're on the machine with the GPU, just type the name of the executable, acc_f1.exe. If you get a message
libcuda.so not found, exiting
then you must not have installed the CUDA software in its default location, /usr/lib. You may have to set the environment variable LD_LIBRARY_PATH. What you should see is just the final output
100000 iterations completed
How do you know that anything executed on the GPU? You can set the environment variable ACC_NOTIFY to 1:
csh: setenv ACC_NOTIFY 1 bash: export ACC_NOTIFY=1
then run the program; it will then print out a line each time a GPU kernel is launched. In this case, you'll see something like:
launch kernel file=acc_f1.f90 function=main line=18 device=0 grid=391 block=256
which tells you the file, function, and line number of the kernel, and the CUDA grid and thread block dimensions. You probably don't want to leave this set for all your programs, but it's instructive and useful during program development and testing.
Our first program was pretty trivial, just enough to get a test run. Let's take on a slightly more interesting program, one that has more computational intensity on each iteration. In Fortran, the program is:
program main use openacc integer :: n ! size of the vector real,dimension(:),allocatable :: a ! the vector real,dimension(:),allocatable :: r ! the results real,dimension(:),allocatable :: e ! expected results integer :: i integer :: c0, c1, c2, c3, cgpu, chost character(10) :: arg1 if( command_argument_count() .gt. 0 )then call get_command_argument( 1, arg1 ) read(arg1,'(i10)') n else n = 100000 endif if( n .le. 0 ) n = 100000 allocate(a(n)) allocate(r(n)) allocate(e(n)) do i = 1,n a(i) = i*2.0 enddo !call acc_init( acc_device_nvidia ) call system_clock( count=c1 ) !$acc kernels loop do i = 1,n r(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2 enddo call system_clock( count=c2 ) cgpu = c2 - c1 do i = 1,n e(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2 enddo call system_clock( count=c3 ) chost = c3 - c2 ! check the results do i = 1,n if( abs(r(i) - e(i)) .gt. 0.000001 )then print *, i, r(i), e(i) endif enddo print *, n, ' iterations completed' print *, cgpu, ' microseconds on GPU' print *, chost, ' microseconds on host' end program
It reads the first command line argument as the number of elements to compute, allocates arrays, runs a kernel to compute floating point sine and cosine, and compares to the host. Some details to note:
- There is a call to acc_init() commented out; you'll uncomment that shortly.
- There are calls to system_clock() to read the real time clock time.
- The program doesn't compare for equality, it compares against a tolerance. We'll discuss that as well.
Now let's build and run the program; you'll compare the speed of your GPU to the speed of the host. Build as you did before, and you should see messages much like you did before. You can view just the accelerator messages by replacing ‑Minfo by ‑Minfo=accel on the compile line.
The first time you run this program, you'll see output something like:
100000 iterations completed 1367529 microseconds on GPU 1348 microseconds on host
So what's this? Over a second on the GPU? Only 1.3 milliseconds on the host? What's the deal?
Let's explore this a little. If I enclose the program from the first call to the timer to the last print statement in another loop that iterates three times, I'll see something more like the following:
100000 iterations completed 1363251 microseconds on GPU 1897 microseconds on host 100000 iterations completed 621 microseconds on GPU 1374 microseconds on host 100000 iterations completed 554 microseconds on GPU 1230 microseconds on host
The time on the GPU is very long for the first iteration, then is much faster after that. The reason is the overhead of connecting to the GPU on Linux. On Linux, if you run on a GPU without a display connected, it can take 1 to 1.5 seconds to make that initial connection, the first time the first kernel executes. You can set the environment variable PGI_ACC_TIME to 1 before running your program. This directs the runtime to collect the time spent in GPU initialization, data movement, and kernel execution. If we set the environment variable, we'll get additional profile information:
100000 iterations completed 1367529 microseconds on GPU 1348 microseconds on host Accelerator Kernel Timing data acc_f2.f90 main 25: region entered 1 time time(us): total=1367517 init=1366590 region=927 kernels=60 data=543 w/o init: total=927 max=927 min=927 avg=927 26: kernel launched 1 times grid:  block:  time(us): total=60 max=60 min=60 avg=60
The timing data tells us that the accelerator region at line 32 was entered once and took a total of 1.367 seconds. Of that, 1.136 was spent in initialization. The actual execution time was 0.9 milliseconds. Of that time, 543 microseconds was spent moving data back and forth (copying the a and r arrays to and from the GPU), and only 60 microseconds was spent executing the kernel.
So, let's take the initialization out of the timing code altogether. Uncomment the call to acc_init() in your program, rebuild and then run the program. You should see output more like:
100000 iterations completed 902 microseconds on GPU 1333 microseconds on host
The GPU time still includes the overhead of moving data between the GPU and host. Your times may differ, even substantially, depending on the GPU you have installed (particularly the Number of Multiprocessors reported by pgaccelinfo) and the host processor. These runs were made on a 2.4GHz AMD Athlon 64.
To see some more interesting performance numbers, try increasing the number of loop iterations. The program defaults to 100000; increase this to 1000000. On my machine, I see
1000000 iterations completed 4168 microseconds on GPU 36536 microseconds on host
Note the GPU time increases by about a factor of 5, less than you would expect; the host time increases by quite a bit more. That's because the host is sensitive to cache locality. The GPU has a very high bandwidth device memory; it uses the extra parallelism that comes from the 1,000,000 parallel iterations to tolerate the long memory latency, so there's no performance cliff.
So, let's go back and look at the reason for the tolerance test, instead of an equality test, for correctness. The fact is that the GPU doesn't compute to exactly the same precision as the host. In particular, some transcendentals and trigonometric functions may be different in the low-order bit. You, the programmer, have to be aware of the potential for these differences, and if they are not acceptable, you may need to wait until the GPUs implement full host equivalence. Before the adoption of the IEEE floating point arithmetic standard, every computer used a different floating point format and delivered different precision, so this is not a new problem, just a new manifestation.
Your next assignment is to convert this second program to double precision. You might compare the results to find the maximum difference. We're computing sin^2 + cos^2, which, if I remember my high school geometry, should equal 1.0 for all angles. So, you can compare the GPU and host computed values against the actual correct answer as well.
Here, we'll explore writing a slightly more complex program, and try some other options, such as building it to run on either the GPU, or on the host if you don't have a GPU installed. We'll look at a simple Jacobi relaxation on a two-dimensional rectangular mesh. In Fortran, the relaxation routine we'll use is:
subroutine smooth( a, b, w0, w1, w2, n, m, niters ) real, dimension(:,:) :: a,b real :: w0, w1, w2 integer :: n, m, niters integer :: i, j, iter do iter = 1,niters !$acc kernels loop do i = 2,n-1 do j = 2,m-1 a(i,j) = w0 * b(i,j) + & w1 * (b(i-1,j) + b(i,j-1) + b(i+1,j) + b(i,j+1)) + & w2 * (b(i-1,j-1) + b(i-1,j+1) + b(i+1,j-1) + b(i+1,j+1)) enddo enddo !$acc kernels loop do i = 2,n-1 do j = 2,m-1 b(i,j) = w0 * a(i,j) + & w1 * (a(i-1,j) + a(i,j-1) + a(i+1,j) + a(i,j+1)) + & w2 * (a(i-1,j-1) + a(i-1,j+1) + a(i+1,j-1) + a(i+1,j+1)) enddo enddo enddo end subroutine
This particular implementation executes a fixed number of iterations, executing two kernels for each iteration. We can build these routines as before, and we'll get a set of informational messages as before:
smooth: 24, Generating copyin(b(:n,:m)) Generating copyout(a(2:n-1,2:m-1)) Generating compute capability 1.0 binary Generating compute capability 2.0 binary 25, Loop is parallelizable 26, Loop is parallelizable Accelerator kernel generated 25, !$acc loop gang, vector(16) ! blockidx%x threadidx%x 26, !$acc loop gang, vector(16) ! blockidx%y threadidx%y CC 1.0 : 15 registers; 100 shared, 8 constant, 0 local memory bytes; 66% occupancy CC 2.0 : 20 registers; 16 shared, 100 constant, 0 local memory bytes; 100% occupancy 32, Generating copyin(a(:n,:m)) Generating copyout(b(2:n-1,2:m-1)) Generating compute capability 1.0 binary Generating compute capability 2.0 binary 33, Loop is parallelizable 34, Loop is parallelizable Accelerator kernel generated 33, !$acc loop gang, vector(16) ! blockidx%x threadidx%x 34, !$acc loop gang, vector(16) ! blockidx%y threadidx%y CC 1.0 : 15 registers; 100 shared, 8 constant, 0 local memory bytes; 66% occupancy CC 2.0 : 20 registers; 16 shared, 100 constant, 0 local memory bytes; 100% occupancy
For each loop nest, the compiler generates a kernel. The compiler produces messages about the data copied into and out of the GPU. The next two lines tell us that the compiler generated two GPU binaries, one suitable for Tesla devices (NVIDIA compute capability 1.0-1.3), and a second for Fermi devices (NVIDIA compute capability 2.0). After the line specifying that an accelerator kernel was generated, the compiler gives us the execution schedule, the mapping of the loop iterations to the device parallelism. The compiler prints this using the directives that you might use if you were going to specify this schedule manually. In this case, the compiler tiles the loop into 16x16 tiles (vector(16) for both loops in each loop nest), and executes the tiles in parallel across the NVIDIA multiprocessors (gang for both loops).
We've provided the source for this routine along with a driver routine to call it. Try running the program with a 1000 x 1000 matrix and 50 iterations, using a call to acc_init to isolate any device initialization. On my machine, setting the environment variable PGI_ACC_TIME, I see the performance output:
Accelerator Kernel Timing data acc_f3.f90 smooth 32: region entered 50 times time(us): total=130109 init=9 region=130100 kernels=18169 data=100746 w/o init: total=130100 max=2738 min=2581 avg=2602 34: kernel launched 50 times grid: [63x63] block: [16x16] time(us): total=18169 max=375 min=355 avg=363 acc_f3.f90 smooth 24: region entered 50 times time(us): total=132923 init=9 region=132914 kernels=18184 data=103381 w/o init: total=132914 max=3064 min=2636 avg=2658 26: kernel launched 50 times grid: [63x63] block: [16x16] time(us): total=18184 max=380 min=355 avg=363 acc_init.c acc_init 42: region entered 1 time time(us): init=1427521
The data for the two kernels is essentially identical, which we would expect. As before, we see the 80% of the time for the GPU is spent moving data to and from the GPU memory. What if, instead of moving the data back and forth for each iteration, we could move the data once and leave it there. Let's modify the program so in the routine that calls smooth, we surround that call with a data construct:
!$acc data copy(a(:,:),b(:,:)) call smooth( a, b, w0, w1, w2, n, m, iters ); !$acc end data
This tells the compiler to copy both arrays to the GPU before the call, and bring the results back to the host memory after the call. Inside the function, replace the copyin and copy data clauses by a present clause:
!$acc kernels loop present(a,b)
This tells the compiler that the data is already present on the GPU, so rather than copying the data it should just use the copy that is already present. Now the performance profile looks like:
Accelerator Kernel Timing data acc_f3a.f90 smooth 32: region entered 50 times time(us): total=18309 init=6 region=18303 kernels=17786 data=0 w/o init: total=18303 max=466 min=349 avg=366 34: kernel launched 50 times grid: [63x63] block: [16x16] time(us): total=17786 max=373 min=341 avg=355 acc_f3a.f90 smooth 24: region entered 50 times time(us): total=18389 init=2 region=18387 kernels=17823 data=0 w/o init: total=18387 max=530 min=348 avg=367 26: kernel launched 50 times grid: [63x63] block: [16x16] time(us): total=17823 max=383 min=340 avg=356 acc_f3a.f90 main 128: region entered 1 time time(us): total=41423 data=4402 acc_init.c acc_init 42: region entered 1 time time(us): init=1386799We see no data movement in the smooth function, just the 18 milliseconds for each kernel execution. The data movement only happens in the main routine, and only once, taking roughly 4 milliseconds, instead of 200 as before.
But suppose we want a program that will run on the GPU when it's available, or on the host when it's not. The PGI compilers provide that functionality using the PGI Unified Binary technology. To enable, we build with the target accelerator option: ‑ta=nvidia,host. This generates two versions of the smooth function, one that runs on the host and one on the GPU. At run time, the program will determine whether there is a GPU attached and run that version if there is, or run the host version if there is not. You should see compiler messages like:
smooth: smooth: 18, PGI Unified Binary version for -tp=nehalem-64 -ta=host ... smooth: 18, PGI Unified Binary version for -tp=nehalem-64 -ta=nvidia 24, Generating present(b(:,:)) Generating present(a(:,:)) Generating compute capability 1.0 binary ...
where the printed ‑tp value depends on the host on which you are running. Now you should be able to run this on the machine with the GPU and see the GPU performance, and then move the same binary to a machine with no GPU, where the host (‑ta=host) copy will run.
By default, the runtime system will use the GPU version if the GPU is available, and will use the host version if it is not. You can manually select the host version two ways. One way is to set the environment variable ACC_DEVICE before running the program:
csh: setenv ACC_DEVICE_TYPE host bash: export ACC_DEVICE_TYPE=host
You can go back to the default by unsetting this variable. Alternatively, the program can select the device by calling an API routine:
use openacc ... call acc_set_device( acc_device_host )
This installment introduced the OpenACC features in the PGI Accelerator compilers for NVIDIA GPUs, and presented three simple programs. We enabled the OpenACC directives with the ‑acc flag, used the simple accelerator profile library with the PGI_ACC_TIME environment variable, and used ‑ta=nvidia,host to generate a unified host+GPU binary, and introduced the data construct and the present clause. We hope you have a chance to try our simple examples and can start putting some simple programs on the GPU. We will follow up this introductory article with more details about Openacc, tuning OpenACC programs, and using OpenACC in larger programs.