Technical News from The Portland Group

OpenACC Features in PGI Accelerator Fortran Compilers—Part 1

This is an update of a previous article about the PGI Accelerator programming model for GPUs

GPUs have a very high compute capacity, and the latest designs have made them even more programmable and useful for tasks other than just graphics. Research on using GPUs for general purpose computing has gone on for several years, but it was only when NVIDIA introduced the CUDA Toolkit in 2007, including a compiler with extensions to C, that GPU computing became useful without heroic effort. Yet, while CUDA is a big step towards GPU programming, it requires significant rewriting and restructuring of your program, and if you want to retain the option of running on an X64 host, you must maintain both the GPU and CPU program versions separately.

In November 2011, PGI, along with Cray, NVIDIA and CAPS Entreprise, introduced the OpenACC API, a directive-based method for host+accelerator programming, allowing us to treat the GPU as an accelerator. The OpenACC specification shares many features of the PGI Accelerator programming model, which we introduced in 2009. This article describes OpenACC features for those of you who have little or no experience with GPU programming, and no experience with the PGI Accelerator programming model. If you have used the PGI Accelerator directives, and specifically want information about the differences between the PGI Accelerator directives and OpenACC directives, please read the companion article instead.

The OpenACC API uses directives and compiler analysis to compile natural Fortran for the GPU; this often allows you to maintain a single source version, since ignoring the directives will compile the same program for the X64 CPU.

PGI is introducing full support for the OpenACC 1.0 specification with 12.6 release of its compilers. Some OpenACC features are available in earlier releases beginning with version 12.3.

GPU Architecture

Let's start by looking at accelerators, GPUs, and the current NVIDIA Fermi GPU in particular, because it is the first target for our compiler. An accelerator is typically implemented as a coprocessor to the host; it has its own instruction set and usually (but not always) its own memory. To the hardware, the accelerator looks like another IO unit; it communicates with the CPU using IO commands and DMA memory transfers. To the software, the accelerator is another computer to which your program sends data and routines to execute. Many accelerators have been produced over the years; with current technology, an accelerator fits on a single chip, like a CPU. Besides GPUs, other accelerators being considered today are the forthcoming Intel MIC, the AMD APU, and FPGAs. Here we focus on the NVIDIA family of GPUs; a picture of the relevant architectural features is shown below.

PGI NVIDIA Block Diagram
NVIDIA Fermi GPU Accelerator Block Diagram

The key features are the processors, the memory, and the interconnect. The NVIDIA GPUs have (currently) up to 16 multiprocessors; each multiprocessor has two SIMD units, each unit with 16 parallel thread processors. The thread processors run synchronously, meaning all thread processors in a SIMD unit execute the same instruction at the same time. Different multiprocessors run asynchronously, much like commodity multicore processors.

The GPU has its own memory, usually called device memory; this can range up to 6GB today. As with CPUs, access time to the memory is quite slow. CPUs use caches to try to reduce the effect of the long memory latency, by caching recently accessed data in the hopes that the future accesses will hit in the cache. Caches can be quite effective, but they don't solve the basic memory bandwidth problem. GPU programs typically require streaming access to large data sets that would overflow the size of a reasonable cache. To solve this problem, GPUs use multithreading. When the GPU processor issues an access to the device memory, that GPU thread goes to sleep until the memory returns the value. In the meantime, the GPU processor switches over very quickly, in hardware, to another GPU thread, and continues executing that thread. In this way, the GPU exploits program parallelism to keep busy while the slow device memory is responding.

While the device memory has long latency, the interconnect between the memory and the GPU processors supports very high bandwidth. In contrast to a CPU, the memory can keep up with the demands of data-intensive programs; instead of suffering from cache stalls, the GPU can keep busy, as long as there is enough parallelism to keep the processors busy.

Programming

Current approaches to programming GPUs include NVIDIA's CUDA and the open standard language OpenCL. Over the past several years, there have been many success stories using CUDA to port programs to NVIDIA GPUs. The goal of OpenCL is to provide a portable mechanism to program different GPUs and other parallel systems. Is there need or room for another programming strategy?

The cost of programming using CUDA or OpenCL is the initial programming effort to convert your program into the host part and the accelerator part. Each routine to run on the accelerator must be extracted to a separate kernel function, and the host code must manage device memory allocation, data movement, and kernel invocation. The kernel itself may have to be carefully optimized for the GPU or accelerator, including unrolling loops and orchestrating device memory fetches and stores. While CUDA and OpenCL are much, much easier programming environments than what was available before 2007, and both allow very detailed low-level optimization for the GPU, they are a long way from making it easy to program and experiment.

To address this, the OpenACC consortium has come up with the OpenACC API, implemented as a set of directives and API runtime routines. Using these, you can more easily start and experiment with porting programs to GPUs, letting the compiler do much of the bookkeeping. The resulting program is more portable, and in fact can run unmodified on the CPU itself. In this first tutorial installment, we will show some initial programs to get you started, and explore some of the features of the model. As we will see in the next installment, this model doesn't make parallel programming easy, but it does reduce the cost of entry; we will also explore using the directives to tune performance.

Setting Up

You need the right hardware and software to start using the PGI Accelerator compilers. First, you need a 64-bit x86 system with a Linux distribution supported both by PGI and NVIDIA; these include recent RHEL, SLES, OpenSUSE, Fedora, and Ubuntu distributions. See the PGI release support page and the Download CUDA page on the NVIDIA web site for currently supported distributions. Your system needs a CUDA-enabled NVIDIA graphics or Tesla card; see the CUDA-Enabled Products page on the NVIDIA web site for a list of appropriate cards. You need to install the PGI compilers, which come with the necessary CUDA toolkit components. From NVIDIA, you'll want a recent CUDA driver, available from the Download CUDA page on the NVIDIA site.

Let's assume you've installed the PGI compilers under the default location /opt/pgi. The PGI installation has two additional directory levels. The first corresponds to the target (linux86 for 32-bit linux86, linux86-64 for 64-bit). The second level has a directory for the PGI version (12.3 or 12.6 for example) and another PGI release directory containing common components (2012). If the compilers are installed, you're ready to test your accelerator connection. Try running the PGI-supplied tool pgaccelinfo. If you have everything set up properly, you should see output similar to this:

CUDA Driver Version:           4010
NVRM version: NVIDIA UNIX x86_64 Kernel Module  285.05.33  Thu Jan 19 
14:07:02 PST 2012

Device Number:                 0
Device Name:                   Quadro 6000
Device Revision Number:        2.0
Global Memory Size:            6441992192
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             1494 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 786432 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           1134669 microseconds
Current free memory:           6367731712
Upload time (4MB):             2605 microseconds (1308 ms pinned)
Download time:                 2929 microseconds (1382 ms pinned)
Upload bandwidth:              1610 MB/sec (3206 MB/sec pinned)
Download bandwidth:            1431 MB/sec (3034 MB/sec pinned

This first tells you the CUDA driver version information. It tells you that there is a single device, number zero; it's an NVIDIA Quadro 6000, it has compute capability 2.0, 6GB memory and 14 multiprocessors. You might have more than one GPU installed; perhaps you have a small GPU on the motherboard and a larger GPU or Tesla card in a PCI slot. The pgaccelinfo will give you information about each one it can find. On the other hand, if you see the message:

No accelerators found.
Try pgaccelinfo -v for more information

then you probably don't have the right hardware or drivers installed.

Now you're ready to start your first program.

First Program

We're going to show several simple example programs; we encourage you to try each one yourself. These examples are all available for download from the PGI web site.

We'll start with a very simple program; it will send a vector of floats to the GPU, double it, and bring the results back. In Fortran, the whole program is:

program main
    integer :: n        ! size of the vector
    real,dimension(:),allocatable :: a, r, e
    integer :: i
    character(10) :: arg1
    if( command_argument_count() .gt. 0 )then
        call get_command_argument( 1, arg1 )
        read(arg1,'(i10)') n
    else
        n = 100000
    endif
    if( n .le. 0 ) n = 100000
    allocate(a(n), r(n), e(n))
    do i = 1,n
        a(i) = i*2.0
    enddo
    !$acc kernels loop
        do i = 1,n
            r(i) = a(i) * 2.0
        enddo

        do i = 1,n
            e(i) = a(i) * 2.0
        enddo
    ! check the results
    do i = 1,n
        if( r(i) .ne. e(i) )then
            print *, i, r(i), e(i)
            stop 'error found'
        endif
    enddo
    print *, n, 'iterations completed'
end program

We prefixed the loop we want sent to the GPU by a kernels loop directive. This tells the compiler to find the parallelism in the loop, move the data over to the GPU, launch the operation on the GPU, then bring the results back. For this program, it's as simple as that.

Build this with the command:

pgfortran -o acc_f1.exe acc_f1.f90 -acc -Minfo

Note the ‑acc and ‑Minfo flags. The ‑acc enables the OpenACC directives in the compiler; by default, the PGI compilers will target the accelerator regions for the NVIDIA GPU. We'll show other options in later examples. The ‑Minfo flag enables informational messages from the compiler; we'll enable this on all our builds, and explain what the messages mean. You're going to want to understand these messages when you start to tune for performance.

If everything is installed and licensed correctly you should see the following informational messages from pgfortran:

main:
     17, Generating copyin(a(1:n))
         Generating copyout(r(1:n))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     18, Loop is parallelizable
         Accelerator kernel generated
         18, !$acc loop gang, vector(256) ! blockidx%x threadidx%x

Let's explain a few of these messages. The first:

Generating copyin(a(1:n))

tells you that the compiler determined that the array a is used only as input to the loop, so those n elements of array a need to copied over from the CPU memory to the GPU device memory; this is a copyin to the device memory. Because they aren't modified, they don't need to be brought back. The second message

Generating copyout(r(1:n))

tells you that the array r is assigned, but never read inside the loop; the values from the CPU memory don't need to be sent to the GPU, but the modified values need to be copied back. This is a copyout from the device memory. Below that is the message:

Loop is parallelizable

This tells you that the compiler analyzed the references in the loop and determined that all iterations could be executed in parallel. The next message is the most key:

Accelerator kernel generated

This tells you that the compiler successfully converted the body of that loop to a kernel for the GPU. The kernel is the GPU function itself created by the compiler, that will be called by the program and executed in parallel on the GPU. We'll discuss the next message, and others that you'll see, in more detail in the next installment.

So now you're ready to run the program. Assuming you're on the machine with the GPU, just type the name of the executable, acc_f1.exe. If you get a message

libcuda.so not found, exiting

then you must not have installed the CUDA software in its default location, /usr/lib. You may have to set the environment variable LD_LIBRARY_PATH. What you should see is just the final output

100000 iterations completed

How do you know that anything executed on the GPU? You can set the environment variable ACC_NOTIFY to 1:

csh:   setenv ACC_NOTIFY 1
bash:  export ACC_NOTIFY=1

then run the program; it will then print out a line each time a GPU kernel is launched. In this case, you'll see something like:

launch kernel  file=acc_f1.f90 function=main line=18 device=0 grid=391 
               block=256

which tells you the file, function, and line number of the kernel, and the CUDA grid and thread block dimensions. You probably don't want to leave this set for all your programs, but it's instructive and useful during program development and testing.

Second Program

Our first program was pretty trivial, just enough to get a test run. Let's take on a slightly more interesting program, one that has more computational intensity on each iteration. In Fortran, the program is:

program main
    use openacc
    integer :: n        ! size of the vector
    real,dimension(:),allocatable :: a  ! the vector
    real,dimension(:),allocatable :: r  ! the results
    real,dimension(:),allocatable :: e  ! expected results
    integer :: i
    integer :: c0, c1, c2, c3, cgpu, chost
    character(10) :: arg1
    if( command_argument_count() .gt. 0 )then
        call get_command_argument( 1, arg1 )
        read(arg1,'(i10)') n
    else
        n = 100000
    endif
    if( n .le. 0 ) n = 100000
    allocate(a(n))
    allocate(r(n))
    allocate(e(n))
    do i = 1,n
        a(i) = i*2.0
    enddo
    !call acc_init( acc_device_nvidia )
    call system_clock( count=c1 )
    !$acc kernels loop
        do i = 1,n
            r(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2
        enddo
    call system_clock( count=c2 )
    cgpu = c2 - c1
        do i = 1,n
            e(i) = sin(a(i)) ** 2 + cos(a(i)) ** 2
        enddo
    call system_clock( count=c3 )
    chost = c3 - c2
    ! check the results
    do i = 1,n
        if( abs(r(i) - e(i)) .gt. 0.000001 )then
            print *, i, r(i), e(i)
        endif
    enddo
    print *, n, ' iterations completed'
    print *, cgpu, ' microseconds on GPU'
    print *, chost, ' microseconds on host'
end program

It reads the first command line argument as the number of elements to compute, allocates arrays, runs a kernel to compute floating point sine and cosine, and compares to the host. Some details to note:

  • There is a call to acc_init() commented out; you'll uncomment that shortly.
  • There are calls to system_clock() to read the real time clock time.
  • The program doesn't compare for equality, it compares against a tolerance. We'll discuss that as well.

Now let's build and run the program; you'll compare the speed of your GPU to the speed of the host. Build as you did before, and you should see messages much like you did before. You can view just the accelerator messages by replacing ‑Minfo by ‑Minfo=accel on the compile line.

The first time you run this program, you'll see output something like:

       100000 iterations completed
      1367529  microseconds on GPU
         1348  microseconds on host

So what's this? Over a second on the GPU? Only 1.3 milliseconds on the host? What's the deal?

Let's explore this a little. If I enclose the program from the first call to the timer to the last print statement in another loop that iterates three times, I'll see something more like the following:

       100000 iterations completed
      1363251  microseconds on GPU
         1897  microseconds on host
       100000  iterations completed
          621  microseconds on GPU
         1374  microseconds on host
       100000  iterations completed
          554  microseconds on GPU
         1230  microseconds on host

The time on the GPU is very long for the first iteration, then is much faster after that. The reason is the overhead of connecting to the GPU on Linux. On Linux, if you run on a GPU without a display connected, it can take 1 to 1.5 seconds to make that initial connection, the first time the first kernel executes. You can set the environment variable PGI_ACC_TIME to 1 before running your program. This directs the runtime to collect the time spent in GPU initialization, data movement, and kernel execution. If we set the environment variable, we'll get additional profile information:

       100000 iterations completed
      1367529  microseconds on GPU
         1348  microseconds on host

Accelerator Kernel Timing data
acc_f2.f90
  main
    25: region entered 1 time
        time(us): total=1367517 init=1366590 region=927
                  kernels=60 data=543
        w/o init: total=927 max=927 min=927 avg=927
        26: kernel launched 1 times
            grid: [391]  block: [256]
            time(us): total=60 max=60 min=60 avg=60

The timing data tells us that the accelerator region at line 32 was entered once and took a total of 1.367 seconds. Of that, 1.136 was spent in initialization. The actual execution time was 0.9 milliseconds. Of that time, 543 microseconds was spent moving data back and forth (copying the a and r arrays to and from the GPU), and only 60 microseconds was spent executing the kernel.

So, let's take the initialization out of the timing code altogether. Uncomment the call to acc_init() in your program, rebuild and then run the program. You should see output more like:

       100000 iterations completed
          902  microseconds on GPU
         1333  microseconds on host

The GPU time still includes the overhead of moving data between the GPU and host. Your times may differ, even substantially, depending on the GPU you have installed (particularly the Number of Multiprocessors reported by pgaccelinfo) and the host processor. These runs were made on a 2.4GHz AMD Athlon 64.

To see some more interesting performance numbers, try increasing the number of loop iterations. The program defaults to 100000; increase this to 1000000. On my machine, I see

      1000000 iterations completed
         4168  microseconds on GPU
        36536  microseconds on host

Note the GPU time increases by about a factor of 5, less than you would expect; the host time increases by quite a bit more. That's because the host is sensitive to cache locality. The GPU has a very high bandwidth device memory; it uses the extra parallelism that comes from the 1,000,000 parallel iterations to tolerate the long memory latency, so there's no performance cliff.

So, let's go back and look at the reason for the tolerance test, instead of an equality test, for correctness. The fact is that the GPU doesn't compute to exactly the same precision as the host. In particular, some transcendentals and trigonometric functions may be different in the low-order bit. You, the programmer, have to be aware of the potential for these differences, and if they are not acceptable, you may need to wait until the GPUs implement full host equivalence. Before the adoption of the IEEE floating point arithmetic standard, every computer used a different floating point format and delivered different precision, so this is not a new problem, just a new manifestation.

Your next assignment is to convert this second program to double precision. You might compare the results to find the maximum difference. We're computing sin^2 + cos^2, which, if I remember my high school geometry, should equal 1.0 for all angles. So, you can compare the GPU and host computed values against the actual correct answer as well.

Third Program

Here, we'll explore writing a slightly more complex program, and try some other options, such as building it to run on either the GPU, or on the host if you don't have a GPU installed. We'll look at a simple Jacobi relaxation on a two-dimensional rectangular mesh. In Fortran, the relaxation routine we'll use is:

subroutine smooth( a, b, w0, w1, w2, n, m, niters )
 real, dimension(:,:) :: a,b
 real :: w0, w1, w2
 integer :: n, m, niters
 integer :: i, j, iter
  do iter = 1,niters
   !$acc kernels loop
   do i = 2,n-1
    do j = 2,m-1
     a(i,j) = w0 * b(i,j) + &
              w1 * (b(i-1,j) + b(i,j-1) + b(i+1,j) + b(i,j+1)) + &
              w2 * (b(i-1,j-1) + b(i-1,j+1) + b(i+1,j-1) + b(i+1,j+1))
    enddo
   enddo
   !$acc kernels loop
   do i = 2,n-1
    do j = 2,m-1
     b(i,j) = w0 * a(i,j) + &
              w1 * (a(i-1,j) + a(i,j-1) + a(i+1,j) + a(i,j+1)) + &
              w2 * (a(i-1,j-1) + a(i-1,j+1) + a(i+1,j-1) + a(i+1,j+1))
    enddo
   enddo
  enddo
end subroutine

This particular implementation executes a fixed number of iterations, executing two kernels for each iteration. We can build these routines as before, and we'll get a set of informational messages as before:

 smooth:
     24, Generating copyin(b(:n,:m))
         Generating copyout(a(2:n-1,2:m-1))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     25, Loop is parallelizable
     26, Loop is parallelizable
         Accelerator kernel generated
         25, !$acc loop gang, vector(16) ! blockidx%x threadidx%x
         26, !$acc loop gang, vector(16) ! blockidx%y threadidx%y
             CC 1.0 : 15 registers; 100 shared, 8 constant, 0 local memory
                      bytes; 66% occupancy
             CC 2.0 : 20 registers; 16 shared, 100 constant, 0 local memory 
                       bytes; 100% occupancy
     32, Generating copyin(a(:n,:m))
         Generating copyout(b(2:n-1,2:m-1))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     33, Loop is parallelizable
     34, Loop is parallelizable
         Accelerator kernel generated
         33, !$acc loop gang, vector(16) ! blockidx%x threadidx%x
         34, !$acc loop gang, vector(16) ! blockidx%y threadidx%y
             CC 1.0 : 15 registers; 100 shared, 8 constant, 0 local memory 
                       bytes; 66% occupancy
             CC 2.0 : 20 registers; 16 shared, 100 constant, 0 local memory  
                       bytes; 100% occupancy

For each loop nest, the compiler generates a kernel. The compiler produces messages about the data copied into and out of the GPU. The next two lines tell us that the compiler generated two GPU binaries, one suitable for Tesla devices (NVIDIA compute capability 1.0-1.3), and a second for Fermi devices (NVIDIA compute capability 2.0). After the line specifying that an accelerator kernel was generated, the compiler gives us the execution schedule, the mapping of the loop iterations to the device parallelism. The compiler prints this using the directives that you might use if you were going to specify this schedule manually. In this case, the compiler tiles the loop into 16x16 tiles (vector(16) for both loops in each loop nest), and executes the tiles in parallel across the NVIDIA multiprocessors (gang for both loops).

We've provided the source for this routine along with a driver routine to call it. Try running the program with a 1000 x 1000 matrix and 50 iterations, using a call to acc_init to isolate any device initialization. On my machine, setting the environment variable PGI_ACC_TIME, I see the performance output:

Accelerator Kernel Timing data
acc_f3.f90
  smooth
    32: region entered 50 times
        time(us): total=130109 init=9 region=130100
          kernels=18169 data=100746
        w/o init: total=130100 max=2738 min=2581 avg=2602
        34: kernel launched 50 times
            grid: [63x63]  block: [16x16]
            time(us): total=18169 max=375 min=355 avg=363
acc_f3.f90
  smooth
    24: region entered 50 times
        time(us): total=132923 init=9 region=132914
                  kernels=18184 data=103381
        w/o init: total=132914 max=3064 min=2636 avg=2658
        26: kernel launched 50 times
            grid: [63x63]  block: [16x16]
            time(us): total=18184 max=380 min=355 avg=363
acc_init.c
  acc_init
    42: region entered 1 time
        time(us): init=1427521

The data for the two kernels is essentially identical, which we would expect. As before, we see the 80% of the time for the GPU is spent moving data to and from the GPU memory. What if, instead of moving the data back and forth for each iteration, we could move the data once and leave it there. Let's modify the program so in the routine that calls smooth, we surround that call with a data construct:

!$acc data copy(a(:,:),b(:,:))
call smooth( a, b, w0, w1, w2, n, m, iters );
!$acc end data

This tells the compiler to copy both arrays to the GPU before the call, and bring the results back to the host memory after the call. Inside the function, replace the copyin and copy data clauses by a present clause:

!$acc kernels loop present(a,b)

This tells the compiler that the data is already present on the GPU, so rather than copying the data it should just use the copy that is already present. Now the performance profile looks like:

Accelerator Kernel Timing data
acc_f3a.f90
  smooth
    32: region entered 50 times
        time(us): total=18309 init=6 region=18303
                  kernels=17786 data=0
        w/o init: total=18303 max=466 min=349 avg=366
        34: kernel launched 50 times
            grid: [63x63]  block: [16x16]
            time(us): total=17786 max=373 min=341 avg=355
acc_f3a.f90
  smooth
    24: region entered 50 times
        time(us): total=18389 init=2 region=18387
                  kernels=17823 data=0
        w/o init: total=18387 max=530 min=348 avg=367
        26: kernel launched 50 times
            grid: [63x63]  block: [16x16]
            time(us): total=17823 max=383 min=340 avg=356
acc_f3a.f90
  main
    128: region entered 1 time
        time(us): total=41423
                  data=4402
acc_init.c
  acc_init
    42: region entered 1 time
        time(us): init=1386799

We see no data movement in the smooth function, just the 18 milliseconds for each kernel execution. The data movement only happens in the main routine, and only once, taking roughly 4 milliseconds, instead of 200 as before.

But suppose we want a program that will run on the GPU when it's available, or on the host when it's not. The PGI compilers provide that functionality using the PGI Unified Binary™ technology. To enable, we build with the target accelerator option: ‑ta=nvidia,host. This generates two versions of the smooth function, one that runs on the host and one on the GPU. At run time, the program will determine whether there is a GPU attached and run that version if there is, or run the host version if there is not. You should see compiler messages like:

smooth:
smooth:
     18, PGI Unified Binary version for -tp=nehalem-64 -ta=host
         ...
smooth:
     18, PGI Unified Binary version for -tp=nehalem-64 -ta=nvidia
     24, Generating present(b(:,:))
         Generating present(a(:,:))
         Generating compute capability 1.0 binary
         ...

where the printed ‑tp value depends on the host on which you are running. Now you should be able to run this on the machine with the GPU and see the GPU performance, and then move the same binary to a machine with no GPU, where the host (‑ta=host) copy will run.

By default, the runtime system will use the GPU version if the GPU is available, and will use the host version if it is not. You can manually select the host version two ways. One way is to set the environment variable ACC_DEVICE before running the program:

csh:   setenv ACC_DEVICE_TYPE host
bash:  export ACC_DEVICE_TYPE=host

You can go back to the default by unsetting this variable. Alternatively, the program can select the device by calling an API routine:

      use openacc
      ...
      call acc_set_device( acc_device_host )

Summary

This installment introduced the OpenACC features in the PGI Accelerator compilers for NVIDIA GPUs, and presented three simple programs. We enabled the OpenACC directives with the ‑acc flag, used the simple accelerator profile library with the PGI_ACC_TIME environment variable, and used ‑ta=nvidia,host to generate a unified host+GPU binary, and introduced the data construct and the present clause. We hope you have a chance to try our simple examples and can start putting some simple programs on the GPU. We will follow up this introductory article with more details about Openacc, tuning OpenACC programs, and using OpenACC in larger programs.