Technical News from The Portland Group

OpenACC Features in the PGI Accelerator C Compiler—Part 1

This is an update of a previous article about the PGI Accelerator programming model for GPUs

GPUs have a very high compute capacity, and the latest designs have made them even more programmable and useful for tasks other than just graphics. Research on using GPUs for general purpose computing has gone on for several years, but it was only when NVIDIA introduced the CUDA Toolkit in 2007, including a compiler with extensions to C, that GPU computing became useful without heroic effort. Yet, while CUDA is a big step towards GPU programming, it requires significant rewriting and restructuring of your program, and if you want to retain the option of running on an X64 host, you must maintain both the GPU and CPU program versions separately.

In November 2011, PGI, along with Cray, NVIDIA and CAPS Entreprise, introduced the OpenACC API, a directive-based method for host+accelerator programming, allowing us to treat the GPU as an accelerator. The OpenACC specification shares many features of the PGI Accelerator programming model, which we introduced in 2009. This article describes OpenACC features for those of you who have little or no experience with GPU programming, and no experience with the PGI Accelerator programming model. If you have used the PGI Accelerator directives, and specifically want information about the differences between the PGI Accelerator directives and OpenACC directives, please read the companion article instead.

The OpenACC API uses directives and compiler analysis to compile natural C for the GPU; this often allows you to maintain a single source version, since ignoring the directives will compile the same program for the X64 CPU.

PGI is introducing full support for the OpenACC 1.0 specification with 12.6 release of its compilers. Some OpenACC features are available in earlier releases beginning with version 12.3.

GPU Architecture

Let's start by looking at accelerators, GPUs, and the current NVIDIA Fermi GPU in particular, because it is the first target for our compiler. An accelerator is typically implemented as a coprocessor to the host; it has its own instruction set and usually (but not always) its own memory. To the hardware, the accelerator looks like another IO unit; it communicates with the CPU using IO commands and DMA memory transfers. To the software, the accelerator is another computer to which your program sends data and routines to execute. Many accelerators have been produced over the years; with current technology, an accelerator fits on a single chip, like a CPU. Besides GPUs, other accelerators being considered today are the forthcoming Intel MIC, the AMD APU, and FPGAs. Here we focus on the NVIDIA family of GPUs; a picture of the relevant architectural features is shown below.

PGI NVIDIA Block Diagram
NVIDIA Fermi GPU Accelerator Block Diagram

The key features are the processors, the memory, and the interconnect. The NVIDIA GPUs have (currently) up to 16 multiprocessors; each multiprocessor has two SIMD units, each unit with 16 parallel thread processors. The thread processors run synchronously, meaning all thread processors in a SIMD unit execute the same instruction at the same time. Different multiprocessors run asynchronously, much like commodity multicore processors.

The GPU has its own memory, usually called device memory; this can range up to 6GB today. As with CPUs, access time to the memory is quite slow. CPUs use caches to try to reduce the effect of the long memory latency, by caching recently accessed data in the hopes that the future accesses will hit in the cache. Caches can be quite effective, but they don't solve the basic memory bandwidth problem. GPU programs typically require streaming access to large data sets that would overflow the size of a reasonable cache. To solve this problem, GPUs use multithreading. When the GPU processor issues an access to the device memory, that GPU thread goes to sleep until the memory returns the value. In the meantime, the GPU processor switches over very quickly, in hardware, to another GPU thread, and continues executing that thread. In this way, the GPU exploits program parallelism to keep busy while the slow device memory is responding.

While the device memory has long latency, the interconnect between the memory and the GPU processors supports very high bandwidth. In contrast to a CPU, the memory can keep up with the demands of data-intensive programs; instead of suffering from cache stalls, the GPU can keep busy, as long as there is enough parallelism to keep the processors busy.

Programming

Current approaches to programming GPUs include NVIDIA's CUDA and the open standard language OpenCL. Over the past several years, there have been many success stories using CUDA to port programs to NVIDIA GPUs. The goal of OpenCL is to provide a portable mechanism to program different GPUs and other parallel systems. Is there need or room for another programming strategy?

The cost of programming using CUDA or OpenCL is the initial programming effort to convert your program into the host part and the accelerator part. Each routine to run on the accelerator must be extracted to a separate kernel function, and the host code must manage device memory allocation, data movement, and kernel invocation. The kernel itself may have to be carefully optimized for the GPU or accelerator, including unrolling loops and orchestrating device memory fetches and stores. While CUDA and OpenCL are much, much easier programming environments than what was available before 2007, and both allow very detailed low-level optimization for the GPU, they are a long way from making it easy to program and experiment.

To address this, the OpenACC consortium has come up with the OpenACC API, implemented as a set of directives and API runtime routines. Using these, you can more easily start and experiment with porting programs to GPUs, letting the compiler do much of the bookkeeping. The resulting program is more portable, and in fact can run unmodified on the CPU itself. In this first tutorial installment, we will show some initial programs to get you started, and explore some of the features of the model. As we will see in the next installment, this model doesn't make parallel programming easy, but it does reduce the cost of entry; we will also explore using the directives to tune performance.

Setting Up

You need the right hardware and software to start using the PGI Accelerator compilers. First, you need a 64-bit x86 system with a Linux distribution supported both by PGI and NVIDIA; these include recent RHEL, SLES, OpenSUSE, Fedora, and Ubuntu distributions. See the PGI release support page and the Download CUDA page on the NVIDIA web site for currently supported distributions. Your system needs a CUDA-enabled NVIDIA graphics or Tesla card; see the CUDA-Enabled Products page on the NVIDIA web site for a list of appropriate cards. You need to install the PGI compilers, which come with the necessary CUDA toolkit components. From NVIDIA, you'll want a recent CUDA driver, available from the Download CUDA page on the NVIDIA site.

Let's assume you've installed the PGI compilers under the default location /opt/pgi. The PGI installation has two additional directory levels. The first corresponds to the target (linux86 for 32-bit linux86, linux86-64 for 64-bit). The second level has a directory for the PGI version (12.3 or 12.6 for example) and another PGI release directory containing common components (2012). If the compilers are installed, you're ready to test your accelerator connection. Try running the PGI-supplied tool pgaccelinfo. If you have everything set up properly, you should see output similar to this:

CUDA Driver Version:           4010
NVRM version: NVIDIA UNIX x86_64 Kernel Module  285.05.33  Thu Jan 19 
14:07:02 PST 2012

Device Number:                 0
Device Name:                   Quadro 6000
Device Revision Number:        2.0
Global Memory Size:            6441992192
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             1494 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 786432 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           1134669 microseconds
Current free memory:           6367731712
Upload time (4MB):             2605 microseconds (1308 ms pinned)
Download time:                 2929 microseconds (1382 ms pinned)
Upload bandwidth:              1610 MB/sec (3206 MB/sec pinned)
Download bandwidth:            1431 MB/sec (3034 MB/sec pinned

This first tells you the CUDA driver version information. It tells you that there is a single device, number zero; it's an NVIDIA Quadro 6000, it has compute capability 2.0, 6GB memory and 14 multiprocessors. You might have more than one GPU installed; perhaps you have a small GPU on the motherboard and a larger GPU or Tesla card in a PCI slot. The pgaccelinfo will give you information about each one it can find. On the other hand, if you see the message:

No accelerators found.
Try pgaccelinfo -v for more information

then you probably don't have the right hardware or drivers installed.

Now you're ready to start your first program.

First Program

We're going to show several simple example programs; we encourage you to try each one yourself. These examples are all available for download from the PGI web site.

We'll start with a very simple program; it will send a vector of floats to the GPU, double it, and bring the results back. In C, the whole program is:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

int main( int argc, char* argv[] )
{
    int n;      /* size of the vector */
    float *a;  /* the vector */
    float *restrict r;  /* the results */
    float *e;  /* expected results */
    int i;
    if( argc > 1 )
        n = atoi( argv[1] );
    else
        n = 100000;
    if( n <= 0 ) n = 100000;

    a = (float*)malloc(n*sizeof(float));
    r = (float*)malloc(n*sizeof(float));
    e = (float*)malloc(n*sizeof(float));
    /* initialize */
    for( i = 0; i < n; ++i ) a[i] = (float)(i+1);

#pragma acc kernels loop
    for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;
    /* compute on the host to compare */
    for( i = 0; i < n; ++i ) e[i] = a[i]*2.0f;
    /* check the results */
    for( i = 0; i < n; ++i )
        assert( r[i] == e[i] );
    printf( "%d iterations completed\n", n );
    return 0;
}

Note the restrict keyword in the declarations of the pointer r assigned in the loop; we'll see why shortly. Note also the explicit float constant 2.0f instead of 2.0. By default, C floating point constants are double precision. The expression a[i]*2.0 is computed in double precision, as (float)((double)a[i] * 2.0). To avoid this, use explicit float constants, or use the command line flag -Mfcon, which treats float constants as type float by default.

We prefixed the loop we want sent to the GPU by a kernels loop directive. This tells the compiler to find the parallelism in the loop, move the data over to the GPU, launch the operation on the GPU, then bring the results back. For this program, it's as simple as that.

Build this with the command:

pgcc -o acc_c1.exe acc_c1.c -acc -Minfo

Note the ‑acc and ‑Minfo flags. The ‑acc enables the OpenACC directives in the compiler; by default, the PGI compilers will target the accelerator regions for the NVIDIA GPU. We'll show other options in later examples. The ‑Minfo flag enables informational messages from the compiler; we'll enable this on all our builds, and explain what the messages mean. You're going to want to understand these messages when you start to tune for performance.

If everything is installed and licensed correctly you should see the following informational messages from pgcc:

main:
     24, Generating copyout(r[:n])
         Generating copyin(a[:n])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     25, Loop is parallelizable
         Accelerator kernel generated
         25, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */

Let's explain a few of these messages. The first:

Generating copyin(a[:n])

tells you that the compiler determined that the array a is used only as input to the loop, so those n elements of array a need to copied over from the CPU memory to the GPU device memory; this is a copyin to the device memory. Because they aren't modified, they don't need to be brought back. The second message

Generating copyout(r[:n])

tells you that the array r is assigned, but never read inside the loop; the values from the CPU memory don't need to be sent to the GPU, but the modified values need to be copied back. This is a copyout from the device memory. Below that is the message:

Loop is parallelizable

This tells you that the compiler analyzed the references in the loop and determined that all iterations could be executed in parallel. We added the restrict keyword to the declarations of the pointer r to allow this; otherwise, the compiler couldn't safely determine that a and r pointed to different memory. The next message is the most key:

Accelerator kernel generated

This tells you that the compiler successfully converted the body of that loop to a kernel for the GPU. The kernel is the GPU function itself created by the compiler, that will be called by the program and executed in parallel on the GPU. We'll discuss the next message, and others that you'll see, in more detail in the next installment.

So now you're ready to run the program. Assuming you're on the machine with the GPU, just type the name of the executable, acc_c1.exe. If you get a message

libcuda.so not found, exiting

then you must not have installed the CUDA software in its default location, /usr/lib. You may have to set the environment variable LD_LIBRARY_PATH. What you should see is just the final output

100000 iterations completed

How do you know that anything executed on the GPU? You can set the environment variable ACC_NOTIFY to 1:

csh:   setenv ACC_NOTIFY 1
bash:  export ACC_NOTIFY=1

then run the program; it will then print out a line each time a GPU kernel is launched. In this case, you'll see something like:

launch kernel  file=acc_c1.c function=main line=25 device=0 grid=391 block=25

which tells you the file, function, and line number of the kernel, and the CUDA grid and thread block dimensions. You probably don't want to leave this set for all your programs, but it's instructive and useful during program development and testing.

Second Program

Our first program was pretty trivial, just enough to get a test run. Let's take on a slightly more interesting program, one that has more computational intensity on each iteration. In C, the program is:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/time.h>
#include <math.h>
#include <openacc.h>
#include <accelmath.h>

int main( int argc, char* argv[] )
{
    int n;      /* size of the vector */
    float *a;  /* the vector */
    float *restrict r;  /* the results */
    float *e;  /* expected results */
    float s, c;
    struct timeval t1, t2, t3;
    long cgpu, chost;
    int i;
    if( argc > 1 )
        n = atoi( argv[1] );
    else
        n = 100000;
    if( n <= 0 ) n = 100000;

    a = (float*)malloc(n*sizeof(float));
    r = (float*)malloc(n*sizeof(float));
    e = (float*)malloc(n*sizeof(float));
    for( i = 0; i < n; ++i ) a[i] = (float)(i+1) * 2.0f;
    /*acc_init( acc_device_nvidia );*/

    gettimeofday( &t1, NULL );
    #pragma acc kernels loop
	    for( i = 0; i < n; ++i ){
		s = sinf(a[i]);
		c = cosf(a[i]);
		r[i] = s*s + c*c;
	    }
    gettimeofday( &t2, NULL );
    cgpu = (t2.tv_sec - t1.tv_sec)*1000000 + (t2.tv_usec - t1.tv_usec);
	    for( i = 0; i < n; ++i ){
		s = sinf(a[i]);
		c = cosf(a[i]);
		e[i] = s*s + c*c;
	    }
    gettimeofday( &t3, NULL );
    chost = (t3.tv_sec - t2.tv_sec)*1000000 + (t3.tv_usec - t2.tv_usec);
    /* check the results */
    for( i = 0; i < n; ++i )
        assert( fabsf(r[i] - e[i]) < 0.000001f );
    printf( "%13d iterations completed\n", n );
    printf( "%13ld microseconds on GPU\n", cgpu );
    printf( "%13ld microseconds on host\n", chost );
    return 0;
}

It reads the first command line argument as the number of elements to compute, allocates arrays, runs a kernel to compute floating point sine and cosine (note sinf and cosf function names here), and compares to the host. Some details to note:

  • There is a call to acc_init() commented out; you'll uncomment that shortly.
  • There are calls to gettimeofday() to measure wall clock time on the GPU and host loops.
  • The program doesn't compare for equality, it compares against a tolerance. We'll discuss that as well.

Now let's build and run the program; you'll compare the speed of your GPU to the speed of the host. Build as you did before, and you should see messages much like you did before. You can view just the accelerator messages by replacing ‑Minfo by ‑Minfo=accel on the compile line.

The first time you run this program, you'll see output something like:

       100000 iterations completed
      1363658 microseconds on GPU
         1886 microseconds on host

So what's this? Over a second on the GPU? Only 1.8 milliseconds on the host? What's the deal?

Let's explore this a little. If I enclose the program from the first call to the timer to the last print statement in another loop that iterates three times, I'll see something more like the following:

       100000 iterations completed
      1360159 microseconds on GPU
         2086 microseconds on host
       100000 iterations completed
          527 microseconds on GPU
         1972 microseconds on host
       100000 iterations completed
          530 microseconds on GPU
         1972 microseconds on host

The time on the GPU is very long for the first iteration, then is much faster after that. The reason is the overhead of connecting to the GPU on Linux. On Linux, if you run on a GPU without a display connected, it can take 1 to 1.5 seconds to make that initial connection, the first time the first kernel executes. You can set the environment variable PGI_ACC_TIME to 1 before running your program. This directs the runtime to collect the time spent in GPU initialization, data movement, and kernel execution. If we set the environment variable, we'll get additional profile information:

       100000 iterations completed
      1359584 microseconds on GPU
         2079 microseconds on host

Accelerator Kernel Timing data
acc_c2.c
  main
    32: region entered 1 time
        time(us): total=1357176 init=1356266 region=910
                  kernels=54 data=530
        w/o init: total=910 max=910 min=910 avg=910
        34: kernel launched 1 times
            grid: [391]  block: [256]
            time(us): total=54 max=54 min=54 avg=54

The timing data tells us that the accelerator region at line 32 was entered once and took a total of 1.357 seconds. Of that, 1.356 was spent in initialization. The actual execution time was 0.9 milliseconds. Of that time, 530 microseconds was spent moving data back and forth (copying the a and r arrays to and from the GPU), and only 54 microseconds was spent executing the kernel.

So, let's take the initialization out of the timing code altogether. Uncomment the call to acc_init() in your program, rebuild and then run the program. You should see output more like:

       100000 iterations completed
          913 microseconds on GPU
         1882 microseconds on host

The GPU time still includes the overhead of moving data between the GPU and host. Your times may differ, even substantially, depending on the GPU you have installed (particularly the Number of Multiprocessors reported by pgaccelinfo) and the host processor. These runs were made on a 2.6GHz Intel Nehalem.

To see some more interesting performance numbers, try increasing the number of loop iterations. The program defaults to 100000; increase this to 1000000. On my machine, I see

      1000000 iterations completed
         4504 microseconds on GPU
        40042 microseconds on host

Note the GPU time increases by about a factor of 5, less than you would expect; the host time increases by quite a bit more. That's because the host is sensitive to cache locality. The GPU has a very high bandwidth device memory; it uses the extra parallelism that comes from the 1,000,000 parallel iterations to tolerate the long memory latency, so there's no performance cliff.

So, let's go back and look at the reason for the tolerance test, instead of an equality test, for correctness. The fact is that the GPU doesn't always compute to exactly the same precision as the host. In particular, some transcendentals and trigonometric functions may be different in the low-order bit. You, the programmer, have to be aware of the potential for these differences, and if they are not acceptable, you may need to wait until the GPUs implement full host equivalence. Before the adoption of the IEEE floating point arithmetic standard, every computer used a different floating point format and delivered different precision, so this is not a new problem, just a new manifestation.

Your next assignment is to convert this second program to double precision; remember to change the sinf and cosf calls. You might compare the results to find the maximum difference. We're computing sin^2 + cos^2, which, if I remember my high school geometry, should equal 1.0 for all angles. So, you can compare the GPU and host computed values against the actual correct answer as well.

Third Program

Here, we'll explore writing a slightly more complex program, and try some other options, such as building it to run on either the GPU, or on the host if you don't have a GPU installed. We'll look at a simple Jacobi relaxation on a two-dimensional rectangular mesh. In C, the relaxation routine we'll use is:

void
smooth( float* restrict a, float* restrict b,
        float w0, float w1, float w2, int n, int m, int niters )
{
   int i, j, iter;
   float* tmp;
   for( iter = 1; iter < niters; ++iter ){
      #pragma acc kernels loop copyin(b[0:n*m]) copy(a[0:n*m]) 
                 independent
      for( i = 1; i < n-1; ++i )
         for( j = 1; j < m-1; ++j )
             a[i*m+j] = w0 * b[i*m+j] +
                 w1*(b[(i-1)*m+j] + b[(i+1)*m+j] + b[i*m+j-1] + 
                                    b[i*m+j+1]) +
                 w2*(b[(i-1)*m+j-1] + b[(i-1)*m+j+1] + b[(i+1)*m+j-1] + 
                                      b[(i+1)*m+j+1]);
      tmp = a;  a = b;  b = tmp;
   }
}

Again, note the use of the restrict keyword on the pointers. We can build these routines as before, and we'll get a set of messages as before. This particular implementation executes a fixed number of iterations. We added some clauses to the kernels directive. We added two data clauses; the first is a copyin clause to tell the compiler to copy the array b starting at b[0] and continuing for n*m elements in to the GPU. The second is a copy clause to tell the compiler to copy the array a starting at a[0] and continuing for n*m elements in to the GPU, and back out from the GPU at the end of the loop. We also added the independent clause to tell the compiler that the iterations of the i loop are data-independent as well, and can be run in parallel.

We can build this program as usual, and we'll see the information messages:

smooth:
     31, Generating copyin(b[:m*n])
         Generating copy(a[:m*n])
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     32, Loop is parallelizable
             33, Loop is parallelizable
         Accelerator kernel generated
         32, #pragma acc loop gang, vector(16) /* blockIdx.y threadIdx.y */
         33, #pragma acc loop gang, vector(16) /* blockIdx.x threadIdx.x */
The compiler produces messages about the data copied into and out of the GPU. The next two lines tell us that the compiler generated two GPU binaries, one suitable for Tesla devices (NVIDIA compute capability 1.0-1.3), and a second for Fermi devices (NVIDIA compute capability 2.0). After the line specifying that an accelerator kernel was generated, the compiler gives us the execution schedule, the mapping of the loop iterations to the device parallelism. The compiler prints this using the directives that you might use if you were going to specify this schedule manually. In this case, the compiler tiles the loop into 16x16 tiles (vector(16) for both loops), and executes the tiles in parallel across the NVIDIA multiprocessors (gang for both loops). A useful exercise would be to take out the independent clause and recompile the program, and look at the schedule generated by the compiler.

We've provided the source for this routine along with a driver routine to call it. Try running the program with a 1000 x 1000 matrix and 50 iterations, using a call to acc_init to isolate any device initialization. On my machine, setting the environment variable PGI_ACC_TIME, I see the performance output:

Accelerator Kernel Timing data
acc_c3.c
  smooth
    31: region entered 50 times
        time(us): total=172415 init=18 region=172397
                  kernels=16849 data=143582
        w/o init: total=172397 max=3942 min=3425 avg=3447
        33: kernel launched 50 times
            grid: [63x63]  block: [16x16]
            time(us): total=16849 max=355 min=331 avg=336
acc_init.c
  acc_init
    30: region entered 1 time
        time(us): init=1372702

As before, we see the 90% of the time for the GPU is spent moving data to and from the GPU memory. What if, instead of moving the data back and forth for each iteration, we could move the data once and leave it there. Let's modify the program so in the routine that calls smooth, we surround that call with a data construct:

#pragma acc data copy(b[0:n*m],a[0:n*m])
{
smooth( a, b, w0, w1, w2, n, m, iters );
}

This tells the compiler to copy both arrays to the GPU before the call, and bring the results back to the host memory after the call. Inside the function, replace the copyin and copy data clauses by a present clause:

#pragma acc kernels loop present(b[0:n*m],a[0:n*m]) independent

This tells the compiler that the data is already present on the GPU, so rather than copying the data it should just use the copy that is already present. Now the performance profile looks like:

Accelerator Kernel Timing data
acc_c3a.c
  smooth
    31: region entered 50 times
        time(us): total=17072 init=5 region=17067
                  kernels=16563 data=0
        w/o init: total=17067 max=448 min=331 avg=341
        33: kernel launched 50 times
            grid: [63x63]  block: [16x16]
            time(us): total=16563 max=354 min=323 avg=331
acc_c3a.c
  main
    126: region entered 1 time
        time(us): total=6324
                  data=4227

We see no data movement in the smooth function, just the 16.5 milliseconds for kernel execution. The data movement only happens in the main routine, and only once, taking roughly 4 milliseconds, instead of 140 as before.

But suppose we want a program that will run on the GPU when it's available, or on the host when it's not. The PGI compilers provide that functionality using the PGI Unified Binary™ technology. To enable, we build with the target accelerator option: ‑ta=nvidia,host. This generates two versions of the smooth function, one that runs on the host and one on the GPU. At run time, the program will determine whether there is a GPU attached and run that version if there is, or run the host version if there is not. You should see compiler messages like:

smooth:
     27, PGI Unified Binary version for -tp=nehalem-64 -ta=host
     33, 2 loop-carried redundant expressions removed with 2 operations 
         and 4 arrays
         ...
smooth:
     27, PGI Unified Binary version for -tp=nehalem-64 -ta=nvidia
     30, Loop not vectorized/parallelized: contains call
     31, Generating copyin(b[:m*n])
         Generating copy(a[:m*n])
         ...

where the printed ‑tp value depends on the host on which you are running. Now you should be able to run this on the machine with the GPU and see the GPU performance, and then move the same binary to a machine with no GPU, where the host (‑ta=host) copy will run.

By default, the runtime system will use the GPU version if the GPU is available, and will use the host version if it is not. You can manually select the host version two ways. One way is to set the environment variable ACC_DEVICE before running the program:

csh:   setenv ACC_DEVICE_TYPE host
bash:  export ACC_DEVICE_TYPE=host

You can go back to the default by unsetting this variable. Alternatively, the program can select the device by calling an API routine:

#include "openacc.h"
...
acc_set_device( acc_device_host );

Summary

This installment introduced the OpenACC features in the PGI Accelerator C compiler for NVIDIA GPUs, and presented three simple programs. We looked at some issues you may run into with float vs. double and unrestricted pointers. We enabled the OpenACC directives with the ‑acc flag, used the simple accelerator profile library with the PGI_ACC_TIME environment variable, and used ‑ta=nvidia,host to generate a unified host+GPU binary, and introduced the data construct and the present clause. We hope you have a chance to try our simple examples and can start putting some simple programs on the GPU. We will follow up this introductory article with more details about Openacc, tuning OpenACC programs, and using OpenACC in larger programs.