Technical News from The Portland Group

OpenACC Kernels and Parallel Constructs

My previous article introduced the PGI implementation of OpenACC in the PGI Accelerator™ compilers, comparing it to the earlier PGI Accelerator programming model, and pointing out a few important differences. One key difference is that the OpenACC API has two compute constructs, the kernels construct and the parallel construct. I promised a followup article describing the differences between the two and use cases for each. To simplify the discussion, I will focus this article only on OpenACC gang and vector parallelism. OpenACC also supports worker parallelism, but that will be the subject of yet another article. I will also ignore any data movement by assuming that all required data is present on the device.

The Kernels Construct

The kernels construct is derived from the PGI Accelerator model region construct. The loop nests in a kernels construct are converted by the compiler into (hopefully) parallel kernels that run efficiently on your GPU. There are three steps to this process. The first is to identify the loops that can be executed in parallel. The second is to map that abstract loop parallelism onto concrete hardware parallelism. For an NVIDIA CUDA GPU, this means mapping a parallel loop onto grid-level parallelism (blockIdx) or thread-level parallelism (threadIdx). In OpenACC terms, gang parallelism maps to grid-level parallelism, and vector parallelism maps to thread-level parallelism. The compiler may map a single loop across multiple levels of parallelism using strip-mining. Finally, in step three the compiler has to generate and optimize the actual code to implement the selected parallelism mapping.

Within a kernels construct, the compiler uses classical automatic parallelization technology to identify parallel loops. This analysis can be augmented by user directives and clauses, such as the loop independent directive in OpenACC. The PGI compiler presents the results of its analysis in the compiler informational messages, enabled with the -Minfo flag. You'll see messages like Loop is parallelizable for a loop that the compiler has determined could be executed in parallel. When the compiler can't determine that a loop can be run in parallel, the PGI compiler will generate an informational message, such as Loop carried dependence of '*(a)' prevents parallelization. This particular message arises from a store through a pointer in C, where the pointer doesn't have the restrict attribute, and so the store may conflict with a load or store through another pointer in the same loop.

The compiler then uses a second step to map the loop parallelism to the target device. The PGI compiler uses a model of the target device to choose which loops to map to vector parallelism and which to map to gang parallelism. For instance, a loop with many stride-1 array references would more naturally map to vector parallelism for current machines; for NVIDIA GPUs, this mapping tends to produce coalesced loads when executed. Again, the user can augment or override the compiler analysis through directives or clauses. A directive can be as simple as loop gang, telling the compiler to map this loop across gangs (the NVIDIA grid). Or, it can include more specific information such as loop vector(64), telling the compiler to map this loop across the vector parallelism (NVIDIA threads) with a vector length (thread block size) of 64. The danger with including detailed mappings in a program is that the mapping may work well with the current generation of accelerators, but may be wrong when moving to another system or future generation from the same vendor.

The third step is code generation and optimization. This is pretty standard compiler technology, though there are some analyses and optimizations that an OpenACC compiler can perform that a compiler for a lower-level language such as CUDA or OpenCL cannot. For an NVIDIA GPU target, the OpenACC compiler sees both the code that a single thread will execute as well as all the threads in the same thread block and across the grid. A CUDA compiler doesn't know the launch configuration, and it is very difficult to perform most analyses or optimizations that depend on behavior on different threads.

Each loop nest in the kernels construct is compiled and launched separately. In CUDA terms, each loop nest becomes a separate CUDA kernel. In particular, this means that the code for the first loop nest will complete before the second loop nest begins execution.

The Parallel Construct

The parallel construct is derived from the OpenMP parallel construct. An OpenMP parallel construct creates a number of parallel threads that immediately begin executing the body of the parallel construct redundantly. When a thread reaches a work-sharing loop, that thread will execute some subset of the loop iterations, depending on the scheduling policy as specified by the program or by the runtime. At the end of the loop, the threads synchronize at a barrier unless the user has explicitly specified that no such barrier is required.

In the same way, the OpenACC parallel construct creates a number of parallel gangs that immediately begin executing the body of the construct redundantly. When a gang reaches a work-sharing loop, that gang will execute a subset of the loop iterations, depending on the scheduling policy. One very important difference between the OpenACC parallel construct and OpenMP is that there is no barrier at the end of a work-sharing loop in a parallel construct. This is effectively the same behavior as with OpenMP with the nowait clause on all work-sharing loops.

In a work-sharing model, like the OpenMP or OpenACC parallel constructs, there's no parallelization required. The OpenMP threads or OpenACC gangs execute code redundantly until they reach a work-sharing construct, then split the work of that construct across the threads or gangs. This means that you, the programmer, need to include a work-sharing loop in the parallel construct, or each gang will execute all the code within the construct redundantly; perfect parallel execution, perhaps, but not very efficient.

Gang-level parallelism is exploited with work-sharing, but vector parallelism is not. Vector parallelism can be exploited in any of four different ways. The programmer can add a loop vector directive just before a loop, telling the compiler to generate vector code for that loop. The compiler can use classical vectorization analysis to automatically identify loops that can be compiled in vector mode, and generate vector code for those. Failing that, if there is only a single parallel loop, the work-sharing loop, the compiler can split the loop iterations of that loop across the gangs, then generate vector code for the iterations executed by each gang. There is a fourth option, which is explained below.

The code generation and optimization for a parallel construct is essentially the same as for the kernels construct. A key difference is that unlike a kernels construct, the entire parallel construct becomes a single target parallel operation; in CUDA terms, the parallel construct becomes a single CUDA kernel.

Single Loops

Let's compare the kernels and parallel constructs through a few simple examples. For single loops, it's pretty hard to distinguish the behavior of the two. The simple vector add below written using the kernels loop construct tells the compiler to generate a kernel from the single loop, automatically parallelizing the loop and mapping the parallelism to the accelerator. The kernels loop construct tells the compiler to map the body of the following loop into an accelerator kernel.

    #pragma acc kernels loop
        for( i = 0; i < n; ++i )
            a[i] = b[i] + c[i];

This is exactly the same as the following code, which tells the compiler to map the loops (the single loop, in this case) in the kernels construct to the accelerator:

    #pragma acc kernels
    {
        for( i = 0; i < n; ++i )
            a[i] = b[i] + c[i];
    }

If a, b and c are C pointers, then the compiler may not be able to disambiguate the pointer targets, and may not be able to generate parallel code. That can be resolved by adding the restrict attribute to the a pointer declaration, or by compiling with -Msafeptr (though we recommend using this flag very carefully), or by adding the independent clause to the directive:

    #pragma acc kernels loop independent
        for( i = 0; i < n; ++i )
            a[i] = b[i] + c[i];

In any case, once the parallelism is identified, the compiler will determine how to map this to the accelerator. For an NVIDIA GPU, the compiler may strip-mine the loop into chunks of 256 iterations, mapping the 256 iterations of a chunk in vector mode across the threads of a CUDA thread block, and map the n/256 chunks in gang mode across the thread blocks of the CUDA grid. The compiler would choose to map consecutive iterations (i and i+1), which refer to contiguous array elements (a[i] and a[i+1]) to adjacent CUDA threads in the same thread block, to optimize for coalesced memory accesses.

The same vector add written with the parallel loop construct tells the compiler two things: to map the body of the following loop as an accelerator kernel, and also to divide the loop iterations of that loop in work-sharing mode across the gangs.

    #pragma acc parallel loop
        for( i = 0; i < n; ++i )
            a[i] = b[i] + c[i];

When targeting NVIDIA GPUs, if the compiler only maps the loop iterations across CUDA thread blocks, the program would only use one CUDA thread in each thread block. That would be an unforgivable waste of resources. Instead, the compiler will attempt to determine if it can use vector instructions to execute the body of the loop within each gang, using automatic vectorization analysis. You may ask whether this analysis is even necessary, because you, the programmer, have already identified this loop as a parallel loop. However, you haven't, really. You've identified this loop as a work-sharing loop, that the iterations of the loop should be split across the gangs. Such work-sharing is certainly legal when the loop iterations are data independent, but you may have specified work-sharing on a loop where the iterations are not truly data independent. This may have been a programming error, or may have been really clever parallel programming, or may be some kind of chaotic relaxation, but in any case vectorizing the loop iterations would be inappropriate.

Another choice the compiler can make at this time is to implement the gangs with CUDA threads instead of CUDA thread blocks. This is a single loop, and the launch configuration for the generated kernel depends only on the loop limits and body of this loop. The directive doesn't specify how many gangs to start, so in this simple case the compiler can choose to start up to n gangs, generating the code so each gang runs on a single CUDA thread. This example shows that the mapping between gangs and thread blocks, and between vectors and threads, is not strict. Why is this legal, when I just said that vectorizing the loop body may not be legal? It's a subtle difference, that will become a little clearer when we look at multiple loops in the same parallel construct.

The example just above is actually identical in behavior and meaning to the following, which is a parallel construct that contains a single loop preceded by a loop directive:

    #pragma acc parallel
    {
        #pragma acc loop
        for( i = 0; i < n; ++i )
            a[i] = b[i] + c[i];
    }

However, if you forget to put in the loop directive, you get a completely different program:

    #pragma acc parallel
    {
        for( i = 0; i < n; ++i )
            a[i] = b[i] + c[i];
    }

Now you have redundant execution of the loop by each gang. This is another waste of resources, and probably not what you wanted in the first place.

Nested Loops

Our second example is two loops, tightly nested; for this example, I'll use Fortran. Let's start again with the kernels construct:

    !$acc kernels loop
    do j = 1, m
        do i = 1, n
            a(j,i) = b(j,i) * alpha + c(i,j) * beta
        enddo
    enddo

According to Fortran storage rules, the j loop is the stride-1 index for the references to a and b, but the i loop is the stride-1 index for the c reference. The parallelism analysis is much simpler for Fortran, because false aliases between array references are much less frequent, and the compiler can determine that both loops are parallelizable. In this program, the compiler also finds that the outer j loop is used as the stride-1 index for more arrays, so it may choose to map the j loop to vector parallelism. The compiler may then choose to map the i loop across gangs.

However, the compiler has a little more freedom here. It may choose to use a two-dimensional grid, splitting the j loop iterations across both vector and gang modes, and the i loop iterations across a second gang dimension. It may even choose to use a two-dimensional thread block, compiling both j and i loops in vector mode, then using a one or two-dimensional grid. As mentioned earlier, the compiler is guided by a target machine model, and it tries to optimize performance by looking at parallelism and memory access patterns.

If I compile this loop with the 12.8 PGI compilers (pgfortran -acc -Minfo=acc) I get the following information messages, where lines 11 and 12 are the two do loops:

     11, Loop is parallelizable
     12, Loop is parallelizable
         Accelerator kernel generated
         11, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
         12, !$acc loop gang ! blockidx%y

This tells me that the compiler found both loops to be fully parallelizable, and that the outer j loop was mapped to vector mode (aligned with CUDA Fortran threadidx%x) with a vector length (thread block size) of 128. Both loops were mapped to gang parallelism, again with the j loop aligned to CUDA Fortran blockidx%x, to avoid partition camping by mapping the stride-1 loop to the x dimension.

I can write this loop with the parallel construct, as follows:

    !$acc parallel loop
    do j = 1, m
        do i = 1, n
            a(j,i) = b(j,i) * alpha + c(i,j) * beta
        enddo
    enddo

Note that I have written this to do work-sharing on the j loop, which is also the most common stride-1 loop. To get vector parallelism, the compiler has the same four options as before: guidance by a user loop vector directive, automatic vectorization on an inner loop, vectorization of the work-shared loop, or mapping the gang iterations to CUDA threads. In this case, the compiler can easily vectorize the inner loop, though it's not necessarily the best choice for vector execution. The compiler could also vectorize the j loop iterations within a gang, though then the i loop would run sequentially within each gang.

Alternatively, I could write this loop as:

    !$acc parallel
    do j = 1, m
        !$acc loop
        do i = 1, n
            a(j,i) = b(j,i) * alpha + c(i,j) * beta
        enddo
    enddo

Now the work-sharing loop is the i loop, and the compiler can vectorize either the j loop or the inner i loop iterations on a single gang.

Or I could interchange the loops, putting the work-sharing i loop outermost, as:

    !$acc parallel loop
    do i = 1, n
        do j = 1, m
            a(j,i) = b(j,i) * alpha + c(i,j) * beta
        enddo
    enddo

This is the more likely original program, anyway, because we have been trained by decades of programming for vector operations, SIMD operations, and cache locality to put the stride-1 indices on the innermost loops. In this case, the compiler will work-share the outer i loop iterations across gangs, and would most likely vectorize the inner j loop iterations.

Note that there is no way to specify multidimensional vector or gang parallelism using the parallel construct. You can use the collapse clause, which approaches that behavior, but that's a subject for yet another article.

Non-tightly Nested Loops

For this example, I'll use a sparse matrix-vector loop, in this case using the compressed sparse-row (CSR) representation. Using the kernels construct, this looks like:

    #pragma acc kernels loop
    for( i = 0; i < nrows; ++i ){
        double val = 0.0;
        int nstart = rowindex[i];
        int nend = rowindex[i+1];
        #pragma acc loop vector reduction(+:val)
        for( n = nstart; n < nend; ++n )
            val += m[n] * v[colndx[n]];
        r[i] = val;
    }

Note the loop vector reduction for the inner loop. In this loop, the compiler would generate gang parallelism for the outer loop and vector parallelism for the inner loop. Generally, the inner loop has relatively few iterations, so we want the outer loop to be parallel across the big dimension in the hardware. The first few statements in the outer loop are executed in scalar mode for each gang.

Without the loop directive for the inner loop, the compiler may resort to autovectorization:

    #pragma acc kernels loop
    for( i = 0; i < nrows; ++i ){
        double val = 0.0;
        int nstart = rowindex[i];
        int nend = rowindex[i+1];
        for( n = nstart; n < nend; ++n )
            val += m[n] * v[colndx[n]];
        r[i] = val;
    }

In this case, the compiler may choose to run the outer loop across both gangs and vectors, or may choose to automatically vectorize the inner loop.

The same behavior occurs with the parallel construct:

    #pragma acc parallel loop
    for( i = 0; i < nrows; ++i ){
        double val = 0.0;
        int nstart = rowindex[i];
        int nend = rowindex[i+1];
        #pragma acc loop vector reduction(+:val)
        for( n = nstart; n < nend; ++n )
            val += m[n] * v[colndx[n]];
        r[i] = val;
    }

Here, no compiler parallelization analysis is required; the compiler will work-share the iterations of the outer i loop across gangs, and vectorize the inner n loop with the sum reduction. You can leave out the loop vector reduction directive, depending on autovectorization to find the parallelism, as before, but you should check the compiler informational messages.

Adjacent Loops

The final example has two adjacent, non-nested loops in the same construct. For a kernels construct, each of the adjacent loops will be a separate parallel operation or kernel.

    !$acc kernels
    do i = 1, n
        a(i) = mob(i)*charge(i)
    enddo
    do i = 2, n-1
        c(i) = 0.5*(a(i-1) + a(i+1))
    enddo
    !$acc end kernels

The two loops are analyzed, scheduled, and compiled independently. There's not much going on in this trivial example, but the values assigned to a(i) in the first loop are in fact used in the second loop. In a kernels construct, the first loop will complete before the second loop starts, so these dependences are preserved. Given that we are not looking at data movement, this is effectively the same as:

    !$acc kernels loop
    do i = 1, n
        a(i) = mob(i)*charge(i)
    enddo
    !$acc kernels loop
    do i = 2, n-1
        c(i) = 0.5*(a(i-1) + a(i+1))
    enddo

Writing this in a parallel construct is quite different, however:

    !$acc parallel
    !$acc loop
    do i = 1, n
        a(i) = mob(i)*charge(i)
    enddo
    !$acc loop
    do i = 2, n-1
        c(i) = 0.5*(a(i-1) + a(i+1))
    enddo
    !$acc end parallel

This tells the compiler to fire up some number of gangs, and to work-share the iterations of the first loop across those gangs, and to work-share the iterations of the second loop across the same gangs. There is no guarantee that the same loop iteration, the same value of i will be executed by the same gang for the two loops. In fact, that's likely to be untrue, that some value of i will be executed by different gangs between the first and second loop. Moreover, there's no synchronization between the first and second loop, so there's no guarantee that the assignment to a(i) from the first loop will be complete before its value is fetched by some other gang for the assignment in the second loop. This is the same problem we would have with an OpenMP parallel construct with two loops and a nowait clause on the first loop. In both cases, the compiler is doing what the programmer says to do, and that's probably not what the programmer wanted.

The same problem occurs when the dependence between the first and second loop is because of a reduction. We might scale a vector by the sum of its values as:

    sum = 0.0
    !$acc kernels
    !$acc loop reduction(+:sum)
    do i = 1, n
        sum = sum + v(i)
    enddo
    do i = 1,n
        v(i) = v(i) / sum
    enddo
    !$acc end kernels

With a kernels construct, the summation in the first loop completes before the second loop starts. The result can then be used in the second loop.

If we write this using a parallel construct, the dependence between the first and second loop is again broken:

    sum = 0.0
    !$acc parallel
    !$acc loop reduction(+:sum)
    do i = 1, n
        sum = sum + v(i)
    enddo
    !$acc loop
    do i = 1,n
        v(i) = v(i) / sum
    enddo
    !$acc end parallel

The acc loop directives are needed here to implement work-sharing of the iterations across the parallel gangs. However, a reduction across gangs isn't completed until the end of the parallel construct. The example above is exactly equivalent to:

    sum = 0.0
    !$acc parallel reduction(+:sum)
    !$acc loop
    do i = 1, n
        sum = sum + v(i)
    enddo
    !$acc loop
    do i = 1,n
        v(i) = v(i) / sum
    enddo
    !$acc end parallel

In fact, most tutorials now recommend avoiding putting the reduction clause on an acc loop directive in a parallel construct, if the loop is being work-shared across gangs, because of the potential confusion about when the result will be available. Instead, put the reduction clause on the parallel construct itself.

For the reasons shown here, we recommend that when you first start using the OpenACC parallel construct, use parallel loop to restrict the parallelism to a single loop, until you have more experience and can avoid the pitfalls shown in this section.

Summary

The OpenACC kernels and parallel constructs each try to solve the same problem, identifying loop parallelism and mapping it to the machine parallelism. The kernels construct is more implicit, giving the compiler more freedom to find and map parallelism according to the requirements of the target accelerator. The parallel construct is more explicit, and requires more analysis by the programmer to determine when it is legal and appropriate.

Both constructs allow for automatic vectorization within the loops. The actual behavior may depend on the compiler you are using. Naturally, the behavior we show here is for the PGI compiler suite. As the PGI Accelerator compilers mature, we continue to train them to make better decisions for the specific targets.