Technical News from The Portland Group

OpenMP on Accelerators—A Position Paper

In 2008, we introduced our directive-based PGI Accelerator C99 & Fortran programming model for GPUs, with an implementation targeting NVIDIA GPUs. We have been asked, many times, why we designed our own directives instead of using the widely used, well understood, standard OpenMP parallel programming directives. We have also been asked whether OpenMP is going to extend its directives to support accelerators. This article attempts to address these questions.

Background

Let's review the history and origins of OpenMP. Many vendors were delivering shared memory multiprocessors from the mid-1980s through the mid-1990s. Each vendor created a mechanism to run shared memory multithreaded programs, some using libraries and some using directives. The library approach was eventually standardized into POSIX Threads, or Pthreads. Cray, IBM, Sequent, Encore, Convex, Alliant, and others each had their own set of directives for parallel computing, most of which were different ways to spell "parallel loop." An early attempt to standardize on a definition and spelling (the Parallel Computing Forum) formed in 1986 and eventually produced a design document, but disbanded in 1992 after failing to get agreement among the vendors.

Shortly after that, several vendors started a new effort, forming the OpenMP committee. It had a limited charter, support from vendors and users, and has been very successful. Every major computer vendor and compiler system supports OpenMP. The target architecture is a shared-memory, homogeneous multiprocessor, into which multicore processors also fit quite well. It was never intended to address or solve the general parallel programming problem: it doesn't address instruction-level parallelism, message-passing parallelism, asymmetric parallelism, automatic parallelism, vector parallelism, and so on.

So, one could argue that OpenMP isn't intended to solve the heterogeneous parallel problem, so why should it even try? In particular, since PGI already has an accelerator programming model, why should PGI participate in an effort to standardize on something that may at best allow other vendors to "catch up," and at worst make the PGI model incompatible or obsolete?

We believe that having a single standard is better than two or more vendor-specific standards. It's better for users, who have a chance of more easily porting programs among different vendors and systems, and it's better for the vendors, because more users will adopt a portable, standard model than a vendor-specific one, even if the vendor-specific one is superior in some ways.

What Accelerators?

What accelerators should an OpenMP accelerator model target? This is a key question. For instance, in the HPC community, the accelerators being discussed currently are GPUs, which exhibit a high degree of internal parallelism. To take advantage of these, we need to design a language that allows programmers to exploit that internal parallelism effectively; parallelism between the host and the GPU is also important, but significantly less so.

In the embedded community, accelerators are more typically signal processing devices, such as DSPs or hardware MPEG blocks. DSPs have internal parallelism, but more in the form of ILP, not multiprocessor parallelism. A compiler that does software pipelining is mostly sufficient for that; it doesn't take new parallelism directives. Hardware blocks are even more interesting; we could design a language to tell the compiler that there is a hardware block which implements some functionality, with certain inputs and outputs, and when compiled in the proper manner, the compiler could replace a function call by enabling the appropriate hardware block.

Can we design a single language that covers the two or three cases above? Are they distinct enough that we can approach each independently?

Another dimension has to do with the memory subsystem. GPUs, for instance, mostly use a separate high-bandwidth memory from the host; data for GPU processing must be allocated in and moved to the GPU memory. This could create consistency problems, if the data lives simultaneously in two places (host memory and GPU memory).

Along that vein, what if we have two or more accelerators, each with its own separate memory? Then we have a data distribution problem, along the lines of High Performance Fortran (HPF). The HPF data distribution concepts (block, cyclic, etc.) have been reused in other languages since, and OpenMP may find itself addressing the same problems.

Choosing what sorts of accelerators OpenMP should address (and what sorts it should not) is a key decision point in this process.

Can't We Just Port OpenMP?

OpenMP is prescriptive. That is, the OpenMP program tells the implementation (the compiler) what to do: start N parallel threads; spread the work of loop iterations over the threads in some fixed fashion; synchronize here; update the shared memory. OpenMP has a well-defined (formally weak) shared memory model.

Accelerators such as GPUs have at least two levels of parallelism: concurrency between the host and the accelerator, and parallelism on the accelerator. OpenMP has no vocabulary for this.

Current accelerators have two types of internal parallelism: SIMD and MIMD. In the NVIDIA GPUs, SIMD parallelism corresponds to the threads in a single warp or single thread block, and MIMD parallelism corresponds to the multiple thread blocks executed in parallel on different NVIDIA multiprocessors, or in multithread mode on the same multiprocessor. In something like a Larrabee or Cell, the SIMD parallelism corresponds to the extended SSE registers or 128-bit vector registers, while MIMD parallelism corresponds to the threads spread across the cores or SPEs. OpenMP has no vocabulary for this.

Many (or most current) accelerators have a separate memory from the host. OpenMP has no vocabulary for this very important point, either. Our experience is that the number one most important factor to optimizing accelerator programs is tuning the data movement between the host and the accelerator or GPU. Exposing this and allowing users to tune this is absolutely critical.

Nevertheless, there are several current efforts to implement OpenMP on GPUs or Cell or other accelerators. We explored this option, and came up with three possible methods:

Support Full OpenMP: This sounds attractive to users, but it's not. Most of OpenMP does not port nicely. There's no shared memory, so the implementation has to automatically manage the memory movement in all cases, which quite often will be very slow. Also, no critical sections, no synchronization. The effort to support all of OpenMP would be huge, and most of that effort will be to support features that will perform badly. Users will have a bad experience, and will then tend to avoid accelerators, or least that implementation, altogether. This is the HPF lesson: making it easy to write a really slow parallel program does not a good language make.

Break OpenMP: We know of several approaches that have gone down this route. They change some definitions, such as defining the master thread as running on the host, while worker threads run on the accelerator, making the model non-homogeneous. In one case, the compiler will ignore certain work sharing directives, and instead run a different loop in parallel across the threads, because it expects to produce better performance. In this case, OpenMP syntax is being used (or abused), but it's not really an OpenMP program. This is the direction for those unwilling or too timid to create their own language.

Limited OpenMP Support: In this option, you only support those features that perform well on the accelerator. This simplifies the implementation quite a bit. Many OpenMP programs don't port at all, but those that do might achieve some good performance.

But none of the above approaches even try to take advantage of the extra power of the accelerator, such as the SIMD/MIMD parallelism or the separate memory. So these approaches miss potential performance benefits of tuning for those features, or avoiding weaknesses.

A Fourth Option

PGI took a fourth option: we defined a new model to focus on the parallelism in accelerators, exposing the separate memory and the multiple styles of parallelism. For the memory, we wanted a model that allowed for an accelerator with a separate memory, where the programmer could tune where the relevant accelerator data gets allocated, transferred, and updated in either direction. But, we also wanted to avoid having to allocate and move data in a system where the accelerator actually shared memory with the host. This turned out to be surprisingly difficult to design.

Moreover, we took a more descriptive approach than does OpenMP. OpenMP, for instance, never really says that a loop is parallel; instead, it says to spread the iterations over the parallel threads. The PGI Accelerator compiler detects or allows a user to declare that a loop is parallel; the parallelism detection and exploitation is a distinct step from the scheduling of iterations to the parallel devices or threads or processors.

Conclusion

Should there be an effort to standardize a programming model for accelerators, GPUs, and heterogeneous systems? Yes, we believe so.

Doesn't OpenCL already solve this problem? No, OpenCL is an important part of the puzzle, but it's very low level, requiring a great deal of target-specific tuning. We should also have a higher level solution, to play the role that OpenMP plays for Pthreads.

Does the OpenMP committee already have the background and experience necessary to produce a standard in a short time? No, unfortunately, we believe there is very little relevant experience in the general community in accelerator programming, and the OpenMP committee similarly has little such experience, at this point.

Will a simple port of the existing OpenMP directives to accelerators work? No, this would not produce the kind of performance that we desire and expect.

So, why should OpenMP take this effort on? As mentioned above, having a single standard is better than several vendor-specific standards. OpenMP has an existing committee structure to produce such a standard, and a standard that can be explained using some of the same terms as the existing OpenMP language would be best. We've worked with OpenMP for five years now, and believe the committee process has done a good job producing a well-designed solid language that stands up to criticism. Moreover, whatever standard gets created, it will need to work within existing OpenMP frameworks, to handle multiple accelerators executing one-per-host-thread in an OpenMP multithreaded program.

In fact, there is now an active OpenMP subcommittee looking at extending the OpenMP directive language to support accelerators. The programming model being proposed borrows heavily from and is very similar to the PGI Accelerator model, and programs written in the PGI Accelerator model will be easily migrated to the proposed OpenMP directives. We expect it will take some time to conclude the standardization process, but the final product will be well designed and comprehensive.