OpenACC is designed as a parallel programming model that can deliver high performance on a wide range of systems, including accelerated systems with GPUs,multicore CPUs, and manycore processors. Until recently, PGI has focused its OpenACC development efforts on the NVIDIA Tesla and AMD Radeon GPU targets. Performance on these two different GPUs is comparable using OpenACC, but until now performane portability of OpenACC on conventional multicore microprocessors has not been demonstrated. That changes with the latest release of the PGI Accelerator compilers.

The PGI 15.10 compilers allow you to generate parallel code from your OpenACC programs for Intel and AMD multicore processors. This feature is enabled with a new option to the –⁠ta (target accelerator) command line flag, –⁠ta=multicore. Adding –⁠ta=multicore to the command line tells the compiler to generate parallel multicore code for OpenACC compute regions, instead of the default of generating parallel GPU kernels. The parallel multicore code will execute in much the same fashion as if you had used OpenMP omp parallel directives instead of OpenACC compute regions.

This feature raises a number of questions:

  • How do I use –⁠ta=multicore?
  • Why should I use –⁠ta=multicore?
  • Does data still get copied for the OpenACC data clauses?
  • What is the performance, relative to OpenMP?
  • Can this be used with –⁠ta=tesla or –⁠ta=radeon to spread work across both a GPU and the CPU cores?
  • Can this be used with –⁠ta=tesla or –⁠ta=radeon to create a single unified binary?
  • How does –⁠ta=multicore compare to OpenMP?
  • Does –⁠ta=multicore make OpenMP less important?
  • Can –⁠ta=multicore be used on the upcoming Knights Landing Intel Xeon Phi processor?
  • Do I need a PGI Accelerator license to use –⁠ta=multicore?
  • How does this relate to the –⁠ta=host option?
  • Will this work on Windows and OS X?
  • Are there limitations in the 15.10 release?
  • What can I expect in the future?
  • When will OpenACC and OpenMP merge?

The rest of this article will address these questions.

How Do I Use –⁠ta=multicore?

It's as simple as it looks. Add the –⁠ta=multicore flag to your pgfortran, pgc++ or pgcc command line, both for the compile and link steps. You can also specify –⁠acc, which enables the OpenACC directives, but the –⁠ta flag implies –⁠acc, so it's not strictly necessary. Adding –⁠Minfo or –⁠Minfo=accel will enable PGI compiler feedback messages, giving details about the parallel code generated, such as:

    ninvr:
         59, Loop is parallelizable
             Generating Multicore code
             59, #pragma acc loop gang
    pinvr:
         90, Loop is parallelizable
             Generating Multicore code
             90, #pragma acc loop gang

You can control how many threads the program will use to run the parallel compute regions with the environment variable ACC_NUM_CORES. The default is to count how many cores are available on the system. For Linux targets, the runtime will only count physical cores (not hyper-threaded logical cores), and launch that many threads. The OpenACC gang-parallel loops will be run in parallel across the threads. If you have an OpenACC parallel construct with a num_gangs(200) clause, the runtime will launch take the minimum of the num_gangs argument and the number of cores on the system, and launch that many threads. That avoids the problem of launching hundreds or thousands of gangs, which makes sense on a GPU but which would overload a multicore.

Why Use –⁠ta=multicore?

This gets to the basic reason why OpenACC was created four years ago. Our goal is to have a single programming model that will allow you to write a single program that runs with high performance in parallel across a wide range of target systems. Until now we have been developing and delivering OpenACC targeting NVIDIA Tesla and AMD Radeon GPUs, but the performance-portability story depends on being able to demonstrate the same program running with high performance in parallel on non-GPU targets, and in particular on a multicore host CPU. So, the first reason to use OpenACC with –⁠ta=multicore is if you have an application that you want to use on systems with GPUs, and on other systems without GPUs but with multicore CPUs. This allows you to develop your program once, without having to include compile-time conditionals (ifdefs) or special modules for each target with the increased development and maintenance cost.

Even if you are only interested in GPU-accelerated targets, you can do parallel OpenACC code development and testing on your multicore laptop or workstation without a GPU. This can separate algorithm development from GPU performance tuning. Debugging is often easier on the host than with a heterogeneous binary with both host and device.

Does Data Get Copied for –⁠ta=multicore?

Short answer: no.

The longer answer is that in the OpenACC execution model, the multicore CPU is treated like an accelerator device that shares memory with the initial host thread. With a shared-memory device, most of the OpenACC data clauses (copy, copyin, copyout, create) are ignored, and the accelerator device (the parallel multicore) uses the same data as the initial host thread. Similarly, update directives and most OpenACC data API routines will not generate data allocation or movement. Other data clauses are still honored, such as private and reduction, which may require some dynamic memory allocation and data movement, but no more than the corresponding OpenMP data clauses.

When using OpenACC with a GPU, data gets copied from the system memory to device memory (and back). The user is responsible for keeping the two copies of data coherent, as needed. When using OpenACC on a multicore CPU, there is only one copy of the data, so there is no coherence problem. However, the GPU-OpenACC program can produce different results than a multicore-OpenACC program, if the program depends on the parallel compute regions updating a different copy of the data than the sequential initial host thread regions.

    #pragma acc data create(a[0:n]) present(x[0:n],b[0:n])
    {
      // following loop executed on device
        #pragma acc parallel loop
        for(i=0;i<n;++i) a[i] = b[i];
      // following loop executed on host
        for(i=0;i<n;++i) a[i] = c[i];
      // following loop executed on device
        #pragma acc parallel loop
        for(i=0;i<n;++i) x[i] = a[i];
        ...
    }

On a GPU, the above code fragment allocates a copy of the array a on the device. It then fills in the device copy and the host copy with different values. The last loop will get the values from the device copy of a, so it's equivalent to x[i]=b[i]; When compiled for a multicore, the first two loops are both executed on the CPU, the first with a single thread and the second with all multicore threads. Both loops update the same copy of a, and the last loop will be equivalent to x[i]=c[i].

What is the Performance of OpenACC Relative to OpenMP?

The PGI OpenACC multicore runtime uses essentially the same code generation and thread management that our compilers have used for OpenMP and autoparallelization for the past 20 years. Our experiments show that the OpenACC multicore performance is equivalent to the same loop(s) running in parallel with OpenMP.

The more pertinent question is whether an OpenACC version of a program will reach the same (or better or worse) performance than an OpenMP version of a program. A typical large OpenMP application will have a single outer parallel construct in an outer routine, then a number of explicit or implicit barrier operations within the program. A typical large OpenACC application will have a larger number of parallel constructs, partly because OpenACC has no barrier across all parallel gangs. For some programs, the OpenMP multicore program can have better performance because there is less thread creation and coordination. For other programs, the OpenACC multicore program can have better performance, because there is less redundant execution and even less synchronization.

Can this Spread Work Across Both a GPU and Multicore?

Short answer: no.

The longer answer is this feature treats the multicore like an accelerator device. Spreading the work across a GPU and the multicore is like spreading the work across multiple GPUs, that is, across multiple devices. There are many issues with parallel execution on a single device, but across multiple devices, especially heterogeneous devices, is a particularly difficult problem. With current and foreseeable devices, the problems are as much about data as about compute: data distribution and replication across device memories, data coherence, work distribution, load balancing, and more. The OpenACC technical committee is looking at how to make multiple device support more natural, but that will take some significant work and creativity.

Can this be used to Create a Unified Binary for GPU and Multicore?

The current release does not support a unified –⁠ta=tesla,multicore or –⁠ta=radeon,multicore binary. We plan to support this feature in a future release, allowing you to build a single binary that will run in parallel on a GPU when one is present, and run in parallel on the multicore otherwise. This could be used on a cluster where some of the nodes are GPU-accelerated and other nodes are not, for instance. Alternatively, you can build a single binary to deliver to users or customers where some have GPU-accelerated systems and others do not.

Can I use both OpenACC with –⁠ta=multicore and OpenMP?

The PGI compilers support OpenACC compute constructs inside OpenMP parallel regions when targeting the OpenACC code for a GPU, whether the parallel OpenMP threads are sharing a GPU or each using a different GPU. However, the parallel code generated for –⁠ta=multicore is very similar to the parallel code generated for OpenMP parallel loops. If you are using OpenMP to distribute work across the CPU cores, there is little need for OpenACC to also spread work across the same cores. In fact, generating too many threads will oversubscribe the cores and could even slow the program down. This release does not support OpenACC compute constructs targeted at a multicore inside OpenMP parallel regions.

Does this make OpenMP less important?

Certainly not. OpenMP has been widely and successfully used for multiprocessor and multicore systems and nodes for shared memory parallelism, most typically with relative small processor and core counts. It includes loop and functional parallelism, and has a wide range of synchronization constructs. It is supported on essentially all current multicore systems in the HPC space. The directives are explicitly designed such that a mechanical process can translate the directives and constructs into parallel code, with explicitly defined behavior.

OpenACC is designed for highly parallel computations. It focuses on loop parallelism and has few scalability-limiting synchronization constructs. The directives are designed such that the code can be efficiently compiled for a wide range of parallel systems with very different parallelism profiles. The two languages have different goals, and we expect they will be used together.

What about asynchronous execution on the multicore?

When targeting a GPU, the compute regions and data movement can be done asynchronously from host computation, allowing yet another level of parallelism. When targeting the multicore itself, there is no data movement, and the compute regions are executed on the same multicre as the host computation, so asynchronous execution doesn't make much sense.

Will you support the Knights Landing Intel Xeon Phi?

The PGI 15.10 release does not support Xeon Phi. We are planning to optimize the PGI compilers for manycore x86 processors when Knights Landing systems become available, including tuning our OpenACC and OpenMP implementations for those processors. One of the more interesting features of the Knights Landing is the exposed memory hierarchy, the near (high bandwidth) and far (system) memories. We will be studying how to use the OpenACC data constructs to manage data movement between those two memories.

Do I need a PGI Accelerator (GPU-enabled) license to use –⁠ta=multicore?

This feature will work with any valid PGI license that includes access to the PGI 15.10 version of compilers.

How does this relate to –⁠ta=host?

The PGI OpenACC compilers have always supported the –⁠ta=host option, along with –⁠ta=tesla,host (or –⁠ta=radeon,host). The former option would generate single-threaded code for the OpenACC regions, essentially ignoring the directives. The latter options would generate code which would use a GPU when one is available, and would execute sequential code on the host when there is no GPU. These are still supported in this release in the same way as before. In the future, we may keep this feature, or roll it into the multicore target, depending on experience with OpenACC nested inside OpenMP parallel regions.

Will this work on Windows and OS X?

Yes, this feature will work on any supported PGI target, including targets for which no GPU target is supported.

Limitations in this Release?

There are a few limitations in ther PGI 15.10 release, which we expect to remove in future releases. In the PGI 15.10 compilers the collapse clause is ignored, this means only the outer loop is parallelized. The worker level of parallelism is ignored; we are still investigating how best to generate parallel code that includes gang, worker and vector parallelism. Also, no loop code optimization or tuning is done. For example, when compiling for a GPU, the compiler will reorder loops to optimize array strides for the parallelism profile. None of these features have been implemented for the multicore target in this release. Additionally, the vector level of parallelism will be used to generate SIMD code in a future release but we are not doing so now. Application performance should improve as these limitations are addressed.

The Future

As mentioned above, we plan to do further work on optimizing and tuning OpenACC multicore code generation and to study the interaction with OpenMP. As the core counts increase, you might use OpenMP for coarse-grain parallelism, either across sockets or across partitions of the cores. Then you might use use nested OpenACC compute constructs for fine-grain parallelism, within a socket or a partition. And we will be looking at new targets, such as OpenPOWER and Intel Knights Landing Xeon Phi and eventually IBM OpenPower.

When will OpenACC and OpenMP merge?

The upcoming release of the OpenMP specification adopts many features from OpenACC, such as unstructured data lifetimes (enter data, exit data), asynchronous device computation (though OpenMP does this through task parallelism instead of async queues), and API routines for device data allocation and management. The implementation of multicore code for OpenACC seems to bring the two closer together. So, are there any fundamental differences between OpenMP and OpenACC that prevent an immediate merger?

There are two types of constructs in OpenACC and OpenMP for heterogeneous computation: data and compute. The data constructs manage data allocation in device memory, movement between system and device memory, and the correspondence of data between the two. While the spellings are different, and not all features in either language appear in the other, the two data models are essentially coherent. This means a single implementation could support both the OpenACC and the OpenMP data models, and OpenACC data directives could be ported to OpenMP, and vice versa, with little effort.

The compute constructs are somewhat more different. Both OpenACC and OpenMP support three levels of parallelism, gang, worker, vector in OpenACC, and team, thread, simd in OpenMP. OpenACC was designed with GPUs, multicore, manycore and other targets in mind, meaning the mapping of language parallelism to hardware parallelism is not strict. The PGI OpenACC implementation will map gang parallelism across threads on a multicore, and across thread blocks on an NVIDIA GPU. OpenMP distinguishes between host multicore parallelism (only thread and simd) and target device parallelism (team, thread and simd). OpenMP was originally designed with only thread parallelism, and there is little experience with team (or simd) parallelism. If you write your OpenMP program with only thread parallelism, which has many implicit barriers in addition to any synchronization you include, your program will not be able to take advantage of all the parallelism on a GPU. If you write your OpenMP program with team parallelism, it's not clear how any implementation will support that on a multicore system, particularly since team parallelism is only supported within a target region.

We feel that parallelism within a socket and a node will increase rapidly over the coming years, so we need a fundamentally scalable, modern programming model. Different vendors will have different solution points, meaning support for different types of parallelism and tradeoffs between them. We think the OpenACC compute model is a step in the right direction for now and for the future. If and when OpenMP adopts such a compute model, the time may be right for the languages to merge.

Summary

The PGI 15.10 release allows you to generate multicore code from your OpenACC parallel programs. This can be used to write truly performance-portable parallel programs, without spending a lot of time recoding to target both GPU-accelerated and multicore systems. Use the –⁠ta=multicore option to enable this feature, and we always recommend enabling –⁠Minfo=accel for the compiler messages. We look forward to your feedback.

Click me
Cookie Consent

This site uses cookies to store information on your computer. See our cookie policy for further details on how to block cookies.

X