Question from the May 2013 IEEE Webinar on Programming Heterogeneous x64+GPU Systems with OpenACC

Q If the compiler can determine which loops are parallelizable and vectorizable, how come it can't automatically offload them?

A This is a really good question. As I mentioned in the presentation, the real issues with accelerator programming are about the data, not so much the computing. Getting the data to the right place, organizing the data in the right way, accessing the data with the right pattern, this takes some thought. There are some very nice "stunt" cases where automatic analysis can deliver good performance. When PGI started with the PGI Accelerator programming model, the precursor to OpenACC, we had intended to go down that route. However, the performance cliffs when the compiler makes a bad decision are very steep. For instance, is there enough work in this region of code to make it worthwhile moving the data to the accelerator? Will this data be used again on the accelerator? Can I manage two copies of the data, one on the host and one on the accelerator? This is a PhD-level research project, and I encourage work along that end, but it's not ready for commercial implementation at this time.

Q Can I use OpenACC to parallelize programs for multi-core CPUs?

A The PGI compiler does not yet use the OpenACC directives to parallelize for multi-core CPUs. We intend to add this to our compilers, but have not announced any schedules when this might be available.

Q Is OpenACC interoperable with OpenMP? With MPI?

A We have used the PGI OpenACC with MPI and with OpenMP. We are still improving our interaction with OpenMP. We are working on making that also work with explicit pthreads. The issues, as you might imagine, are defining how multiple threads (and processes) interact with the device, and making the generated code and runtime library thread-safe. These definitions were not in place in OpenACC 1.0. In OpenACC 2.0, the definitions are that multiple threads share the same accelerator context and data, but multiple processes do not. Our runtime library is mostly thread-safe, but as mentioned does not yet work with explicit pthreads.

Q Is OpenACC interoperable with CUDA? OpenCL?

A The PGI OpenACC is interoperable with CUDA, both CUDA Fortran and CUDA C. For CUDA C, that means if you use OpenACC directives to allocate and move data to an NVIDIA GPU, you can get the device address and pass that to CUDA C kernels. Also, if you use the CUDA C API to allocate data on an NVIDIA GPU, you can use that pointer in OpenACC compute regions, using the deviceptr() clause. For CUDA Fortran, which is a PGI product, you can use device arrays directly in OpenACC regions. If you use OpenACC data constructs to move data to an NVIDIA GPU, you can pass those arrays directly to CUDA Fortran kernels and the compiler will pass the device address directly. You can also call CUDA Fortran device subroutines and functions from within OpenACC loops. OpenACC 2.0 defines some additional interoperability between OpenACC programs and CUDA, and similarly for OpenCL. PGI doesn't use OpenCL as a target for NVIDIA GPUs, so that question doesn't apply for PGI. I believe CAPS Entreprise does use OpenCL for one or more accelerator targets; you would have to ask them about interopability with their OpenACC.

Q Does this replace CUDA Fortran?

A No, there are essentially different models. CUDA Fortran, like CUDA C, is essentially a lower-level programming model, giving the programmer more control, and hence more responsibility, for exploiting features of the GPU to get the performance. As mentioned above, OpenACC interoperates well with CUDA Fortran. All that being said, we believe that even when CUDA (Fortran or C) is needed, the bulk of the program can be written with directives, leaving only the most important kernels to be manually optimized with CUDA. Here, CUDA essentially serves the same purpose as assembly language does for host computing.

Q What is the performance penalty for using OpenACC vs CUDA/OpenCL?

A There certainly is some performance penalty, though for algorithms where the natural implementation in loops is essentially the same as what you would implement in CUDA or OpenCL, we find that is minimal. We are continually working to improve the quality of the generated code. The biggest performance differences come when you, the clever CUDA or OpenCL programmer, take advantage of information that you know but that the compiler can't. Such as you might know that the loop trip count is always a multiple of 32 or 64, so you don't need to check whether to do a small remainder loop. Another example is that you might take advantage of some hardware feature, like texture fetches or constant memory, that is very hard for a compiler to utilize. I'm sorry that I can't quantify the difference more specifically, though.

Q How do I determine if my application is appropriate for GPU acceleration?

A The most appropriate programs are loops with high trip count, or nested loops, with regular array accesses, especially where the stride between array fetches for one of the loop is 1, meaning adjacent elements are references. Programs with "while" loops, or linked lists, or deep recursion are unlikely to work well on a GPU. You can start by doing a performance profile and see whether there are any juicy spikes in the nested loops.

Q Can I run my OpenACC program on a machine that doesn't have an accelerator on it?

A PGI supports an option (enabled by default) that generates both accelerator code and host code for the OpenACC loops. The explicit option is "-ta=nvidia,host" which I mentioned very briefly in the webinar. The generated code will test whether there is an appropriate NVIDIA GPU on the machine where the program is running. If there is one, the program will use it. If there is not, the program will execute the host versions. The program can also choose dynamically whether to execute OpenACC regions on the GPU or on the host, or even select which GPU to use, using API runtime calls. At this time, the PGI compiler does not use the OpenACC directives to parallelize the code for the multi-icore host.

Q Which accelerators can be targeted by PGI Accelerator compilers?

A Today, the PGI Accelerator compilers target CUDA-capable NVIDIA GPUs. We have announced that we will soon have a closed beta test program, followed by an open beta program, for PGI compilers targeting AMD GPUs.

Q Which compilers support OpenACC? For which accelerators?

A The PGI C, C++ and Fortran compilers all support OpenACC. Other vendors supporting OpenACC include Cray Computer and CAPS Entreprise. There are several open source implementations in the works as well, at least one of which has a partial implementation.

Q Which programming languages does PGI support with OpenACC?

A Fortran, C, C++.

Q Does OpenACC support two or more accelerators on a system?

A Yes. I don't know whether the other OpenACC vendors support this, but PGI does. Using the OpenACC API call acc_set_device_num( n, acc_device_nvidia ), you can dynamically change which device to use. Use this with caution, however; the data is not automatically moved from one device to the next. As I mentioned in the webinar, accelerator programming is more about the data than the compute, and adding another accelerator makes this even more apparent.

Q Do I have to rebuild my application for each different accelerator?

A No. When you build with pgcc -acc, the default target accelerator option is ‑ta=nvidia,host. To be more specific, it's really ‑ta=nvidia:cc1x,cc2x,cc3x,host, meaning it builds a version for NVIDIA Tesla, Fermi and Kepler GPUs, as well as one for the host CPU. When we add support for AMD, you will have the option to select either NVIDIA GPUs, AMD GPUs, or both build in the same binary. That will allow you to build one binary that will run on whichever device is available on the system where the program is run, or to run on the host if there is no accelerator available. A related question is whether you can tune your program in a target-device-specific way. OpenACC 2.0 has device-specific tuning options, but that's beyond the scope of this discussion.

Q How can I try it?

A You register at the PGI website and download a copy and get a trial license.

Q Can you explain the difference between NVIDIA and ATI GPUs again?

A NVIDIA has had three generations of CUDA-capable GPUs: Tesla (compute capability 1.x), Fermi (compute capability 2.0) and Kepler (compute capability 3.0 and 3.5). NVIDIA GPUs are organized as an array of streaming multiprocessors, called SM's or, for Kepler, SMX's. Each SM has one or more arrays of what NVIDIA calls CUDA cores; in the diagram on the webinar, this array is the SIMD/SIMT unit, where each element of that unit is roughly a CUDA core. On a Kepler, for instance, there are 12 SIMD units in each SMX, and up to 15 SMX units on the chip. You could choose to call this 15 very capable cores or 180 SIMD units. NVIDIA likes to total up the CUDA cores, which in this case gives 2880 CUDA cores. The architectural model, however, can avoid those details entirely, focusing on the two levels of parallelism: MIMD or multi-icore parallelism across the PEs and SIMD or vector parallelism within a PE.

AMD has also had several generations of GPUs. The latest generation is what they call GCN (graphics core next) or Southern Islands GPUs, such as their latest Radeon 7970 graphics cards. Each of these GPUs has 32 compute units, and each compute unit has 4 vector units. The vector units each have 16 GPU cores, so a high end Radeon has 2048 GPU cores. The architectural model again avoids the details, and focuses on the two levels of paralellism.

Q We have code with OpenMP pragmas. Is it viable to replace them with OpenACC?

A Abstractly, yes, however there are two caveats. First, programs that work well on multi-cores with OpenMP may not work well on a GPU. GPUs require much more parallelism (hundreds or thousands of parallel iterations) instead of the 4-way or 8-way parallelism needed to make efficient use of a multi-core. GPUs are also much more sensitive to the data layout in memory. Second, you have to take into account the cost of moving the data from the host memory to the device memory and back. If you simply add an OpenACC parallel directive wherever you had an OpenMP parallel directive, I can almost guarantee you will get very poor performance, because in the best case, data will be moved back and forth for each loop, dominating your performance. The fun part of OpenACC is playing with the parallel loops and mapping program parallelism to the device parallelism, but the facts are that the first and most important performance bottleneck is managing the memory traffic between host and device memories.

Q Can you please elaborate on the -fast flag? What exactly are being done?

A The ‑fast flag is used by the PGI compilers (and others) to enable what we think are the most common optimization appropriate for the host CPU. I typed pgcc ‑help ‑fast to get the definition on the host, and I get:

-fast   Common optimizations; includes -O2 ‑Munroll=c:1 ‑Mnoframe ‑Mlre
        ==  ‑Mvect=sse ‑Mcache_align ‑Mflushz ‑Mpre

Q Was fast-math used in your n-body GPU test? If so, n-body is sensitive to precision. Is one really comparing apples to apples?

A Very good question. GPUs implement a low-precision math library, particular for the transcendental functions like sin, cos, sqrt, and sometimes even floating point divide. The low-precision is good enough for graphics operations but not for most compute applications. The default for the PGI compilers is to use full-precision implementations. You can enable the low-precision option, what NVIDIA calls the fastmath option, with the target-accelerator flag:


Q Can the GPU evaluate intrinsic functions e.g. SIN, COS, SQRT with the same precision as the CPU?

A Yes, see above.

Q Can you please show more CUDA Fortran examples?

A This was an OpenACC webinar; we have some CUDA Fortran material available in our PGInsider newsletter and elsewhere on our website. <***>

Q Why does't the webinar program work with linux?

A We're sorry about that. The webinar was run through IEEE, and the webinar controls were from a third party. We don't really have control over that.

Q Can you explain the difference between OpenMP and OpenACC?

A OpenMP is the dominant method for programming shared-memory multi-core and multi-processor systems in technical computing. It's target is homogeneous cores with uniform shared memory, and it handles that quite well. We tried to model OpenACC closely on OpenMP, though there are important differences between the targets. OpenACC targets a host plus accelerator system, where the accelerator may have its own device memory. Therefore, OpenACC must manage data movement as well as parallelism. Moreover, OpenACC targets typically have multiple levels of parallelism and the program must manage this as well.

Q What is the difference between openACC and OpenMP and MPI?

A MPI is the technical computing standard for building programs across a cluster or supercomputer network of nodes. Programs either use MPI directly, or in many cases, use an application platform that interacts with MPI, relieving the programmer from the details of the MPI calls. The MPI model is many copies of the same program executing in parallel across a cluster or supercomputer, interacting and cooperating on a single problem. The programming language knows nothing about the parallelism, however; the MPI parallelism is all hidden in library calls. OpenMP is the technical computing standard for building shared-memory multi-core or multi-processor programs. Many programs use MPI+OpenMP, with MPI between nodes of the cluster and OpenMP on each shared-memory multi-core node. OpenMP is implemented with directives and some API library routines, and is widely available from essentially every vendor in the technical computing space. OpenACC targets a host plus accelerator system, where the accelerator may have its own device memory, as mentioned above.

Q Will OpenACC 2.0 handle atomic operations for doubles?

A Yes, as long as the target system supports double precision atomics natively, or at least 64-bit compare-and-swap.

Q Does [0:n] vs [n:0] reflect column data fetch vs row fetch?

A The data construct I showed in the webinar for C was #pragma acc data copy( force[0:n] )... In C and C++, this means start at force[0] and continue for 'n' elements. The notation "[n:0]" makes no sense, it would be a length-zero vector starting at element 'n'. In Fortran, OpenACC just uses Fortran array notation.

Q Can we get a copy of the slides?

A Yes. <***>

Q Can you show the slide of OpenACC 1.0 features again?

A The slides are available; see the question above.

Q Will you send the slides and the source codes to participants?

A The slides are available, see above. We have permission from the author so the source code is available as well.

Q How much should a beginner know about CUDA before learning a high-level language like OpenACC?

A A beginner should know something about parallel programming, a little about computer architecture. It helps to have a little knowledge of GPU programming, but in fact most of our programmers do not use CUDA before starting with OpenACC.

Q When will the OpenACC 2.0 be available for Fortran?

A We will be adding OpenACC 2.0 features to the releases as they become available. We hope to be substantially done by the end of the year.

Q Intel suggests using OpenMP for the Phi. What are your thoughts on this vs. OpenACC?

A OpenMP is perfect for the Intel Xeon Phi coprocessor (IXPC) if you are programming natively. The IXPC architecture looks very like our architectural model, however, and you need both multi-core and vector parallelism to get reasonable performance. It also supports 4-way multithreading, and it's important to take advantage of that to tolerate cache misses, since like a GPU, the IXPC has a simplified control unit. OpenMP 4.0 will be adding target directives for offloading regions of code to an attached processor. We believe this is also an appropriate model for the IXPC, where the target processor is really an OpenMP engine. However we believe the OpenMP target directives cannot be implemented efficiently on all the accelerators of interest. We believe OpenACC can be implemented efficiently and deliver performance portability across all currently available systems and those in the future that we know about.

Q Can OpenACC be used with Intel Xeon Phi coprocessor? If so, is there an advantage over Intel compiler?

A PGI had a little proof of concept demonstration of OpenACC on the IXPC at Supercomputing 2012 last fall in Salt Lake City. We are now focusing on finishing the AMD GPU targeting and have deferred the IXPC target until next year at the earliest.

Q Support for Intel MIC?

A See above.

Q Does the runtime involves user/kernel switching?

A No. The OpenACC runtime is based on the CUDA runtime for NVIDIA GPUs.

Q Are drivers required?

A The NVIDIA CUDA drivers are required, and must be downloaded from the NVIDIA website.

Q What does an "Atomic operation" mean in slide "OpenACC 2.0"?

A An atomic operation is something like:

	#pragma acc parallel loop
	for( int i = 0; i < n; ++i ){
	    h = a[i] % x;	// compute histogram box
	    #pragma acc atomic update
	    ++histo[h];	// accumulate atomically

The histo[h] update is, strictly speaking, a dependence across iterations of the 'i' loop. However, we don't really care about the order of the updates, only that they all get updated in some order. One way to do this would be to have a critical region around the update, but that requires global synchronization across the whole machine, which is a pretty big hammer for a much simpler problem. Most processors, CPUs and GPUs, support atomic operations like this, which allow parallel updates to proceed correctly, even when multiple threads are updating the same value at the same time. It does requires special instructions, so the compiler needs to know to generate an atomic operation.

Q What is the difference between 'gang' and 'worker'? OpenACC doesn't seem to define those terms.

A Good question from someone who obviously has read the Specification. The execution model is of a number of PEs, where each PE has vector instructions and uses multithreading to execute multiple vector threads. Roughly, gangs are spread across the PEs (NVIDIA SMs), workers are spread across the multithreading operations (NVIDIA warps), and vectors are spread across the vector lanes (NVIDIA CUDA cores within a warp). That's a rough model, not a direct model, and I know the compiler will sometimes remap multiple gangs to a single PE and so on. We have not seen a good case for using all three, gang, worker and vector. Our examples all use gang and vector, or gang and worker. In the latter case, the workers are the threads that can share state on a single PE, and the gangs are the gangs of workers that are spread across PEs.

Q If OpenACC works on C/C++, it should work on PGI Fortran. Fortran converts math equations into C, saves lots of programmer time and eliminates problems in math programming.

A OpenACC does work with PGI Fortran, and I hoped I had made that clear. Fortran does not translate math equations into C.

Q How much control over GPU memory management does a user have when using OpenACC?

A How much control do you want? With the dynamic data lifetimes in OpenACC 2.0, you can determine exactly when in the program to allocate data on the device, when to copy data to the device, when to copy results back, and when to deallocate. The OpenACC model is that device memory is a copy of the data on the host, meaning it's the same layout and referenced with the same name.

Q Is the C++ support comparable to C++AMP?

A Microsoft C++AMP is a templated class library that supports execution on accelerators, such as GPUs. It manages the memory, overloads all the operations, and with a clever compiler, runtime and class library, generates code for your operations on the accelerator. However, it essentially requires a complete review of your program to use the C++AMP datatypes to get the data management. OpenACC is a different model, focusing on the parallel loops, where the accelerators get their performance advantage. It uses the same dataspace as the original program. I suspect one could implement C++AMP using OpenACC loops in the class library.

Q When will The Portland Group support an OpenACC debugger?

A This has been a source of continuing embarrassment for PGI for some time. What's been missing is a way to generate the DWARF information for the generated code. DWARF is information from the compiler for the debugger. I'm pleased to say that we are hard at work on converting our back end to use the new LLVM-based CUDA back end library, and as soon as that is complete we will start on the DWARF generation.

Q Do you recommend specific IDE to be used to trace and debug OpenMP and OpenACC code (e.g. eclipse? ) any other?

A We have used eclipse internally, as well as PGI Visual Fortran (based on the Microsoft Visual C++ IDE). We don't have any specific support for eclipse or other IDEs, however. Also, see the earlier answer about debugging.

Q Running the OpenACC-generated app will likely interfere with a GUI running on the same hardware, wouldn't it? How does one execute the program?

A The GPU is not virtualized, and in particularly its memory is physical, not virtual memory. So, yes, if there are two applications running on the GPU, they can interfere with each other. This can affect performance of either or both, and can cause one of them to fail if the other uses all or most of the memory. If your GPU is also running your graphics display, that appears as another application, and the OpenACC program can affect the display (and vice versa). Some operating systems will impose a time limit for a compute kernel on a GPU that is running a display, to avoid the display freezing for too long (5 seconds or so).

Q What are the innovations with the Intel Phi Processor? To what extent does today's outline apply there?

A The Intel Xeon Phi coprocessor (IXPC) is a highly parallel multi-core processor. It has about 60 cores that implement much of the X86 instruction set, and each core also has a very capable vector unit. If you look at the abstract architecture slide in the webinar, the IXPC fits this very directly. Each PE corresponds to an IXPC core, and the SIMD unit corresponds to the IXPC vector unit. Like a GPU, the IXPC core has a relative slow clock, shallower pipeline, narrow multiscalar instruction issue, smaller cache, and in-order instruction execution, allowing a simpler control unit relative to an Intel Xeon core. However, like a GPU, the IXPC has wider SIMD instructions, more cores, and more multithreading than a Xeon core. I believe the architectural model fits directly, and OpenACC programs would be a good fit.

Q Is this compiler suite only good for parallelized environments or does the compiler suite stand on its own for more traditional type programming needs?

A The PGI compilers compete very well against the best available X86 compilers for C, C++ and Fortran on Intel and AMD processors.

Q Which GPUs do you support?

A Today we support CUDA-capable NVIDIA GPUs. We are planning to also support AMD GPUs by the end of the year. Follow our website and announcements for more detailed information.

Q How portable is the runtime to embedded OS's?

A I'm not sure how to answer this. We haven't ever looked at porting to an embedded or mobile OS.

Q What would you point out as the main advantage over Matlab GPU support?

A I'm not familiar with Matlab GPU support. Certainly OpenACC doesn't support Matlab code, and we don't pretend to. I hesitate to make any claims of specific superiority without a better understanding of what Matlab GPU support allows. Not a very satisfactory answer, I'm sorry.

Q You mentioned your a fan of CUDA. What are your thoughts on OpenCL compared to OpenACC? Obvious different abstraction...

A I like CUDA (and OpenCL, though not as much) for what it's designed for, meaning a low-level programming model which you use when you need that low-level control. OpenCL is more like CUDA than OpenACC. OpenCL allows you to write device kernels, which correspond pretty directly to CUDA kernels. It also lets you write the host program to control device memory allocation (OpenCL buffers) and data movement, and launching kernels on the GPU. The big difference between CUDA and OpenCL is CUDA requires a compiler for the host code, whereas OpenCL is implemented as a library for the host, and the library includes a compiler for the device. The CUDA compiler allows CUDA to have a higher level programming model. For instance, a CUDA kernel launch looks like:

  foo<<< blocks, threads >>>(arg, arg2, arg3, n);

whereas the corresponding OpenCL kernel launch is:

  clSetKernelArg( func, 0, sizeof(arg), &arg );
  clSetKernelArg( func, 1, sizeof(arg2), &arg2 );
  clSetKernelArg( func, 2, sizeof(arg3), &arg3 );
  clSetKernelArg( func, 3, sizeof(n), &n );
  b[0] = blocks*threads;
  [0] = threads;
  clEnqueueNDRangeKernel( cmdqueue, func, 1, NULL, b, t, 0, NULL, NULL );

Even ignoring the NULL and zero arguments, that's a lot of code to write and maintain for what is really just a procedure call.

Q What does OpenACC do if your host code uses "double" variables but your GPU does not support it?

A If you have C double and are targeting something like the NVIDIA Tesla compute capability 1.1, such as is available on my older Apple Macbook, the compiler will fail to generate GPU code at all. If you tell it to generate code for compute capability 1.3, which does have double precision, then try to run that, you will get a runtime error that this device does not support this program. The compiler makes no attempt to simulate double precision on devices that don't support it.

Q Any thoughts on a low power flag?

A This is another PhD-level research project. I have yet to see any convincing demonstration of a compiler that generates different code to lower the energy usage of a program. There are some research projects out there, but I don't think they are ready for commercial use.

Q Excellent presentation, would love lots more.

A We have more information available on our website, Later this year, we will have another IEEE Webinar with more technical details of OpenACC programming.

Q What options are available for mixing host and accelerator computations, can a single OpenACC kernel be spread between them?

A We don't have any automatic mechanism to take, say, a single loop and put 1/3 of the iterations on the host and 2/3 on the GPU. The problem, as hinted at, is who manages the data distribution between the two memories. We have had some stunt cases where we take a single loop, compile it with "-ta=nvidia,host" to generate both GPU and host code, move some of the data to the GPU, then using asynchronous computation, launch the GPU code for 80% of the work, and then set the device to the host and execute the remaining 20% on the host while the GPU is at work. We are working with one of our premier customers to help port a large application where we use asynchronous computations on the GPU while the host processor is working on the setup for the next stage of the computation, though that is really not quite what you are asking.

Q Can you elaborate what performance portability mean? Do you mean sustaining equivalent performance across different generations/vendors of GPUs?

A Yes, except not exactly equivalent performance, but the best performance on each device or generation of device. Cray had a great presentation at GTC 2013 this year, when they were porting a code to both CUDA and OpenACC. They were developing this code using NVIDIA Fermi GPUs. When the real Titan machine was installed at Oak Ridge National Lab with Kepler GPUs, the OpenACC program ran much faster with no source code changes. The CUDA code needed to be retuned, because the program contains all the performance logic, such as the thread block size and shape, the size of the grid, and so forth. With OpenACC, the compiler and runtime can make those decisions based on the target.

Click me