Question from the Dec 2013 IEEE Webinar on Running OpenACC Programs on NVIDIA and AMD GPUs


Q Can OpenACC be used with x64 applcations?

A Support for targeting x64 with OpenACC is planned for later in 2014.

Q Can you kindly provide the intermediate code after applying compiler transformation? I want to get the transformed source code after applying compiler transformation.

A As far as I know, PGI applies various compiler transformation on the source code before translating it into parallel form.

The compiler does various transformations (and even more are planned for implementation in the coming year), but they are not done as source-to-source transformations, so generating a transformed source code is not possible.

Q I am primarily interested in CPU/GPU latency issues as they impact the operation of large numbers of small, potentially branch-independent kernels.

A There are several issues with CPU/GPU latency, and I'm not sure I fully understand your point. Certainly the latency of transferring data from CPU memory to GPU memory is the number one bottleneck to performance on the accelerated systems, and OpenACC provides several ways to optimize this. There is also overhead for just launching a kernel from the CPU, which is hard to avoid. Also, if you have many small kernels, you might want to run them all asynchronously. On a device like Kepler, the GPU can run many independent kernels at the same time, if they are all on distinct async queues, until all the device resources are utilized.

Q Are any GPU architectures suitable for this type of operation, and does OpenACC provide tools for managing/profiling latency?

A Again, I'm not sure I understand the full question. OpenACC does provide data constructs to explicitly manage the latency of moving data from CPU to GPU and back, and provides support for asynchronous operations, including data movement as well as computation on the device. Asynchronous operations doesn't really reduce latency, but overlaps it with something else. OpenACC does not have a profiler, but the OpenACC vendors do. PGI is leading an effort within the OpenACC committee to define a standard profiler interface, that will also allow third-party and open-source profile and trace tools to collect information about device execution and overheads.

Q The newer AMD APUs purport to use a unified address space - can we expect this change to dramatically reduce CPU/GPU latency?

A This is going to be a long and potentially boring answer. AMD has announced plans to make APUs available with a unified physical as well as virtual memory. The AMD APU includes both a multicore CPU and an integrated GPU (iGPU). This is intended to replace the motherboard-mounted GPUs that we used to see in laptops and low-end desktops. Intel offers something similar in their latest desktop processors, but the AMD APU is moving towards deeper integration. The current AMD APUs use the same physical system memory for the CPU and iGPU, but the OS reserves part of the physical memory for use by the GPU. In that case, data still has to be copied from the CPU memory space to the GPU memory space, but those copies happen at CPU memory speeds, much faster than PCI-express speeds. The newer APUs, when they become available, will allow the iGPU to access CPU virtual memory, and even will be cache-coherent with the CPU, according to the presentations I've seen. This will mean that for an API like OpenACC, data need not be copied at all between CPU and GPU, so the data latency will essentially go to zero. The downside is that the host memory is optimized for latency, not bandwidth. Current discrete GPUs (dGPUs) have a high bandwidth interface to a wide memory, essentially loading a whole cache line at a time. CPU memories simply don't have that kind of bandwidth. On the other hand, the iGPU doesn't have the same high performance as the dGPU, if only because so much of the chip real estate is devoted to the CPU, caches, and so on, so perhaps it won't need bandwidth quite as much as a dGPU. There are at least two possible other avenues that can be explored by system architects. One is exemplified by the original Convey Hybrid Core system. It had an Intel CPU and an accelerator (a customizable vector engine implemented in FPGAs, though that doesn't matter here). The system came with two physical memories: a classical host memory using normal DDR chips, connected to the CPU memory controller, and a high bandwidth memory connected to the accelerator. The two memories were mapped into a single virtual address space, so the CPU could access data from the high bandwidth memory, and the accelerator could access data in the CPU memory, but in each case there was a performance penalty for doing so. If the data is in the right place, performance was optimized. In that case, it's less an issue of copying data as of placing data in the right sub-memory. For Convey, that was done by the user allocating with a different malloc routine. One could imagine a system like an APU with two memory interfaces, a classical deep cache hierarchy connected to a CPU memory, and a high bandwidth interface to an accelerator-optimized memory. The second avenue is to use some new high bandwidth memory, something like the Micron Hybrid Memory Cube technology, or memory stacked on the processor chip itself using TSVs (through-silicon-vias). These are really new technologies, and so are likely prohibitively expensive to use as system memory today, but are technologies that we should be watching for the next few years.

Q What happened when n is bigger than device memory of copyin(in[0:n])?

A It fails. The compiler and runtime do not even try to do paging, swapping, or pipelining of data and communication.

Q Does this require a return to stream/array-based optimization (feeding the pipeline, as on Crays) versus cache-based optimization (completing work on any single data element)? Many years ago, we swapped inner & outer loops, depending on architecture.

A This is a fascinating and very current topic of discussion. We realize that the model of the target machine influences the programs we write. In vector days (Cray), we wrote inner loops without conditionals, stride-1 operations, etc. For multiprocessors, we wanted those parallel loops to be outermost. For caches, we again want low-strides in the inner loops. At PGI, we are looking at what technology we need in the compiler to allow a program written in a CPU-optimized manner to run efficiently on a machine like a GPU which has a different execution model, and similarly what technology we need to allow a program written in a GPU-optimized manner to run efficiently on a multicore CPU with big cache memories. We believe we understand most of the problems and will be working on this over the coming year or two.

Q Would run-time selection of HW type be possible via shared lib selection or other methods?

A We actually compile in the multiple versions into the binary. We know of other platforms that use shared objects, dynamically selecting the right shared object to load depending on the available hardware. This works, and has some advantages, but for people who have to deliver software it is a packaging problem that we'd like to avoid.

Q Is there any way I can speedup matlab with nvidia?

A I'm probably the wrong person to ask. I believe a quick search for 'cuda matlab' would address your question.

Q Does PGI OpenACC support Intel MIC?

A Today, PGI does not support the Intel Xeon Phi Coprocessor (IXPC). We demonstrated OpenACC on a Xeon Phi at Supercomputing 2012 in Salt Lake City, but that was not a product-level compiler. We've spent the last year working on other features and the AMD target. Our announced plans are to add IXPC support in 2014.

Q Is this code available for download?

A Yes and no. The SWIM example is copyright SPEC so we can't distribute that to you unless you're a member. The other two examples are available from this website.

Q If you get a wrong result in one of these parallel loops, how do you debug it?

A This is a tough question. There are several sources of errors, including normal coding errors, data errors, and coherence errors. Today, we have no way to debug a running program, since the compilers we have don't generate DWARF for the debugger. We are working very hard on this, and plan to deliver this later in 2014. Data errors are things like moving the wrong array, or moving the wrong part of an array to the device. Finding such errors can be difficult, and we are looking into way to try to help find these. Coherence errors arise when you have updated data on the GPU (or CPU) that needs to be copied to the CPU (or GPU). Again, we are looking into ways to help find such errors, and here we think we have at least one idea. In a debugging mode, we could have the runtime always update the device from the host before each kernel, and update the host from the device data after each kernel, so the two memories would always be coherent. If the programs works in coherent (but slow) mode, and not in normal mode, then there is probably a coherence error.

Q (+NumPy) is higher level than Fortran.

A I'm sure you are right. I haven't looked at NumPy. I can't say I use Fortran all that much, but for its original intended target audience, it's pretty good. It's tough for me to hear a bunch of C programmers denigrate the original high level language as old and tired, when the current Fortran language is so rich.

Q Is there some way to tell how much video performance is being degraded by timesharing the card? (I'm thinking of implementing some kind of dynamic load balancing)

A There may be, but I don't know what it is. Some operating systems (Windows, I believe) will forcibly kill a compute kernel on the GPU being used for graphics if it takes too long (5 seconds, I think).

Q What is the relationship between OpenACC and OpenCL?

A The same as the relationship between C and assembler. OpenACC is a higher level language than OpenCL, in the sense that it virtualizes and hides more of the details. The PGI OpenACC compiler for AMD Radeon uses OpenCL as the target interface.

Q Are multiple targets supported at the same time? Can I mix NVIDIA & AMD on the same system with the same OpenACC application (binary)?

A With the PGI 14.1 compiler, you will be able to generate a single binary with both NVIDIA and AMD code (and host code) in the same object file and executable binary. At runtime, the default behavior is to search for an accelerator that matches the generated code, so such a program would run on a system with an NVIDIA GPU, an AMD GPU, or no GPU at all. You will even be able to dynamically select which device to use, and even switch during execution, though you will have to be careful to have the right data on the right device; the runtime doesn't do the paging between devices. However, as I said during the webinar, at PGI we have not been successful in building a workstation with both NVIDIA and AMD GPU cards.

Q Do we need special ops to do reductions (sum, max, ...) or are those just inferred from standard-style loops?

A OpenACC has reduction() clauses that match the OpenMP reduction clauses. The PGI compilers can also recognize simple reductions (sum, max, min) that appear in loops.

Q When writing code for software that will be released to the general public, is it better to optimize for multiple GPUs and then package them together with a way to sense which accelerators, or is it better to produce two separate ideas.

A There are at least three ways to handle this. One is to produce two binaries, perhaps having a load routine decide which binary to run depending on the GPU. Another is to use shared objects, putting the accelerated parts into a shared object and then loading the right one again depending on the GPU. A third is to use a high level model like OpenCL, which does the compile at runtime, so it will always compile down to the current GPU. A fourth is to use what PGI does for OpenACC, to compile at compile time for multiple targets and package these into a unified binary. There are advantages and disadvantages to each approach, you'll have to experiment to decide what will serve your particular community.

Q What's an example of an app that does not match GPUs well?

A I would say Monte Carlo simulations, or Mandelbrot calculations. Lots of branching, no internal vectors, no regularity between threads.

Q If I was only developing for NVIDIA platforms, are there still benefits to using OpenACC over direct CUDA design?

A This is a bit like comparing C vs. assembly, or OpenMP vs. pthreads might be better. OpenACC is a higher level model and provides forward portability. CUDA is lower level, is NVIDIA-specific, but provides more access to and more control over the device. However, more control means more responsibility as well. If you want the absolute best performance and are willing to commit the resources, a lower level model like CUDA is your answer.

Q Does OpenACC support AMD APU architectures?

A OpenACC itself is target independent. The PGI compilers will support the integrated GPU on the AMD APU.

Q Can pgcc compile to a static lib or dll?

A I believe so.

Q Is pgcc available for Windows?

A Yes, though support for AMD Radeon will initially only be available in 64-bit mode.

Q What is the current state of the idea of OpenACC being merged into OpenMP? Is OpenACC here to stay or can we be expected to have to change all our directive syntaxs to conform to some future version of OpenMP?

A We are hoping that OpenACC and the OpenMP support for accelerators will eventually converge. Right now, OpenACC is leading technically. In the short term, many OpenACC members are also OpenMP members, and are providing our experiences to the OpenMP committee. Whatever happens, you should expect that the OpenACC vendors will continue to support OpenACC, even if the organization eventually merges with OpenMP. After all, we still support Fortran 66 one-trip loops.

Q Will CUDA Fortran ever be free-use as nvcc is today?

A I don't know the answer to this, sorry.

Q Will PGCC be integrated into the CUDA suite in the near future or will it remain a separate product? Will it ever be a free license for this OpenAcc compiler?

A There are several open-source efforts to implement OpenACC today, including two efforts in gcc and four others that I know of: OpenUH at University of Houston, OpenARC at Oak Ridge National Lab, AccULL at Universidad de La Laguna, and one more project in Korea whose name I've forgotten.

Q Does pgcc support C++ 11?

A pgcc is our C compiler; we have two C++ compilers: pgcpp and pgc++. The latter is only on Linux, and is the same as pgcpp except uses the GNU interfaces and STL. I believe we are currently working on C++11 features; we have many implemented, but not all. That being said, C++ support in OpenACC is sorely lacking. In particular, C++ programs typically use classes, arrays are members of classes, and so on, and OpenACC support for members is simply not there yet. The OpenACC committee is working (quite hard) on this exact issue.

Q I have Fortran 77 legacy code with COMMON statements and such how easy would it be to apply OpenACC? Thanks.

A You can use our Fortran 90 compiler to compile Fortran 77 programs as well. The biggest issue will be dealing with Common blocks. What you will eventually want, when we have procedure calls and separate compilation added, is to use the Common blocks (or C extern variables) in the same way on the device as on the host. That will be added, but I can't promise that will work in January.

Click me