High Performance Computing on GPUs with OpenACC Using PGI Compilers
- What are the three most popular hardware platforms for PGI OpenACC users?
- What are the real advantages of using OpenACC over actual CUDA code?
- Should we create a flowchart of the overall hardware configuration to guide us in peppering our code with OpenACC calls?
- So there's no way to "fence" between parallelized function calls?
- Will OpenACC extend its capabilties beyond NVIDIA GPUs and CUDA soon to other accelerators such as AMD GPUs, FPGAs, and DSPs?
- Will OpenACC be available to researchers, students, or others outside of commercial use for free aside from a trial or restrained license?
- Would you say that NVIDIA is better than AMD?
- Is there a new version of OpenACC standard under the way to announce?
- Is creating an array of pointers to arrays residing on device supported in parallel regions?
- What happened with single precision?
- How does PGI compiler take advantage of software-managed cache or CUDA shared memory? Does it come only with cache directive?
- Would the ability to know the gang index inside the code (like openMP) be ever added to OpenACC?
Q What are the three most popular hardware platforms for PGI OpenACC users?
A NVIDIA GPUs are the most popular accelerators for HPC users today. PGI also supports the AMD Radeon discrete GPUs and AMD APUs, and we've found the Radeon to be quite fast, though the AMD software strategy hasn't been as successful. PGI doesn't support the Intel Knights Corner Xeon Phi; we are waiting for the Knights Landing.
Q With the newer details of features and added lines of code comes better control over computation and performance, but with all the new lines of code that look similar to CUDA code what are the real advantages of using OpenACC over actual CUDA code?
A Both CUDA and OpenACC have to solve the same problems: data management and parallel computation. CUDA does data management using API calls (cudaMalloc, cudaMemcpy) whereas OpenACC does data management through directives. The C++ example I showed today used member functions to allocate and move data, and that's the best solution we've found for today. When we finish the deep copy work for OpenACC, you'll be able to remove those and put in a simpler data directive for the class, and the right data will get allocated and moved to and from the device.
Q Should we create a flowchart of the overall hardware configuration to guide us in peppering our code with OpenACC calls?
A As discussed in the webinar, there are several different solutions being developed by different vendors. The common themes are that you will likely have to think about another level of the memory hierarchy, that is, high bandwidth memory, and you will have to expose more parallelism, and in particular more vector-like parallelism (think stride-1 inner loops) to get the best performance from the compute engines of the future. Today, the data directives are needed because the host and accelerator have physically and logically separate memories. In the future, the memories will likely present a single virtual address space, but data placement will still be important for best performance, so the data directives, or some similar mechanism, will be used to manage that.
Q So there's no way to "fence" between parallelized function calls?
A This was a question about two calls to parallel routines, each of which had a parallel loop. OpenMP requires a barrier synchronization after each parallel loop, so any dependences from the first loop to the second would be satisfied. OpenACC does not, in part because current devices (GPUs in particular) do not support a global barrier, for various technical reasons. So, no, there's no way to guarantee that all iterations of the first parallel loop will complete before starting the second parallel loop, as it is written, without splitting the parallel region up.
Q Will OpenACC extend its capabilties beyond NVIDIA GPUs and CUDA soon to other accelerators such as AMD GPUs, FPGAs, and DSPs?
A The OpenACC spec is target agnostic. PGI already supports AMD APUs and GPUs. PGI demonstrated a compiler for the Intel MIC at SC12, and CAPS had support for the MIC. I think support for a DSP, or Adapteva Epiphany, or Kalray MPPA, or any of a number of other compute units could be done, if there were enough demand. PGI internally has support for a generic OpenCL target, though it's not a product-quality implementation. FPGA support would be a stretch. There is some OpenCL support for FPGA targets, but the performance you get depends very much on the type and shape of the OpenCL code you write. A small change in the program can product a very big change in performance (up or down). FPGAs have been very successful in specific markets where they can apply a high degree of parallelism for a very carefully-crafted algorithm, often with application-specific precision. In those cases, any effort to carefully optimize the code is well worth the cost, and it's hard to argue that making it easy to program will be generally useful.
Q Will OpenACC be available to researchers, students, or others outside of commercial use for free aside from a trial or restrained license?
A OpenACC is coming to GCC with the 5.x release due out next year.
You can register for a free 30 day trial to PGI Accelerator compilers with OpenACC, or if you already have a PGI account but you don't have a permanent license, log in to your PGI account to retrieve new trial license keys to use with the latest PGI release.
Q Would you say that NVIDIA is better than AMD?
A Speaking in terms of the performance and quality of the GPUs, they are both great products. If you look at the native peak performance and delivered graphics and compute performance, they often leapfrog each other from one product to the next. The NVIDIA software strategy and CUDA platform has been more successful than OpenCL at AMD. The PGI compilers support both targets.
Q Is there a new version of OpenACC standard under the way to announce?
A The OpenACC committee is working hard on a version 2.5, with about 12 new or modified features (mostly minor), targeting adoption for this Spring. We are also working on a version 3.0, which will add support for deep copy, which is very important and needed, specifically for C++, but also for many Fortran applications.
Q Is creating an array of pointers to arrays residing on device supported in parallel regions?
A Yes. In fact, if you have a C 'float**x' array, you can say '#pragma acc data copy(x[0:n][0:m])', which gets implemented by allocating a vector of pointers on the device, then an array of data, then filling in the vector of pointers.
Q What happened with single precision?
A The SP performance dropped between version 14.1 and the 15.1 candidate. We haven't investigated this as yet. One big difference is that the 14.1 compiler used the CUDA 5.0 toolkit as the default version back end code generator, whereas the 15.1 compiler uses the CUDA 6.0 toolkit. We have found some performance drops with later CUDA versions, and we are working with the CUDA team to fix those.
Q How does PGI compiler take advantage of software-managed cache or CUDA shared memory? Does it come only with cache directive?
A The PGI compiler used to try very hard to optimized for the CUDA shared memory, which was very important on the original Tesla (compute capability 1.3) devices. With Fermi and Kepler, this is less important, so we have disabled the automatic caching by default. We have started back on this feature recently, and may reenable automatic caching. However, mostly today we rely on the user cache directives.
Q Would the ability to know the gang index inside the code (like openMP) be ever added to OpenACC?
A We have resisted adding this to the specification. In OpenMP, the threads are long-lived, and the same threads execute across multiple parallel regions. In OpenACC, the gangs and threads are short-lived. However, we have also found is useful to be able to get the gang, worker and vector index inside a loop, if only for debugging. We will likely expose this in the PGI compiler in the 15.1 or 15.4 release.