Question from the May 2014 IEEE Webinar on C++ on GPUs Using OpenACC and the PGI Accelerator Compilers

Questions about the examples used in the webinar:


Q How do I profile the performance?

A You can set the PGI_ACC_TIME environment variable to a nonzero value, or to the name of a file, and device times will be collected for data movement and kernel launches. You can also use the PGI profiler, running your program under control of 'pgcollect -cuda a.out' and the running pgprof to see the results. You can also use the NVIDIA nvvp visual profiler. We use all three.

Q Can you utilize multiple GPUs in one program?
Q Does OpenACC take advantage of multiple GPUs on a system? How well does it manage those assets?

A The compiler and runtime do support using more than one GPU, but it's not easy. In particular, the compiler can't automatically split the data and the work across multiple GPUs. There are PhD dissertations to be written on this topic. I did a presentation on this at the NVIDIA GTC-2014 conference. Here are links to the presentation (MP4 video) and the slides I used.

Session S4474 — Scaling OpenACC Across Multiple GPUs MP4, PDF

Q Do you plan to support Intel Xeon CPU and Intel Xeon Phi Accelerator?
Q Can we use this directives for Intel Xeon Phi?
Q When compiling for both the device and the cpu, is the cpu version parallelized as well (multicore)?
Q For -ta=host, does it generate code for multi-core?

A We do have plans to generate parallel code for multicore CPUs, like the Intel Xeon CPU, from OpenACC directives, but that work is not in the current release. Support for the Intel Xeon Phi is on our roadmap, but is not yet scheduled; we will likely wait for the Knights Landing chip, which has the full 64-bit x86 instruction set.

Q Support for AMD hardware?

A Yes, all the features shown here are supported on AMD Radeon GPUs and APUs. The one feature I mentioned that is not support for the AMD hardware (yet) is separate compilation: having a device routine in one file called from another file.

Q How do I download the slides to a file?
Q When will this presentation show up at openacc.org?

A The slides are available on the PGI website now. We'll try to get them up at openacc.org as well.

Q What would you recommend as a typical HW configuration?

A For experimentation, many people use a Linux workstation with any CUDA-capable NVIDIA GPU. I do some of my work on my Macbook laptop, with has an NVIDIA GeForce GPU. For production work, many users have or use a Linux cluster with each node having one or more NVIDIA Tesla cards.

Q Would I be able to try out your C examples on PGI Developer C Workstation for Mac OS X (PGI Free for Mac)? Are the compiling flags the same in the slide?

A The Free PGI for Mac does not include the accelerator features. The command line flags are the same for Mac as for Linux.

Q Would there be Ruby bindings (after vector or abstract data type deep copy handling is done)?

A We have no plans for Ruby bindings. To be honest, I'm not sure what that would mean.

Q Do you have an example of a device routine using $acc routine? (assuming the inliner doesn't handle it)

A I don't have one immediately handy. I'll be working on a PGI Insider article about this very topic in the near future, with several examples.

Q Will PGI support OpenMP 4.0 directives for accelerators for GPUs?

A Eventually, PGI will likely support the OpenMP directives for accelerators. We think OpenACC has advantages for the devices available today.

Q Are Unified Memory features of CUDA used?

A We are experimenting with Unified Memory internally. The problem with Unified Memory is the data must be allocated using the CUDA managed memory allocator, so it affects more than just the compute regions, and doesn't work at all for static data.

Q Can OpenACC make use of mapped memory (host memory mapped to GPU)?

A Yes and no. OpenACC and the PGI compilers will not try to automatically map host memory into the GPU space. If the user does so, the user can pass the pointer to the OpenACC region using the deviceptr() data clause. However, when the GPU accesses data from the host memory, that access goes over the PCIe bus. The PCIe bus is a great I/O bus, but a pretty bad memory bus. PCIe 3 bandwidth is in the range of 1GB/second (per lane). The bandwidth of the GPU device memory is over 200GB/second, so that's two orders of magnitude difference.

Q I think NVIDIA is trying to implement MPI between devices via fast GPU interconnects. Is PGI going to support this?

A This is part of GPUdirect, allowing MPI to move data directly between GPU memories. We are looking at how to expose the exploit this with OpenACC device data.

Q Licensing and pricing for OpenACC compiler, for personal and corporate users?

A Go to the www.pgroup.com website, click on products in the banner at the top, then under HPC Products choose PGI Workstation, PGI Server, or PGI CDK Cluster Development Kit. At that page, click on Pricing in the box at the right.

Q Can OpenACC deal with recursive functions?

A To be honest, we don't have tests of this. CUDA allows for recursive functions, but not to infinite depth, so OpenACC should as well.

Q Does the PGI debugger support OpenACC?

A Yes. Compiling with -g will generate DWARF debug information both for the host and GPU, and also disables most optimizations.

Q Anything you can say about NVIDIA's purchase of PGI? e.g. integration of OpenACC into NVIDIA compilers with Visual Studio?

A NVIDIA acquired PGI last summer. We are preserving the PGI brand and products. As you note, there are many synergies between PGI and NVIDIA software, and we are exploring how best to integrate them. Nothing to announce at this time.

Q Performance comparison with native GPU (such as CUDA) vs OpenACC?

A Good question. Our experience has been that the OpenACC code will perform about the same as straightforward CUDA code. CUDA programmers tend to do more manual optimizations, in particular including some things that a compiler can't do, such as changing the data layout, or extending data structures to a size that's a multiple of the thread block size, for instance.

Q What are the differences with the Thrust library?

A If you find a library, like Thrust, that solves your problems, you should definitely go in that direction. Using libraries allows you to use someone else's work. OpenACC is designed for those programmers that have to (or want to) write their own loops.

Questions about the examples used in the webinar:

Q What is the limitation of maximum n?

A Does the maximum value of n depend on the GPU memory size?

A In this example, the maximum size is limited only by the memory on the device. In most programs that we've seen, the only limit to the amount of work you can do on the device is the memory size.

Q Where is the 'a'-vector in the example?

A The a vector is declared in the main routine, which I didn't show in the webinar. The example programs will be available for download.

Q Should it not be "y and x" in the present clause in the slides, because "a & b" are passed as "x & y"?
Q The present clause refers to arrays a and b which are defined outside the axpy function. Is it correct?

A Yes, you are right. The examples that I ran were right, but the slide was wrong.

Q What is "this" variable?

A The this variable is the implicit pointer to the current struct, defined for member functions of a class.

Q Where to download your sample codes?

A The examples are available to download from the PGI website.

Q Does C++ provide performance advantages over C?

A Not to get into language wars, the basic answer is no. The C++ language itself has no performance advantages over C. C++ is more expressive, and has features (templates in particular) to encourage more code reuse, but basically C++ compiles down to C code, so there's nothing in C++ that you couldn't have written in C, though the C code may be longer.

Q Question about the "data update" clause. May I use it in the middle of a code block, or do I have to use it in the top of a new code block?

A The 'data update' clause is executed from the host. It may not appear in a compute region, but it may appear anywhere that a statement could appear in the host code.

Q Is it possible to call CUDA device routines from within ACC parallel regions?

A Yes, but it takes some work. It's easier with CUDA Fortran and OpenACC in Fortran, but it works in C/C++ as well. You have to build the device routine with nvcc, then declare the routine prototype in the OpenACC source file followed by a "#pragma acc routine(name) seq" line, telling the OpenACC compiler that there is a routine of this name compiled for the device.

Q Is it possible to use OpenACC and CUDA together?

A Yes. OpenACC has a 'deviceptr' data clause, which tells the compiler that the user allocated this array using (for NVIDIA) CUDA calls, so it should just use that pointer on the device. There is also an 'acc_deviceptr' API routine that will return the GPU address for an array on the device, allowing you to access that data from CUDA kernels.

Q Does PGI compiler support CUDA 6?
Q Does 14.4 require 6.0 cudatoolkit? I heard a rumor that it would work with 5.5.

A Yes, PGI 14.4 supports CUDA 5.5 and 6.0.

Click me