Technical News from The Portland Group
PGI Accelerator Programming Model Support on NVIDIA Fermi GPUs
Since NVIDIA's announcement late last Fall, we've all been anxiously awaiting the new Tesla C2050 / C2070 cards based on NVIDIA's Fermi architecture. High end Fermi cards can support more cores, larger device memory, a two-level device memory cache, and larger shared memory. Moreover, and quite exciting, the peak compute bandwidth for double precision floating point operations is up to half the single precision bandwidth, the same ratio we see in everyday CPUs. They also have other important features, such as error correcting (ECC) device memory, important for any production application.
Using a Fermi architecture card requires a little more than just finding one and installing the card. Fermi implements CUDA compute capability 2.0 (Tesla C1060 was compute capability 1.3, and there are a selection of other graphics cards with compute capability 1.2, 1.1, or 1.0). In addition, you need an appropriate CUDA-enabled NVIDIA device driver installed with your operating system, and a corresponding CUDA toolkit.
To use a C2050 or C2070 card, you have to install an updated device driver, and use an updated toolkit. The PGI pgaccelinfo tool will tell you what driver version you have installed on your system, if you have PGI 10.2 (or later) installed. The first line of output is the device driver version number:
% pgaccelinfo CUDA Driver Version 3000 Device Number: 0 ...
If the printed Driver Version is 3000 (meaning 3.0) or higher, then you have a Fermi-enabled driver installed; a 2.3 driver will display as version 2030, and doesn't support Fermi architecture cards.
You can also download the 3.0 NVIDIA toolkit from the NVIDIA website to generate CUDA programs for Fermi cards. The 3.0 toolkit can also generate code for devices with compute capability 1.3 or lower, such as a Tesla C1060 or GeForce GTX 280 card. It's important to note that the 3.0 toolkit uses a different compiled representation for the CUDA kernels. The 2.3 and older toolkits used a textual cubin file; the 3.0 toolkit uses standard ELF representation. This creates a dependence between the toolkit and device driver versions: programs compiled with the 2.3 toolkit will run on systems with 2.3 or 3.0 device drivers, but programs compiled with the 3.0 toolkit require a 3.0 driver, regardless of the NVIDIA card you are using.
PGI CUDA Fortran and the PGI Accelerator directive-based GPU compilers have now added support for NVIDIA cards based on the Fermi architecture. The PGI compilers come with both the 2.3 and 3.0 toolkits, so can support existing systems with version 2.3 device drivers, and updated systems with 3.0 drivers and Fermi cards. Our goal is to serve users who are building and running programs on their own systems, as well as users who want to build programs that will run on other systems that may not match their own configuration. This article will explain the techniques used in the compilers to provide the most flexibility, portability and performance across a variety of target systems.
Setting up your PGI installation
By default, the PGI compilers will use the CUDA 2.3 toolkit to build CUDA Fortran and PGI Accelerator applications. This is fine if you don't yet have a C2050 / C2070 card, or are still running a CUDA version 2.3 driver. However, if you have the latest drivers installed, you can change the default by creating or editing your siterc file in the $PGI/<target>/<version>/bin directory, and adding the line
Alternatively, you can add that line to your personal .mypgirc file in your $HOME directory (on Windows, drop the leading period). Keep in mind that programs built with the CUDA 3.0 toolkit will not run on systems using a 2.3 driver, so that may limit the portability of your program, until everyone has updated to 3.0 or higher drivers.
Second, you can set the default target compute capability for your CUDA Fortran and PGI Accelerator applications. In older PGI 10.x versions, the default was to generate Tesla C1060 code (compute capability 1.3); you would change this on the command line with a command line option like ‑Mcuda=cc11 or ‑ta=nvidia,cc11 where cc11 meant to generate code for compute capability 1.1. With PGI 10.4, we started using PGI Unified Binary technology to pack multiple compute capabilities into a single object, one version for compute capability 1.0 and one optimized for Tesla C1060 at compute capability 1.3. The pgaccelinfo tool will give you the compute capability for your card as well; the output below shows the GeForce 8600 GTS card has compute capability 1.1:
% pgaccelinfo CUDA Driver Version: 2030 Device Number: 0 Device Name: GeForce 8600 GTS Device Revision Number: 1.1 Global Memory Size: 268107776
Now you can set the default compute capability for your installation by adding another line to the siterc file or your .mypgirc file.
You can still override this on the command line with the cc10, cc11, cc12, cc13, or the new cc20 sub-options to ‑Mcuda or ‑ta=nvidia flags.
Moreover, you can set multiple default capabilities; with the line:
set COMPUTECAP=1.1 1.3;
the compiler will generate two versions of every GPU kernel, one for compute capability 1.1 and another for 1.3.
Using Compute Capability
Many users are only targeting one card, on the system they use regularly. Those users may be best served by setting the default compute capability appropriately for that card, as described above.
Others are developing a program on one system, then do production runs on a larger system which may have a different type of card installed. In particular, you may do your development on a Tesla C1060 system, then do your production runs on a system with a newer Fermi architecture C2050 or C2070 card. In that case, you'll want to understand how compute capabilities are managed by the compiler and runtime system.
Compute capability describes the revision and capability of the hardware. Higher capability devices have more functionality. The lowest compute capability 1.0 devices are quite powerful, but don't support all the features of more capable devices. Compute capability 1.1 devices added some atomic operations. Compute capability 1.2 allowed more resident warps in each streaming multiprocessor, and added more atomic operations and warp vote functions. Compute capability 1.3 added native double precision floating point operations. Compute capability 2.0, enlarged the shared memory, and increased the resident warp count and maximum thread block size, along with many other upgrades.
The PGI compilers will generate CUDA binaries with one or more compute capabilities, depending on the command line options and the installed defaults. The stock installation will use CUDA toolkit 2.3 to generate compute capability 1.0 and 1.3 CUDA binaries. If you change the CUDA toolkit version by setting DEFCUDAVERSION as shown above, or by using the ‑Mcuda=cuda3.0 or ‑ta=nvidia,cuda3.0 options, the compilers will generate compute capability 1.0, 1.3 and 2.0 CUDA binaries. The compiler also saves one version of the NVIDIA portable assembly file, the PTX code, as we shall see shortly.
At execution time, the runtime library will select the version of code to run based on the actual compute capability of the card being used. If you have compute capability 1.0 and 1.3 binaries, the runtime system will use the 1.0 binary for devices of compute capability 1.0, 1.1 or 1.2; it will use the 1.3 binary on a Tesla C1060 and other compute capability 1.3 devices (e.g. a GTX 280). If you have compute capability 1.3 and 2.0 binaries, which might happen if you are using double precision floating point, the runtime system will fail to find an appropriate binary for compute capability 1.0, 1.1 or 1.2 devices.
As mentioned, if you have a compute capability 1.0 binary and a 1.2 device, the runtime system will select the 1.0 binary. This works because the compute capability 1.x devices are binary upwards compatible; a 1.x binary can run on a 1.y device if x<y. However, the compute capability 2.0 devices are not binary compatible; compute capability 1.3 binaries will not run on a Fermi architecture card.
To preserve forward compatibility as seamlessly as possible, the compilers will save the NVIDIA PTX file for your CUDA binary as well. If there is an appropriate binary available, the runtime system will always choose that binary. If you have only 1.x binaries, but are running on a 2.0 card, the runtime system will select the NVIDIA PTX file. This incurs some runtime overhead, as the device driver will have to dynamically invoke the device assembler to assemble the PTX into a binary.
Possible Runtime Problems
With all the combinations of compute capabilities, CUDA driver versions and toolkit versions, there are many ways to run into problems. When using the PGI Accelerator model, the runtime library will give a fatal error message when it can't find a version of the kernel that will run on the target hardware.
For example, we might see a message like:
Compute capability mismatch file: /home/mwolfe/test3/accmatmul/mm2.f90 routine: mm1 line: 6 device: 0 compute capability 1.1 driver: 2030 Available compute capability: 1.0(elf-unsupported) 1.3(elf-unsupported) 2.0(elf-unsupported) 2.0(ptx)
The first three lines tell me about the source file. The next line tells me on which device the program is running; in this case, it's running on device zero, which happens to be the GeForce 8600 GTS mentioned above, with compute capability 1.1. Next it tells me the CUDA device driver version, 2030, meaning this machine has a version 2.3 driver installed. The available versions of the program are three binaries (1.0, 1.3, 2.0) and a PTX (2.0). However, the binaries are all in ELF format, which the 2.3 driver doesn't support; this happens if the program is built using the CUDA 3.0 toolkit. The one available PTX version is for compute capability 2.0, so that's not usable either. The solution is to rebuild the program using the CUDA 2.3 toolkit, or to upgrade the driver on the system.
Another message you might see is:
Compute capability mismatch file: /home/mwolfe/test3/accmatmul/mm2.f90 routine: mm1 line: 6 device: 0 compute capability 1.1 driver: 2030 Available compute capability: 1.3(cubin) 1.3(ptx)
Here, I rebuilt the same program using the CUDA 2.3 toolkit, but only generated compute capability 1.3. Since the device is only compute capability 1.1, the program fails. The solution is to rebuild the program specifying compute capability 1.0 or 1.1 on the command line, or to reset the default, as shown above.
In CUDA Fortran, you can have the same problems, but they won't show up in the same way. Instead, you have to check for errors. You can check for errors in your host program after CUDA procedure calls, kernel invocations, or memory allocation and transfers:
use cudafor real,device :: a(:) ... istat = cudaGetDeviceCount(n) if( istat .ne. 0 )then print *,cudaGetErrorString(istat) stop endif
If you run a program built with the version 3.0 toolkit on a system with a version 2.3 driver, you'll get an error code of 35:
CUDA version is insufficient for CUDART version
If you try to run a kernel compiled for compute capability 1.3 on a device with compute capability 1.1, you'll get an error code 8:
invalid device function
You might not see these errors until after a cudaThreadSynchronize() call, or some other operation that synchronizes with the device. In particular, the kernel launches are asynchronous, so errors from the kernel launch won't show up until the next synchronizing operation.
Summary of How to Target Multiple Types of NVIDIA Devices
If you are running your own workstation or server with NVIDIA cards, we encourage you to upgrade to a version 3.0 device driver, downloadable from the NVIDIA web site. Then you can add or modify the siterc file in the PGI installation directory to make the 3.0 toolkit the default. If you have a cluster or other production system where upgrading drivers can't be scheduled on short notice, you can still use the CUDA 2.3 toolkit, but you won't be able to target Fermi architecture devices like the C2050 and C2070.
The default for PGI CUDA Fortran and Accelerator compilers is to generate CUDA binaries for compute capabilities 1.0 and 1.3, when using the CUDA 2.3 toolkit, and for compute capabilities 1.0, 1.3 and 2.0, when using the 3.0 toolkit. This gives the widest portability across the available NVIDIA GPU devices. Remember that if your program uses double precision floating point, the compiler can't generate a compute capability 1.0 kernel, and the program will require a compute capability 1.3 or higher card.
You can override the default toolkit version and target compute capabilities on the command line with ‑Mcuda and ‑ta=nvidia suboptions:
- cuda2.3 will use the CUDA 2.3 toolkit; you might need this option if you have or are targeting a system with a CUDA 2.3 driver.
- cuda3.0 will use the CUDA 3.0 toolkit; you might want this option if you want to target Fermi devices; only one of the cuda2.3 or cuda3.0 options may be used.
- cc10, cc11, cc12, cc13, and cc20 will generate code for compute capability 1.0, 1.1, 1.2, 1.3, and 2.0, respectively. These options can be combined to generate multiple versions, similar to the default behavior. Note that double precision or other options that are not available in the lower compute capability devices will disable code generation for that compute capability.
The PGI Accelerator compilers have a new feature; they can display an estimate of the occupancy on the GPU, along with the registers and shared memory used by each kernel. This comes out with the ‑Minfo or ‑Minfo=accel messages:
... 12, Loop is parallelizable Accelerator kernel generated 7, !$acc do parallel, vector(16) 11, !$acc do seq Cached references to size [16x16] block of 'b' Cached references to size [16x16] block of 'c' 12, !$acc do parallel, vector(16) CC 1.0 : 17 registers; 2072 shared, 84 constant, 0 local memory bytes; 33% occupancy CC 1.3 : 17 registers; 2072 shared, 84 constant, 0 local memory bytes; 75% occupancy CC 2.0 : 22 registers; 2056 shared, 104 constant, 0 local memory bytes; 83% occupancy
In this example, the compiler generated three CUDA binaries, for compute capabilities 1.0, 1.3 and 2.0. The informational messages tell how many registers were used in each binary, how much shared memory, and give an estimate of occupancy. Occupancy is the ratio of the number of simultaneously active warps this kernel will have to the maximum number of active warps for that compute capability (a warp is a group of 32 threads that the hardware executes in SIMT mode). You want many simultaneously active warps to keep the device busy when some warps are waiting for memory to respond, for instance. Occupancy is limited by:
- the number of registers used by each thread (low register usage allows more simultaneous warps);
- the shared memory used by each thread block (low shared memory usage allows more thread blocks, and hence more warps);
- the number of threads in a thread block (resources for the whole thread block must be available, so two large thread blocks may not fit).
High occupancy does not guarantee high performance, but low occupancy almost certainly predicts low performance. Note that the occupancy depends on the compute capability, because high compute capabilities have more registers and other resources.
When I recently visited our local Fry's Electronics store, the staff noted that the shelf life of a new Fermi-class NVIDIA graphics card (e.g. a GTX 470 or GTX 480) was about an hour. I saw one being placed on the shelf, only to have someone else pick it up before I could get my hands on it. But, if you're lucky enough to have found a Fermi architecture device, you'll be pleased to know that the PGI compilers are ready to support your new card. As availability improves, we expect this to be the standard for accelerated computing for the foreseeable future.