Michael Wolfe

One of the most important issues when programming a system with a GPU or any attached accelerator is managing data movement between host memory and device memory. This is a special case of managing a memory hierarchy, a problem that has been with us since the dawn of computing. The very first computer I programmed was an IBM 360/75 at the University of Illinois which had a whole megabyte of magnetic core memory, composed of 400KB of fast core and 600KB of slow core. Systems in those days didn't support virtual memory, so programmers were responsible for staging data from disk (or tape) storage to memory and back, and for telling the operating system the maximum amount of memory the program would use (or on this system, the maximum amount of fast and of slow core that the program would use).

Introductory Example

Since that time, much of the memory hierarchy management problem has been relegated to the hardware and operating system. Common CPUs now include fast on-chip cache memories with hardware to detect cache misses, load the missed data into the cache, evict stale data, and detect conflicts between caches attached to different cores or processors. Programmers now mostly ignore cache memory, but sometimes will tune a program to enhance locality, thus improving the cache hit rate and the total performance. Operating systems use address translation hardware to support paged virtual memory, detecting when a program needs more memory that is physically available, evicting stale pages to disk and reloading pages when needed. Some decades ago, programmers would sometimes optimize for virtual memory locality in the same way that they optimize today for caches. Now that current systems have many gigabytes of physical memory, the need for virtual memory locality is much reduced except for those applications with very, very large datasets.

However, today's attached accelerators, such as GPUs, bring us back to a programmer-managed memory hierarchy. Using CUDA or OpenCL, the programmer is responsible for allocating memory either in the host memory or in the device memory, and for copying data between the two. This is an extra burden placed on the programmer for two reasons. First, the memory requirements and technologies for host memory and device memory are quite different. The host already has a very large memory, quite often 16GB, 32GB, 64GB or larger. This memory doesn't need to be particularly high performance, since the CPU has a large, deep cache hierarchy. With such a large memory, cost is key. A highly parallel device like a GPU requires a much higher bandwidth memory system. Current GPUs use GDDR, which supports memory bandwidths of over 300GB/s in the right configuration, which is several times faster than CPU memory bandwidth. CPU performance depends on most memory accesses hitting in the cache memory, so a lower main memory bandwidth is sufficient because it only needs to satisfy the cache misses. Highly parallel devices like GPUs are designed to work efficiently with very large datasets where cache memories are too small to be effective, so they require higher memory bandwidths. This memory is more expensive and therefore smaller. The smaller size often requires data to be shuffled from the host memory to device memory and back again, much like data was moved between CPU and disk some four decades ago.

The second reason this problem is left for the programmer is that automating the memory hierarchy management requires more hardware support than is currently available on these devices. We might think of the device memory like a cache on the device, backed by the system memory. What would it take to manage this just like today's hardware caches, or like today's virtual memory? Unfortunately, this requires several hardware innovations that exist in today's CPUs, but are not available in current GPUs, such as page translation, miss detection, kernel suspend and resume, and more. These features will be added in the coming years, but today's GPUs just don't have it.

However, NVIDIA GPUs and CUDA drivers support a feature called CUDA Unified Memory that approximates what we really want. CUDA Unified Memory allows a program to dynamically allocate data that can be accessed from the CPU or from the GPU, with the same pointer and address, and the CUDA driver will ensure that the data is in the right memory at the right time. The PGI CUDA Fortran compiler added support for a Managed Memory attribute in 2014. In 2015, we implemented a beta feature that allowed you to use CUDA Unified Memory seamlessly in OpenACC programs. Effective with the PGI 17.7, it is no longer a beta feature. This article describes how to use the feature and what behavior to expect, and some problems and pitfalls you might encounter on the way. Internally, we have found this feature to be a great help when initially porting a program to NVIDIA GPUs with OpenACC, by allowing us to focus on first on porting the compute regions and then tuning the data movement afterwards. It's not perfect, and it's not robust enough for production use in most cases, but it's helpful for initial porting. The rest of this article assumes you have some familiarity with OpenACC and NVIDIA GPUs.

Using CUDA Unified Memory

The command line option to enable CUDA Unified Memory with the PGI OpenACC compiler is ‑ta=tesla:managed. This should be specified when compiling and linking your program. This option changes the way dynamic memory is allocated. In particular, for C programs, it changes the stdlib.h header file so that calls to malloc, realloc, calloc and free are changed (using #define) to new routines that allocate using CUDA Unified Memory. Your C source program must include stdlib.h for this to work. For C++ programs, it changes the cstdlib header file in the same way. Also, for C++ programs, it changes new and delete to allocate using CUDA Unified Memory. For Fortran programs, allocatable arrays are changed to include the CUDA Fortran managed attribute, so that allocating these arrays will use CUDA Unified Memory.

The ‑ta=tesla:managed option must also be used at link time. This sets the OpenACC runtime to test at runtime whether data was allocated in CUDA Unified Memory. If so, then the OpenACC runtime does nothing, letting the CUDA driver manage data movement between the system and device memories. Otherwise, the OpenACC runtime proceeds normally. This allows mixing driver-managed CUDA Unified Memory and OpenACC runtime-managed data in the same program which can be very handy.

Expected Behavior

Let's take a very simple test program and see the difference in behavior.

    #include <stdio.h>
    #include <stdlib.h>
    
    int
    main( int argc, char* argv[] ){
        float *a, *b;
        int n, i;
        n = atoi( argv[1] );
        /* allocate */
        a = (float*)malloc( sizeof(float)*n );
        b = (float*)malloc( sizeof(float)*n );
    
        /* initialize */
        for( i = 0; i < n; ++i ){ a[i] = i; b[i] = 2*i; }
    
        /* run a parallel loop */
        #pragma acc parallel loop
        for( i = 0; i < n; ++i ) a[i] *= b[i];
    
        /* print partial results */
        printf( "%f %f %f\n", a[0], a[1], a[2] );
        printf( "%f %f %f\n", a[n-3], a[n-2], a[n-1] );
        return 0;
    }

First, compile this with OpenACC, using pgcc ‑acc. Run the binary after setting the environment variable PGI_ACC_NOTIFY to 3; this will print out a runtime diagnostic after each data movement and each GPU kernel launch. If you run this program with an argument of 1000, you will see output like:

upload CUDA data  file=bb.c function=main line=17 device=0 variable=a  bytes=4000
upload CUDA data  file=bb.c function=main line=17 device=0 variable=b  bytes=4000
launch CUDA kernel  file=bb.c function=main line=17 device=0 num_gangs=4 
  num_workers=1 vector_length=256 grid=4 block=256
download CUDA data  file=bb.c function=main line=19 device=0 variable=a  bytes=4000
0.000000 3.000000 6.000000
2991.000000 2994.000000 2997.000000

This shows that the compiler implicitly added a copy clause for the array a and a copyin clause for b. If you recompile enabling this new feature, using pgcc ‑acc ‑ta=tesla:managed, and then run the program, the output will include the launch line and the output, but no data movement:

launch CUDA kernel  file=/home/mwolfe/test2/15.02.man2/bb.c function=main line=17 
  device=0 num_gangs=4 num_workers=1 vector_length=256 grid=4 block=256
0.000000 3.000000 6.000000
2991.000000 2994.000000 2997.000000

You will see the same behavior even if you add explicit data clauses for the arrays a and b. Because the arrays get allocated using CUDA Unified Memory, the OpenACC runtime lets the CUDA driver manage data movement. Because the data movement is not managed by the OpenACC runtime, it is not reported.

The Details

It looks simple, just adding a command line option, and in many cases it really is that simple. However, it's not perfect, and it's not ready for production use in most cases. There are some details of the implementation that you should be aware of and some problems that you may run into. Some of these problems will be fixed in future PGI releases as we improve this feature, but some will have to wait for future NVIDIA GPU architectures, as described earlier.

  • Use of managed memory applies only to dynamically-allocated data. Static data (C static and extern variables, Fortran module, common block and save variables) and function local data is still handled by the OpenACC runtime. Dynamically allocated Fortran local variables and Fortran allocatable arrays are implicitly managed but Fortran array pointers are not.
  • Given an allocatable aggregate with a member that points to local, global or static data, compiling with ‑ta=tesla:managed and attempting to access memory through that pointer from the compute kernel will cause a failure at runtime.
  • C++ virtual functions are not supported.
  • The ‑ta=tesla:managed compiler option must be used to compile the files in which variables are allocated, even if there is no OpenACC code in the file.
  • The CUDA Unified Memory manager can only manage data that is allocated using the cudaMallocManaged() routine.
  • When the program allocates managed memory, it allocates host pinned memory as well as device memory thus making allocate and free operations somewhat more expensive and data transfers somewhat faster. A memory pool allocator is used to mitigate the overhead of the allocate and free operations. The pool allocator is enabled by default for ‑ta=tesla:managed or ‑ta=tesla:pinned. In the PGI 17.7 release, the presence of ‑Mcuda disables the pool allocator; we are working on lifting that restriction in an upcoming release.
  • Data movement of managed data is controlled by the NVIDIA CUDA GPU driver; whenever data is accessed on the CPU or the GPU, it could trigger a data transfer if the last time it was accessed was not on the same device.

This feature has the following additional limitations when used with NVIDIA Kepler GPUs:

  • Data motion on Kepler GPUs is achieved through fast pinned asynchronous data transfers; from the program’s perspective, however, the transfers are synchronous.
  • Kepler GPUs disallow the simultaneous access by host and device of shared data; if this situation occurs, the program will halt with a segmentation fault.
  • The total amount of managed memory is limited to the amount of available device memory on Kepler GPUs.

This feature is not supported on NVIDIA Fermi GPUs.

Summary

One of the big challenges using accelerator devices today is data management. It would really simplify life for a programmer if the system hardware and system software could automatically manage the device memory as if it were a cache, or similar to virtual memory. Such support will come in future GPUs, but that raises the question today whether the performance of automatic data management will be satisfactory, or whether user data management will always be required. Using CUDA Unified Memory today gives a first hint at an answer, and it's quite promising.

PGI 17.7 provides an enhanced version of Unified Memory support for 64-bit Linux x86-64 and Linux/OpenPOWER. With the ‑ta=tesla:managed option, dynamic memory is allocated in CUDA Unified Memory and managed by the CUDA driver. Data in Unified Memory is automatically moved to device memory at kernel launch, and back to the host when needed. There are some limitations to this method, since it only applies to dynamically allocated data. Note also that runtime failures on Kepler GPUs are possible because there is no hardware support for detecting missing data on the device. We encourage you to try it for initial porting and experimentation.

As an example, we took the 15 OpenACC Spec Accel V1.2 benchmarks and compiled and ran them with and without the ‑ta=tesla:managed option. Three of the benchmarks (cg, csp, bt) only use static data, so the managed suboption did not apply. For the other twelve, the performance was very promising. The numbers below are SPEC estimates (not run in the SPEC harness) using the ref dataset and measuring execution times in seconds on a IBM Power8 CPU with an NVIDIA Tesla P100 (Pascal) GPU connected via NVLINK.

BenchmarkUnmanaged Managed Overhead
ostencil 19.0 19.3 1.6%
olbm 40.7 42.1 3.4%
omriq 108.0 112.0 3.7%
md 18.8 19.1 1.3%
palm 116.0 121.0 4.6%
ep 62.5 63.6 1.8%
clvrleaf 54.7 56.4 3.1%
seismic 44.2 45.1 2.0%
sp 33.6 33.9 0.9%
miniGhost 52.8 53.1 0.6%
ilbdc 41.2 40.3 -2.7%
swim 39.4 50.1 27.2%

Some of these give a speedup using managed memory, but this is usually a side effect of using pinned host memory for data transfers. One benchmark had a pretty serious slowdown, but still nowhere near a factor of two. In ten of these eleven benchmarks, all of the data movement was managed using CUDA Unified Memory. These results would indicate that locality works; when most of the compute operations are being done on the GPU, the data can stay resident on the GPU and data movement will not be the limiting factor in performance. For the cases where we see performance degradations, future hardware support for data movement will help because not all the managed data will have to be moved before each kernel launch. However, we expect that there will always be cases where a user will want to control data movement, for instance to prefetch data to the right memory and overlap that with other computation.

Click me
Cookie Consent

This site uses cookies to store information on your computer. See our cookie policy for further details on how to block cookies.

X