Technical News from The Portland Group

First Look: PGI CUDA C/C++ for x86

In an earlier article PGI introduced plans to develop an implementation of NVIDIA's CUDA C language for multi-core x86 processors (CUDA-x86). Here we describe the initial release, including how to use it, limitations and differences from NVIDIA's CUDA C compiler, and our plans for the rest of the year. This release targets the CUDA 3.2 Runtime API.

Using PGI CUDA-x86

PGI has integrated the CUDA extensions into the PGC++® compiler. You can compile CUDA programs for the x86 processor using the .cu suffix with the normal PGC++ compiler options:

pgcpp -c -O file.cu

Alternatively, you can enable the CUDA extensions in any C or C++ source file with the -Mcudax86 flag:

pgcpp -c -O -Mcudax86 file.cc

If you link separately, you will need to include the -Mcudax86 flag to include all the appropriate CUDA libraries.

The compiler builds an x86 object or executable file as it normally would. Functions with the __global__ or __device__ attribute are compiled in a special mode that enables the CUDA built-in variables (threadIdx, blockIdx, and so on). Many restrictions on kernel or device functions are removed on the x86, such as limits on IO.

Most of the CUDA Runtime API routines have been implemented to work on the x86 host. The Memory Management routines (cudaMalloc, cudaFree, cudaMemcpy, and so on) will allocate and copy data from host to host. Because the kernels execute on the same host as the main program, you can in fact simply pass host pointers to the kernels, without allocating and copying data to the device memory. We expect that CUDA programmers will develop coding strategies that allow conditional compilation of data movement calls, so data movement occurs only when targeting NVIDIA CUDA devices. Such a coding strategy will help maximize performance of multi-core x86 as a CUDA compute device by minimizing or eliminating spurious data movement. Some of the routines have no corresponding behavior on the x86 host. For instance, cudaThreadSetCacheConfig is used to set the cache configuration for an NVIDIA Fermi-class GPU; there is no such control for an x86 host, so this call will have no effect with CUDA-x86. All operations in CUDA-x86 are synchronous, including kernel launches. The stream management API routines are supported, but managing multiple streams will produce no performance benefit.

The pgcpp compiler will define the preprocessor variable __PGI_CUDA_X86 when compiling with CUDA-x86 extensions enabled, so you can write your program with GPU-specific and x86-specific behavior. The PGDBG® debugger has been enhanced to support CUDA-x86 programs as well. See the companion article in this PGInsider issue about debugging CUDA-x86 with PGDBG. CUDA-x86 is supported on all x86 platforms, 32-bit and 64-bit, that PGI compilers currently support including Linux, Mac OS X and Windows.

CUDA-x86 Language Limitations

The PGC++ compiler is delivered with a modified NVIDIA CUDA SDK, a set of code samples that have been validated with CUDA-x86. The modifications are related to the limitations of the current implementation, which we discuss here.

Driver API: The current implementation only supports the CUDA Runtime API; this includes support for the popular chevron (<<<...>>>) syntax, and the cuda... API routines. Support for the lower-level driver API is being explored.

Textures and Surfaces: These are mostly used on the GPU to take advantage of the 2-dimensional and 3-dimensional texture memory cache. Support for textures and surfaces is still being implemented, and will be included in a future release. They will have no particular performance advantage on an x86 processor, however.

CUDA Arrays: These are specially allocated data constructs used to support textures and surfaces, and will be supported at the same time.

OpenGL and DirectX Interoperability: The benefits of OpenGL and DirectX interoperability are clear on a GPU, where the CUDA code and the graphics share the same memory and hardware. The current CUDA-x86 implementation does not include these API routines, and future support is under discussion.

Thrust: The Thrust library has not been ported to CUDA-x86

Warp-synchronous Programming: When two threads in the same thread block need to communicate through shared memory, you generally need to include a __syncthreads() call between the assignment to shared memory by one thread and the use of that memory by the other thread

A simple example would be a loop to sum the values of an array. In this kernel, each thread computes a local sum in the first loop, iterating through the array in blocks of size N, which is the number of threads in the thread block. Each thread then stores its local sum into the shared memory array S. The second loop does a binary reduction from N partial sums down to one total sum.

#define N 128
__global__ void sumit( float* a, int m ){
    int i, n;
    float localsum = 0.0f;
    __shared__ float S[N];
    for( i = threadIdx.x; i < m; i += N ) localsum += a[i];
    i = threadIdx.x;
    S[i] = localsum;
    __syncthreads();  /* all threads done with localsum */
    n = N;
    while( (n>>1) > 0 ){
	/* add S[0:n-1] += S[n:2*n-1] */
	if( i < n ) S[i] += S[i+n];
	__syncthreads();	/* all partial sums done */
    }
    /* here, S[0] has the final sum */
}

If the number of threads that need to synchronize is equal to the warp size, we don't need the __syncthreads() call in the second loop. For a GPU, the warp size is 32, so some programmers would optimize the previous kernel as follows:

#define N 128
__global__ void sumit( float* a, int m ){
    int i, n;
    float localsum = 0.0f;
    __shared__ float S[N];
    for( i = threadIdx.x; i < m; i += N ) localsum += a[i];
    i = threadIdx.x;
    S[i] = localsum;
    __syncthreads();  /* all threads done with localsum */
    n = N;
    while( (n>>1) > 32 ){
	/* add S[0:n-1] += S[n:2*n-1] */
	if( i < n ) S[i] += S[i+n];
	__syncthreads();	/* all partial sums done */
    }
    while( (n>>1) > 0 ){
	/* add S[0:n-1] += S[n:2*n-1] */
	if( i < n ) S[i] += S[i+n];
	/* the last 32 elements are all done by one warp
	 * so we don't need to synchronize */
    }
    /* here, S[0] has the final sum */
}

CUDA-x86 doesn't implement threads in warps, so warp-synchronous programming will not work. The value of the built-in warpSize variable, and the value returned for warp size from cudaGetDeviceProperties for CUDA-x86 is 1.

If you use warp-synchronous programming, you should be testing against the value of warpSize instead of the literal constant 32.

CUDA-x86 Performance

This release of the compiler uses what we call emulation mode. As with the PGI CUDA Fortran emulation mode, the compiler creates a task for each CUDA thread, and one host worker thread for each core on your system. The worker threads time share to execute all the tasks (CUDA threads) in a single thread block. The worker threads will execute a single task until a synchronization point is reached (__syncthreads), then will switch to a different task, until all tasks are complete. The thread blocks are executed one at a time; multi-core parallelism is used to execute multiple threads within a block.

This is a functionality-oriented release, not a performance-oriented release. While there is ample parallelism within a thread block to keep a multi-core busy, the way you write a program for a GPU is generally exactly wrong for a program executing in this way on a multi-core. To get coalesced memory loads and stores on the GPU, you organize your data so that adjacent threads in the thread block (consecutive values of threadIdx.x) access adjacent locations in the device memory. With this CUDA-x86 release, this means that different cores will access adjacent locations in memory, meaning they will have numerous cache conflicts.

In emulation mode, a CUDA program will run significantly slower than the same program on an NVIDIA GPU. Later this year, we will release an optimized implementation that will change the parallelism model and use vectorization to execute multiple threads in parallel on a single core. We expect the performance will improve significantly, though it will likely still be a factor slower than a GPU; this is not surprising, given the massive parallelism available on a high-end GPU. Another performance metric is to compare a CUDA program to a native OpenMP implementation of the same algorithm. There will always be extra overhead for the CUDA program, dealing with memory management and rectangular thread block and grids, but our goal is to approach the performance of native OpenMP parallel code using CUDA.

Summary

PGI has released the first native CUDA C/C++ implementation for multi-core x86 processors. This first release implements the CUDA Runtime API, using multi-core parallelism across threads within a thread block. Some missing features, such as textures, will be added in the coming months, and a highly-optimizing code generator is also under development.