Technical News from The Portland Group

Parallel Random Number Generation Using OpenMP, OpenCL and PGI Accelerator Directives

In the article Tuning a Monte Carlo Algorithm on GPUs, Mat Colgrove explored an implementation of Monte Carlo Integration on NVIDIA GPUs using PGI Accelerator directives and CUDA Fortran. The article showed how the required random number generator could be accelerated in the CUDA Fortran version by calling the CUDA C Mersenne Twister random number generator included in the NVIDIA CUDA SDK. The result was a speed-up of the random number generation by a factor of 23 over the serial version running on a single host core.

To further explore this topic, we created OpenMP, OpenCL and PGI Accelerator directive-based versions of the Mersenne Twister algorithm, all derived from the source code available in the NVIDIA SDKs. The OpenMP version allows for a direct performance comparison of random number generation using all available host cores versus the CUDA C version on a GPU, which is of course necessary to get a true understanding of the potential speed-up using the GPU. The OpenCL version allows us to compare performance of OpenCL versus CUDA C. The directive-based version allows us to compare the performance of the PGI Accelerator programming model versus hand-coded CUDA C and OpenCL.

OpenMP Implementation of Mersenne Twister

Implementing the Mersenne Twister algorithm in OpenMP was very straightforward. The original serial version looks as follows:

void initSamples_MT_CPU(float *samples)
{
    unsigned int seed;
    unsigned int nPerRng = N/MT_RNG_COUNT;      // # of recurrence steps

    /* create and populate MT parameters array with initial pre-computed states*/
    initMTRefCPU("data/MersenneTwister.raw");

    /* initialize the random number generator for the seed */
    srandom((unsigned int) (time(NULL)));
    double num = (random() / (float) RAND_MAX) * 4194304;
    seed = (unsigned int) trunc(num);

    /* populate samples */
    RandomRef(samples, nPerRng, seed);
}

void RandomRef(float *h_Rand, unsigned int NPerRng, unsigned int seed)
{       
    int iRng, iOut; 
        
    for(iRng = 0; iRng < MT_RNG_COUNT; iRng++){
        MT[iRng].state = state;
        sgenrand_mt(seed, &MT[iRng]);
        
        for(iOut = 0; iOut < NPerRng; iOut++)
           h_Rand[iRng * NPerRng + iOut] = 
           ((float)genrand_mt(&MT[iRng]) + 1.0f) / 4294967296.0f;
    }   
}       

According to the construction of the Mersenne Twister algorithm, the outer for loop in RandomRef() is inherently parallel. However, the pointer assignment of state to MT[iRng].state in each iteration of the outer loop is a barrier to correct SMP parallelization. If the loop is parallelized with an OpenMP pragma in its current form, the OpenMP threads will overwrite a common work area without the necessary synchronization while performing calculations involving MT[iRng].state in the functions sgenrand_mt() and genrand_mt().

Rather than insert synchronization, which would be very expensive and inhibit scaling, our solution was to create a separate state work area for each OpenMP thread. We implemented this by changing the declaration of state from

static uint32_t         state[MT_NN];

to

static uint32_t         state[32][MT_NN];

Where the first dimension is used to create separate work areas for up to 32 OpenMP threads, and the second dimension defines the work area for each thread. We then re-wrote the initSamples_MT_CPU() and RandomRef() functions to enable parallel execution on a multi-core CPU using OpenMP:

void initSamples_MT_CPU(float *samples, const int num_openmp_threads)
/* initialize samples on CPU with random values by using Mersenne Twister OpenMP
 * implementation with a number of threads defined by "num_openmp_threads".
 * If num_openmp_threads is 0, the serial version RandomRef() is used. */
{
    unsigned int seed;
    unsigned int nPerRng = N/MT_RNG_COUNT;      // # of recurrence steps

    /* create and populate MT parameters array with initial pre-computed states*/
    initMTRefCPU("data/MersenneTwister.raw");

    /* initialize the random number generator for the seed */
    srandom((unsigned int) (time(NULL)));
    double num = (random() / (float) RAND_MAX) * 4194304;
    seed = (unsigned int) trunc(num);

    /* populate samples */
    if(num_openmp_threads != 0) {
        omp_set_num_threads(num_openmp_threads);
        RandomRef_OMP(samples, nPerRng, seed);
    } else {
        RandomRef(samples, nPerRng, seed);
    }
}

void RandomRef_OMP(float *h_Rand, unsigned int NPerRng, unsigned int seed)
{
   int iRng, iOut;
   int my_thread_id;

#pragma omp parallel private(my_thread_id)
{
   my_thread_id = omp_get_thread_num();

#pragma omp for private(iRng,iOut)
   for(iRng = 0; iRng < MT_RNG_COUNT; iRng++) {
      MT[iRng].state = &state[my_thread_id][0];
      sgenrand_mt(seed, &MT[iRng]);

      for(iOut = 0; iOut < NPerRng; iOut++) {
         h_Rand[iRng * NPerRng + iOut] = 
         ((float)genrand_mt(&MT[iRng]) + 1.0f) / 4294967296.0f;
      }
   }
}
}

While this version of the Mersenne Twister parallelized as expected, scaling on our 2.67Ghz Nehalem was relatively poor:

% make mersenneTwister
...
mersenneTwisterCPU.c:
initMTRefCPU:
     51, Loop not vectorized/parallelized: contains call
RandomRef:
     71, Loop not vectorized/parallelized: contains call
RandomRef_OMP:
     83, Parallel region activated
     87, Parallel loop activated with static block schedule
     91, Loop not vectorized/parallelized: contains call
     98, Barrier
     99, Parallel region terminated
...
% mersenneTwister
Computing over 134217728 samples

 --- CPU Mersenne Twister ---
exec time  = 1.306561 s
throughput = 410.903825 [MB/s]

 --- CPU Mersenne Twister with OpenMP (1 threads) ---
exec time  = 1.328847 s
throughput = 404.012585 [MB/s]

 --- CPU Mersenne Twister with OpenMP (2 threads) ---
exec time  = 1.467748 s
throughput = 365.778670 [MB/s]

 --- CPU Mersenne Twister with OpenMP (3 threads) ---
exec time  = 1.072229 s
throughput = 500.705458 [MB/s]

 --- CPU Mersenne Twister with OpenMP (4 threads) ---
exec time  = 0.909803 s
throughput = 590.095781 [MB/s]

After some experimentation and profiling using PGPROF, we determined that the lack of scaling was likely caused by false sharing of cache lines during accesses to state. Changing the declaration of state to:

static uint32_t         state[32][MT_NN*16];

to ensure the work area used by each thread is a multiple of the cache line size was all it took to improve scaling dramatically to 3.72x using 4 cores of a 2.67Ghz Nehalem vs the true serial version running on 1 core:

...
0% mersenneTwister
Computing over 134217728 samples

 --- CPU Mersenne Twister ---
exec time  = 1.304117 s
throughput = 411.673885 [MB/s]

 --- CPU Mersenne Twister with OpenMP (1 threads) ---
exec time  = 1.327572 s
throughput = 404.400599 [MB/s]

 --- CPU Mersenne Twister with OpenMP (2 threads) ---
exec time  = 0.715547 s
throughput = 750.294407 [MB/s]

 --- CPU Mersenne Twister with OpenMP (3 threads) ---
exec time  = 0.464550 s
throughput = 1155.679501 [MB/s]

 --- CPU Mersenne Twister with OpenMP (4 threads) ---
exec time  = 0.349923 s
throughput = 1534.254427 [MB/s]

So, as expected, we are able to see near-linear scaling of the Mersenne Twister algorithm on a multi-core x64 CPU.

Initial PGI Accelerator Directive-based Implementation of Mersenne Twister

To create a GPU-enabled version of the Mersenne Twister algorithm using PGI Accelerator directives, we also started with the original serial algorithm listed above. Currently the PGI Accelerator C99 compiler is unable to offload accelerator kernels that include function calls, in our case the calls to sgenrand_mt() and genrand_mt(). So, the first step in creating the directive-based GPU version was to manually inline these functions into RandomRef(), creating RandomRef_ACC():

void RandomRef_ACC(float *h_Rand, unsigned int NPerRng, unsigned int seed)
{
   int iRng, iOut;
#pragma acc region
{
   for(iRng = 0; iRng < MT_RNG_COUNT; iRng++) {
      MT[iRng].state = state;
#ifdef NOINLINE
      sgenrand_mt(seed, &MT[iRng]);
#else
      mt_struct mts = MT[iRng];
      mts.state[0] = seed & mts.wmask;
      for(int i = 1; i < mts.nn; i++){
         mts.state[i] = ( UINT32_C(1812433253) * 
                          (mts.state[i - 1] ^ 
                          (mts.state[i - 1] >> 30)) + i ) & mts.wmask;
      }
      mts.i = mts.nn;
#endif

      for(iOut = 0; iOut < NPerRng; iOut++) {
#ifdef NOINLINE
         h_Rand[iRng * NPerRng + iOut] = 
         ((float)genrand_mt(&MT[iRng]) + 1.0f) / 4294967296.0f;
#else
         uint32_t *st, uuu, lll, aa, x;
         int k,n,m,lim;
         if(mts.i >= mts.nn ){
            n = mts.nn; m = mts.mm;
            aa = mts.aaa;
            st = mts.state;
            uuu = mts.umask; lll = mts.lmask;
            lim = n - m;
            for(k = 0; k < lim; k++){
               x = (st[k]&uuu)|(st[k+1]&lll);
               st[k] = st[k + m] ^ (x >> 1) ^ (x&1U ? aa : 0U);
            }
            lim = n - 1;
            for(; k < lim; k++){
               x = (st[k] & uuu)|(st[k + 1] & lll);
               st[k] = st[k + m - n] ^ (x >> 1) ^ (x & 1U ? aa : 0U);
            }
            x = (st[n - 1] & uuu)|(st[0] & lll);
            st[n - 1] = st[m - 1] ^ (x >> 1) ^ (x&1U ? aa : 0U);
            mts.i=0;
         }
         x = mts.state[mts.i];
         mts.i += 1;
         x ^= x >> mts.shift0;
         x ^= (x << mts.shiftB) & mts.maskB;
         x ^= (x << mts.shiftC) & mts.maskC;
         x ^= x >> mts.shift1;
         h_Rand[iRng * NPerRng + iOut] = (x + 1.0f) / 4294967296.0f;
#endif
      }
   }
}
}

This code works if run in serial mode, as expected. However, it has the same problem as the OpenMP version regarding redundant use of state, and will not parallelize correctly. Nonetheless, knowing this is easily solved, we compiled the loop with the PGI Accelerator 10.4 C99 compiler just to see if it could successfully generate a GPU kernel. In fact, the compiler was still unable to generate a GPU kernel, but did generate a number of useful CCFF compiler feedback messages. Among them:

    123, Accelerator region ignored

This tells us the compiler declined to generate any GPU kernels for the code in the accelerator region, and instead chose to generate host code. Looking at other messages generated by the loop:

    125, Accelerator restriction: struct/union/derived type variables not 
                                  yet supported: mts.state
         Accelerator restriction: size of the GPU copy of 'h_Rand' is unknown
         Accelerator restriction: size of the GPU copy of an array depends on 
                                  values computed in this loop
         Accelerator restriction: one or more arrays have unknown size

The first message indicates that the compiler can't handle an accelerator region that operates on structs. In fact, while basic struct support is enabled in PGI 10.4, there is still a limitation on structs with array components. The second message indicates that the compiler can't determine the size of the array h_Rand, and so does not know how much memory to allocate for a copy of it on the GPU. The third message indicates that an array (not identified, unfortunately) used in the loop might require varying amounts of GPU memory depending on values computed in the loop.

Working around the struct issue requires a significant rewrite, removing the state component from mt and MT and creating standalone arrays. The unknown array size issues can be easily overcome by adding a clause to the accelerator region pragma that specifies the size of the arrays.

Other CCFF messages emitted for the loop included:

    135, Accelerator restriction: an unsupported operation was found
    136, Accelerator restriction: unsupported operation: URSHIFT

While it was not immediately obvious, correspondence with PGI engineers confirmed that shifts on variables of type unsigned int were not supported in PGI 10.4. Similarly, recasting of an unsigned int to float was unsupported.

One of the more interesting feedback messages emitted for the loop was:

    161, Accelerator restriction: loop contains induction variable 
         live after the loop

This message was being triggered by the for loop that was split to run from 0 to n-m-1, and then from n-m to n-1. An easy workaround was to eliminate the use of the lim variable, and explicitly initializing k to begin the second loop. I.e.:

            for (k = 0; k < n - m; k++) {
               x = (st[k] & uuu) | (st[k + 1] & lll);
               st[k] = st[k + m] ^ (x >> 1) ^ (x & 1U ? aa : 0U);
            }
            for (k = n - m; k < n-1; k++) {
               y = (st[k] & uuu) | (st[k + 1] & lll);
               st[k] = st[k + m - n] ^ (y >> 1) ^ (y & 1U ? aa : 0U);
            }

In all, we had to make four modifications to the original code to get it to successfully offload:

  1. modify the mt_struct type to remove the state component, and create a standalone state array MTpgi_state,
  2. add clauses to the accelerator region pragma to specify sizes of input/output arrays and make two of the arrays local to GPU memory,
  3. modify the split loop to explicitly initialize the loop index to start the second loop, and
  4. change the type of several variables from unsigned int to int.

Multiplying the results by 2.0 compensated for the loss of range introduced by the change from unsigned int types to int.

In addition, we had to add a pragma to specify the kernel launch schedule. The performance of the default schedule was very poor because the compiler was unable to resolve certain potential dependencies present in the outer loop, and chose to create kernels out of some of the inner loops. Insertion of a pragma indicating that the outer loop is parallelizable and vectorizable provided the compiler with the information it needed to create a more optimal schedule using the outer loop as the kernel. After these modifications, the code looked as follows:

void RandomRef_ACC(float *h_Rand, int NPerRng, unsigned int seed)
{
        int iRng, iOut;

#pragma acc region \
        copyin(MTpgi[0:MT_RNG_COUNT-1]) \
                   copyout(h_Rand[0:N-1]) \
        local(MTpgi_state[0:MT_NN-1][0:MT_RNG_COUNT-1])
{
#pragma acc for parallel(128) vector(32)
    for(iRng = 0; iRng < MT_RNG_COUNT; iRng++) {

      MTpgi_state[0][iRng] = seed & MTpgi[iRng].wmask;
        for(int i = 1; i < MTpgi[iRng].nn; i++) {
         MTpgi_state[i][iRng] = ( UINT32_C(1812433253) * 
                                  (MTpgi_state[i-1][iRng] ^ 
                                     (MTpgi_state[i-1][iRng] >> 30)) + i )
                                     & MTpgi[iRng].wmask;
      }
      MTpgi[iRng].i = MTpgi[iRng].nn;

        for(iOut = 0; iOut < NPerRng; iOut++) {
#ifdef NOUINT
            int uuu, lll, aa, x;
#else
            uint32_t uuu, lll, aa, x;
#endif
         int k,n,m,lim;

         if (MTpgi[iRng].i >= MTpgi[iRng].nn) {
            n = MTpgi[iRng].nn;
            m = MTpgi[iRng].mm;
            aa = MTpgi[iRng].aaa;
            uuu = MTpgi[iRng].umask;
            lll = MTpgi[iRng].lmask;

                for (k = 0; k < n - m; k++) {
                    x = (MTpgi_state[k][iRng] & uuu) | 
                        (MTpgi_state[k + 1][iRng] & lll);
                    MTpgi_state[k][iRng] = MTpgi_state[k + m][iRng] ^ 
                                           (x >> 1) ^ (x & 1U ? aa : 0U);
            }
                for (k = n - m; k < n-1; k++) {
                    x = (MTpgi_state[k][iRng] & uuu) | 
                        (MTpgi_state[k + 1][iRng] & lll);
                    MTpgi_state[k][iRng] = MTpgi_state[k + m - n][iRng] ^ 
                                           (x >> 1) ^ (x & 1U ? aa : 0U);
                }
                x = (MTpgi_state[n - 1][iRng] & uuu) | 
                    (MTpgi_state[0][iRng] & lll);
                MTpgi_state[n - 1][iRng] = MTpgi_state[m - 1][iRng] ^ 
                                           (x >> 1) ^ (x & 1U ? aa : 0U);
            MTpgi[iRng].i = 0;
         }
         x = MTpgi_state[MTpgi[iRng].i][iRng];
         MTpgi[iRng].i += 1;
         x ^= x >> MTpgi[iRng].shift0;
            x ^= (x << MTpgi[iRng].shiftB) & MTpgi[iRng].maskB;
            x ^= (x << MTpgi[iRng].shiftC) & MTpgi[iRng].maskC;
         x ^= x >> MTpgi[iRng].shift1;
#ifdef NOUINT
            h_Rand[iRng*NPerRng+iOut] = ((x+1.0f)/4294967296.0f)*2.0f;
#else
            h_Rand[iRng*NPerRng+iOut] = ((x+1.0f)/4294967296.0f);
#endif
      }
   }
}
}

Initial performance was slightly slower than one core of a host Nehalem CPU:

% mersenneTwister
Computing over 134217728 samples

 --- CPU Mersenne Twister ---
exec time = 1.353710 s
throughput = 396.592263 [MB/s]
max = 1.0000000000
min = 0.0000000009

 ...

 --- Mersenne Twister on GPU with PGI Accelerator directives ---
 --- Main loop based on original serial CPU Mersenne Twister ---
exec time = 1.404601 s
throughput = 382.223074 [MB/s]
max = 1.0000000000
min = 0.0000000028

Using an early version of the PGI 10.6 compiler that included an update enabling support for unsigned int variables in accelerator regions, we were able to see about the same performance on the GPU, and using the -ta=nvidia:time profiling option we were able to see that about 1.24 seconds were spent doing the random number generation and about 0.10 seconds were spent moving data and launching the kernel:

% mersenneTwister
Computing over 134217728 samples

 --- CPU Mersenne Twister ---
exec time = 1.351139 s
throughput = 397.346914 [MB/s]
max = 1.0000000000
min = 0.0000000009

...

 --- Mersenne Twister on GPU with PGI Accelerator directives ---
 --- Main loop based on original serial CPU Mersenne Twister ---
exec time = 1.343602 s
throughput = 399.575851 [MB/s]
max = 1.0000000000
min = 0.0000000102

Accelerator Kernel Timing data
/home/sw/tmp/miles/MersenneTwister/mersenneTwisterCPU.c
  RandomRef_ACC
    122: region entered 1 time
        time(us): total=1343101 init=1 region=1343100
                  kernels=1239953 data=97604
        w/o init: total=1343100 max=1343100 min=1343100 avg=1343100
        125: kernel launched 1 times
            grid: [128]  block: [32]
            time(us): total=1239953 max=1239953 min=1239953 avg=1239953
acc_init.c
  acc_init
    29: region entered 1 time
        time(us): init=4422310

The performance of the GPU kernel seems disappointing in comparison to the host Nehalem, about the speed of just one core. We decided to move on to create a directly comparable OpenCL version of Mersenne Twister based on the version in the NVIDIA SDK, expecting that perhaps we would learn something that might help us further tune the directive-based version.

OpenCL and CUDA Implementations of Mersenne Twister

An OpenCL version of the Mersenne Twister algorithm is included in the NVIDIA OpenCL SDK. An OpenCL program consists of host C code with OpenCL API calls and one or more device kernels that must be compiled with an OpenCL Language compiler. The host code can be compiled with any C compiler, so we chose to use PGCC. The device kernels are read from a file in source code form at runtime, and dynamically compiled using the NVIDIA OpenCL language compiler. The host portion of an OpenCL program performs these basic steps using API calls:

  • Query the platform to determine what type of devices are available
  • Create a context in which device kernels can be executed
  • Create a command queue within the context
  • Read the OpenCL kernels source code from a file
  • Compile the OpenCL kernels source code
  • Load the OpenCL kernel onto the device
  • Copy any input arguments from host memory to device memory
  • Launch the OpenCL kernel into the command queue
  • Wait for the kernel to complete execution on the device
  • Copy output arguments from device memory back to host memory
  • Free up device memory and perform cleanup operations

Most of these operations take more than one API call, so the host code can get very complicated. The NVIDIA SDK example is fairly robust, allowing for example the use of multiple devices. We simplified the code to create a version that is easier to work with, and to get an apples-to-apples performance comparision with the other models. Even in our simplified version, 27 API calls are required to implement the steps listed above. However, the steps required for the platform query, dynamic compilation and cleanup would typically be performed only once in an applicaton. As a result, we chose to time those components for informational purposes but to not include that overhead time in the performance comparision.

The host code is not particularly interesting, as it simply initializes and drives the OpenCL device kernel. The kernel OpenCL itself looks as follows:

__kernel void mersenneTwister_kernel_cl( __global float* results,
                  __global mt_struct_stripped* parameters, int nPerRng)
{
    int globalID = get_global_id(0);

    int iState, iState1, iStateM, iOut;
    unsigned int mti, mti1, mtiM, x;
    unsigned int mt[MT_NN], matrix_a, mask_b, mask_c;

    //Load bit-vector Mersenne Twister parameters
    matrix_a = parameters[globalID].matrix_a;
    mask_b   = parameters[globalID].mask_b;
    mask_c   = parameters[globalID].mask_c;

    //Initialize current state
    mt[0] = parameters[globalID].seed;
    for (iState = 1; iState < MT_NN; iState++)
        mt[iState] = ( 1812433253U *
                       (mt[iState-1] ^
                       (mt[iState-1] >> 30)) + iState ) & MT_WMASK;

    iState = 0;
    mti1 = mt[0];

    iOut = 0;
    for (iOut = 0; iOut < nPerRng; iOut++) {
        iState1 = iState + 1;
        iStateM = iState + MT_MM;
        if(iState1 >= MT_NN) iState1 -= MT_NN;
        if(iStateM >= MT_NN) iStateM -= MT_NN;
        mti  = mti1;
        mti1 = mt[iState1];
        mtiM = mt[iStateM];

        x = (mti & MT_UMASK) | (mti1 & MT_LMASK);
	x = mtiM ^ (x >> 1) ^ ((x & 1) ? matrix_a : 0);

        mt[iState] = x;
        iState = iState1;

        x ^= (x >> MT_SHIFT0);
        x ^= (x << MT_SHIFTB) & mask_b;
        x ^= (x << MT_SHIFTC) & mask_c;
        x ^= (x >> MT_SHIFT1);

        results[globalID+iOut*MT_RNG_COUNT] = ((float)x+1.0f)/4294967296.0f;
    }
}

The loop indexed by iRng in the previous implementations has been abstracted away; each instance of this kernel is executed by a separate thread. We initialized the launch parameters to use a configuration with 128 work groups (a work group is equivalent to a CUDA threadblock) and 32 work items (a work item is equivalent to a CUDA thread) per work group, the same launch configuration used in the PGI Accelerator directive-based version. Performance was as follows:

...
 --- Mersenne Twister on GPU with OpenCL ---
OpenCL platform query time  = 0.098961 s
OpenCL dynamic compile time  = 0.146656 s
opencl malloc time  = 0.000003 s
file i/o time  = 0.001135 s
copyin time  = 0.000213 s
kernel exec time  = 0.055795 s
copyout time  = 0.097260 s
OpenCL cleanup time  = 0.066340 s
total exec time  = 0.154413 s (not including OpenCL overhead)
OpenCL overhead time  = 0.311957 s
throughput = 3476.850472 [MB/s]
max = 1.0000000000
min = 0.0000000002
...

A significant improvement over the initial directive-based version. Why the difference? One obvious difference between the two kernels is that the OpenCL kernel uses much simpler conditionals. The OpenCL kernel also has removed all use of structs in the main compute loop and most uses of arrays, resulting in a relatively simple straight line sequence of operations primarily on unsigned int variables. A detailed performance profile of the two versions should be undertaken to get a better understanding the performance difference, but we decided to forge ahead for the time being and look at the CUDA C version.

The CUDA C kernel implementation is nearly identical. The host code is much simpler due to a few simple yet powerful CUDA C language extensions that simplify the kernel launch, and the absence of any need to query the platform—a CUDA C program assumes availability of a CUDA-enabled NVIDIA GPU, and the CUDA HW/SW architecture ensures the code will execute correctly on any CUDA device without the need to tailor the source code accordingly. The CUDA C kernel looks as follows:

__global__ void mersenneTwister_kernel_cuda( float *d_Random, int NPerRng
){
    const int      tid = blockDim.x * blockIdx.x + threadIdx.x;
    const int THREAD_N = blockDim.x * gridDim.x;

    int iState, iState1, iStateM, iOut;
    unsigned int mti, mti1, mtiM, x;
    unsigned int mt[MT_NN];

    for(int iRng = tid; iRng < MT_RNG_COUNT; iRng += THREAD_N){
        //Load bit-vector Mersenne Twister parameters
        mt_struct_stripped config = ds_MT[iRng];

        //Initialize current state
        mt[0] = config.seed;
        for(iState = 1; iState < MT_NN; iState++)
            mt[iState] = ( 1812433253U *
                           (mt[iState-1] ^
                           (mt[iState-1] >> 30)) + iState ) & MT_WMASK;

        iState = 0;
        mti1 = mt[0];
        for(iOut = 0; iOut < NPerRng; iOut++){
            iState1 = iState + 1;
            iStateM = iState + MT_MM;
            if(iState1 >= MT_NN) iState1 -= MT_NN;
            if(iStateM >= MT_NN) iStateM -= MT_NN;
            mti  = mti1;
            mti1 = mt[iState1];
            mtiM = mt[iStateM];

            x    = (mti & MT_UMASK) | (mti1 & MT_LMASK);
            x    =  mtiM ^ (x >> 1) ^ ((x & 1) ? config.matrix_a : 0);
            mt[iState] = x;
            iState = iState1;

            x ^= (x >> MT_SHIFT0);
            x ^= (x << MT_SHIFTB) & config.mask_b;
            x ^= (x << MT_SHIFTC) & config.mask_c;
            x ^= (x >> MT_SHIFT1);

            d_Random[iRng+iOut*MT_RNG_COUNT] = ((float)x+1.0f)/4294967296.0f;
        }
    }
}

Performance of the CUDA C kernel is about 13% faster than OpenCL, but the data movement time is nearly identical. Disregarding the OpenCL startup and shutdown overhead, the performance of CUDA C and OpenCL is very close:

Computing over 134217728 samples

 --- Mersenne Twister on GPU with CUDA ---
cudamalloc time  = 0.002696 s
file i/o time  = 0.002816 s
copyin time  = 0.000097 s
kernel exec time  = 0.049395 s
copyout time  = 0.097568 s
total exec time  = 0.152586 s
throughput = 3518.480804 [MB/s]

Tuning the PGI Accelerator Directive-based Mersenne Twister

Clearly there is a large performance gap between the initial PGI Accelerator version, and the CUDA C and OpenCL implementations. Is this an inherent inefficiency in the directives, or just a matter of how the code is structured? Rather than iterating on the initial directive-based version, we decided to create a second directive-based version with a main loop that mirrors the structure of the CUDA C and OpenCL kernel source. This was straightforward, and resulted in a code that looks as follows:

void RandomRef_ACC_2(float *h_Rand, unsigned int NPerRng, unsigned int seed)
{
    int iRng, iOut;
    unsigned int mt[MT_NN];

#pragma acc region copyin(MTpgi[0:MT_RNG_COUNT-1]) \
                   copyout(h_Rand[0:N-1]) \
                   local(mt[0:MT_NN-1])
{
#pragma acc for parallel(128) vector(32) private(mt[0:MT_NN-1])
{
    for(iRng = 0; iRng < MT_RNG_COUNT; iRng++) {
        int iState, iState1, iStateM;
        unsigned int mti, mti1, mtiM, x;
        unsigned int matrix_a, mask_b, mask_c;

        matrix_a = MTpgi[iRng].aaa;
        mask_b   = MTpgi[iRng].maskB;
        mask_c   = MTpgi[iRng].maskC;

        mt[0] = seed & MTpgi[iRng].wmask;

        for (iState = 1; iState < MT_NN; iState++) {
            mt[iState] = ( 1812433253U *
                            (mt[iState - 1] ^
                            (mt[iState - 1] >> 30) ) + iState) & MT_WMASK;
        }

        iState = 0;
        mti1 = mt[0];
        mti  = mti1;

        for (iOut = 0; iOut < NPerRng; iOut++) {
            iState1 = iState + 1;
            iStateM = iState + MT_MM;
            if(iState1 >= MT_NN) iState1 -= MT_NN;
            if(iStateM >= MT_NN) iStateM -= MT_NN;
            mti1  = mt[iState1];
            mtiM  = mt[iStateM];

            x = (mti & MT_UMASK) | (mti1 & MT_LMASK);
            x = mtiM ^ (x >> 1) ^ ((x & 1) ? matrix_a : 0);

            mt[iState] = x;
            iState = iState1;

            x ^= (x >> MT_SHIFT0);
            x ^= (x << MT_SHIFTB) & mask_b;
            x ^= (x << MT_SHIFTC) & mask_c;
            x ^= (x >> MT_SHIFT1);

            h_Rand[iRng+iOut*MT_RNG_COUNT] = (((float)x+1.0f)/4294967296.0f);

            mti  = mti1;
        }
        }
    }
}

The initial runs of this version showed a huge performance improvement over the original directive-based version:

...
 --- Mersenne Twister on GPU with PGI Accelerator directives ---
 --- Main loop based on CUDA/OpenCL Mersenne Twister kernels ---
exec time = 0.366780 s
throughput = 1463.740967 [MB/s]
max = 1.0000000000
min = 0.0000000070

...

Accelerator Kernel Timing data
/home/sw/tmp/miles/MersenneTwister/mersenneTwisterCPU.c
  RandomRef_ACC_2
    188: region entered 1 time
        time(us): total=366287 init=1 region=366286
                  kernels=245992 data=98019
        w/o init: total=366286 max=366286 min=366286 avg=366286
        193: kernel launched 1 times
            grid: [128]  block: [32]
            time(us): total=245992 max=245992 min=245992 avg=245992
acc_init.c
  acc_init
    29: region entered 1 time
        time(us): init=4423375
...

However, the performance is still less than half that delivered by OpenCL and CUDA C. Analyzing the CCFF feedback messages emitted during compilation gave a significant clue as to the reason:

...
RandomRef_ACC_2:
    188, Generating copyin(MTpgi[:4095])
         Generating copyout(h_Rand[:134217727])
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
    193, Loop carried dependence of 'h_Rand' prevents parallelization
         Loop carried backward dependence of 'h_Rand' prevents vectorization
         Accelerator kernel generated
        193, #pragma acc for parallel(128), vector(32)
             Non-stride-1 accesses for array 'mt'
             CC 1.0 : 14 registers; 20 shared, 64 constant, 
                       0 local memory bytes; 33% occupancy
             CC 1.3 : 14 registers; 20 shared, 64 constant, 
                       0 local memory bytes; 25% occupancy
    204, Loop carried dependence of 'mt' prevents parallelization
         Loop carried backward dependence of 'mt' prevents vectorization
         Inner sequential loop scheduled on accelerator
    214, Complex loop carried dependence of 'mt' prevents parallelization
         Loop carried dependence due to exposed use of 'mt[0:17]' 
              prevents parallelization
         Loop carried scalar dependence for 'iState' at line 215
         Loop carried scalar dependence for 'iState' at line 216
         Loop carried scalar dependence for 'iState' at line 225
         Loop carried scalar dependence for 'mti' at line 222
         Inner sequential loop scheduled on accelerator
...

There are non-stride-1 accesses (what NVIDIA calls non-coalesced accesses) to mt in the inner loop. Note that we have declared mt private, meaning the compiler is free to create a separate instance of the variable for each iteration of the loop. In fact, this is necessary in order to successfully parallelize the loop. In this case, the compiler made a poor decision and essentially promoted mt[MT_NN] to mt[MT_RNG_COUNT][MT_NN]. As a result, since C arrays are stored in row-major order, the accesses by the threads within a thread-block to their corresponding elements in each instance are non-stride-1. However, there's nothing that prevents us from performing the promotion of mt in the source code manually to ensure it's done in a more optimal way—i.e. promoting to mt[MT_NN][MT_RNG_COUNT]. Once this is done, the code looks as follows:

void RandomRef_ACC_2(float *h_Rand, unsigned int NPerRng, unsigned int seed)
{
    int iRng, iOut;
    unsigned int mt[MT_NN][MT_RNG_COUNT];

#pragma acc region copyin(MTpgi[0:MT_RNG_COUNT-1]) \
                   copyout(h_Rand[0:N-1]) \
                   local(mt[0:MT_NN-1][MT_RNG_COUNT])
{
#pragma acc for parallel(128) vector(32)
    for(iRng = 0; iRng < MT_RNG_COUNT; iRng++) {
        int iState, iState1, iStateM;
        unsigned int mti, mti1, mtiM, x;
        unsigned int matrix_a, mask_b, mask_c;

        matrix_a = MTpgi[iRng].aaa;
        mask_b   = MTpgi[iRng].maskB;
        mask_c   = MTpgi[iRng].maskC;

        mt[0][iRng] = seed & MTpgi[iRng].wmask;

        for (iState = 1; iState < MT_NN; iState++) {
            mt[iState][iRng] = ( 1812433253U *
                            (mt[iState - 1][iRng] ^
                            (mt[iState - 1][iRng] >> 30) ) + iState) & MT_WMASK;
        }

        iState = 0;
        mti1 = mt[0][iRng];
        mti  = mti1;

        for (iOut = 0; iOut < NPerRng; iOut++) {
            iState1 = iState + 1;
            iStateM = iState + MT_MM;
            if(iState1 >= MT_NN) iState1 -= MT_NN;
            if(iStateM >= MT_NN) iStateM -= MT_NN;
            mti1  = mt[iState1][iRng];
            mtiM  = mt[iStateM][iRng];

            x = (mti & MT_UMASK) | (mti1 & MT_LMASK);
            x = mtiM ^ (x >> 1) ^ ((x & 1) ? matrix_a : 0);
        
            mt[iState][iRng] = x;
            iState = iState1;

            x ^= (x >> MT_SHIFT0);
            x ^= (x << MT_SHIFTB) & mask_b;
            x ^= (x << MT_SHIFTC) & mask_c;
            x ^= (x >> MT_SHIFT1);

            h_Rand[iRng+iOut*MT_RNG_COUNT] = (((float)x+1.0f)/4294967296.0f);

            mti  = mti1;
    }
}
}
}

Re-compiling, we no longer see the CCFF message indicating non-stride-1 accesses and the performance is on par with the OpenCL and CUDA C versions:

...
 --- Mersenne Twister on GPU with PGI Accelerator directives ---
 --- Main loop based on CUDA/OpenCL Mersenne Twister kernels ---
exec time = 0.149562 s
throughput = 3589.621107 [MB/s]
max = 1.0000000000
min = 0.0000000147

...

Accelerator Kernel Timing data
/home/sw/tmp/miles/MersenneTwister/mersenneTwisterCPU.c
  RandomRef_ACC_2
    188: region entered 1 time
        time(us): total=149106 init=1 region=149105
                  kernels=46294 data=97286
        w/o init: total=149105 max=149105 min=149105 avg=149105
        193: kernel launched 1 times
            grid: [128]  block: [32]
            time(us): total=46294 max=46294 min=46294 avg=46294
acc_init.c
  acc_init
    29: region entered 1 time
        time(us): init=4400599
...

Conclusions and Future Work

Using the Mersenne Twister random number generator example code from the NVIDIA CUDA SDK, we created a multi-core x64 host version using OpenMP and were able to see near-linear scaling on 4 cores of an Intel Nehalem CPU. Using this same source code, incrementally modifed according to compiler feedback, we created a GPU-enabled version using PGI Accelerator directives. Comparing the performance of the directive-based version to NVIDIA's OpenCL and CUDA versions, we were able to re-code the directive-based version to deliver performance on par with hand-coded CUDA C and OpenCL. While it's possible to use PGI Accelerator directives to create GPU-enabled loops from existing host source code, it seems clear that many of the same coding guidelines that apply to CUDA C also apply to PGI Accelerator directive-based kernels.

For very large sequences, in our case 128M, performance of the GPU-enabled versions of the Mersenne Twister outperform all 4 cores of a 2.67Ghz Nehalem by a factor of about 2.3, including the cost of data transfer between the host and GPU. When the data transfer cost is excluded, speed-up of the GPU versions is anywhere from a factor of 6.25 to a factor of 7.6 over the quad-core host CPU. We conclude that offloading random number generation using one of these routines as a standalone function is not likely to be of great benefit. However, performing random number generation on the GPU using these methods as one component of the port of an entire application will certainly yield significant benefits.

In the end, the high-level PGI Accelerator directive-based programming model reached the same performance as the lower-level OpenCL and CUDA programming models. We had to make some changes to the code to reach maximum performance, but these were minimal with respect to the effort spent in writing (and debugging) the OpenCL implementation. The productivity benefits of the directive-based programming model are clear in this case.

As often occurs, this project yielded as many questions as answers. We would like to do performance comparisons on a Fermi-architecture C2050 to see how it compares to the Nehalem and the C1060. There is an open question about why the performance of the 2nd directive-based version is so much higher than the first; this should be explored, and may yield useful information on how to incrementally create optimal kernels using directives. We'd like to plug these versions of the Mersenne Twister into the Monte Carlo example code referenced earlier, to get an apples-to-apples performance comparison of all 4 cores of a Nehalem to the GPU using both directives and CUDA C. We also need to run performance tests across a spectrum of sizes, to see at what point the GPU performance overtakes the CPU performance. Finally, we relied on rather simple tests of the random numbers generated. The sequences generated from each of the versions need to be more thoroughly vetted using a random number generator testing package, for example the NIST tests.