Technical News from The Portland Group

Tuning a Monte Carlo Algorithm on GPUs

Since the introduction of PGI CUDA Fortran late last year, we've seen a dramatic rise in the number of customers using this new extension to the Fortran language. As the moderator of the PGI User Forum, I have been very busy answering questions about the language, and noting those questions that seem to be asked often or may be of interest to the wider community. For this installment of the PGInsider, I have implemented the Monte Carlo Integration algorithm to highlight some of the tips, tricks, and traps of programming for the GPU.

Monte Carlo Integration

For my sample code I chose a simple Monte Carlo Integration algorithm to compute the approximate value of pi. The code first creates a list of random points within a square. Each point is evaluated using the function:

f(x,y) = (x^2 + y^2 < 1) ? 1 : 0

The points are then summed. The approximate value of pi can then be calculated by multiplying four times the volume of the square with the mean value for f(x,y).

The code itself is very simple to understand and the algorithm is highly parallel because each f(x,y) calculation can be performed independently. Besides this, the algorithm uses a sum reduction and requires a random number generator (RNG). Both of which present interesting problems.

For the following examples source code is available for download from the PGI website.

Here is a basic Fortran implementation of a Monte Carlo Integration algorithm.

! Perform the function f(x,y) = (x^2 + y^2 < 1) ? 1 : 0
    do i=1,N
      tempVal = X(i)*X(i) + Y(i)*Y(i)
      if (tempVal < 1) then
 	 temp(i) = 1
	 temp(i) = 0

! Sum the results
    sumA = 0
    sumSq = 0
    do i=1,N
      sumA = sumA + temp(i)
      sumSq = sumSq + (temp(i)*temp(i))

! calulate the mean
    meanA = sumA / real(N);
    meanSq = sumSq / real(N);

! approximate pi
    results%estimate = meanA * volume * 4
    results%variance = (meanSq - meanA*meanA) / (N - 1)

Baseline Performance

First, let's start with the host version of the code to get our baseline performance. We compiled the code with auto-parallelization enabled and ran with four threads. The system we're using is an Intel Core i7 (single socket four core Nehalem) with an attached NVIDIA S1070 (four Tesla C1060 cards). The compiler version used is PGI 10.2. (For a listing of the flags used in these builds, please see the Flag Reference section at the end of this article.) To build, run the following command:

pgi_mc_example% setenv OMP_NUM_THREADS 4
pgi_mc_example% make run_CPU
pgfortran -fast -c -Iinc ./src/mcUtils.F90 -o ./obj/mcUtils.o
pgfortran -fast -c -Iinc -Mpreprocess  -DITER=10  -Mconcur=innermost -Minfo=par 
 ./src/mcCPU.F90 -o ./obj/mcCPU.o
     28, Parallel code generated with block distribution for inner loop if trip 
         count is greater than or equal to 100
     44, Loop not parallelized: may not be beneficial
pgfortran -fast -c -Iinc -Mpreprocess  -DITER=10 -DMCTYPE=0 ./src/monte_drv.F90 
 -o ./obj/monte_drv_cpu.o
pgfortran -fast   -Mconcur=innermost -Minfo=par ./obj/monte_drv_cpu.o 
 ./obj/mcUtils.o ./obj/mcCPU.o -o mcCPU.out
time  mcCPU.out
 ----- CPU -----
 Result =     3.141863
 Standard deviation =    5.0109735E-05
 Difference from real pi value =    2.6988983E-04
 Time in Seconds
    Total :   13.80947
      RNG :   12.27999
  Compute :    1.43082
Data Xfer :    0.00000
50.04user 1.13system 0:13.83elapsed 369%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+786745minor)pagefaults 0swaps

In this sample, we used 67108860 random points and are able to get a rough estimate of pi. We collect the total execution time, the total time to generate the random numbers, and the time to compute the f(x,y) function and sum the results. For the results run on the GPU, we will also collect the time it takes to transfer the data back and forth between the host and GPU. Due to the short run time of this algorithm, we perform 10 iterations of the RNG, f(x,y), and sum reduction. All times are in seconds.

As a first step, we created a version using the PGI Accelerator model. The PGI Accelerator model uses a directive-based approach where the programmer inserts a !$acc region around the sections of code they wish to offload to the GPU. The compiler then generates highly optimized low-level GPU code as well as taking care of the data management and "bookkeeping" code.

On Linux, the device drivers are powered down between each use. Because of this, each time your application runs, the devices need to be re-initialized. This initialization cost is typically about 1 second per attached device, or on my system, four seconds. To help with this, I use the pgcudainit utility. pgcudainit holds a system's devices open so they will not need to be restarted for each run.

pgi_mc_example% make run_ACC
pgfortran -ta=nvidia -fast -c -Iinc -Mpreprocess  -DITER=10 -Minfo=accel 
 ./src/mcACC.F90 -o ./obj/mcACC.o
     25, Generating local(temp(:))
         Generating copyin(y(:))
         Generating copyin(x(:))
     32, Loop is parallelizable
         Accelerator kernel generated
         32, !$acc do parallel, vector(256)
             Using register for 'temp'
     41, Loop is parallelizable
         Accelerator kernel generated
         41, !$acc do parallel, vector(256)
             Using register for 'temp'
         42, Sum reduction generated for suma
         43, Sum reduction generated for sumsq
pgfortran -ta=nvidia -fast -c -Iinc -Mpreprocess  -DITER=10 -Minfo=accel 
 -DMCTYPE=1 ./src/monte_drv.F90 -o ./obj/monte_drv_acc.o
pgfortran -fast  -ta=nvidia ./obj/monte_drv_acc.o ./obj/mcUtils.o 
 ./obj/mcACC.o  -o mcACC.out
time  mcACC.out
 ----- ACC -----
 Result =     3.141863
 Standard deviation =    5.0109724E-05
 Difference from real pi value =    2.7036667E-04
 Time in Seconds
    Total :   13.52597
      RNG :   12.09849
  Compute :    0.24569
Data Xfer :    1.10420
12.96user 0.62system 0:13.61elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+131860minor)pagefaults 0swaps

While not every code is this easy, it took me approximately 10 minutes to insert the directives and realize a 6x speed-up of the compute time over the host's parallelized version. However, with the random number generation and data transfer overhead, the overall time does not improve over the host. For other algorithms where the compute time has a greater share of the overall time, the PGI Accelerator model would be a very good choice. However, until the PGI Accelerator model supports a means for creating random numbers on the GPU, we'll need to port this algorithm using CUDA Fortran. This will be more difficult and time consuming to write than using PGI Accelerator directives, but hopefully we can see significant speed-up. (For the record, it took me approximately 10 hours to get to the final version shown below.)

Initial CUDA Fortran Port

Let's start with a simple CUDA Fortran program. Below, we have created a device module which will contain our GPU code. It includes routines with global and device attributes. A global attribute can only be used with subroutines and can be called from host code. A device attribute can be used with either functions or subroutines, but can only be called from other device or global routines within the same device module. Likewise, access to a device module's global data is restricted to routines declared in the same module or from host code which uses the module. The main reason for this restriction is that NVIDIA currently does not have a linker that can be used on device code. Hence, there is no way to associate global symbols found outside the device module.

! Sample Device Module
module mcCUF_1

  ! Declare our global device variables.
  ! The 'device' attribute tells the compiler that this variable
  ! should be allocated on the GPU when the module is loaded. 
  ! These values can be access by either the host or device code
  real, device :: dSum, dSumSq

  ! The 'constant' attribute will place the variable in the 
  ! devices constant memory buffer.  Constant memory can only be 
  ! updated from the host.
  integer, constant :: N_PER


! A 'global' subroutine will be executed on the device and can 
! be called from the host via the chevron syntax.
attributes(global) subroutine montecarlo_cuf1_kernel(dX, dY, dTemp,N)

  use cudafor

  implicit none

  ! The dX, dY device arrays contain the random numbers
  real, dimension(N), device :: dX, dY

  ! dTemp is the output array 
  real, dimension(N), device :: dTemp

  ! scalar variable passed in for the host must be passed by value
  integer, value :: N
  integer :: i
  real :: tempVal

  ! Each thread will perform the f(x,y) calculation on a single 
  ! point.  The routine will be launched in a 'grid' containing
  ! a number of blocks, each containing a number of threads. To 
  ! determine which point this thread should calculate, we take
  ! the block number (blockIdx) multiplied by the number of 
  ! threads in a block (blockDim), added to the thread number 
  ! within the block (threadIdx).  Since our arrays are a single
  ! dimension, only the "%x" value is needed.
  i  = (blockIdx%x-1)*blockDim%x + threadIdx%x

  ! perform the f(x,y) calculation and store the results
  tempVal = dX(i)*dX(i) + dY(i)*dY(i)
  if (tempVal < 1.0) then
     dTemp(i) = 1.0
     dTemp(i) = 0.0

end subroutine montecarlo_cuf1_kernel

subroutine montecarlo_cuf1(X, Y, volume, results, N)

    use mcUtils
    use cudafor

    implicit none

    ! the host variables containing the random points
    ! and host copy of the temp results array
    real, dimension(N) :: X, Y
    real, dimension(N) :: temp

    ! The corresponding device arrays
    real, dimension(N), device :: dX, dY
    real, dimension(N), device :: dTemp

    ! contains the test results and timing info
    type(mcResults) :: results

    integer :: volume, N, i, istat
    real :: sumA, sumSq
    real :: meanA, meanSq

    ! Dim3 variables to define the grid and block size
    type(dim3) :: dimGrid, dimBlock

    ! timer variables
    real :: sum_start, sum_end, func_time
    type(cudaEvent) :: func_start, func_end

    ! Initialize our timing routines 
    istat = cudaEventCreate(func_start)
    istat = cudaEventCreate(func_end)
    istat = cudaEventRecord(func_start, 0)

    ! Copy the random points from the host to device
    dX = X
    dY = Y

    ! set our Grid and Block sizes
    dimBlock = dim3(256,1,1)
    dimGrid = dim3(N/dimBlock%x,1,1)

    ! call our device kernel using the grid/block dimensions
    ! and passing in device pointers to our random point and temp array.
    call  montecarlo_cuf1_kernel<<<dimGrid,dimBlock>>>(dX, dY, dTemp, N)

    ! copy the result temp array back to the host
    temp = dTemp

    ! get the function time
    istat = cudaEventRecord(func_end, 0)
    istat = cudaThreadSynchronize()
    istat = cudaEventElapsedTime(func_time, func_start, func_end)
    results%time(3) = results%time(3) + (func_time/1000)

    ! start timing the sum 
    call cpu_time(sum_start)
    sumA = 0
    sumSq = 0

    ! perform the summation on the host
    do i=1,N
       sumA = sumA + temp(i)
       sumSq = sumSq + (temp(i)*temp(i))
    call cpu_time(sum_end)
    results%time(4) = results%time(4) + (sum_end-sum_start)

    ! calculate the mean values 
    meanA = sumA / real(N);
    meanSq = sumSq / real(N);

    ! calculate PI
    results%estimate = meanA * volume * 4
    results%variance = (meanSq - meanA*meanA) / (N - 1)

end subroutine montecarlo_cuf1

end module mcCUF_1

However, when we run this code we immediately encounter the error "ConfigureCall FAILED:9". On a Tesla card, the maximum thread block X dimension is 512 and the maximum grid X dimension is 65535 (see the output from the pgaccelinfo utility to see your card's maximums). Because we're using 67108860 points divided by 256 as the grid size, we're exceeding the maximum. Even if we used thread blocks of size 512, it's still too many points. Later we'll fix this problem by having each thread compute more than one element, but for now let's reduce the number of points to 16777215—set using the define flag -DUSE_SMALL—just to see if the program executes.

pgi_mc_example% make DFLAG=-DUSE_SMALL run_CUF1
pgfortran -fast -c -Iinc ./src/mcUtils.F90 -o ./obj/mcUtils.o
pgfortran -Mcuda -fast -c -Iinc -Mpreprocess -DUSE_SMALL -DITER=10 
 ./src/mcCUF_1.F90 -o ./obj/mcCUF_1.o
pgfortran -Mcuda -fast -c -Iinc -Mpreprocess -DUSE_SMALL -DITER=10 -DMCTYPE=11 
 ./src/monte_drv.F90 -o ./obj/monte_drv_cuf1.o
pgfortran -fast  -Mcuda ./obj/monte_drv_cuf1.o ./obj/mcUtils.o ./obj/mcCUF_1.o 
  -o mcCUF_1.out
time  mcCUF_1.out
 ----- CUF1 -----
 Result =     3.142020
 Standard deviation =    1.0021195E-04
 Difference from real pi value =    4.2748451E-04
 Time in Seconds
    Total :    3.98795
      RNG :    3.12032
  Compute :    0.12943
Data Xfer :    0.58712
3.59user 0.46system 0:04.08elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+197625minor)pagefaults 0swaps

The compute time in this initial CUDA Fortran version on the smaller data set is about 0.13 seconds. When using the smaller number of points the parallel host version has a compute time of 0.36 seconds and the PGI Accelerator compute version is at 0.11 seconds. While we haven't quite matched the PGI Accelerator version's time, it's a good first step. Next, we'll work on reducing the data transfer cost by moving the summation to the GPU.

Sum Reductions

One of the biggest performance bottle necks in GPU programming is the cost to transfer data back and forth between the host and device. We most likely can improve our performance if we don't need to copy back our temporary results array. So instead, lets try moving the sum reduction code on to the GPU by adding a new device kernel to perform the sum in serial.

attributes(global) subroutine montecarlo_cuf2_sum(dTemp, N)

  implicit none

  real, dimension(N), device :: dTemp
  integer, value :: N, i

  dSum = 0
  dSumSq = 0
  do i=1,N
     dSum = dSum + dTemp(i)
     dSumSq = dSumSq + (dTemp(i)*dTemp(i))

end subroutine montecarlo_cuf2_sum

Next, we need to remove host code which copies the device results array back from the GPU (temp=dTemp) as well as the host's summation loop. Then, we need to add the call to our new summation kernel:

    call  montecarlo_cuf2_sum<<<1,1>>>(dTemp, N)

Running the updated CUDA Fortran code we see the following output:

pgi_mc_example% make DFLAG=-DUSE_SMALL run_CUF2
pgfortran -Mcuda -fast -c -Iinc -Mpreprocess -DUSE_SMALL -DITER=10 
 ./src/mcCUF_2.F90 -o ./obj/mcCUF_2.o
pgfortran -Mcuda -fast -c -Iinc -Mpreprocess -DUSE_SMALL -DITER=10 
 -DMCTYPE=12 ./src/monte_drv.F90 -o ./obj/monte_drv_cuf2.o
pgfortran -fast  -Mcuda ./obj/monte_drv_cuf2.o ./obj/mcUtils.o 
 ./obj/mcCUF_2.o  -o mcCUF_2.out
time  mcCUF_2.out
 ----- CUF2 -----
 Result =     3.142021
 Standard deviation =    1.0021187E-04
 Difference from real pi value =    4.2796135E-04
 Time in Seconds
    Total :   76.96400
      RNG :    3.00475
  Compute :   73.57511
Data Xfer :    0.25853
65.76user 11.24system 1:17.03elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+33778minor)pagefaults 0swaps

While we cut our data transfer time in half, the compute time using a single GPU thread processor running the summation in serial was about 600 times slower. We'll need to rewrite this code to be parallel.

It turns out that optimally performing parallel GPU reductions are quite complex to implement. However, we should be able to improve our performance by performing a partial reduction in parallel. For details on implementing parallel reductions, please refer to NVIDIA's slide deck on reductions.

Let's add a new device kernel to perform a partial sum:

attributes(global) subroutine montecarlo_cuf3_sum0(dTemp, dSumA, dSumSqA, N)

  implicit none
  ! temporary results device array
  real, dimension(N), device :: dTemp

  ! arrays to hold the partial sums
  real, dimension(sumN), device :: dSumA, dSumSqA
  integer, value :: N
  integer :: i, ix, iy, nthrd, start, end, share, rem
  real :: tempVal

  ! Calculate the thread number as before, however each
  ! thread will now perform the sum on more then one element
  ! of the temporary array
  ix  = (blockIdx%x-1)*blockDim%x + threadIdx%x

  ! The total number of threads 
  nthrd = blockDim%x * gridDim%x

  ! Thread's share of the elements
  share = N/nthrd

  ! Thread's starting and ending index
  start = (share*(ix-1)) + 1
  end = start + share-1

  ! Add the extra elements to the end
  if (ix .eq. sumN) then
     rem = N - (share*nthrd)
     end = end + rem

  ! perform the partial summation.
  dSumA(ix) = 0
  dSumSqA(ix) = 0
  do i=start,end
     dSumA(ix) = dSumA(ix) + dTemp(i)
     dSumSqA(ix) = dSumSqA(ix) + (dTemp(i)*dTemp(i))

end subroutine montecarlo_cuf3_sum0

Next, declare the device summation arrays in the host code and then call the new partial sum kernel:

... ! add the summation arrays in the host
    real, dimension(sumN), device :: dSumA, dSumSqA

    call  montecarlo_cuf3_sum0<<<sumN,1>>>(dTemp, dSumA, dSumSqA, N)
    call  montecarlo_cuf3_sum<<<1,1>>>(dSumA, dSumSqA)

And finally, rerun with the new parallel partial sum kernel:

pgi_mc_example% make DFLAG=-DUSE_SMALL run_CUF3
pgfortran -Mcuda -fast -c -Iinc -Mpreprocess -DUSE_SMALL -DITER=10 
 ./src/mcCUF_3.F90 -o ./obj/mcCUF_3.o
pgfortran -Mcuda -fast -c -Iinc -Mpreprocess -DUSE_SMALL -DITER=10 
 -DMCTYPE=13 ./src/monte_drv.F90 -o ./obj/monte_drv_cuf3.o
pgfortran -fast  -Mcuda ./obj/monte_drv_cuf3.o ./obj/mcUtils.o 
 ./obj/mcCUF_3.o  -o mcCUF_3.out
time  mcCUF_3.out
 ----- CUF3 -----
 Result =     3.142020
 Standard deviation =    1.0021206E-04
 Difference from real pi value =    4.2724609E-04
 Time in Seconds
    Total :    6.41454
      RNG :    3.06278
  Compute :    2.97773
Data Xfer :    0.25702
5.80user 0.66system 0:06.49elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+33780minor)pagefaults 0swaps

With the parallel partial summation, we were able to reduce the time from 74 seconds down to 3.

Improving Warp Performance

When programming for a CPU using multiple threads, the programmer generally wants each thread working on a contiguous block of memory. This helps performance since the data will be more likely held in the CPU's local memory cache rather then being fetched from main memory with each read or write.

On the NVIDIA GPU, threads are organized in groups called 'warps' (a Tesla warp has 32 threads). Each thread in a warp executes the same instruction at the same time. This includes memory fetches. So if thread n fetches element i of our temp array, the other threads in the warp will need to wait for this fetch to complete. However, if all the threads are working on the same contiguous block of memory (i.e. thread n works on element i, thread n+1 on element i+1, etc.), the entire block of memory can be fetched at the same time.

For a good overview of the NVIDIA GPU Architecture, please see Michael Wolfe's article Understanding the CUDA Data Parallel Threading Model in this issue of the PGInsider.

Let's change our partial summation kernel to have each warp work on a contiguous block of memory. Also, we should use local variables to calculate the partial sum so that we don't need to update global memory with each iteration of the loop.

attributes(global) subroutine montecarlo_cuf4_sum0(dTemp, dSumA, dSumSqA, N)

  implicit none

  real, dimension(N), device :: dTemp
  real, dimension(THREAD_N), device :: dSumA, dSumSqA
  integer, value :: N
  integer :: i, ix, iy, nthrd, start, end, share, rem
  real :: tempVal, sum, sumSq
  ix  = (blockIdx%x-1)*blockDim%x + threadIdx%x
  nthrd = blockDim%x * gridDim%x

  sum = 0
  sumsq = 0
  do i=ix,N,nthrd
     sum = sum + dTemp(i)
     sumSq = sumSq + (dTemp(i)*dTemp(i))
  dSumA(ix) = sum
  dSumSqA(ix) = sumSq

end subroutine montecarlo_cuf4_sum0

We can also take this opportunity to address the problem we encountered earlier with the limited number of grid blocks. Similar to the partial sum, we can use a fixed number of grids and blocks. Then, each thread will work on a variable number of elements in our random point array. This way the number of elements is only limited by the amount of memory we have on the GPU and the precision of a REAL data type.

Let's change our f(x,y) kernel to calculate multiple elements per thread.

attributes(global) subroutine montecarlo_cuf5_kernel(dX, dY, dTemp, N)

  implicit none

  real, dimension(N), device :: dX, dY
  real, dimension(N), device :: dTemp
  integer, value :: N
  integer :: i, ix, iy, nthrd, start, end, share, rem
  real :: tempVal

  ix  = (blockIdx%x-1)*blockDim%x + threadIdx%x
  nthrd = blockDim%x * gridDim%x 

  do i=ix,N,nthrd
     tempVal = dX(i)*dX(i) + dY(i)*dY(i)
     if (tempVal < 1.0) then
        dTemp(i) = 1.0
        dTemp(i) = 0.0

end subroutine montecarlo_cuf5_kernel

Rerun with the small data set.

pgi_mc_example% make DFLAG=-DUSE_SMALL run_CUF4
pgfortran -fast -c -Iinc ./src/mcUtils.F90
 -o ./obj/mcUtils.o
pgfortran -Mcuda -fast -c -Iinc -Mpreprocess
 -DUSE_SMALL -DITER=10 ./src/mcCUF_4.F90 -o ./obj/mcCUF_4.o
pgfortran -Mcuda -fast -c -Iinc -Mpreprocess 
 -DUSE_SMALL -DITER=10 -DMCTYPE=14 ./src/monte_drv.F90 
 -o ./obj/monte_drv_cuf4.o
pgfortran -fast  -Mcuda ./obj/monte_drv_cuf4.o 
 ./obj/mcUtils.o ./obj/mcCUF_4.o  -o mcCUF_4.out
time  mcCUF_4.out
 ----- CUF4 -----
 Result =     3.142038
 Standard deviation =    1.0021203E-04
 Difference from real pi value =    4.4536591E-04
 Time in Seconds
    Total :    3.44605
      RNG :    2.99784
  Compute :    0.07027
Data Xfer :    0.25840
3.27user 0.23system 0:03.52elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+33781minor)pagefaults 0swaps

Then run again using the larger number of points.

pgi_mc_example% make clean
pgi_mc_example% make run_CUF4
pgfortran -fast -c -Iinc ./src/mcUtils.F90 
 -o ./obj/mcUtils.o
pgfortran -Mcuda -fast -c -Iinc -Mpreprocess  
 -DITER=10 ./src/mcCUF_4.F90 -o ./obj/mcCUF_4.o
pgfortran -Mcuda -fast -c -Iinc -Mpreprocess  
 -DITER=10 -DMCTYPE=14 ./src/monte_drv.F90 
 -o ./obj/monte_drv_cuf4.o
pgfortran -fast  -Mcuda ./obj/monte_drv_cuf4.o 
 ./obj/mcUtils.o ./obj/mcCUF_4.o  -o mcCUF_4.out
time  mcCUF_4.out
 ----- CUF4 -----
 Result =     3.141860
 Standard deviation =    5.0109786E-05
 Difference from real pi value =    2.6726723E-04
 Time in Seconds
    Total :   13.37639
      RNG :   11.91478
  Compute :    0.24730
Data Xfer :    1.02121
12.99user 0.42system 0:13.47elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+132084minor)pagefaults 0swaps

While we've had a lot of programming effort so far, our performance is only slightly better than the PGI Accelerator directives on the GPU or on the original host version. The next steps then are to begin moving the random number generator to the GPU and removing the remaining data transfers.

Calling a CUDA C Random Number Generator

For the final section, let's work on generating the random numbers on the device. Not only will this eliminate almost all data transfers, we should be able to run the RNG in parallel.

While there isn't a built-in RNG for the GPU, the NVIDIA CUDA SDK includes an implementation of the Mersenne Twister algorithm. The code is written in CUDA C but we can call a CUDA C kernel from a CUDA Fortran program by adding an interface block to our host code and using Fortran 2003 ISO_C_BINDING. Also, we'll need to slightly modify the CUDA C SDK version of the code to remove the CUDA SDK logging functions and add extern "C" to inhibit nvcc's name mangling:


    ! C function to load the Mersenne Twister input data set
    subroutine loadMTGPU(dat_path) bind(c,name='loadMTGPU')
       use iso_c_binding
       character(len=80) :: dat_path
    end subroutine loadMTGPU

    ! C function to initialize the random seed on the GPU
    subroutine seedMTGPU(seed) bind(c,name='seedMTGPU')
       use iso_c_binding
        integer(c_int), value :: seed   ! Pass by value
    end subroutine seedMTGPU

    ! CUDA C Mersenne Twister algorithm to generate an array of random
    ! numbers in parallel.  Note that nvcc has added '__entry' to the
    ! end of the function name.
    subroutine RandomGPU(d_Rand, n_per_rng) bind(c,name='randomgpu__entry')
        use iso_c_binding
        real(c_float), dimension(:), device :: d_Rand
        integer, value :: n_per_rng  ! pass by value
    end subroutine RandomGPU

end interface

The Mersenne Twister kernel is set to use 4096 threads in a 32 by 128 grid. Each random number array must be a multiple of 4096. Hence, we'll need to pad our random arrays before calling the RNG.

To find the optimal grid size requires some experimentation with various sizes. I tried a few other grid sizes, but settled for a 32 by 128 grid for the f(x,y) function kernel. I'm still not sure it's optimal.

    ! pad the random point arrays to be a multiple of THREAD_N (4096)
    if (MOD(N,THREAD_N) .ne. 0) then

    if (MOD(Ns,2) .ne. 0) then
        N_PER_THD = N_PER_THD - MOD(Ns,2) + 2
        N_PER_THD = N_PER_THD
    allocate(dX(Ns), dY(Ns))

    istat = cudaEventRecord(rng_start, 0)

    ! initalize and then call the RNG for dX
    write(dpath,'(a)') 'mtdata/MersenneTwister.dat'//char(0)
    call loadMTGPU(dpath)
    call seedmtgpu(777)
    call randomgpu<<<32,128>>>(dX,N_PER_THD)   
    ! Re-seed and call the RNG for dY
    call seedmtgpu(123)
    call randomgpu<<<32,128>>>(dY,N_PER_THD)    

Note that to compile this example, you will need to have nvcc in your PATH. Rerunning now with all the optimizations in place and calling the GPU resident RNG kernel produces these results:

pgi_mc_example% make run_CUF5
pgfortran -fast -c -Iinc ./src/mcUtils.F90 
 -o ./obj/mcUtils.o
pgfortran -Mcuda -fast -c -Iinc -Mpreprocess  
 -DITER=10 ./src/mcCUF_5.F90 -o ./obj/mcCUF_5.o
pgfortran -DUSE_GPU_RNG -Mcuda -fast -c -Iinc 
 -Mpreprocess  -DITER=10 -DMCTYPE=15 ./src/monte_drv.F90 
 -o ./obj/monte_drv_cuf5.o
nvcc -O3 -c -Iinc ./src/ 
 -o ./obj/MersenneTwister_kernel.o
pgfortran -fast  -Mcuda ./obj/monte_drv_cuf5.o ./obj/mcUtils.o 
 ./obj/mcCUF_5.o ./obj/MersenneTwister_kernel.o -o mcCUF_5.out
time  mcCUF_5.out
 ----- CUF5 -----
 Result =     3.141596
 Standard deviation =    5.0115388E-05
 Difference from real pi value =    3.3378601E-06
 Time in Seconds
    Total :    0.95668
      RNG :    0.52054
  Compute :    0.24725
Data Xfer :    0.00038
0.62user 0.32system 0:01.02elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1058minor)pagefaults 0swaps

Success! We were able to reduce our random number time and get the data transfer time near zero.


In this article, we stepped through the process of porting a simple Monte Carlo Integration algorithm to the GPU. We started by using the PGI Accelerator directive-based approach because it was quick to implement and our host code did not need to be rewritten. While we were able to achieve a much higher performance by offloading the compute intensive portions of the code to the GPU, the overall time was consumed by the overhead of creating random numbers and transferring data between the host and GPU. To achieve the maximum performance, we needed to port the algorithm to CUDA Fortran.

While developing our CUDA Fortran code, we highlighted several of the more frequently asked questions about the language. We showed the basic structure of a CUDA Fortran program and noted some restrictions. Next, we discussed how reductions can be performed in parallel and how to improve performance by fetching contiguous data for a warp instead of a single thread. Finally, we showed how CUDA C and CUDA Fortran can be inter-mixed and how to generate random numbers on the GPU using the Mersenne Twister algorithm.

With effort, by using CUDA Fortran with a CUDA C random number generator, we were able to achieve a 14x speed-up over a four threaded, auto-parallelized version run on our host.

Comparison of time between four Monte Carlo Integration implementations

  Host Fortran
with auto-parallelization
CUDA Fortran
host RNG
CUDA Fortran
Total Time 13.80947 13.52597 13.37639 0.95668
RNG 12.27999 12.09849 11.91478 0.52054
Compute 1.43082 0.24569 0.24730 0.24725
Data Transfer 0.00000 1.10420 1.02121 0.00038

Have a question or comment about CUDA Fortran, the PGI Accelerator model, or this article? Please let us know by posting on the PGI User Forum or contacting PGI Customer Support (

Flag Reference

Flag Name Description
-fast Aggregate flag which includes most common optimizations including auto-vectorization.
-Mconcur=innermost Auto-parallelize innermost loops.
-Mpreprocess Invoke the preprocessor on Fortran files. Required on Windows since file names are case insensitive.
-Mcuda Enable CUDA Fortran. Default for files with the ".cuf" extension. Required when linking CUDA Fortran objects.
-ta=nvidia Target PGI Accelerator Model code regions to the NVIDIA GPU.
-Minfo=accel Display informational messages about accelerator regions.
-Minfo=par Display informational messages about parallelization.