Tuning a Monte Carlo Algorithm on GPUs
Since the introduction of PGI CUDA Fortran late last year, we've seen a dramatic rise in the number of customers using this new extension to the Fortran language. As the moderator of the PGI User Forum, I have been very busy answering questions about the language, and noting those questions that seem to be asked often or may be of interest to the wider community. For this installment of the PGInsider, I have implemented the Monte Carlo Integration algorithm to highlight some of the tips, tricks, and traps of programming for the GPU.
Monte Carlo Integration
For my sample code I chose a simple Monte Carlo Integration algorithm to compute the approximate value of pi. The code first creates a list of random points within a square. Each point is evaluated using the function:
f(x,y) = (x^2 + y^2 < 1) ? 1 : 0
The points are then summed. The approximate value of pi can then be calculated by multiplying four times the volume of the square with the mean value for f(x,y).
The code itself is very simple to understand and the algorithm is highly parallel because each f(x,y) calculation can be performed independently. Besides this, the algorithm uses a sum reduction and requires a random number generator (RNG). Both of which present interesting problems.
For the following examples source code is available for download from the PGI website.
Here is a basic Fortran implementation of a Monte Carlo Integration algorithm.
! Perform the function f(x,y) = (x^2 + y^2 < 1) ? 1 : 0 do i=1,N tempVal = X(i)*X(i) + Y(i)*Y(i) if (tempVal < 1) then temp(i) = 1 else temp(i) = 0 endif enddo ! Sum the results sumA = 0 sumSq = 0 do i=1,N sumA = sumA + temp(i) sumSq = sumSq + (temp(i)*temp(i)) enddo ! calulate the mean meanA = sumA / real(N); meanSq = sumSq / real(N); ! approximate pi results%estimate = meanA * volume * 4 results%variance = (meanSq  meanA*meanA) / (N  1)
Baseline Performance
First, let's start with the host version of the code to get our baseline performance. We compiled the code with autoparallelization enabled and ran with four threads. The system we're using is an Intel Core i7 (single socket four core Nehalem) with an attached NVIDIA S1070 (four Tesla C1060 cards). The compiler version used is PGI 10.2. (For a listing of the flags used in these builds, please see the Flag Reference section at the end of this article.) To build, run the following command:
pgi_mc_example% setenv OMP_NUM_THREADS 4 pgi_mc_example% make run_CPU pgfortran fast c Iinc ./src/mcUtils.F90 o ./obj/mcUtils.o pgfortran fast c Iinc Mpreprocess DITER=10 Mconcur=innermost Minfo=par ./src/mcCPU.F90 o ./obj/mcCPU.o montecarlo_cpu: 28, Parallel code generated with block distribution for inner loop if trip count is greater than or equal to 100 44, Loop not parallelized: may not be beneficial pgfortran fast c Iinc Mpreprocess DITER=10 DMCTYPE=0 ./src/monte_drv.F90 o ./obj/monte_drv_cpu.o pgfortran fast Mconcur=innermost Minfo=par ./obj/monte_drv_cpu.o ./obj/mcUtils.o ./obj/mcCPU.o o mcCPU.out time mcCPU.out  CPU  Result = 3.141863 Standard deviation = 5.0109735E05 Difference from real pi value = 2.6988983E04 Time in Seconds Total : 13.80947 RNG : 12.27999 Compute : 1.43082 Data Xfer : 0.00000 50.04user 1.13system 0:13.83elapsed 369%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+786745minor)pagefaults 0swaps
In this sample, we used 67108860 random points and are able to get a rough estimate of pi. We collect the total execution time, the total time to generate the random numbers, and the time to compute the f(x,y) function and sum the results. For the results run on the GPU, we will also collect the time it takes to transfer the data back and forth between the host and GPU. Due to the short run time of this algorithm, we perform 10 iterations of the RNG, f(x,y), and sum reduction. All times are in seconds.
As a first step, we created a version using the PGI Accelerator model. The PGI Accelerator model uses a directivebased approach where the programmer inserts a !$acc region around the sections of code they wish to offload to the GPU. The compiler then generates highly optimized lowlevel GPU code as well as taking care of the data management and "bookkeeping" code.
On Linux, the device drivers are powered down between each use. Because of this, each time your application runs, the devices need to be reinitialized. This initialization cost is typically about 1 second per attached device, or on my system, four seconds. To help with this, I use the pgcudainit utility. pgcudainit holds a system's devices open so they will not need to be restarted for each run.
pgi_mc_example% make run_ACC pgfortran ta=nvidia fast c Iinc Mpreprocess DITER=10 Minfo=accel ./src/mcACC.F90 o ./obj/mcACC.o montecarlo_acc: 25, Generating local(temp(:)) Generating copyin(y(:)) Generating copyin(x(:)) 32, Loop is parallelizable Accelerator kernel generated 32, !$acc do parallel, vector(256) Using register for 'temp' 41, Loop is parallelizable Accelerator kernel generated 41, !$acc do parallel, vector(256) Using register for 'temp' 42, Sum reduction generated for suma 43, Sum reduction generated for sumsq pgfortran ta=nvidia fast c Iinc Mpreprocess DITER=10 Minfo=accel DMCTYPE=1 ./src/monte_drv.F90 o ./obj/monte_drv_acc.o pgfortran fast ta=nvidia ./obj/monte_drv_acc.o ./obj/mcUtils.o ./obj/mcACC.o o mcACC.out time mcACC.out  ACC  Result = 3.141863 Standard deviation = 5.0109724E05 Difference from real pi value = 2.7036667E04 Time in Seconds Total : 13.52597 RNG : 12.09849 Compute : 0.24569 Data Xfer : 1.10420 12.96user 0.62system 0:13.61elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+131860minor)pagefaults 0swaps
While not every code is this easy, it took me approximately 10 minutes to insert the directives and realize a 6x speedup of the compute time over the host's parallelized version. However, with the random number generation and data transfer overhead, the overall time does not improve over the host. For other algorithms where the compute time has a greater share of the overall time, the PGI Accelerator model would be a very good choice. However, until the PGI Accelerator model supports a means for creating random numbers on the GPU, we'll need to port this algorithm using CUDA Fortran. This will be more difficult and time consuming to write than using PGI Accelerator directives, but hopefully we can see significant speedup. (For the record, it took me approximately 10 hours to get to the final version shown below.)
Initial CUDA Fortran Port
Let's start with a simple CUDA Fortran program. Below, we have created a device module which will contain our GPU code. It includes routines with global and device attributes. A global attribute can only be used with subroutines and can be called from host code. A device attribute can be used with either functions or subroutines, but can only be called from other device or global routines within the same device module. Likewise, access to a device module's global data is restricted to routines declared in the same module or from host code which uses the module. The main reason for this restriction is that NVIDIA currently does not have a linker that can be used on device code. Hence, there is no way to associate global symbols found outside the device module.
! Sample Device Module module mcCUF_1 ! Declare our global device variables. ! The 'device' attribute tells the compiler that this variable ! should be allocated on the GPU when the module is loaded. ! These values can be access by either the host or device code real, device :: dSum, dSumSq ! The 'constant' attribute will place the variable in the ! devices constant memory buffer. Constant memory can only be ! updated from the host. integer, constant :: N_PER contains ! A 'global' subroutine will be executed on the device and can ! be called from the host via the chevron syntax. attributes(global) subroutine montecarlo_cuf1_kernel(dX, dY, dTemp,N) use cudafor implicit none ! The dX, dY device arrays contain the random numbers real, dimension(N), device :: dX, dY ! dTemp is the output array real, dimension(N), device :: dTemp ! scalar variable passed in for the host must be passed by value integer, value :: N integer :: i real :: tempVal ! Each thread will perform the f(x,y) calculation on a single ! point. The routine will be launched in a 'grid' containing ! a number of blocks, each containing a number of threads. To ! determine which point this thread should calculate, we take ! the block number (blockIdx) multiplied by the number of ! threads in a block (blockDim), added to the thread number ! within the block (threadIdx). Since our arrays are a single ! dimension, only the "%x" value is needed. i = (blockIdx%x1)*blockDim%x + threadIdx%x ! perform the f(x,y) calculation and store the results tempVal = dX(i)*dX(i) + dY(i)*dY(i) if (tempVal < 1.0) then dTemp(i) = 1.0 else dTemp(i) = 0.0 endif end subroutine montecarlo_cuf1_kernel subroutine montecarlo_cuf1(X, Y, volume, results, N) use mcUtils use cudafor implicit none ! the host variables containing the random points ! and host copy of the temp results array real, dimension(N) :: X, Y real, dimension(N) :: temp ! The corresponding device arrays real, dimension(N), device :: dX, dY real, dimension(N), device :: dTemp ! contains the test results and timing info type(mcResults) :: results integer :: volume, N, i, istat real :: sumA, sumSq real :: meanA, meanSq ! Dim3 variables to define the grid and block size type(dim3) :: dimGrid, dimBlock ! timer variables real :: sum_start, sum_end, func_time type(cudaEvent) :: func_start, func_end ! Initialize our timing routines istat = cudaEventCreate(func_start) istat = cudaEventCreate(func_end) istat = cudaEventRecord(func_start, 0) ! Copy the random points from the host to device dX = X dY = Y ! set our Grid and Block sizes dimBlock = dim3(256,1,1) dimGrid = dim3(N/dimBlock%x,1,1) ! call our device kernel using the grid/block dimensions ! and passing in device pointers to our random point and temp array. call montecarlo_cuf1_kernel<<<dimGrid,dimBlock>>>(dX, dY, dTemp, N) ! copy the result temp array back to the host temp = dTemp ! get the function time istat = cudaEventRecord(func_end, 0) istat = cudaThreadSynchronize() istat = cudaEventElapsedTime(func_time, func_start, func_end) results%time(3) = results%time(3) + (func_time/1000) ! start timing the sum call cpu_time(sum_start) sumA = 0 sumSq = 0 ! perform the summation on the host do i=1,N sumA = sumA + temp(i) sumSq = sumSq + (temp(i)*temp(i)) enddo call cpu_time(sum_end) results%time(4) = results%time(4) + (sum_endsum_start) ! calculate the mean values meanA = sumA / real(N); meanSq = sumSq / real(N); ! calculate PI results%estimate = meanA * volume * 4 results%variance = (meanSq  meanA*meanA) / (N  1) end subroutine montecarlo_cuf1 end module mcCUF_1
However, when we run this code we immediately encounter the error "ConfigureCall FAILED:9". On a Tesla card, the maximum thread block X dimension is 512 and the maximum grid X dimension is 65535 (see the output from the pgaccelinfo utility to see your card's maximums). Because we're using 67108860 points divided by 256 as the grid size, we're exceeding the maximum. Even if we used thread blocks of size 512, it's still too many points. Later we'll fix this problem by having each thread compute more than one element, but for now let's reduce the number of points to 16777215—set using the define flag DUSE_SMALL—just to see if the program executes.
pgi_mc_example% make DFLAG=DUSE_SMALL run_CUF1 pgfortran fast c Iinc ./src/mcUtils.F90 o ./obj/mcUtils.o pgfortran Mcuda fast c Iinc Mpreprocess DUSE_SMALL DITER=10 ./src/mcCUF_1.F90 o ./obj/mcCUF_1.o pgfortran Mcuda fast c Iinc Mpreprocess DUSE_SMALL DITER=10 DMCTYPE=11 ./src/monte_drv.F90 o ./obj/monte_drv_cuf1.o pgfortran fast Mcuda ./obj/monte_drv_cuf1.o ./obj/mcUtils.o ./obj/mcCUF_1.o o mcCUF_1.out time mcCUF_1.out  CUF1  Result = 3.142020 Standard deviation = 1.0021195E04 Difference from real pi value = 4.2748451E04 Time in Seconds Total : 3.98795 RNG : 3.12032 Compute : 0.12943 Data Xfer : 0.58712 3.59user 0.46system 0:04.08elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+197625minor)pagefaults 0swaps
The compute time in this initial CUDA Fortran version on the smaller data set is about 0.13 seconds. When using the smaller number of points the parallel host version has a compute time of 0.36 seconds and the PGI Accelerator compute version is at 0.11 seconds. While we haven't quite matched the PGI Accelerator version's time, it's a good first step. Next, we'll work on reducing the data transfer cost by moving the summation to the GPU.
Sum Reductions
One of the biggest performance bottle necks in GPU programming is the cost to transfer data back and forth between the host and device. We most likely can improve our performance if we don't need to copy back our temporary results array. So instead, lets try moving the sum reduction code on to the GPU by adding a new device kernel to perform the sum in serial.
attributes(global) subroutine montecarlo_cuf2_sum(dTemp, N) implicit none real, dimension(N), device :: dTemp integer, value :: N, i dSum = 0 dSumSq = 0 do i=1,N dSum = dSum + dTemp(i) dSumSq = dSumSq + (dTemp(i)*dTemp(i)) enddo end subroutine montecarlo_cuf2_sum
Next, we need to remove host code which copies the device results array back from the GPU (temp=dTemp) as well as the host's summation loop. Then, we need to add the call to our new summation kernel:
call montecarlo_cuf2_sum<<<1,1>>>(dTemp, N)
Running the updated CUDA Fortran code we see the following output:
pgi_mc_example% make DFLAG=DUSE_SMALL run_CUF2 pgfortran Mcuda fast c Iinc Mpreprocess DUSE_SMALL DITER=10 ./src/mcCUF_2.F90 o ./obj/mcCUF_2.o pgfortran Mcuda fast c Iinc Mpreprocess DUSE_SMALL DITER=10 DMCTYPE=12 ./src/monte_drv.F90 o ./obj/monte_drv_cuf2.o pgfortran fast Mcuda ./obj/monte_drv_cuf2.o ./obj/mcUtils.o ./obj/mcCUF_2.o o mcCUF_2.out time mcCUF_2.out  CUF2  Result = 3.142021 Standard deviation = 1.0021187E04 Difference from real pi value = 4.2796135E04 Time in Seconds Total : 76.96400 RNG : 3.00475 Compute : 73.57511 Data Xfer : 0.25853 65.76user 11.24system 1:17.03elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+33778minor)pagefaults 0swaps
While we cut our data transfer time in half, the compute time using a single GPU thread processor running the summation in serial was about 600 times slower. We'll need to rewrite this code to be parallel.
It turns out that optimally performing parallel GPU reductions are quite complex to implement. However, we should be able to improve our performance by performing a partial reduction in parallel. For details on implementing parallel reductions, please refer to NVIDIA's slide deck on reductions.
Let's add a new device kernel to perform a partial sum:
attributes(global) subroutine montecarlo_cuf3_sum0(dTemp, dSumA, dSumSqA, N) implicit none ! temporary results device array real, dimension(N), device :: dTemp ! arrays to hold the partial sums real, dimension(sumN), device :: dSumA, dSumSqA integer, value :: N integer :: i, ix, iy, nthrd, start, end, share, rem real :: tempVal ! Calculate the thread number as before, however each ! thread will now perform the sum on more then one element ! of the temporary array ix = (blockIdx%x1)*blockDim%x + threadIdx%x ! The total number of threads nthrd = blockDim%x * gridDim%x ! Thread's share of the elements share = N/nthrd ! Thread's starting and ending index start = (share*(ix1)) + 1 end = start + share1 ! Add the extra elements to the end if (ix .eq. sumN) then rem = N  (share*nthrd) end = end + rem endif ! perform the partial summation. dSumA(ix) = 0 dSumSqA(ix) = 0 do i=start,end dSumA(ix) = dSumA(ix) + dTemp(i) dSumSqA(ix) = dSumSqA(ix) + (dTemp(i)*dTemp(i)) enddo end subroutine montecarlo_cuf3_sum0
Next, declare the device summation arrays in the host code and then call the new partial sum kernel:
... ! add the summation arrays in the host real, dimension(sumN), device :: dSumA, dSumSqA ... call montecarlo_cuf3_sum0<<<sumN,1>>>(dTemp, dSumA, dSumSqA, N) call montecarlo_cuf3_sum<<<1,1>>>(dSumA, dSumSqA) ...
And finally, rerun with the new parallel partial sum kernel:
pgi_mc_example% make DFLAG=DUSE_SMALL run_CUF3 pgfortran Mcuda fast c Iinc Mpreprocess DUSE_SMALL DITER=10 ./src/mcCUF_3.F90 o ./obj/mcCUF_3.o pgfortran Mcuda fast c Iinc Mpreprocess DUSE_SMALL DITER=10 DMCTYPE=13 ./src/monte_drv.F90 o ./obj/monte_drv_cuf3.o pgfortran fast Mcuda ./obj/monte_drv_cuf3.o ./obj/mcUtils.o ./obj/mcCUF_3.o o mcCUF_3.out time mcCUF_3.out  CUF3  Result = 3.142020 Standard deviation = 1.0021206E04 Difference from real pi value = 4.2724609E04 Time in Seconds Total : 6.41454 RNG : 3.06278 Compute : 2.97773 Data Xfer : 0.25702 5.80user 0.66system 0:06.49elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+33780minor)pagefaults 0swaps
With the parallel partial summation, we were able to reduce the time from 74 seconds down to 3.
Improving Warp Performance
When programming for a CPU using multiple threads, the programmer generally wants each thread working on a contiguous block of memory. This helps performance since the data will be more likely held in the CPU's local memory cache rather then being fetched from main memory with each read or write.
On the NVIDIA GPU, threads are organized in groups called 'warps' (a Tesla warp has 32 threads). Each thread in a warp executes the same instruction at the same time. This includes memory fetches. So if thread n fetches element i of our temp array, the other threads in the warp will need to wait for this fetch to complete. However, if all the threads are working on the same contiguous block of memory (i.e. thread n works on element i, thread n+1 on element i+1, etc.), the entire block of memory can be fetched at the same time.
For a good overview of the NVIDIA GPU Architecture, please see Michael Wolfe's article Understanding the CUDA Data Parallel Threading Model in this issue of the PGInsider.
Let's change our partial summation kernel to have each warp work on a contiguous block of memory. Also, we should use local variables to calculate the partial sum so that we don't need to update global memory with each iteration of the loop.
attributes(global) subroutine montecarlo_cuf4_sum0(dTemp, dSumA, dSumSqA, N) implicit none real, dimension(N), device :: dTemp real, dimension(THREAD_N), device :: dSumA, dSumSqA integer, value :: N integer :: i, ix, iy, nthrd, start, end, share, rem real :: tempVal, sum, sumSq ix = (blockIdx%x1)*blockDim%x + threadIdx%x nthrd = blockDim%x * gridDim%x sum = 0 sumsq = 0 do i=ix,N,nthrd sum = sum + dTemp(i) sumSq = sumSq + (dTemp(i)*dTemp(i)) enddo dSumA(ix) = sum dSumSqA(ix) = sumSq end subroutine montecarlo_cuf4_sum0
We can also take this opportunity to address the problem we encountered earlier with the limited number of grid blocks. Similar to the partial sum, we can use a fixed number of grids and blocks. Then, each thread will work on a variable number of elements in our random point array. This way the number of elements is only limited by the amount of memory we have on the GPU and the precision of a REAL data type.
Let's change our f(x,y) kernel to calculate multiple elements per thread.
attributes(global) subroutine montecarlo_cuf5_kernel(dX, dY, dTemp, N) implicit none real, dimension(N), device :: dX, dY real, dimension(N), device :: dTemp integer, value :: N integer :: i, ix, iy, nthrd, start, end, share, rem real :: tempVal ix = (blockIdx%x1)*blockDim%x + threadIdx%x nthrd = blockDim%x * gridDim%x do i=ix,N,nthrd tempVal = dX(i)*dX(i) + dY(i)*dY(i) if (tempVal < 1.0) then dTemp(i) = 1.0 else dTemp(i) = 0.0 endif enddo end subroutine montecarlo_cuf5_kernel
Rerun with the small data set.
pgi_mc_example% make DFLAG=DUSE_SMALL run_CUF4 pgfortran fast c Iinc ./src/mcUtils.F90 o ./obj/mcUtils.o pgfortran Mcuda fast c Iinc Mpreprocess DUSE_SMALL DITER=10 ./src/mcCUF_4.F90 o ./obj/mcCUF_4.o pgfortran Mcuda fast c Iinc Mpreprocess DUSE_SMALL DITER=10 DMCTYPE=14 ./src/monte_drv.F90 o ./obj/monte_drv_cuf4.o pgfortran fast Mcuda ./obj/monte_drv_cuf4.o ./obj/mcUtils.o ./obj/mcCUF_4.o o mcCUF_4.out time mcCUF_4.out  CUF4  Result = 3.142038 Standard deviation = 1.0021203E04 Difference from real pi value = 4.4536591E04 Time in Seconds Total : 3.44605 RNG : 2.99784 Compute : 0.07027 Data Xfer : 0.25840 3.27user 0.23system 0:03.52elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+33781minor)pagefaults 0swaps
Then run again using the larger number of points.
pgi_mc_example% make clean pgi_mc_example% make run_CUF4 pgfortran fast c Iinc ./src/mcUtils.F90 o ./obj/mcUtils.o pgfortran Mcuda fast c Iinc Mpreprocess DITER=10 ./src/mcCUF_4.F90 o ./obj/mcCUF_4.o pgfortran Mcuda fast c Iinc Mpreprocess DITER=10 DMCTYPE=14 ./src/monte_drv.F90 o ./obj/monte_drv_cuf4.o pgfortran fast Mcuda ./obj/monte_drv_cuf4.o ./obj/mcUtils.o ./obj/mcCUF_4.o o mcCUF_4.out time mcCUF_4.out  CUF4  Result = 3.141860 Standard deviation = 5.0109786E05 Difference from real pi value = 2.6726723E04 Time in Seconds Total : 13.37639 RNG : 11.91478 Compute : 0.24730 Data Xfer : 1.02121 12.99user 0.42system 0:13.47elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+132084minor)pagefaults 0swaps
While we've had a lot of programming effort so far, our performance is only slightly better than the PGI Accelerator directives on the GPU or on the original host version. The next steps then are to begin moving the random number generator to the GPU and removing the remaining data transfers.
Calling a CUDA C Random Number Generator
For the final section, let's work on generating the random numbers on the device. Not only will this eliminate almost all data transfers, we should be able to run the RNG in parallel.
While there isn't a builtin RNG for the GPU, the NVIDIA CUDA SDK includes an implementation of the Mersenne Twister algorithm. The code is written in CUDA C but we can call a CUDA C kernel from a CUDA Fortran program by adding an interface block to our host code and using Fortran 2003 ISO_C_BINDING. Also, we'll need to slightly modify the CUDA C SDK version of the code to remove the CUDA SDK logging functions and add extern "C" to inhibit nvcc's name mangling:
interface ! C function to load the Mersenne Twister input data set subroutine loadMTGPU(dat_path) bind(c,name='loadMTGPU') use iso_c_binding character(len=80) :: dat_path end subroutine loadMTGPU ! C function to initialize the random seed on the GPU subroutine seedMTGPU(seed) bind(c,name='seedMTGPU') use iso_c_binding integer(c_int), value :: seed ! Pass by value end subroutine seedMTGPU ! CUDA C Mersenne Twister algorithm to generate an array of random ! numbers in parallel. Note that nvcc has added '__entry' to the ! end of the function name. subroutine RandomGPU(d_Rand, n_per_rng) bind(c,name='randomgpu__entry') use iso_c_binding real(c_float), dimension(:), device :: d_Rand integer, value :: n_per_rng ! pass by value end subroutine RandomGPU end interface
The Mersenne Twister kernel is set to use 4096 threads in a 32 by 128 grid. Each random number array must be a multiple of 4096. Hence, we'll need to pad our random arrays before calling the RNG.
To find the optimal grid size requires some experimentation with various sizes. I tried a few other grid sizes, but settled for a 32 by 128 grid for the f(x,y) function kernel. I'm still not sure it's optimal.
! pad the random point arrays to be a multiple of THREAD_N (4096) if (MOD(N,THREAD_N) .ne. 0) then N_PER_THD = N/THREAD_N+1 else N_PER_THD = N/THREAD_N endif if (MOD(Ns,2) .ne. 0) then N_PER_THD = N_PER_THD  MOD(Ns,2) + 2 else N_PER_THD = N_PER_THD endif Ns = N_PER_THD * THREAD_N allocate(dX(Ns), dY(Ns)) istat = cudaEventRecord(rng_start, 0) ! initalize and then call the RNG for dX write(dpath,'(a)') 'mtdata/MersenneTwister.dat'//char(0) call loadMTGPU(dpath) call seedmtgpu(777) call randomgpu<<<32,128>>>(dX,N_PER_THD) ! Reseed and call the RNG for dY call seedmtgpu(123) call randomgpu<<<32,128>>>(dY,N_PER_THD)
Note that to compile this example, you will need to have nvcc in your PATH. Rerunning now with all the optimizations in place and calling the GPU resident RNG kernel produces these results:
pgi_mc_example% make run_CUF5 pgfortran fast c Iinc ./src/mcUtils.F90 o ./obj/mcUtils.o pgfortran Mcuda fast c Iinc Mpreprocess DITER=10 ./src/mcCUF_5.F90 o ./obj/mcCUF_5.o pgfortran DUSE_GPU_RNG Mcuda fast c Iinc Mpreprocess DITER=10 DMCTYPE=15 ./src/monte_drv.F90 o ./obj/monte_drv_cuf5.o nvcc O3 c Iinc ./src/MersenneTwister_kernel.cu o ./obj/MersenneTwister_kernel.o pgfortran fast Mcuda ./obj/monte_drv_cuf5.o ./obj/mcUtils.o ./obj/mcCUF_5.o ./obj/MersenneTwister_kernel.o o mcCUF_5.out time mcCUF_5.out  CUF5  Result = 3.141596 Standard deviation = 5.0115388E05 Difference from real pi value = 3.3378601E06 Time in Seconds Total : 0.95668 RNG : 0.52054 Compute : 0.24725 Data Xfer : 0.00038 0.62user 0.32system 0:01.02elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+1058minor)pagefaults 0swaps
Success! We were able to reduce our random number time and get the data transfer time near zero.
Conclusion
In this article, we stepped through the process of porting a simple Monte Carlo Integration algorithm to the GPU. We started by using the PGI Accelerator directivebased approach because it was quick to implement and our host code did not need to be rewritten. While we were able to achieve a much higher performance by offloading the compute intensive portions of the code to the GPU, the overall time was consumed by the overhead of creating random numbers and transferring data between the host and GPU. To achieve the maximum performance, we needed to port the algorithm to CUDA Fortran.
While developing our CUDA Fortran code, we highlighted several of the more frequently asked questions about the language. We showed the basic structure of a CUDA Fortran program and noted some restrictions. Next, we discussed how reductions can be performed in parallel and how to improve performance by fetching contiguous data for a warp instead of a single thread. Finally, we showed how CUDA C and CUDA Fortran can be intermixed and how to generate random numbers on the GPU using the Mersenne Twister algorithm.
With effort, by using CUDA Fortran with a CUDA C random number generator, we were able to achieve a 14x speedup over a four threaded, autoparallelized version run on our host.
Comparison of time between four Monte Carlo Integration implementations
Host Fortran with autoparallelization 
PGI Accelerator Model 
CUDA Fortran with host RNG 
CUDA Fortran with CUDA C RNG 


Total Time  13.80947  13.52597  13.37639  0.95668 
RNG  12.27999  12.09849  11.91478  0.52054 
Compute  1.43082  0.24569  0.24730  0.24725 
Data Transfer  0.00000  1.10420  1.02121  0.00038 
Have a question or comment about CUDA Fortran, the PGI Accelerator model, or this article? Please let us know by posting on the PGI User Forum or contacting PGI Customer Support through our online technical support request form.
Flag Reference
Flag Name  Description 

fast  Aggregate flag which includes most common optimizations including autovectorization. 
Mconcur=innermost  Autoparallelize innermost loops. 
Mpreprocess  Invoke the preprocessor on Fortran files. Required on Windows since file names are case insensitive. 
Mcuda  Enable CUDA Fortran. Default for files with the ".cuf" extension. Required when linking CUDA Fortran objects. 
ta=nvidia  Target PGI Accelerator Model code regions to the NVIDIA GPU. 
Minfo=accel  Display informational messages about accelerator regions. 
Minfo=par  Display informational messages about parallelization. 