Technical News from The Portland Group
The PGI Accelerator Programming Model on NVIDIA GPUs
Part 3: Porting WRF
This is the third part of a series introducing the PGI Accelerator Programming Model. The first installment introduced the model with three simple programs in C and in Fortran. The second article discussed performance tuning, including a four step process: choose an appropriate parallel algorithm, optimize data movement between host and GPU, optimize the kernel (particularly the device memory accesses), and tune the kernel schedule.
Here we'll present our experiences porting a part of a large application to the PGI Accelerator Model. WRF is the Weather Research and Forecasting model, a "next-generation mesoscale numerical weather prediction system." It is a large community code, collaboratively developed with many contributors, principally National Center for Atmospheric Research (NCAR), National Oceanic and Atmospheric Administration (NOAA), Federal Aviation Administration (FAA), Air Force Weather Agency (AFWA), Naval Research Laboratory (NRL), and Oklahoma University. We've taken the latest version (3.1.1) of the WRF code and ported the WSM5 ice microphysics model to the PGI Accelerator Model. We hope you'll find this description of the process we went through to be helpful and educational. The WRF program is written in Fortran, though the porting and tuning process is language independent. This description is quite technical, and is aimed at those who have some experience with GPUs and the PGI Accelerator Programming Model. You can also skip ahead to the results table to see a summary of the performance results.
We'll close with a short summary of our plans to improve the performance even more; we invite you to come to our booth at SC09 in Portland, November 16-19, where we'll be demonstrating this on the show floor.
Large numerical weather models have always stressed current supercomputers. As the computers get faster, models have improved by additional physics, shrinking the grid sizes, shortening the time step, and lengthening the total time interval. Unlike some supercomputer applications, forecasting has a real deadline; a 24-hour forecast is only valuable if the forecast takes less than 24 hours to produce, usually much, much less. So, speeding up the forecast allows developers to add more precise computations, giving a better forecast.
WRF is a large application, about 400,000 lines of Fortran developed by a myriad of programmers over a period of years. WRF has thousands of users, who use it for weather prediction, climate modeling, air quality and wildfire research, and atmospheric research. One version of WRF was used for the SPEC CPU2006 benchmark suite, with three datasets for the SPEC benchmarking process. We had initially worked with that version, but then we moved to the latest release, version 3.1.1. The WRF source code is in the public domain, so we downloaded the source and built this using our normal compiler suite. The program is designed to run on parallel computers, using MPI for cluster parallelism and OpenMP for shared memory parallelism; hybrid MPI+OpenMP is used for today's multicore clusters. It has been ported to and used on almost all the types of supercomputers represented in the Top 500 list. WRF in particular, and weather prediction in general, is particularly well suited for porting to accelerators such as GPUs, because it typically uses single (32-bit) precision. Today's general purpose CPUs generally run single precision about twice as fast as double precision, but for GPUs available today, single precision can be as much as 8 times faster than double precision.
To test WRF, you need an input dataset. We used a CONUS (continental US) benchmark dataset, specifically a 48-hour, 12km resolution domain for October 24, 2001, using Eulerian Mass dynamics. The benchmark period is 3 hours from hours 25 through 27, with a computational cost of about 22 billion floating point operations per 72-second time step.
The Ice Microphysics Model
WRF has several single moment ice microphysics models (WSM for WRF Single Moment), a simple scheme (WSM3), a mixed phase scheme (WSM5), and a graupel (snow pellets) phase scheme. The WSM5 model represents the condensation, precipitation, and thermodynamic effects of latent heat release. The code is organized into a 3-dimensional Cartesian grid (X/Y/Z), where each column (Z) is processed independently of the others. This gives us two dimensions of parallelism to work with, as we'll see shortly.
We chose to port the WSM5 model for two reasons. First, the dataset we were testing spent a lot of time in the WSM5 model, so we would be able to measure our performance improvement. Second, one of the WRF developers, John Michalakes, had ported the same model to the NVIDIA GPU using CUDA manually, so we can compare to both the host code and a manual GPU port.
Compared to John's port, we were at one distinct disadvantage; not being weather experts, we could not change the computation in the program in any substantive way. In fact, we tried to leave as much of the source code unchanged as we could.
The WSM5 Module
As mentioned above, the code uses a 3-dimensional grid; the code uses loop index variables i, j, and k to access the X, Y, and Z dimensions, consistently. The i and j loops, corresponding to the X and Y dimensions, are fully parallel.
The WSM5 module comprises mainly two large subroutines; the code in subroutine WSM5 looks something like this:
do j = jts, jte do k = kts, kte do i = its, ite t(i,k) = th(i,k,j)*pii(i,k,j) .... enddo enddo call wsm52D( lots of arguments ) do k = kts, kte do i = its, ite th(i,k,j) = t(i,k)/pii(i,k,j) enddo enddo enddo
The bulk of the module is in subroutine WSM52D, the 2-dimensional subset. The call to WSM52D passes two-dimensional subarrays, the j-th plane of ten arrays, along with a couple dozen scalar arguments, array bounds and loop limits.
The subroutine WSM52D looks more or less like this:
initialization code do ll = 1, llimit do i = its, ite code(i) enddo do i = its, ite morecode(i) enddo do k = kts, kte do i = its, ite evenmorecode(i,k) enddo enddo do k = kts, kte do i = its, ite yetmorecode(i,k) enddo enddo ..... enddo
The main loop is about 700 lines long and contains about 20 k/i loop nests. All the arrays are accessed with two dimensions, such as work(i,k). In Fortran, the leftmost index is the stride-1 (contiguous) dimension, so putting the i loops innermost means all the arrays are accessed with stride-1 in the innermost loop. Because the X dimension is fully parallel, all the i loops can be vectorized, such as for x86 SSE instructions. The do ll loop is a simple timing loop, and has a very small trip count. The program was obviously carefully coded for good performance on today's commodity processors.
We could have started by putting the PGI Accelerator Region directives around the do ll loop, but the performance would have been much less than satisfactory, since the ll loop itself is not parallel. This directive would generate many kernels, one for each parallel i loop. This creates a great deal of overhead, launching many kernels, and the parallelism in the j dimension is lost.
We need a lot of parallelism to keep the GPU busy and take advantage of its high throughput. So what we want is one big outer i loop to run in parallel on the GPU; we also want a big outer j loop in the same routine, so we can run both the i and j loops in parallel.
So we followed this sequence of steps:
Formulated the algorithm for appropriate parallelism:
- Moved the i loops to be outermost;
- Fused all the i loops, so there's only one outer i loop;
- Inlined the j loop to WSM52D;
Added accelerator directives:
- Inserted the region directive;
- Inserted private clauses, after reading the compiler feedback.
Tuned host-GPU data movement:
- Inserted copyin, copyout, copy, and local clauses.
Tuned the kernel schedule:
- Ran the program with timing enabled;
- Tuned the kernel schedule clauses.
We'll review each of these steps in more detail here.
Making the i loop outermost
In most cases, the i loop was tightly nested inside a k loop:
do k = kts, kte do i = its, ite ... enddo enddo
So swapping the i to the outermost position was easy. There were a few array assignments:
numdt = 1
which we expanded to loop form:
do i = its, ite numdt(i) = 1 enddo
and two BLAS-type subroutine calls that we manually inlined:
do k = kts, kte call vsrec( tvec1(its), den(its,k), ite-its+1 ) do i = its, ite tvec1(i) = tvec1(i) * den0 enddo call vssqrt( denfac(its,k), tvec1(its), ite-its+1 ) enddo
Routine vsrec computes the reciprocal of each vector element, and vssqrt computes the square root, so this is equivalent to:
do k = kts, kte do i = its, ite denfac(i,k) = sqrt( den0 / den(i,k) ) enddo enddo
There was one place where the program actually was made simpler. There is a recurrence in one of the k loops, and the program computes how many times to iterate through the recurrence, for each Z column. There's another loop, indexed by n, that iterates through the recurrence. However, the programmers want the i loop to be innermost, for the reasons already mentioned, so the program was somewhat convoluted; it computed the maximum number of steps for all columns, then used a conditional in the inner loop to see if this column should be processed for this step:
do k = kts, kte do i = its, ite mstep(i) = ...number-of-iterations... enddo enddo mstepmax = 1 do i = its, ite if( mstepmax .le. mstep(i) ) mstepmax = mstep(i) enddo do n = 1, mstepmax do k = kts, kte do i = its, ite if( n .le. mstep(i) )then ...recurrence(i,k)... endif enddo enddo enddo
After outermosting the i loop, this simplified to:
do i = its, ite do k = kts, kte mstep(i) = ...number-of-iterations... enddo enddo do i = its, ite do n = 1, mstep(i) do k = kts, kte ...recurrence(i,k)... enddo enddo enddo
The computation of mstepmax went away, and the conditional comparing n is no longer needed, as it's handled by the loop limit.
The final step was to fuse all the i loops together, so we had one big outermost i loop surrounding the body of the subroutine.
Inlining the j loop
As discussed above, the j loop surrounded the call to WSM52D. We manually inlined that loop into WSM52D, along with the surrounding loops. The call to WSM52D originally passed the j-th plane or column of ten arrays; we changed that to pass the whole array instead, since the j indexing would happen inside the subroutine. This also required us to modify the code in the routine to add the j index for those three-dimensional arrays. This was the biggest change we had to make. Now the routine WSM52D had two outer loops, j and i, surrounding the body.
Insert the region directive
The next step was to insert the !$acc region directive and see if the compiler could generate GPU code. The compiler found three problems.
First, the code used the Fortran nint intrinsic; with the Beta compiler we were using, nint was not supported on the GPU. We replaced nint(m) with int(m+0.5), which is correct for positive values of m.
Second, there were some optional arguments used in the loop; access to these arguments was protected by calls to the Fortran present intrinsic. At this time, present is not allowed in accelerator code; we created two versions of the loop, one using the optional arguments and one not, so the present call would execute on the host. This suggests that we need to enhance the compiler to handle optional arguments properly in accelerator regions.
Inserting private clauses and scalarizing
Finally, and more importantly, the compiler generated informational CCFF messages (we used the -Minfo flag) telling us that parallelization of the j loop would require privatizing a number of arrays, like:
Parallelization would require privatization of array 'work2(its:ite,i2+kts)' Parallelization would require privatization of array 'numdt(its:ite)'
An array or scalar is private to a loop when each loop iteration has its own copy of the variable. Currently, the compiler will automatically and correctly privatize scalars using standard compiler analysis. Privatizing arrays requires adding another dimension to the array for the loop; the compiler does not yet attempt to automatically do this.
In WSM52D, there are a number of subroutine local arrays indexed as work(i,k) or numdt(i); since they have no j index, they can't be assigned in parallel by different iterations of the j loop. This emphasizes that it's important to read -Minfo messages produced by the accelerator compiler. In response to the privatization messages, we inserted a !$acc do directive for the j loop, with private clauses for the arrays noted by the compiler, such as private(work). For the arrays that were subscripted only by the i index, we could simplify the program even more by scalarizing those arrays. So, numdt(i) becomes numdt. This has the advantage that numdt will perhaps be assigned to a register, and never have to go to device memory at all.
Tuning the host-GPU data movement
Then, we compiled again, and again looked at the (much reduced) number of messages from the compiler. In particular, we noticed a number of data copy messages, such as:
The array delz is declared as:
real, dimension(ims:ime,kms:kme,jms:jme) :: delz
The compiler analyzed the program to find the smallest subarray of delz that was accessed, and only sends over that subarray. Since the compiler doesn't know how big the whole array is, it doesn't want to send over too much data or use too much GPU memory. On the other hand, we know that delz isn't much bigger than this region. Sending the whole array over as a single contiguous chunk is much faster than sending a subarray that has holes, so we added some copy, copyin and copyout clauses to the region directive:
!$acc region copyin(delz(:,:,:))....
We also noticed one array was assigned on the GPU and never used on the host, so we added a local clause for that array, removing the data movement altogether.
Tuning the kernel schedule
Then we looked at the kernel schedule generated. On the version of the program we had, the compiler tried to generate a number of GPU kernels. Though some of the inner k loops had recurrences, many could be executed in parallel. The compiler blindly tries to generate as much parallelism as it can, so it tried to generate three-dimensional kernels for those loops with three dimensions of parallelism. We wanted to restrict the code to just one large kernel with two dimensions of parallelism, so we added a directive to the i loop:
!$acc do kernel do i = its, ite
This tells the compiler to treat the body of the i loop as a single kernel, not to split the loop into multiple kernels. This implicitly ignores parallelism for loops contained inside the i loop, but reduces the overhead of launching many kernels, trading less potential parallelism for less overhead.
Then we recompiled; this time, the i loop body became the kernel, and the compiler generated the following schedule:
Accelerator kernel generated 278, !$acc do parallel, vector(16) 282, !$acc do parallel, vector(16)
In CUDA terms, this schedule means the loop nest will execute on the GPU as a two dimensional grid of parallel thread blocks, where each thread block is a 16x16 block of iterations. The vector(16) means 16 iterations of the loop at that line are part of the thread block, and the parallel means the thread blocks execute in parallel across that loop dimension in the grid.
So we ran the program. Our test machine has a 2.67GHz Intel Nehalem processor with two attached NVIDIA Tesla C1060 cards. The runs described here used one host core and one Tesla.
For the base case, we compiled WRF using the Intel 11.1 compilers, with the flag set:
-O3 -xT -ip -fp-model fast=2 -no-prec-div -no-prec-sqrt
The entire benchmark took 1535 seconds (25:35), of which 236 seconds (3:56) was spent in subroutine WSM5.
When we built module WSM5 with the PGI Accelerator directives enabled, the time in subroutine WSM5 dropped to 36 seconds. We ran with our performance data collection option, -ta=nvidia,time, which generated the profile output:
Accelerator Kernel Timing data module_mp_wsm5.f90 wsm52d 266: region entered 150 times time(us): total=36151893 init=108873 region=36043020 kernels=25401138 data=10641882 w/o init: total=36043020 max=252052 min=239648 avg=240286 282: kernel launched 150 times grid: [19x27] block: [16x16] time(us): total=25401138 max=169722 min=168012 avg=169340
This tells us that the program spent 10 seconds transferring data and 25 seconds executing the kernels.
We did some experimentation with the accelerator kernel schedule clauses as well. We found the best performance came from putting the vector schedule clause on the i loop and a parallel schedule clause on the j loop. We believe this is because it lets the vector i loop access the stride-1 array dimensions efficiently, and there's enough j parallelism to keep the GPU fully populated.
With this new kernel schedule, the execution time for module WSM5 dropped to 30 seconds:
Accelerator Kernel Timing data module_mp_wsm5.f90 wsm52d 265: region entered 150 times time(us): total=29794978 init=97534 region=29697444 kernels=18856280 data=10841164 w/o init: total=29697444 max=203134 min=197272 avg=197982 282: kernel launched 150 times grid:  block:  time(us): total=18856280 max=126527 min=124930 avg=125708
Relative to the non-tuned kernel schedule, the data transfer time is the same, and the kernel execution time is 34% faster. The total speedup over the host is 236/30=7.8; the speedup of just the computation part is 236/19=12.4.
Multiple cores, multiple GPUs
Comparing the GPU time to that for a single host core may not be quite fair; the GPU code uses all 240 compute cores, so the host code should try to use all four x86 cores. As mentioned, weather is a very parallel computation (as you would expect; nature computes weather in parallel continuously). The performance of WSM5 scales pretty well, where the times below are in seconds on our test Nehalem system:
The speedup of our GPU code over the four-core result is only 70/30 = 2.33.
Then we tried using multiple GPUs; the NVIDIA interface library allows for controlling multiple GPUs from a single application, but only one GPU per host thread. So we ran with multiple host threads, using one or two GPUs:
|host cores||GPUs||time||GPU data||GPU compute|
The compute time scales very well by adding another GPU. This allows us to compute the speedup of our two-GPU code over the four-core Nehalem for WSM5 as 70/17.1 = 4.1, including all the data transfer overhead, or 70/10.47 = 6.6 if we compare the best compute times only.
Comparison to hand-written CUDA
At the beginning of this article, we had pointed out that John Michalakes had manually rewritten this physics kernel in CUDA C. This allows us to compare our performance to that of John's CUDA code. On the same machine, running the same benchmark code, John's code takes 26.35 seconds in WSM5. This time comprises three parts:
- 7.17 seconds executing the actual kernel on the GPU
- 9.04 seconds transferring data from the host to the GPU and back
- 10.14 seconds copying data from the Fortran arrays to the C arrays and back.
The last part is spurious, and would be unnecessary had CUDA Fortran been available when he did his work, so we'll ignore it. Then, the total time for John's hand-written kernel is 16.21 seconds. The data transfer time is about the same as what we see in our directive-based Accelerator code, but his computational kernel is more than twice as fast as ours. Why is that?
When John converted the WSM5 module to CUDA, he applied a couple low-level tricks. One was to use the faster, lower precision implementations of the transcendental functions exp, log, and sqrt. Weather prediction apparently does not suffer much with the lower-precision; after all, the entire calculation is done in single precision. This is analogous to using the -Mfprelaxed flag for the PGI compilers, or the -no-prec-div -no-prec-sqrt flags for the Intel compilers.
Another trick was to use the faster 24-bit integer multiplication for index calculation. As long as the arrays in use are less than 16GB, 24-bit multiplication is sufficient, and takes only a single clock on the NVIDIA Tesla. On the newly announced NVIDIA Fermi, this should be unnecessary, since it implements a full 32-bit integer ALU.
On a larger scale, one important bottleneck for the GPU is simply the number of device memory loads and stores. There are many arrays in this kernel that are used only on the GPU. In the original program, these can't turn into scalars because the values were carried from one inner k loop to another. For instance, in a program like:
do j = jts, jte do i = its, ite do k = kts, kte cpm(i,k) = c(i,k,j) * c(i,k,j) enddo do k = kts, kte b(i,k,j) = c(i,k,j) - cpm(i,k) a(i,k,j) = b(i,k,j) + c(i,k,j) enddo enddo enddo
We can privatize the cpm array in the j loop, but that still leaves all the accesses to cpm as stores and loads from the device memory. If we instead fuse the two k loops, array can be scalarized:
do j = jts, jte do i = its, ite do k = kts, kte cpm = c(i,k,j) * c(i,k,j) b(i,k,j) = c(i,k,j) - cpm a(i,k,j) = b(i,k,j) + c(i,k,j) enddo enddo enddo
Now the total number of device memory accesses is reduced, which strongly affects total performance. As an example, we ran these two loops on a Tesla, and the performance improved 75% with loop fusion and scalarization. In WSM52D, there are a number of arrays to which this process can be applied, which John did manually.
So, in a few day's time, we were able to port this large kernel from the host to the GPU using the PGI Accelerator Model. We got a significant performance improvement over the host, and were able to easily enable multiple cores and multiple GPUs using OpenMP and MPI, something that is hard to do with the hand-written CUDA kernel. However, we think we can still improve the performance quite a bit, using some of the tricks that the manual CUDA C code applied.
First, we are experimenting with compiler options to enable the equivalent of the --use_fast_math option from NVCC, NVIDIA's CUDA C compiler, and to enable use of 24-bit multiplication for array indexing. Second, we are modifying the WSM5 code to scalarize a number of the temporary arrays; our initial experiments have dropped the single-GPU compute times from 18.9 seconds to under 12 seconds. This leads us to believe we can get another big positive boost on performance. Stop by the PGI Booth at SC09 in Portland in November, and we'll show our latest results.
The performance summary is shown in the following table.
|WSM5 Performance Table (seconds)|
|host cores||GPUs||time||GPU data||GPU compute||Notes|
|1||-||236||Nehalem, one core|
|2||-||125||Nehalem, two cores|
|4||-||70||Nehalem, four cores|
|1||1||36.15||10.64||25.40||initial GPU kernel|
|1||1||29.79||10.84||18.85||tuned kernel schedule|
|2||2||16.67||6.29||10.5||tuned kernel schedule|
|1||1||19.72||10.75||8.85||work in progress kernel|
|2||2||12.00||6.87||5.29||work in progress kernel|
This shows how well WSM5 scales on the multicore Nehalem, and the progression of our GPU kernel performance.
During the porting and tuning process of WSM5, only a small amount of our time, and very little of this article, was dedicated to using PGI Accelerator directives. Most of the work and most of this article focus on organizing and optimizing the loops appropriately to expose the appropriate level and style of parallelism. This, in fact, is largely the point of the PGI Accelerator model; it's a high level programming model to allow you to quickly and easily get a complex code offloaded to an NVIDIA GPU using standard C or Fortran. An important part of the process is performance measurement and tuning, which should be familiar to many of you who have migrated applications to new generations of processors. There are some new concepts specific to today's GPUs, but they are no more difficult to understand than vector instructions or multiprocessor parallelism.
It's been said that parallel programming is hard, and that's just the way it is. We can't make it simple, but using the right language and compiler, we can make it straightforward to expose and exploit the available parallelism in the program on the parallel hardware we are targeting.
There's still the question of what all this program restructuring does to host execution. In particular, the original program was carefully written to try to expose the vector operations along the i index. If we rewrite the code to expose large-scale parallelism for an accelerator, how will that affect the host execution? Will we have to support multiple versions of a program, one optimized for a host and another for a GPU or other accelerator?
One of the goals of the PGI Accelerator programming model is portability, across accelerators and also in the CPU and multicore domain. Future versions of our compiler will interpret the accelerator directives for single-core and multicore optimization, parallelizing across cores and vectorizing within a core. Watch this space for future developments.