September 2010
Porting the SPEC Benchmark BWAVES to GPUs with CUDA Fortran
Overview
The SPEC® CPU2006 benchmarks are designed to compare compute-intensive workloads on different computer systems. They are organized into two groups: a 12 benchmark integer suite, CINT2006 and a 17 benchmark floating point suite, CFP2006. BWAVES a member of the CFP2006 suite is one of the benchmarks used to assess CPUs for compute-intensive workloads.
The CUDA architecture facilitates the porting of compute-intensive applications to the GPU. CUDA C was first introduced by NVIDIA in 2007, and CUDA Fortran was introduced in 2009 in a joint effort between The Portland Group (PGI) and NVIDIA. Both CUDA C and CUDA Fortran are essentially standard-compliant C and Fortran with a handful of extensions that allow portions of the computation to be offloaded to the GPU.
CUDA is a hybrid programming model. Both the host (or CPU) and the device (or GPU) can execute parts of the code stream independently and in some case simultaneously. CUDA specific code modifications are usually relatively small and typically focus on two areas: host/device data transfers and kernels or the code that will run on the device. We need to note here that SPEC run rules forbid testers from modifying the SPEC source code. In this light, the performance data presented here on a CUDA Fortran implementation of BWAVES does not qualify as a rules-compliant run, but can nevertheless give useful comparison data.
We chose to port BWAVES using CUDA because it is not a good fit for the GPU. Neither the grid size (65x65x64) nor the layout of the interleaved dependent variable arrays are structured for optimizing bandwidth utilization. While not a good fit for the GPU, BWAVES is an excellent fit for multi-core CPU systems as evidenced by the performance data available on the SPEC website. It is because BWAVES has been so well optimized to run on CPUs that we chose it for this comparison.
Following are three sections covering 1) detailed breakdown of the BWAVES algorithm including a CPU performance profile. 2) discussion of the CUDA Fortran port considerations, including strategies for porting certain kernels to the device, and 3) results validation and performance comparisons of CPU and CUDA runs.
The BWAVES Benchmark
BWAVES simulates the evolution of a blast wave using a discretization of the compressible Navier-Stokes equations on a uniform, structured, three-dimensional mesh. This mesh is periodic in each direction and contains 65 points in the x and y directions and 64 points in the z direction. There are five dependent variables at each point in the mesh corresponding to density, energy and velocity in the x, y and z directions. The code runs for 100 time steps. The time-step loop is composed of four sections:
- determination of the time step via CFL condition,
- calculation of the left-hand side (LHS) terms including Jacobian evaluations,
- calculation of the right-hand side (RHS) terms, and
- the BiCGSTAB routine
The first component, determining the time step, is a simple maximum reduction of the velocity field and is executed in very little time relative to the other components. It is not included in this analysis. The LHS calculation assembles the components of the A matrix, the RHS calculation similarly calculates the components of the b vector, and the BiCGSTAB routine is an iterative method of solving the system Ax = b for x using the bi-conjugate gradient method. The BiCGSTAB algorithm as implemented in BWAVES is listed in Figure 1. The iterative loop in the BiCGSTAB routine consists of two matrix-vector multiplications, five reductions, several daxpy operations and parameter updates.
(1) | r = Ax | Matrix-vector multiplication |
(2) | r = b − r | |
(3) | rˆ = r, p = v = t = 0, α = ω_{0} = ρ_{0} = 1 | |
(4) | while (∑ r · r − ε > 0): | Reduction |
(5) | ρ = ∑ r · rˆ | Reduction |
(6) | β = ρ/ρ_{0} ∗ α/ω_{0} | |
(7) | p = r + β(p − ω_{0}ν) | |
(8) | ν = Ap | Matrix-vector multiplication |
(9) | α = ρ / ∑ rˆ · ν | Reduction |
(10) | r = r − αv | |
(11) | t = Ar | Matrix-vector multiplication |
(12) | ω = ∑ t · r / ∑ t · t | Reductions |
(13) | x = x + αp + ωr | |
(14) | r = r − ωt | |
(15) | ρ_{0} = ρ, ω_{0} = ω |
Figure 1: The BiCGSTAB algorithm in BWAVES, where A is a matrix from the LHS calculation, lowercase roman letters are vectors, and greek letters are parameters. b is the vector from the RHS calculation, and x is the solution vector. The iterative loop starting in line (4) consists of two matrix-vector multiplications, five reductions, and several daxpy operations and parameter updates.
CPU Performance Profile
To understand how to implement this algorithm in CUDA Fortran and to determine where to concentrate most of the porting effort we began by profiling the original code.
Table 1 below contains performance data for BWAVES on two systems: a single quad-core, 2.83 GHz Intel Xeon E5440 (Harpertown) system and a dual quad-core, 2.67 GHz Intel Xeon X5550 (Nehalem) system. Both systems were running Red Hat Linux. The benchmark was compiled on these systems using the PGI 10.5 Fortran compiler with flags -fastsse -Mfprelaxed for single-core runs and -fastsse -Mfprelaxed ‑Mconcur=levels:5,allcores for the four- and eight-core runs. The table shows overall wall-clock times for the 100 time-step run, the time spent calculating the left-hand side (LHS) and right-hand side (RHS) terms and the time spent in the BiCGSTAB routine. In all cases, the time spent in the BiCGSTAB routine dominated at around 90% of overall CPU time.
CPU | Number of CPU Cores | CPU Overall Time (s) | LHS Time (s) | RHS Time (s) | BiCGSTAB Time (s) |
Harpertown (E5440, 2.83 GHz) | 4 1 | 580 685 | 39 55 | 7 10 | 533 618 |
Nehalem (X5550, 2.67 GHz) | 8 1 | 81 309 | 6 33 | 1.5 7 | 73 267 |
Table 1: Overall and component wall-clock times for BWAVES on two CPU platforms. The components are the left-hand side (LHS) calculation including the Jacobian, the right-hand side (RHS) calculation, and the BiCGSTAB routine which includes the matrix-vector multiplication.
Further examination of the performance using PGPROF shows that the matrix-vector multiplication routine mat.times.vec() that is called from the BiCGSTAB routine in steps 1, 8, and 11 of Figure 1 takes approximately 75% of total wall-clock time. The CPU matrix-vector multiply code is listed in Figure 2.
do k=1,nz km1=mod(k+nz-2,nz)+1 kp1=mod(k,nz)+1 do j=1,ny jm1=mod(j+ny-2,ny)+1 jp1=mod(j,ny)+1 do i=1,nx im1=mod(i+nx-2,nx)+1 ip1=mod(i,nx)+1 do l=1,nb y(l,i,j,k)=0.0d0 do m=1,nb y(l,i,j,k)=y(l,i,j,k)+ 1 a(l,m,i,j,k)*x(m,i,j,k)+ 2 axp(l,m,i,j,k)*x(m,ip1,j,k)+ 3 ayp(l,m,i,j,k)*x(m,i,jp1,k)+ 4 azp(l,m,i,j,k)*x(m,i,j,kp1)+ 5 axm(l,m,i,j,k)*x(m,im1,j,k)+ 6 aym(l,m,i,j,k)*x(m,i,jm1,k)+ 7 azm(l,m,i,j,k)*x(m,i,j,km1) enddo enddo enddo enddo enddo
Figure 2: Matrix-vector code used in the CPU BiCGSTAB algorithm of BWAVES. At each point in physical space, the sum of the product of seven 5x5 matrices with five element vectors is calculated. The seven matrices represent terms in the seven-point stencil for each point which includes the centered value and its nearest neighbor in each direction. As a result each component of the a, axp, ... matrices are accessed just once in this loop. The elements of the x vector are accessed multiple times via shifting the indices to get nearest neighbor information.
Several aspects of this loop are worth noting. First, when calculating the result at any point in the mesh, information from the point's nearest neighbors is used by both the matrix and vector terms. However, the neighboring information is represented differently for the two terms. In the case of the matrix quantities, seven different matrices are used; one centered and six neighbors. For the vector, only a single array is used with shifted indices. The result is that each component of each matrix in the entire nested loop is accessed only once, whereas each element of the vector is used 35 times (five times for each component of the seven-point stencil). This access pattern will an important consideration in the CUDA implementation.
CUDA Fortran Implementation
This section starts with a review of the CUDA architecture and programming model, and highlights some of the performance issues to be aware of when porting sections of the CPU code to CUDA. Next is a discussion on determining which sections of the code to run on the device. Included here is a discussion on using asynchronous data transfers to efficiently communicate between the host and device. Finally, the section wraps up with a discussion of porting the BiCGSTAB algorithm and in particular the matrix-vector multiplication routine to CUDA.
CUDA Architecture and Programming Model
The CUDA architecture supports hybrid computing. This means both the host (CPU) and device (GPU) can be used to compute portions of the code simultaneously. A key advantage of this hybrid model is the ability to modify code incrementally. One performance caveat of this model is that data transfers between host and device through the PCI bus tend to be the primary performance bottleneck.
Current CUDA-enabled GPUs contain from tens to hundreds of processor cores and are capable of running tens of thousands of threads concurrently. The processor cores on a GPU are grouped into multiprocessors. The programming model mimics this grouping. A subroutine, or kernel, running on the device is launched with a grid of threads grouped into thread blocks. The threads blocks of the programming model map to multiprocessors in hardware. The number of thread blocks that can reside on a single multiprocessor at any instant in time is limited by the number of shared multiprocessor registers and the amount of shared memory. As a result far more threads can be resident on a multiprocessor than there are thread processors. Because context switches on the GPU are very inexpensive, such oversubscription of processors by threads is the primary means for hiding memory latencies. Threads waiting on data loads and stores can be efficiently swapped out for other threads that are ready to perform computations.
Even with the prospect of hiding latencies described above, one of the primary performance challenges with CUDA, is efficient data movement. With BWAVES the challenge is even more acute because there is relatively little computation performed per element.
Efficient Host to Device Data Transfers
There are two aspects to efficient data movement: data transfer between the host memory and device memory, and onboard data transfer between the device memory and actual GPU chip. We begin with a discussion of how to efficiently perform transfers between the host and device.
The bandwidth of the PCI bus is over an order of magnitude slower than the bandwidth between the device memory and GPU chip. Therefore special emphasis needs to be given to limiting and hiding PCI traffic. The first and most obvious approach is to limit PCI traffic by running as much code on the device as possible. In our case, porting all four components of the time-step loop to CUDA would eliminate all PCI traffic except the initialization and final data retrieval. However, not all code needs to be ported to CUDA to avoid the cost of PCI transfers.
Asynchronous transfers can be used to overlap computation on the device (and host) with PCI memory transfers. To see how we can use the asynchronous mechanism, we'll first need to look at the dependencies between sections of the code. The calculation of the LHS and RHS components are independent of one another. As such they are candidates for one component being calculated on the host while the other is calculated on the device. To decide which component to run where, we need to consider the potential speedup and also the amount of PCI traffic required for each component. Both of these factors suggest that the LHS component is the better candidate for implementing on the device. First, the LHS calculation takes approximately five times as long to run as the RHS. Second, the result of the LHS contains 35 times as much data as the result of the RHS calculation. (Seven 5x5-word matrices per mesh element versus one 5-word vector per mesh element.) Calculating the LHS component on the device presents the opportunity to both greatly improve the LHS performance and hide the host to device PCI transfer of the single vector array generated by the RHS calculation.
Porting the BiCGSTAB Routine
Having looked at how to distributed the code between the host and device, we now turn our attention to the implementation of the kernels to run on the device.
The algorithm for BiCGSTAB depicted in Figure 1 contains several different operations including matrix-vector multiplications in lines 1, 8, and 11, reductions in lines 4, 5, 9, and 12, daxpy-like vector operations in lines 2, 7, 10, and 13-14, as well as scalar parameter updates in lines 6 and 15. We begin with the most time-consuming operation of BiCGSTAB, the matrix-vector multiplication.
Matrix-Vector Multiplication
From our earlier performance analysis, we discovered that the matrix-vector multiplication listed in Figure 2 takes up by far the majority of the CPU time. Therefore, we'll pay special attention to its CUDA implementation.
As described previously, at each point in the physical mesh the routine calculates the sum of a multiplication of seven 5x5 matrices with a five element vector. Each element of the seven a* matrices is used only once and each element of the vector x is used a total of 35 times. The use of on-chip shared memory to store elements of the vector x can reduce the amount of traffic to device memory, but shared memory cannot help with reducing the number of accesses to the seven a* matrices. Thus we need to ensure we are efficiently retrieving the elements of the seven matrices from device memory. Before discussing how this is done we need to talk about the notion of data coalescing.
In terms of data movement on the device, the primary performance issue deals with how memory accesses can be coalesced into a single transaction if certain access requirements are met. The details of how coalescing works can be found in the CUDA Programming and Best Practices Guides, but in a nutshell the optimum memory throughput occurs when contiguous threads access contiguous words in memory. More precisely, pending certain alignment properties, 16 contiguous threads (on the C1060) or 32 contiguous threads (on the C2050) can access an equivalent number of contiguous elements from device memory in as few as a single transaction.
A typical strategy for coalescing mesh-based problems is to have each thread block work on one or more two-dimensional tiles in physical space. To ensure coalesced loads, the leading dimension of the tile is usually a multiple of 16 or 32. Data that are reused, such as components of the x vector, are placed in shared memory so they can be accessed by different threads within the block. This makes stencil calculations easy and efficient because access to noncontiguous data in shared memory does not suffer the performance penalty that noncontiguous accesses to device memory suffer.
While this type of tiling is a plausible approach in general, it is not an appropriate strategy for this matrix-vector multiplication for several reasons. First, is the size of the mesh. At 65x65x64 it is not conducive to tiles with dimensions that are multiples of 16 or 32. The last tile in each horizontal direction would greatly underutilize the threads in the block. Second, there is a limit to the amount of shared memory on a multiprocessor. Multiple two-dimensional shared-memory tiles with 16 or 32 points in the leading direction, where each point contains five 64-bit elements of the vector, would greatly limit the occupancy, or the number of threads that are resident on a multiprocessor at one time. (Recall maximizing occupancy helps to hide memory latency.)
Because the data at a grid point are interleaved, it is not necessary to have 16 or 32 points in the leading dimension of the tile to achieve coalescing. As an example, we can launch a kernel with thread blocks containing 25 threads which map to one point in the physical mesh. The 25 contiguous elements of one of the seven matrices at that point can be accessed in as few as a single transaction. However, there are several problems with this approach. The first is that a group of 32 threads, referred to as a warp, is executed in SIMD fashion on the multiprocessor. This implies that a block of 25 threads is essentially a block of 32 threads were seven of the threads do not perform any work. The second problem is that there are a maximum of eight thread blocks per multiprocessor at any one time, resulting in a maximum of 200 threads per multiprocessor. This once again greatly limits occupancy. The third issue is that in such an arrangement each element of the x vector can only be used within that block five times out of a possible 35 times that they are used overall. So each element of the x vector would still have to be reloaded from device memory seven times.
One way to improve on the above approach is to map a block of 125 threads to five consecutive mesh points. This improves thread block effectiveness from 25 of 32 to 125 of 128. Moreover the five-point tile fits evenly into the 65-point mesh points in the first direction. This eliminates any prospect of dormant threads in the last blocks of the first dimension. This approach also addresses the other two issues identified above. The maximum number of threads per multiprocessor is now 1000 based on the eight block limit. For a block of threads calculating the five components of y at points (i0:i0+4,j,k), the elements of x stored in shared memory are the five components at points (i0:i0+4,j-1:j+1,k-1:k+1) and additionally (i0-1,j,k) and (i0+5,j,k). The components of x at (i0:i0+4,j,k) in shared memory are accessed either 10 or 15 rather than five times within the block, reducing the number of times it is loaded from device memory from seven to either five or six times.
Taking the analysis one step further, we can improve the reuse of the x vector elements by looping in the third direction within the kernel so that each block of threads processes a ribbon of 5 by 64 elements. This is demonstrated in Figure 3. This kernel would launch with (65/5)*65 blocks instead of the (65/5)*65*64 blocks as in the previous case. Shared memory is used in the same manner as shown earlier, however the components at (i0:i0+4,j,k) and (i0:i0+4,j,k+1) are reused when k is incremented rather than reloaded from device memory. This results in the interior points in the ribbon being loaded from device memory just three times overall, once for its vertical ribbon and for the j±1 ribbons.
Figure 3: The tile used in the CUDA Fortran implementation of the matrix-vector multiplication kernel for two consecutive iterations of the loop over the z direction. Each block of threads evaluates components of the output vector at the five interior elements of the tile for each iteration in the z direction. The 10 orange (lighter) elements, five of which are interior in both the k and k+1 steps, represent components of the input vector that get reused from one iteration of the k loop to the next. Upon completion of the loop over the z direction, each block of threads will have calculated results for an xz ribbon of five by 64 elements.
Up to this point we have discussed coalescing and efficient shared memory use for the matrix multiplication input values. Now we'll look briefly at the output vector y. Let's start by comparing the CUDA Fortran code shown on Figure 4 with the equivalent section of CPU code in Figure 2. The 125 threads of a block access contiguous elements of the matrix arrays with indices l=1:5, m=1:5, and i=i0:i0+4. The variable il=i-i0+1 is a local index for i used to access the shared memory arrays. The results of the multiplications by the 125 threads are stored in the 125-word shared memory array sum_s which acts as a partial sum. Following this, syncthreads() is called to ensure all threads in the block have completed the partial sum. Then 25 of the threads in a block perform a final reduction into y. Similarly, the variables p and pl serve an analogous role in mapping the first 25 threads to elements of the vectors as i and il do in mapping all 125 threads of a block to elements of the matrices.
sum_s(l,m,il) = & a(l,m,i,j,k) * x_000_s(m,il )+ & axp(l,m,i,j,k) * x_000_s(m,il+1)+ & ayp(l,m,i,j,k) * x_0p0_s(m,il )+ & azp(l,m,i,j,k) * x_00p_s(m,il )+ & axm(l,m,i,j,k) * x_000_s(m,il-1)+ & aym(l,m,i,j,k) * x_0m0_s(m,il )+ & azm(l,m,i,j,k) * x_00m_s(m,il ) call syncthreads() if (t <= 25) then y(l,p,j,k) = sum_s(l,1,pl)+ sum_s(l,2,pl)+ & sum_s(l,3,pl)+ sum_s(l,4,pl)+ & sum_s(l,5,pl) endif
Figure 4: Matrix-vector multiplication in CUDA Fortran. This section of code is inside a loop over the k index, where each block of threads calculates results in a ribbon of mesh points, five points in i and all 64 points in k. For each iteration in k, the x_*_s shared-memory arrays are updated (not shown). Results are first stored in a partial sum shared-memory array sum_s. After all threads in the block have completed the partial summation, 25 of the 125 threads perform the final summation and store the result in y. The indices il=1:5 and pl=1:5 are unit-offset variants of i and p.
Reductions
In addition to matrix-vector multiplication, the reduction operation is prevalent in the BiCGSTAB algorithm. The reductions appearing in lines 4, 5, 9, and 12 of Figure 1 are implemented in two steps. The first step, done on the device, is a reduction within each thread block. The second step is to return the resulting partial sums to the host for the final reduction. The reason for doing the final reduction on the host is twofold. First, performing the final reduction on the device would require either use of atomic functions to accumulate between blocks, or launching a separate kernel with just one block. Neither operation is particularly efficient. Second, the result needs to be brought back to the host in any event because it is used by the next kernel. For example, the final reduction of step 5 is done on the host, where its result, ρ, is used to calculate β on the host in step 6. β is then placed into constant memory before launching the next kernel which performs step 7. Similarly α in step 9 and ω in step 12 are put into constant memory before launching the kernels which calculate steps 10 and 13-14. The use of partial sums or multiple accumulators for reductions is both more efficient and more accurate than using a single accumulator.
Results
Performance data for the CUDA Fortran implementation of BWAVES is shown in Table 2 below. As with the CPU runs, the overall times in seconds are provided along with a breakdown of time spent in the LHS, RHS, and BiCGSTAB components of the runs (in parentheses). For each platform, two sets of numbers are presented:
- concurrent runs where the LHS calculation on the device overlaps the data transfers and computation of the RHS on the host, and
- serial runs where no overlap occurs.
The serialization was achieved through enabling CUDA profiling which effectively changes the nonblocking data transfers into blocking transfers. Only the time taken to launch the LHS kernel is recorded for the LHS time in the concurrent runs, and is therefore not reported. The actual time required for computing the LHS on the GPU is given by the first number in parentheses for the serial runs. In some cases it appears that the BiCGSTAB routine takes longer for the concurrent run than it does for the serial run on the same platform. However, this is not the case. In these cases the RHS calculation on the host and the subsequent host to device data transfer complete before the LHS. The additional time reported simply reflects the time waiting for the LHS calculation to finish.
CPU | CPU Cores | C1060 | C2050 (ECC on) | C2050 (ECC off) | |||
Concurrent | Serial | Concurrent | Serial | Concurrent | Serial | ||
Harpertown (E5440, 2.83 GHz) | 4 | 51 | 59 | 40 | 45 | 34 | 39 |
(-/7/43) | (11/7/41) | (-/7/32) | (5/7/33) | (-/7/27) | (4/7/28) | ||
1 | 52 | 63 | 43 | 49 | 38 | 43 | |
(-/11/40) | (11/10/41) | (-/10/32) | (5/10/33) | (-/10/27) | (4/11/28) | ||
Nehalem (X5550, 2.67 GHz) | 8 | 50 | 53 | 36 | 39 | 31 | 33 |
(-/1.5/48) | (11/1.5/41) | (-/1.5/35) | (5/1.5/33) | (-/1.5/29) | (4/1.5/27) | ||
1 | 50 | 59 | 40 | 46 | 35 | 39 | |
(-/7/43) | (11/7/41) | (-/7/32) | (5/8/33) | (-/8/27) | (4/8/27) |
Table 2: Results for CUDA Fortran runs on Harpertown and Nehalem systems with C1060 (Tesla) and C2050 (Fermi) GPUs. The numbers are times in seconds for the runs, and the breakdown of time spent in the left-hand side (on device), right-hand side (on host), and bi-conjugate gradient (on device) portions of the code are given in parentheses, respectively. For each platform both concurrent and serial runs are given. In the concurrent runs the computation of the LHS on the GPU overlap those of the RHS on the CPU. In these cases no time is reported for the LHS as only the launch time is recoded. The actual time taken be the LHS computation is given in the breakdown of the serial runs.
Verification of Results
BWAVES generates three different output files that can be used for result verification:
- the L_{2} norm of dq(l,i,j,k), the change in the dependent variables after the last time step,
- the residual convergence after each time step, and
- the cumulative sum of iterations for convergence over all times steps.
The major difference between platforms appears in the last of these. From this data (shown in Table 3) we see that parallel runs, whether on multiple cores of the CPU or on the GPU, require fewer iterations overall. This is due to the improved accuracy of parallel reductions over their serial counterparts. These reductions are used to find search directions for the BiCGSTAB algorithm; the more accurate these search directions the fewer iterations needed.
CPU | CPU Cores | CPU Only | C1060 (Tesla) | C2050 (Fermi) | |
ECC On | ECC Off | ||||
Harpertown (E5440, 2.83 GHz) | 4 | 2192 | 2172 | 2169 | 2169 |
1 | 2195 | 2172 | 2169 | 2169 | |
Nehalem (X5550, 2.67 GHz) | 8 | 2184 | 2172 | 2169 | 2169 |
1 | 2195 | 2172 | 2169 | 2169 |
Table 3: The number of total iterations of the BiCGSTAB loop for the 100 time steps, as output in bwaves2.out. The parallel reductions are more accurate than serial reductions, resulting in obtaining better search directions and fewer iterations overall. This is seen both in comparing serial and parallel CPU-only runs as well as GPU and CPU runs. Both concurrent and serial GPU runs produce the same result, thus only a single value for each is reported here.
The L_{2} norm of the change vector after the last time step shows excellent agreement across platforms, as indicated in Table 4.
CPU | CPU Cores | CPU Only | C1060 (Tesla) | C2050 (Fermi) |
Harpertown (E5440, 2.83 GHz) | 4 | 18.47247154991293 | 18.47247154854392 | 18.47247154990315 |
1 | 18.47247154832275 | 18.47247154854392 | 18.47247154990315 | |
Nehalem (X5550, 2.67 GHz) | 8 | 18.47247154934678 | 18.47247154854392 | 18.47247154990315 |
1 | 18.47247154832275 | 18.47247154854392 | 18.47247154990315 |
Table 4: The L_{2} norm of the dependent variables change vector after the last time step as reported in bwaves3.out. Both concurrent and serial GPU runs produce identical results, as do C2050 (Fermi) runs with ECC both on and off, so these columns are collapsed.
Conclusions
Table 5 summarizes our results for BWAVES CPU and GPU performance. Because both host and device execute sections of the code, overall performance is dependent on both CPU and GPU architectures. However, to the first order, the performance can be categorized by the GPU architecture. Runs on the C1060 were between 50 and 52 seconds. Here, the LHS calculation on the GPU is the bottleneck in the LHS/RHS overlap as can be seen in the serialized runs in Table 2. For the C2050 this is no longer the case and the different CPU platforms play a larger role in the overall performance. Of significance to note is that regardless of the GPU architecture, all CUDA runs outperform any CPU-only run, and a single Harpertown core with a C1060 outperforms an eight-core Nehalem system.
Finally, no changes were made to the code between any of the runs. This is not particularly notable when comparing CPU and GPU runs as the execution naturally takes different code paths. However, it is noteworthy in the context of different GPU architectures. Specifically, the execution configurations used when launching CUDA kernels were identical for the C1060 Tesla and C2050 Fermi runs. The same optimization principles apply to both GPU architectures, such as limiting PCI traffic and hiding it with computation whenever possible, efficient use of device bandwidth through coalescing global memory transactions and reducing multiple reads of the same data through shared memory.
CPU | CPU Cores | CPU Only Time (s) | C1060 (Telsa) Time (s) | C2050 (Fermi) Time (s) | |
ECC on | ECC off | ||||
Harpertown (E5440, 2.83 GHz) | 4 | 580 | 51 | 40 | 34 |
(39/7/533) | (-/7/43) | (-/7/32) | (-/7/27) | ||
1 | 685 | 52 | 43 | 38 | |
(55/10/618) | (-/11/40) | (-/10/32) | (-/10/27) | ||
Nehalem (X5550, 2.67 GHz) | 8 | 81 | 50 | 36 | 31 |
(6/1.5/73) | (-/1.5/48) | (-/1.5/35) | (-/1.5/29) | ||
1 | 309 | 50 | 40 | 35 | |
(33/7/267) | (-/7/43) | (-/7/32) | (-/8/27) |
Table 5: Comparison of CPU and GPU results for BWAVES on both Harptertown and Nehalem systems with and without Tesla and Fermi cards. Two Fermi runs are performed, with and without ECC. In each cell of the table the top number is the overall time, and the bottom numbers in parentheses reflects the cumulative execution time for the LHS, RHS, and BiCGSTAB components of the benchmark. For concurrent runs the LHS time recorded reflects only the time to launch the kernel due to the nonblocking of CUDA kernel launches, and is not reported.
Additional information on this port is available in the PGI Accelerator Files section of the PGI website.
SPEC® is a registered trademark of the Standard Performance Evaluation Corporation