November 2010
Porting Computational Chemistry Algorithms to GPUs
Using PGI Accelerator Directives
Introduction
Heterogeneous computers based on accelerators such as Graphics Processing Units (GPUs) are emerging as a popular platform for High Performance Computing. This article describes the application of this technology to computational chemistry algorithms using the PGI Accelerator Fortran compiler. We discuss the techniques in general terms using pseudo-code, where code in black is executed on the CPU while code in red is executed on the GPU.
The algorithms we are working with are used within models capable of high precision but only at significant computational cost. Less costly models may be used, but the accuracy of the calculations suffers as a result. Here, we look at algorithms that allow calculations to be performed at an extremely high level of accuracy but are prohibitively expensive for use in very large scale calculations. The goal is for GPU acceleration of this code to make such calculations feasible. Profiling studies have shown that this aspect of the code may take up more than 50% of the overall compute time for a calculation.
The highly parallel section of the algorithm of interest calculates huge numbers of variables that are independent of one another. Here we refer to those variables as work-units. These work-units are the sum of a set of components. Each component of a work unit is calculated using the same algorithm. However, the algorithm used differs according to the class of the work-unit. There are a number of classes and subclasses of work-units. The subclasses of a given class of work-unit differ from one another by the number of components from which they are built. However, each component is calculated using the same algorithm. This is illustrated in Figure 1 below.
Figure 1: Problem summary
Each class of work-unit requires an optimized algorithm. Within these algorithms, different numbers of iterations over loops are required for each sub-class in order to calculate each component. These components are then summed to give a final result for the work-unit.
Original Approach
As mentioned above, there are a large number of work-units produced in calculations for even the most basic of systems. The original CPU code generates work-units one at a time and then calls subroutines designed for the algorithm suited to the class of the current work-unit. A number of iterations of a loop are then performed for each component. As noted above, the number of iterations over the algorithm is dependent upon the subclass (number of components) of the work-unit and the actual algorithm used within said loops is dependent upon the class of the work-unit.
Loop over i
Loop over j
Generate work-unit
Classify work-unit
Call work-unit analysis routine
Loop over components of work-unit
Summate components of work-unit
| Data transfer: | n/a |
| Kernel performance: | n/a |
| Subclasses: | unimportant |
Scheme 1: Original pseudo-code
This approach is unsuitable for GPU accelerated code, as only one work-unit is generated at a time, meaning evaluation of the work-units would require regular changes to the kernel running on the accelerator. Such code would not be able to take maximum advantage of the hardware parallelism.
Inefficient Approach
Initially, the problem of switching between kernels was resolved by generating the work-units, sorting them according to class and then submitting each class as a batch to the relevant GPU kernel.
Loop over i
Loop over j
Generate work-unit
Classify work-unit
Store work-unit according to class
Loop over classes
Call work-unit analysis routine
Loop over work-units of given class
Loop over components of work-unit
Calculate components of work-unit
Summate components of work-unit
| Data transfer: | ~2Gb |
| Kernel performance: | Unbalanced |
| Subclasses: | Ignored (resulting in unbalanced threads length) |
Scheme 2: Inefficient approach
This initial approach was inefficient to say the least, but proved that the algorithms could be executed on the accelerator. While this code was both functional, and more importantly, numerically accurate, it was much (~100x!) slower than the original code running on the host. Profiling studies showed that a number of factors were responsible for the poor performance:
- No load balancing with respect to the number of loop iterations within the kernel
- The large amount of data transferred due to the un-optimised storage of data required for kernel execution
The first issue resulted in poor performance of the compute kernel on the GPU itself. On the second issue, a relatively small system required the transfer of approximately 2Gbytes of data to and from the GPU. Optimization of the manner in which the variables were stored allowed us to reduce the amount of data transfer from ~2Gb to ~42Mb. This level of data transfer was still higher than required, and later versions of the code reduced the data transfer to ~130kB. This was deemed acceptable and allowed us to shift our focus from the reduction of data transfer to the optimization of kernel performance.
Load Balancing on the GPU
As described above, the number of iterations over loops within the kernel is dependent upon the subclass of the work-unit. Therefore, if subclasses are not taken into account, the threads resulting from consecutive work-units are of unpredictable lengths. This obviously resulted in extremely poor load balancing across threads. To improve this, the code was modified in such a way as to make a single thread of the kernel equivalent to the execution of a single iteration of the loop, rather than a number of iterations.
Loop over i
Loop over j
Generate work-unit
Classify work-unit
Store work-unit according to class
Loop over classes
Expand work-units into components
Call work-unit analysis routine
Loop over work-unit components
Calculate work-unit components
Summate components of work-unit
| Data transfer: | ~2Gb |
| Kernel performance: | Balanced |
| Subclasses: | Ignored (resulting in high data transfer) |
Scheme 3: Load balanced, data expensive approach
The performance of this version of the code was observed to be poor, but as shown in Figure 2 this was due to the amount of data (~2Gb) transferred to the GPU. The performance of the kernel itself was actually quite impressive.
Figure 2: Timing data for the data transfer constrained algorithm. (enlarge)
The return to data transfer limited performance was due to the expansion of the work-units into their components on the host machine and the subsequent transfer of large amounts of data to the accelerator.
Optimized Load Balanced Approach
Fortunately, the increased data transfer could be negated by performing the expansion on the accelerator itself. However, in order to maintain threads of consistent length, this requires that the work-units are sorted according to subclass:
Loop over i
Loop over j
Generate work-unit
Classify work-unit
Store work-unit according to subclass
Loop over classes
Call work-unit analysis routine
Loop over subclasses
Expand work-units into components
Loop over work-unit components
Calculate work-unit components
Sum work-unit components
Summate components of work-unit
| Data transfer: | ~130 kb |
| Kernel performance: | Balanced |
| Subclasses: | Accounted for |
Scheme 4: Load balanced, data transfer-optimized approach
As can be seen above, these changes made a significant difference to the amount of data transferred! 2Gb was reduced to 140 kB.
This version of the code was running with a considerable performance increase relative to the original code. However, most of the modifications had been made to the engine which was used to generate the work-units. The algorithm used to perform the actual calculation had remained largely unchanged. In fact, the major differences between the kernel executing on the GPU and the original subroutine were related to the manner in which the required data was accessed and the removal of the loops over the components.
This meant that the kernel contained code originally designed around the strengths of a CPU. For example, one variable was calculated Nwork-unit*Ncomponent times, despite having only Njob-size2/2 unique values (where Nwork-unit*Ncomponent >> Njob-size). A number of similar issues, relating to other variables, were also present. To address this, these unique values were calculated and stored on the GPU (using the data region clause) allowing access when needed. These relatively minor alterations resulted in a four fold increase in performance.
As can be seen in Figure 3 below, the resulting code performed admirably in comparison to the original CPU code, running 10x faster on a single NVIDIA GTX260 than on an Intel i7 2.80 GHz processor.
Figure 3: GPU vs CPU performance
Having established the performance of the accelerated code relative to the original CPU code, it was tested on other GPU based accelerators. Figure 4 shows the performance of the different GPU accelerators used for testing. Note that the time-axis of this graph is only 1/10th the size of that of the time-axis for the CPU results graph. The most recent GPU tested is a Tesla C2050, upon which the code ran ~60x faster than on the Intel i7 2.8 GHz processor.
Figure 4: Performance of different GPUs
Using Multiple GPUs
The most important issue for dealing with multiple GPUs was obvious: ensuring that the load balance across GPUs was even. This could potentially be achieved by careful distribution of the different subclasses of work-units across the GPUs.
Note that we are only dealing with a single class at this point and that the subclasses differ by the number of components from which they are built. Each component is calculated using identical code. However, the number of work-units of each subclass can differ considerably according to the size of calculation being performed. Therefore, the method used to balance the computational load across multiple GPUs needs to be flexible.
For this code, the development of dynamic load balancing code for multi-GPU systems requires the consideration of a number of factors:
- The number of work-units in each subclass
- The number of components associated with the subclass
- The compute time required for a component
- The data transfer time required for each work-unit
Factors 2-4 are interconnected: for each subclass, a single unit of data (of the same size, regardless of class or subclass) is transferred to the GPU for each work-unit. Then, a thread is generated for each component. As the number of components varies with the subclass (from 1 to ~1500), the ratio of data transfer to compute time varies wildly between different subclasses.
These relationships mean that a simple distribution of the workload according to the number of work-units within each subclass is insufficient. As such, a method was devised whereby a total GPU time was estimated for each subclass by estimating total kernel and data transfer times. The subclasses were then allocated to a GPU in such a way as to balance the workload.
Figure 5a: Performance of multiple GPUs
Figure 5b: Performance of multiple GPUs
As can be seen from Figures 5a and 5b, the performance of this code was impressive. Near linear scaling with respect to the number of GPUs was observed for most tested calculation sizes. The resulting code, when running across 4 Tesla C2050 GPUs, is approximately 180x faster than the original code running on the Intel i7 2.8 GHz processor.
Conclusion
We have successfully applied the PGI Accelerator programming model to accelerate a section of an established scientific software package, achieving impressive speed-ups. The acceleration of an existing software package in this manner retains the full functionality of the software, a major advantage.
The process of developing the accelerated software has not been a simple case of adding the accelerator pragmas within the existing code. A large amount of re-structuring has proven necessary. This is a result of the generality and flexibility of the original code for the problem addressed. This code was designed around the strengths of a CPU based system. Having said this, other aspects of the code are much more amenable to acceleration through the addition of simple pragmas. We have just chosen to attack the most complex problem first.
Acknowledgements
This work is based upon research supported by the South African Research Chairs Initiative (SARChI) of the Department of Science and Technology and National Research Foundation to Prof. Kevin J. Naidoo. We thank the NVIDIA Professor Partnership Program to Kevin Naidoo for a generous donation of equipment. Karl Wilkinson thanks SARChI for postdoctoral fellowship support. In addition, the work discussed here would not have been possible without the support of PGI, in particular Mathew Colgrove who I would like to take this opportunity to thank once again!