October 2011

##
Accelerating Scientific Computing Code in FORTRAN:

The Quantum Chemistry Project

### FORTRAN in Scientific Computing

Historically scientists and engineers have been major users of high performance computing for their numerical computations. The FORmula TRANslation (FORTRAN) language, that specially addresses and optimizes the requirements of large-scale numerical computation, is therefore deeply rooted in code development in many laboratories. This code development in academic laboratories undertaken by faculty, postdocs and students responding to immediate and pressing scientific questions is the major factory of scientific computing software. Notwithstanding this there has been a noticeable change over the recent past where some popularly used codes are now written in C and C++. Nonetheless, the bulk of large-scale (hundreds of thousands of lines of code/software package) computational tools in science and engineering that are routinely used by academics remain FORTRAN based.

There are clear advantages of using C/C++ to a software developer. In the context of scientific computing these advantages are primarily due to the templates for generic programming and object orientation for abstraction and code reuse. The question then is why is there continued development of FORTRAN based code by scientists and engineers. Briefly, the FORTRAN90 revision to the language, ease of use and inbuilt machine optimization make it optimal for laboratory based software design and numerical computations. In the first instance complex numerical routines read much like the formulas being transcribed. In the second instance automatic optimization comes from FORTRAN90's stronger aliasing rules for memory pointers i.e., FORTRAN function arguments may not alias each other, and the compiler assumes they do not. This enables an average programmer to produce fast performing software while a similar performing C code will require a skilled programmer at the cost of readability.

### Special Purpose Parallel Computing on GPUs

The CUDA (Compute Unified Device Architecture) GPU parallel computing architecture, developed by NVIDIA, and the OpenCL (Open Computing Language) toolset used for writing kernels implementable on GPUs, developed by the Khronos Group, are both based on C. Consequently, when the porting of numerical code to Graphical Processing Units (GPUs) became possible, software written in C and C++ were the first to take advantage of the revolution in accelerated computing.

A major handicap that prevented the acceleration of FORTRAN code on GPUs was due to a lack of convenient tools (compilers, kernel instructions etc) that makes communication with GPUs easy. A significant innovation for FORTRAN developers has been the PGI Accelerator programming model introduced in 2009. The PGI model is implemented as a set of directives that are accepted by the PGI FORTRAN and C compilers. Using the PGI Accelerator compiler, the conversion of FORTRAN codes to a form capable of running on a GPU can be as simple as inserting !$acc region and !$acc end region directives around the code intended for execution on the GPU.

A common feature of many popular scientific computing codes is the presence of time consuming components in their algorithms that are repeated on different data sets. The single instruction, multiple data (SIMD) parallel nature of these codes has been recognized early on and therefore many scientific codes have been implemented on massively parallel computers with some achieving linear scaling of up to 128 CPU nodes. While SIMD parallelism is ideally suited for implementation on General purpose Graphical Processing Units (GPGPUs) existing parallelized codes cannot be directly ported across from clusters to GPUs by simply inserting !$acc directives at appropriate points in the code. To better understand this we briefly summarize the architectural features of a GPU in the context of NVIDIAs Fermi hardware. Principally GPUs have hundreds of processors and different levels of memory that are accessed by individual and groups of processors at different speeds. Fermi has a 64 kB on-chip shared memory and L1 cache capacity, a 768 kB L2 cache for main memory, and support for complete overlapping of asynchronous bidirectional host-GPU memory transfers and GPU kernel execution.

There are many subtle optimization strategies however, the two most important goals in an acceleration-porting project are 1) achieving load balancing on the GPU and 2) minimizing data transfer between the GPU and CPU. Besides the acceleration, a major concern for scientific computing applications is the impact of data precision on GPU performance. We address these issues here in an overview of our project to accelerate the performance of the popular quantum chemistry FORTRAN academic software, GAMESS-UK using GPUs

### Grand Challenges in Computational Chemistry, Materials, Biology and Biophysics

Molecular and Mesoscale simulation methods are employed to calculate thermodynamic properties (e.g., phase equilibria chemical engineering), mechanical properties (e.g., stress strain relationships in material science), transport properties (e.g., diffusion and thermal conductivity in biology) conformational properties and configurational searching properties in chemistry and biophysics. The software used in these applications is often embedded in a molecular dynamics (MD) or a monte carlo (MC) frame. The most computationally intensive part of MD codes is the calculation of non-bonded interactions. MD codes were some of the first large scale scientific computing software to be accelerated by porting the non-bonded force routine to GPGPUs.

Electron level computations are essential for calculating the electronic structure of molecules and materials, chemical reactivity, spectroscopic properties and catalytic mechanisms. These computations are generally undertaken by large sophisticated quantum mechanical (*ab initio*) software written in FORTRAN. A key bottleneck in speeding up the computational time of *ab initio* software of this type is the time consuming calculation of complicated electron repulsion integrals (ERI). This has been the major limitation in the acceleration of popularly used quantum chemistry code.

### Basic Summary of *ab initio* Methods

As indicated above, the most important principle behind *ab initio* methods is the description of the electronic structure within a molecule. The electronic structure of an atom is the arrangement of electrons within shells, also known as atomic orbitals. A detailed discussion of this subject is well beyond the limit of this article, but the basic principles must be mentioned in order to illustrate the origins of the computational expense of *ab initio* calculations.

There are a number of different types of atomic orbitals. As shown below, the shapes of the 1s-, 2p- and 3d- type orbitals differ significantly, this is a result of the difference in their angular quantum numbers (l and ml). Below we illustrate this for orbitals in the n=3 energy level.

3s shell | 3pz shell | 3dz^{2} shell | 3dxz shell | |

l | 0 | 1 | 2 | 2 |

m_{l} | 0 | 1 | -2 | 1 |

The functions that represent molecular orbital (MO) shapes have to be represented numerically and this is achieved by writing out the functions as an expansion in a finite set of basis functions. The use of Gaussian basis functions of the form

are commonly used in *ab initio* computations. In addition to the relative simplicity of the algorithms, they can be combined, or contracted, in such a way as to give increasingly accurate representations of the shape of the shells. Collectively, the gaussian functions used within *ab initio* calculations are known as basis sets and the gaussian functions themselves are referred to as contracted and primitive basis functions. One of the simpler basis sets, 6-31G, uses three contracted functions to describe the three shells of carbon (The 1s, 2s and 2p shells respectively). The first of these contracted functions is built up of six primitive functions, whilst the second and third shells are represented by a combination of a contracted function built up of three primitives and singly contracted function.

Calculations using *ab initio* methods are iterative in nature. The aim is to calculate a series of coefficients for the basis functions that result in the lowest possible electronic energy.

A very simplified overview of a typical *ab initio* calculation, as illustrated in Figure 2 below, would be:

- Calculate coefficients (Using a guess if it is the first iteration)
- For all shell quartets
- Generate shell quartet
- Evaluate ERI for shell quartet
- Calculate ERI contribution to Fock Matrix
- Update Fock Matrix

- Diagonalise Fock Matrix
- Check for convergence (Return to 1 if not converged)
- Calculate required molecular properties

The most expensive part of this process is the evaluation of the large number of 2-Electron Repulsion Integrals (ERIs). The problem is compounded by the fact that the ERIs are evaluated during each iteration. ERIs themselves may be loosely described as the overlap of four shells (a shell quartet) and may be represented using the notation (ab|cd). The (ab|cd) representation is often referred to as the bra-ket notation. The bra-ket terminology allows separation of the bra-ket into the bra: (ab| and the ket |cd). We will expand on the importance of this for *ab initio* calculations in future articles.

The high cost of the ERI evaluation process results from the sheer number of ERIs, which scales with N^{4} where N is the number of contracted basis functions. For example, in a simple water molecule (H_{2}O), using a 6-31G basis set, there are seven contracted basis functions, and up to 2401 contracted ERIs. A number of these ERIs are either permutatively equivalent (For example: (ab|cd) = (cd|ab) ) or are small enough to ignore. As such, the total number of contracted ERIs that are evaluated is only 406. A larger, more representative, calculation on a cluster of 50 water molecules contains 350 contracted and 900 primitive basis functions and thus potentially requires the evaluation of 1.5x10^{10} contracted ERIs! Fortunately, this reduces to just 3,000,000 contracted ERIs when the permutation and size filters are taken into account.

A further complication arises from the aforementioned requirement of contracted basis functions to provide an accurate description of the shapes of the shells. The use of contracted Gaussian functions means that contracted ERIs built up from contracted shells must be considered. These contracted ERIs, (ab|cd), are calculated as a sum of primitive ERIs [ij|kl]:

When this expansion is considered approximately 10,150 and 75,000,000 primitive ERIs need to be evaluated for the single water molecule and 50 water molecule cluster examples mentioned above.

As each contracted ERI is independent of all others, the ERI evaluation problem lends itself to parallelism and is thus well suited to GPU acceleration. While a general algorithm may be used for the evaluation of all types of ERIs, optimized forms of the algorithm are used for shell quartets with different combinations of angular momentum (effectively the shape, as outlined in scheme A, above).

### Brief Summary of the Accelerated Code

The processes that will be discussed in this article are generalized in Figure 3 below. This Figure represents the processes that are repeated during each iteration of an ab initio calculation.

To summarise: A list of contracted shells is used to create a series of contracted shell quartets. These contracted shell quartets are transferred to the GPU and then decontracted and evaluated to produce a huge number of primitive ERIs. These primitive ERIs are then summed to produce the contracted ERIs. There is a further calculation step after which the contracted ERIs are combined into the Fock matrix which is subsequently used to check for convergence.

### ERI Evaluation on GPUs: Problems and Their Solutions

**Problem 1: Getting the shell quartets in an order that maintains thread consistency.**
The original GAMESS-UK shell quartet generation code was completely unsuited to GPU execution as it generates shell quartets sequentially in an order dependent upon the shell numbers, which in turn depends on the order of the atoms given as input. Once the shell quartet has been generated the ERI is evaluated using the relevant subroutine.

Code representing this process is:

do i = a, nshels do b = 1, a test (ab| bra for significance do c = 1, b do d = 1, c test |cd) ket for significance test (ab|cd) shell quartet for significance expand shell quartet into primitive ERIs evaluate ERI for primitive ERI [ij|kl] end do end do end do end do

A schematic representation of this approach is:

Not only does this algorithm only evaluate a single shell quartet at a time, but there is no consistency between sequential shell quartets regarding which algorithm is required to perform the evaluation. However, it does contain crucial logic that prevents the generation of permutatively equivalent shell quartets as well as tests that prevent the generation of a shell quartet that will not contribute significantly to the final answer, before it would be evaluated. In addition, direct use of this approach would result in an extremely high ratio of data transfer to thread execution time and also result in a only a single execution thread for each call to the GPU kernel, resulting in an extremely wasteful load on the GPU.

To effectively load the GPUs it is necessary to generate equivalent shell quartets together. Equivalent ERIs share two characteristics:

- The angular moment of the component shells for each quartet must be the same, as each combination requires a unique subroutine.
- The level of contraction must be the same, as this controls the number of loop iterations within the subroutine.

**Solution:**

Generate and then sort shell quartets: This approach is prohibitively costly due to the sheer number of shell quartets.

Sort shells before generating shell pairs and the subsequent shell quartets: This allows the generation of all equivalent shell quartets in a single batch and does not require the storage of any additional data.

sort shells and record number of shells of each class !shell pair generation do aclass = 1, nclasses do bclass = 1, aclass do a = 1, nshelsofclass ashell = a do b = 1, bmax bshell = b test (ab| bra for significance if significant, save (ab| end do end do end do end do !shell quartet generation do abclass = 1, npairclasses do cdclass = 1, abclass do ab = abclassmin, abclassmax do cd = cdclassmin, cdclassmax test (ab|cd) shell quartet for significance if significant, save (ab|cd) shell quartet end do evaluate significant shell quartets of this class end do end do end do

This code provides an identical list of shell quartets to the CPU code with a single caveat: There are some cases where the ashell and bshell are swapped with the cshell and dshell. However the resulting complications are issues that are addressed later in the computation anyway. Schematically this process can be represented as:

**Problem 2: Minimization of Data Transfer**

__Recognition of data re-use and pre-calculation of variables__

An obvious issue with any code executed on a GPU is the reduction of the data transfer across the PCI express bus.

In the original CPU code, two variables for each primitive basis function (the exponent and coefficient of the Gaussian function) were stored in arrays and used where needed. However, due to the nature of the basis sets, the number of unique values within these arrays is actually very low. For the cluster of 50 water molecules described above there are 900 primitive Gaussian functions resulting in the storage of 1,800 variables, yet only 20 unique values.

As can be seen from the shell quartet generation code above, the code is dependent upon the generation of the bra and ket shell pairs (these are equivalent). There is significant re-use of these pairs, meaning that pre-calculation of pair-wise variables is advantageous. This results in the storage of just 600 variables. Whilst this is a larger number than mentioned above, this array of pair-wise values is actually generated and stored on the GPU and the only data transferred across the PCI express bus at this point are the 20 unique values.

!$acc mirror pair-wise !$acc update device (inputarray1, inputarray2) !$acc do do i = 1, nuniquevalues do j = 1, i Calculate and store 6 pair-wise values end do end do !$acc update host(pair-wise)

The use of !$acc update host directive here sends the calculated values to the cpu so that a cpu implementation of the code can be tested. The use of !$acc update device directive is required as the values within these arrays are also used outside of the pair-wise variables.

An identical approach may be taken for the calculation of the distances between the atoms of the system.

__2b: Simple summation on the GPU—contraction__

As described earlier, the contracted ERIs are calculated as the summation of a series of primitive ERIs. As the evaluation of an individual primitive ERI directly represents an actual thread, the summation is performed in a separate kernel to the ERI evaluation itself.

!$acc data region !$acc local(primitiveERIs) !$acccopyout(contractedERIs) !$acc do do ijkl = 1, nquartets do iprim = 1, contraction Evaluate primitive ERI end do end do !$acc do do ijkl = 1, nquartets do iprim = 1, contraction Contracted ERI = Contracted ERI + primitive ERI end do end do

**Problem 3: Accuracy**

Quantum Mechanical computational chemistry methods provide a high level of detail about chemical systems. To provide this level of detail, high levels of accuracy are needed, meaning that double precision operations are generally essential.

However, GPUs perform single precision operations significantly faster than double precision operations. As such, careful choices have to be made regarding the manner in which numerical accuracy is dealt with in order to get the ideal balance between performance and precision.

A number of schemes to deal with this issue have been proposed in the computational chemistry literature. These include performing all of the summations at double precision, calculating larger ERIs using double precision and switching to versions of the code that are fully double precision after a certain number of iterations.

Currently, the code implements both a fully GPU accelerated single precision version and a version that initially uses single precision and then reverts to the double precision CPU version of the code for all ERIs when a specific level of convergence has been reached. The energies of two water molecules calculated using these various approaches is shown below.

Electronic Energy | Total Energy | |
---|---|---|

Original CPU code | -190.2020361428 | -151.9769225894 |

GPU optimized code | -190.2020366394 | -151.9769230861 |

Hybrid CPU and GPU code | -190.2020361422 | -151.9769225888 |

### Comparison of GPU and CPU Performance

As shown above, despite the accuracy of GPU accelerated code may match that of standard CPU code if the correct considerations are made. Due to the relative number of single and double precision cores, the best performance for GPU accelerated code is achieved when performing operations at single precision.

The graphs in Figure 6 show the speed of the ERI evaluation for system in which there are shells with just a single shape, or principle quantum number (n=1, l=, m=0 leading to the spherical shells shown above). However, these timings do contain multiple levels of contraction. As is clear from these graphs, the accelerated code runs significantly faster than the original code on just a single GPU. When multiple GPUs are used, the performance exhibits almost linear scaling.

The performance of the accelerated code for variously sized water clusters, systems in which there are shells of higher angular momenta (p- shells, n=2, l=1, m=-1,0,1) is shown in comparison with the CPU version of GAMESS-UK. The timings shown here are for the ERI evaluation employed in a recent version of the accelerated GAMESS-UK code. Unfortunately we do not present our latest results (which are significantly faster) here as they are in review elsewhere.

### Acknowledgements

This work is based upon research supported by the South African Research Chairs Initiative (SARChI) of the Department of Science and Technology and National Research Foundation to Kevin J. Naidoo (KJN). Karl Wilkinson and Kyle Fernandes thanks SARChI for postdoctoral and doctoral fellowship support. We thank the NVIDIA Professor Partnership Program to KJN for a generous donation of equipment. We thank PGI for support in particular Mathew Colgrove for valuable discussions.

* Principle Investigator and Corresponding Author