Technical News from The Portland Group

Speed-Up for Physicists

Introduction

In June 2007 the Center of Information and Media Technology (Director: Prof. Dr. Stephan Olbrich) at the Heinrich Heine University (Duesseldorf, Germany) and the leading European HPC vendor Bull (Cologne, Germany) started a mutual cooperation. The aim of this cooperation is the technology transfer between science and industry including the implementation of mutual projects. One of these projects focuses on the evaluation of application possibilities of leading-edge accelerator technologies for existing scientific simulation codes (e.g. NVIDIA TESLA). Our university was the first in Germany to install a TESLA S870 computing system. No problems occurred for the developers and technicians of Bull when integrating this accelerator into our present HPC cluster.

Most of our scientists are passionate FORTRAN programmers—it's just their familiar programming language. Therefore using CUDA as API for NVIDIA GPUs in order to speed up their code is quite an acceptance threshold. Of course we were very interested when we first heard about the new PGI compiler, which is able to generate CUDA code just from standard FORTRAN.

Searching within simulation science for the right scientific application exemplifying the advantages of TESLA accelerators I found "ISPI". ISPI stands for "Iterative Summation of real-time Path Integrals" and is a code from solid state physics developed at our university by the institute of Prof. Egger. This single application consumed more than 400000 CPU hours (i.e. about 45 CPU years) in less than 6 months. So I contacted the developer of this code and he kindly left the source code to me for "CUDAfication".

Where has the time gone?

At the beginning of all optimization exhaustive benchmarking and profiling is mandatory. In the case of ISPI I found that more than 90% of the walltime was spent in a single computation step. Without bothering you with all the details of theoretical physics just let me assure you that it all comes down to the computation of


Equation 1

where


Equation 2

and element-wise:


Equation 3

As a "natural" approach one could just implement the last expression in FORTRAN which would look like in Fig. 1. Indeed, this was exactly the way the computation was implemented in the original source code of ISPI.

 1      do i = 1, n
 2        do j = 1, n
 3          do k = 1, n
 4            do l = 1, n
 5              L(i,j) = L(i,j) + B(i,k) * C(k,l) * D(l,j)
 6            end do !. of l
 7          end do !. of k
 8        end do !. of j
 9      end do !. of i
10

Figure 1

How To Do A Matrix-Matrix-Matrix Multiplication

Of course, from a mathematical point of view equation (3) is totally correct. However, this quadruple loop scales with n4 and a lot of multiplications and accesses to the two-dimensional fields have to be performed. For a small 24x24 matrix more than 660.000 multiplications and more the 1.6 million array accesses are necessary (Table 1).

n multiplications field operations
8 8192 20480
12 41472 103680
16 131072 327680
20 320000 800000
24 663552 1658880

Table 1

Even though FORTRAN stands for "FORmula TRANslating system" one should never do just this: translating an arbitrary formula to a programming language one-to-one. This may be acceptable for a proof-of-concept-code but as showed in this example the complexity of the computation needs to be checked carefully as it may plainly ruin the scaling of the application. Beyond that, because of the numerousness of floating-point multiplications performed in this example, the accuracy of the result is shortened.

It is best practice to divide complex problems in already known (and already understood and already optimized) subproblems. So, doing the obvious, I split up the one threefold matrix multiplication in two standard matrix-matrix multiplications:


Equation 4

Though this IS trivial there are important effects concerning efficiency. Firstly, the complexity of the computation is reduced to n3 because now only a triple loop is needed (Fig. 2). Moreover, both the number of array accesses and the number of multiplications are decreased dramatically (Tab. 2). This not only causes an enormous speed-up but also increases the accuracy of the calculation by a reduced number of floating-point multiplications.

 1!$acc region
 2      do i = 1, n
 3        do j = 1, n
 4          do k = 1, n
 5              T(i,j) = T(i,j) + C(i,k) * D(k,j)
 6          end do !. of k
 7        end do !. of j
 8      end do !. of i
 9
10     do i = 1, n
11       do j = 1, n
12         do k = 1, n
13             TLi,j) = L(i,j) + C(i,k) * T(k,j)
14         end do !. of k
15       end do !. of j
16     end do !. of i
17!$acc end region
10

Figure 2

n multiplications field operations
8 1024 4096
12 3456 13824
16 8192 32768
20 16000 640000
24 27648 110592

Table 2

Matrix-matrix multiplication is very well understood and highly optimized implementations are applied in many cases. Of course, based on equation (4), it is very easy to utilize any adequate library on hand. The "Basic Linear Algebra Subprograms" (BLAS) is a collection of linear algebra operations which is subject of highly enthusiastic optimizations. The general matrix-matrix-multiplication function of BLAS (GEMM) is one of the most frequently used operations not even in benchmarking scenarios but also in quantum theory or molecular modeling and therefore optimized continuously since its first implementation 1979.

But, what if there is not a library optimized for about 30 years, which fits for your special computational problem? In this case, a compiler that is capable of optimizing your existing code as much as possible is needed. The newest generation of the PGI compiler even goes a step further and also utilizes current accelerators like nVIDIA's TESLA. The question is how this accelerator approach performs versus a highly optimized library (e.g. Intel's MathKernelLibrary)?

To find this out, I just enclosed the two triple loops with the !$acc region and !$acc end region lines—that's it. The PGI compiler automatically determines which arrays need to be copied in, copied out or are only used locally within in the CUDA kernel. It also takes care of allocating and deallocating adequate fields in the TESLA memory.

Benchmarking

The benchmarking was performed using a Bull Novascale R422-E1 twin dual-socket server with Intel Xeon 5462 (Harpertown) dual-core CPUs at 2.8 GHz and a single nVidia TESLA S870. I measured the performance of the different triple matrix multiplication methods for different sizes of the matrices.

I compared the original implementation of the ISPI code (i.e. the quadruple loop) and the two triple loops each compiled with the PGI 8.0-4 compiler for x86_64 CPUs. Furthermore, I compiled the two triple loops with the same compiler using the TESLA as target.

Figure 3 shows the dramatic difference in the performance between the quadruple loop and the two triple loops. Additionally, the PGI-generated TESLA version of the triple loops and the computation using the Intel MKL are compared in Fig. 4.


Figure 3: Comparison of the performance of the quadruple loop and the two triple loops


Figure 4: Comparison of all benchmark runs (different scale than Figure 3)

The most noteworthy outcome is the small difference between the simple-to-use solution utilizing the PGI compiler (with TESLA as a target) and the highly optimized MKL. Of course, implementing the triple matrix multiplication was also very easy. But this example shows impressively the great potential of the PGI compiler in generating very fast accelerator-aware applications from your familiar FORTRAN code.

Conclusions

Optimizing existing applications always starts with exhaustive profiling and benchmarking. In most of the cases the major part of the compute time is consumed in a limited number of lines of the source code which is easily manageable.

Using pre-existing libraries is always the first choice and I would like to strongly encourage all scientific programmers to perform a systematic search for appropriate libraries. Don't try to just translate your math into source code. Rather split your problem, reformulate it, until it can be expressed in the sense of already known computational steps.

But, in case of new algorithms or mathematical methods you need to write your own numerical algorithms. With the new PGI compiler these algorithms now can be compiled for the usage of current accelerator technologies in a very straight-forward way. As shown in this example of a triple matrix multiplication the result performs quite as good as the matrix multiplication of the BLAS library which is optimized for three decades now.

About The Author

Dr. Stephan Raub was born in Düsseldorf (Germany) in 1975. He started studying Chemistry at the Heinrich-Heine University of Düsseldorf in 1995 and did his diploma thesis 2001 in quantum chemistry. In 2007 he graduated in computational chemistry with a dissertation about predicting the binding energy of protein-ligand complexes, which is a crucial step within the process of computer aided drug design. Since 2007 Dr. Raub is working at the Institute for IT-Management (chair: Prof. Dr.-Ing. Olbrich) in a cooperation project of the Heinrich-Heine Universtity and the HPC vendor Bull Germany. His focus lies of the development and the implementation of methods and algorithms for the efficient utilization of heterogeneous high-performance-compute clusters. Furthermore, he leads the development of the accounting and monitoring solution.