August 2012
Introduction to Texture Memory Support in CUDA Fortran
PGI 12.8 includes an initial implementation of a CUDA feature, textures, that has been widely requested since we first released CUDA Fortran back in 2009. Early on, we expected the need for this functionality would fade away. However, as CUDA Fortran has evolved, the push for texture support is probably stronger than ever.
Textures are capable of wondrous things in the hands of graphics programmers. For our purposes, we just want to think of texture memory as a fast read-only cache between the NVIDIA symmetric multiprocessors (SMs) and device memory, a cache the Fortran programmer can easily use to stash modest sized data items that are frequently read but never written within a device kernel. In the spirit of CUDA Fortran, we've abstracted away many of the details of textures that application programmers, scientists and engineers would find confusing. Even the term "texture" probably doesn't have any intuitive meaning to most HPC programmers. In any event, we've added the texture attribute qualifier to the CUDA Fortran language, just as we added the device, shared and constant attributes to Fortran back in 2009.
Because a CUDA texture is an alternate way to refer to device memory locations, in CUDA Fortran we require the texture attribute to be combined with the pointer attribute. Just as with Fortran pointers in host memory, any array to which a texture points must have the target attribute (and of course the device attribute). This allows the compiler to catch programming errors at compile time, eliminating a host of runtime errors which are potentially difficult to debug.
Background
In CUDA C++, textures are required to be global. The CUDA Fortran implementation is analogous. The simplest and currently only supported means to use texture memory is to add a declaration like the following to a module that is used in both the host and device code:
real, texture, pointer :: t(:)
In your host code, add the target attribute to the device data that you wish to place in texture memory by changing a declaration like:
real, device :: a(n)
To:
real, target, device :: a(n)
The target attribute is standard Fortran syntax to denote an array or other data structure that may be "pointed to" by another entity. Finally, tie the global texture declaration to the device array using Fortran pointer assignment:
t => a
This will cause the compiler to perform all the underlying CUDA texture binding operations so that accesses to array t, or through to array a, will go through the texture cache.
The CUDA Fortran device code contained in the module that declares array t can now access the array without any other wrapper or decoration, for example:
! Vector add, s is in device memory, t is in texture memory
i = threadIdx%x + (blockIdx%x-1)*blockDim%x
s(i) = s(i) + t(i)
Example 1
One feature of the texture cache is that loads will access data at a granularity of individual data items, as opposed to global device memory accesses through the L1 cache which always load a 128-byte cache line.
A simple demonstration code is a kernel that makes use of strided vector accesses and which measures memory bandwidth. Two versions are given, one that uses textures and one using traditional memory accesses:
module memtests
integer, texture, pointer :: t(:)
contains
attributes(global) subroutine stridem( a, b, s )
integer, device :: a(*), b(*)
integer, value :: s
i = blockDim%x*(blockIdx%x-1) + threadIdx%x
j = i * s
b(i) = a(j) + 1
return
end subroutine
attributes(global) subroutine stridet( b, s )
integer, device :: b(*)
integer, value :: s
i = blockDim%x*(blockIdx%x-1) + threadIdx%x
j = i * s
b(i) = t(j) + 1
return
end subroutine
end module memtests
The two kernels are called to transfer 2**20 elements, or 4 Mbytes, with each call. The kernels are run with varying threadblock sizes, and stride values ranging from 1 to 32. When s == 1, these accesses are completely coalesced in transfers from device memory to the SMs.
Here is the observed performance, in GBytes/sec, so bigger is better, on device 0 of a GTX 690:
| blockDim = 32 | blockDim = 128 | blockDim = 512 | ||||
|---|---|---|---|---|---|---|
| s | stridem | stridet | stridem | stridet | stridem | stridet |
| 1 | 53.45 | 50.94 | 141.93 | 135.43 | 132.24 | 128.74 |
| 2 | 49.77 | 49.93 | 100.37 | 99.61 | 99.72 | 99.02 |
| 3 | 46.96 | 48.07 | 75.25 | 74.91 | 75.32 | 74.94 |
| 4 | 44.25 | 45.90 | 60.19 | 60.00 | 60.17 | 60.02 |
| 5 | 39.06 | 39.89 | 50.09 | 49.96 | 50.11 | 50.05 |
| 6 | 37.14 | 39.33 | 42.95 | 42.93 | 42.96 | 42.97 |
| 7 | 32.88 | 35.94 | 37.56 | 37.50 | 37.59 | 37.61 |
| 8 | 32.48 | 32.98 | 33.38 | 33.42 | 33.38 | 33.45 |
| 9 | 29.90 | 32.94 | 30.01 | 33.38 | 30.02 | 33.44 |
| 10 | 27.43 | 32.86 | 27.28 | 33.07 | 27.30 | 33.19 |
| 11 | 25.17 | 32.77 | 24.98 | 32.91 | 25.02 | 33.00 |
| 12 | 23.23 | 33.19 | 23.07 | 33.01 | 23.08 | 33.21 |
| 13 | 21.57 | 33.13 | 21.40 | 32.26 | 21.43 | 32.51 |
| 14 | 20.15 | 32.98 | 19.97 | 31.76 | 19.99 | 32.13 |
| 15 | 18.87 | 32.80 | 18.72 | 30.80 | 18.74 | 31.44 |
| 16 | 17.78 | 32.66 | 17.60 | 31.83 | 17.64 | 31.78 |
| 17 | 16.78 | 33.17 | 16.62 | 29.33 | 16.64 | 30.01 |
| 18 | 15.90 | 33.31 | 15.72 | 28.66 | 15.76 | 29.19 |
| 19 | 15.10 | 33.01 | 14.94 | 26.50 | 14.96 | 27.82 |
| 20 | 14.38 | 33.37 | 14.23 | 26.84 | 14.25 | 27.58 |
| 21 | 13.73 | 33.42 | 13.56 | 24.23 | 13.58 | 24.45 |
| 22 | 13.13 | 33.40 | 12.97 | 22.79 | 12.99 | 23.09 |
| 23 | 12.58 | 33.19 | 12.42 | 22.18 | 12.45 | 21.88 |
| 24 | 12.08 | 33.71 | 11.94 | 24.30 | 11.94 | 22.68 |
| 25 | 11.61 | 33.33 | 11.47 | 21.63 | 11.48 | 20.03 |
| 26 | 11.19 | 33.42 | 11.03 | 21.63 | 11.05 | 19.90 |
| 27 | 10.78 | 33.41 | 10.65 | 19.96 | 10.65 | 18.52 |
| 28 | 10.41 | 33.38 | 10.27 | 19.97 | 10.28 | 18.51 |
| 29 | 10.07 | 33.36 | 9.93 | 19.19 | 9.93 | 17.67 |
| 30 | 9.73 | 33.54 | 9.61 | 18.42 | 9.61 | 17.10 |
| 31 | 9.44 | 33.31 | 9.30 | 18.08 | 9.30 | 16.56 |
| 32 | 9.15 | 33.42 | 9.02 | 20.19 | 9.02 | 17.94 |
We can see that for strides < 4, normal accesses through device global memory are preferred. For strides from 4–8, it is basically a wash. For strides > 8, the accesses through texture memory begin to far outperform the normal method. Source code for this example and the following example is available from the PGI website.
Example 2
Back in 2009 I visited Sandia National Labs to introduce CUDA Fortran. During that visit I learned that Sandia's initial investigations with CUDA C indicated that they could use texture memory to speed up their solvers by 5–10%. So, this was an early data point, and was supported by other discussions I've had since that time.
I'm no linear algebra expert, but I've written a simple code to try to duplicate the results reported by Sandia and subsequently by NVIDIA as well. Here, I've tried to generalize a sparse matrix-vector multiplication operation, Ax=b. The sparse matrix is simple, and probably doesn't represent any real-world code, but has 27 nonzero elements per row, set in diagonal groups of three. I used a variation of the jagged-diagonal storage (JDS) format which allows for coalesced accesses of the A matrix within CUDA, but still makes for some interesting cases in the upper-left and lower-right hand corners. The total number of rows, N, in the sparse matrix, the spacing between diagonal sections, etc, are set with compile-time constants. The decomposition of the problem amongst CUDA threads is also simple: each thread is responsible for one row of the matrix and one result of b. I've added enough zero-fill to the A matrix to simplify addressing.
I compiled and executed the code using various methods of reading or computing the indexing into the x vector. In each case I compared the performance with the x vector accessed from global device memory directly to the performance when it is accessed through the texture cache. The results are shown in the table below. Table entries are single-precision GFlops on device 0 of a GTX 690, with N = 1,000,000:
| Index Matrix in device memory |
Computed Indices | Indices | ||
|---|---|---|---|---|
| Texture | Shared | |||
| x vector in device memory |
34.9 | 46.5 | 56.7 | 62.2 |
| x vector in texture memory |
34.2 | 48.2 | 58.6 | 67.1 |
In the last two columns, given the simple nature of my sparse matrix, I was able to compute the sets of indices into x for any row of the multiply operation using an 8-bit, 256-entry state machine and just a little supporting arithmetic. I can hold that state machine either in texture memory or shared memory. So, for instance, in the lower right hand corner result, the A matrix is in device global memory, the x vector is in texture memory, and the indices are computed using a lookup table in shared memory. This gives the best result, over 67 GFlops on one device of a GTX 690, which I compute to be 84% of peak based on the published internal bandwidth specification (160 GBytes/sec) for the device on the card.
The purpose of this article is not to show artificial performance numbers, but rather to show the new CUDA Fortran texture feature. This code shows that textures can point to either real or integer data, and can point to relatively large arrays (1 million elements or more in the case of x) or to small arrays (only 256 entries in the case of the lookup table). Whether or not the use of the texture cache speeds up your code is likely dependent on a number of factors. However, the ability to free up memory bandwidth by using textures, which up until now were unavailable in CUDA Fortran, is an area for exploration if you are trying to eke out every drop of performance from your code.
Future Work
We are headed in three possible directions with this work:
- We plan to implement more of the texture functionality, starting with support for double precision and complex data types, and possibly float2, float4, int2, and int4 types. Then we will consider the different addressing modes, floating point indices, and possibly linear filtering.
- We are looking at ways to generate fetches that go through the texture memory automatically if we can detect, through compiler analysis or other means, that the data is read-only and a good candidate.
- We can eventually pull some of this functionality into the PGI accelerator model, so that read-only data within accelerator compute regions can be accessed through the texture cache.
I hope you have an opportunity to give this new feature a try in your CUDA Fortran codes, and to send feedback on performance and which of the above features you would find most useful.