Changes for Fortran Allocatable Arrays
This short post summarizes changes in the behavior of Fortran Allocatable arrays and array assignments in the PGI 18.7 release. You may be particularly interested in this if you use OpenACC and Fortran Allocatable array assignments in your device code.
Fortran has an advantage over C or C++ for array processing, because it has high level array features in the language, including allocatable arrays and array assignments. After allocating an array, you can easily initialize the whole array with a single assignment statement, or copy one array to another, or easily compute an array-valued expression.
! Sample 1 real, dimension(:,:), allocatable :: a, b, c ... allocate(a(n,m), b(n,m), c(n,m)) ... a(1:n,1:m) = 0 ... c(1:n,1:m) = a(1:n,1:m) + 2.0 ... b(1:n,1:m) = c(1:n,1:m)
As all Fortran programmers know, you don't have to give the lower bounds and upper bounds of the array expression dimensions if you are using or assigning the whole extent of that dimension:
! Sample 2 ... a(:,:) = 0 ... c(:,:) = a(:,:) + 2.0 ... b(:,:) = c(:,:)
Moreover, if you are referencing the whole array, you don't even need to include the parentheses:
! Sample 3 ... a = 0 ... c = a + 2.0 ... b = c
This is a very convenient way to write array-oriented programs, especially programs dominated by matrices and vectors. However, it's this last case, assignments to the whole allocatable array, where the behavior has changed. In 2003, Fortran changed the behavior of assignments to allocatable arrays where the left-hand side is the array name, with no subscripts. Before 2003, the three array assignments in Samples 1, 2, and 3 had exactly the same behavior. Looking at Sample 3:
- The initialization of a assigned zero to each element of a. It was a runtime error if a was not already allocated.
- The assignment to c required that the arrays a and c both be allocated and have the same shape.
- Similarly, the assignment to b required that both b and c be allocated with the same shape.
The Fortran 2003 change was to treat an assignment to an allocatable array with no subscripts as implicitly allocation or reallocation of the array to match the shape of the right-hand side. It's no longer an error if the left-hand side array is not allocated, or if it has a different shape from the right-hand side.
- In the initialization of a, if a is already allocated, then each element of a is assigned that value. If a is not allocated, it is then implicitly allocated with a size of 1, and that element is assigned.
- In the assignment to c, if c is not allocated, it gets allocated to the shape of the array expression, in this case the shape of a. If c is already allocated, but its shape is different from the array expression, it is deallocated and reallocated to the right shape.
- In the array copy from c to b, if the right-hand side array c is not allocated, then b gets deallocated, unless it was already not allocated. Otherwise, the left-hand side array b gets allocated or reallocated to the shape of c, unless it already has the right shape.
This means when you are writing array assignments to allocatable arrays, you no longer have to worry quite so much about whether the arrays are allocated to the right size. If they aren't, they will get reallocated appropriately.
With just a little thought, you can imagine how this affects the generated code for an array assignment. Before 2003, the compiler would turn the array assignment into a loop. That middle assignment:
c = a + 2.0
would turn into a loop like the following:
do i = 0, size(c,1)-1 c(i+lbound(c,1)) = a(i+lbound(a,1)) + 2.0 enddo
With the Fortran 2003 changes, the generated code looks more like:
if (.not.allocated(c)) then allocate(c(size(a,1))) elseif (size(c,1) /= size(a,1)) then deallocate(c) allocate(c(size(a,1))) endif do i = 0, size(c,1)-1 c(i+lbound(c,1)) = a(i+lbound(a,1)) + 2.0 enddo
The array copy:
b = c
now turns into something even more interesting:
if (.not.allocated(c)) then if (allocated(b)) deallocate(b) else if (.not.allocated(b)) then allocate(b(size(c,1))) elseif (size(b,1) /= size(c,1)) then deallocate(b) allocate(b(size(c,1))) endif do i = 0, size(b,1)-1 b(i+lbound(b,1)) = c(i+lbound(c,1)) enddo endif
What was a relatively simple array assignment has now turned into a lot more code. This can affect performance. One of the key compiler optimizations for Fortran array assignments is to fuse those generated loops together. So two adjacent array assignments:
a = b * 3.14 d = sin(a)
would, before 2003, create two loops:
do i = 0, size(a)-1 a(i) = b(i) * 3.14 enddo do i = 0, size(d)-1 d(i) = sin(a(i)) enddo
If the compiler can prove that the loops have the same trip count and that there are no data hazards from fusing the loops, the compiler will optimize this to:
do i = 0, size(a)-1 a(i) = b(i) * 3.14 d(i) = sin(a(i)) enddo
This is much more than removing the extra loop overhead. It's more about utilizing cache and register locality. In the latter fused loop, the element a(i) need not be fetched from memory at all, since that value is still in the register that just got stored in the previous assignment. However, with Fortran 2003, it's much more difficult for a compiler to do such fusion for these allocatable array assignments.
PGI Fortran had implemented this behavior, but it was not the default. It was available under the -Mallocatable=03 option. PGI 18.7 has adopted the standard Fortran 2003 behavior as the default; you can revert to the previous behavior with the -Mallocatable=95 flag.
You can also avoid the complex generated code, with the runtime allocation status and shape checking, by using array sections on the left-hand side array. With Fortran 2003 allocatable array behavior, the two assignments:
a = b * 3.14 d = sin(a)
are very different from assignments with section subscripts:
a(:) = b * 3.14 d(:) = sin(a)
Even though you probably want these to do the same thing, if the left-hand side arrays are allocatable arrays, the version with section subscripts will be more efficient regardless of which compiler you use to build your program, unless you depend on the Fortran 2003 reallocation behavior.
Effects on OpenACC Code
The more interesting change has to do with Fortran array assignments in OpenACC compute constructs or device routines. Array assignments are compiled the same way in device code as in host code. However, the array reallocation or deallocation required by the new Fortran 2003 behavior can't be done in OpenACC device code because of the need to keep track of the relationship between host and device data. To explain this in a little more detail, let's look at another short example.
!$acc kernels copyout(a) copy(b) copyin(c) a(:) = b(:) b(:) = sin(a(:)) + cos(c(:)) !$acc end kernels
When this gets compiled for device execution, the compiler will generate code for the array assignments more or less just like it would for host execution. But let's look at the effects of the data clauses. The copy(b) clause tells the compiler to generate code to do the following:
- Check whether the array b already has a copy on the device.
- If it does, the generated code will use that copy in the device code.
- If it does not, the generated code will allocate device memory corresponding to the array b and copy the data for b from host to device memory. At the end of the compute construct, if the generated code had just allocated b, it will copy the data out from device memory back to host memory and deallocate the device memory.
The check in the first step above is done by the host using a present table, which keeps track of which variables or arrays have already been allocated in device memory and which host variables or arrays they correspond to. When a data directive or compute construct or OpenACC runtime API call allocates device memory, the host updates the present table with the device address and size, and the corresponding host address and size. The behavior of the copyout and copyin clauses are similar, but they allow the user to optimize the data movement between host and device memory.
Now let's look at a similar compute construct with the Fortran 2003 behavior:
!$acc kernels copyout(a) copy(b) copyin(c) a = b b = sin(a) + cos(c) !$acc end kernels
Now the array assignments may allocate, reallocate, or deallocate the left-hand side arrays depending on the allocation status and size of the right-hand side expressions. If the device code allocates or reallocates device memory, the present table is no longer valid. When exiting the compute construct, the host would not know where the data to be copied out is stored, nor the size of the reallocated array.
One solution to this problem would be to disallow unsubscripted allocatable arrays on the left-hand side of array assignments in OpenACC device code. This is unacceptable, particularly since most of the time the arrays are already allocated to the proper size. Instead, in PGI 18.7, we've modified the behavior of these cases to check that the left-hand side arrays are properly allocated and have the right size, and then to perform the assignment. If the allocatable array is not allocated when it should be, or has the wrong size, the code will fail at runtime with an error message. This is mostly reverting to Fortran 95 behavior in device code.
If this becomes a significant issue for device code, we will look in more detail at a solution, although it is likely to come with a significant performance cost. In any case, we recommend that you use (:,:) subscripts in your array assignments, at least on the left-hand side, when you know or expect the arrays to have the right size. This will preserve your performance expectations from Fortran. Remember that you can revert to the old behavior with the -Mallocatable=95 flag.