I cannot get ACC ROUTINE to work

OpenACC and CUDA Fortran
Post Reply
chasmotron
Posts: 2
Joined: Feb 27 2020

I cannot get ACC ROUTINE to work

Post by chasmotron » Wed May 20, 2020 7:51 am

I've taken the laplace2d example problem from the NVIDIA repository and modified it in two ways:

1. All of the data is in a module with an !$acc declare create, so the accelerator data should have global scope over the entire problem.
2. The main routine does the data transfer to the accelerator, but the work is done in subroutine calls.

The compiler tells me the subroutine is accelerated:

bash-4.2$ /opt/pgi/20.4/linux86-64/20.4/bin/pgf90 -acc -ta=tesla:cc60 -Minfo=accel laplace2d5.f90
work:
14, Generating Tesla code
22, !$acc loop gang ! blockidx%x
24, !$acc loop worker, vector ! threadidx%y threadidx%x
31, !$acc loop gang ! blockidx%x
33, !$acc loop worker, vector ! threadidx%y threadidx%x
24, Loop is parallelizable
33, Loop is parallelizable
laplace:
72, Generating update device(a(:,:))
78, Generating update self(a(:,:))
bash-4.2$

The results are incorrect-- as though the data transfer in the main routine had not happened. I'm assuming I'm missing something in my routine declaration, but I have no idea what. Code follows:

module data_mod

integer, parameter :: fp_kind=kind(1.0)
integer, parameter :: n=4096, m=4096
integer :: i, j, iter
real(fp_kind), dimension (:,:), allocatable :: A, Anew
real(fp_kind), dimension (:), allocatable :: y0
real(fp_kind) :: pi=3.14159265

!$acc declare create(n, m, i, j, A, Anew, y0)

end module data_mod

subroutine work

use data_mod, only: fp_kind, n, m, i, j, A, Anew

!$acc routine gang device_type(acc_device_nvidia)
!$acc declare present(fp_kind, n, m, i, j, A, Anew)

!$acc loop
do j=1,m-2
!$acc loop
do i=1,n-2
Anew(i,j) = 0.25_fp_kind * ( A(i+1,j ) + A(i-1,j ) + &
A(i ,j-1) + A(i ,j+1) )
end do
end do

!$acc loop
do j=1,m-2
!$acc loop
do i=1,n-2
A(i,j) = Anew(i,j)
end do
end do

return

end subroutine work

program laplace
use data_mod, only: fp_kind, n, m, i, j, iter, A, Anew, &
& y0, pi

!$acc routine (work) gang

implicit none
real(fp_kind) :: start_time, stop_time

allocate ( A(0:n-1,0:m-1), Anew(0:n-1,0:m-1) )
allocate ( y0(0:m-1) )

A = 0.0_fp_kind

! Set B.C.
y0 = sin(pi* (/ (j,j=0,m-1) /) /(m-1))

A(0,:) = 0.0_fp_kind
A(n-1,:) = 0.0_fp_kind
Anew(0,1:m-1) = 0.0_fp_kind
Anew(n-1,1:m-1) = 0.0_fp_kind
A(:,0) = y0
A(:,m-1) = y0*exp(-pi)
Anew(1:n-1,0) = y0(i)
Anew(1:n-1,m-1) = y0(i)*exp(-pi)

write(*,'(a,i5,a,i5,a)') 'Jacobi relaxation Calculation:', n, ' x', m, ' mesh'

call cpu_time(start_time)

!$acc update device(A)
do iter = 1,1000

call work

end do
!$acc update self(A)

call cpu_time(stop_time)
write(*,'(a,f10.3,a)') ' completed in ', stop_time-start_time, ' seconds'

WRITE(*,'(10F6.3)') ((A(I,J),J=0,9),I=0,n-1,n/16)

deallocate (A,Anew,y0)
end program laplace

mkcolg
Posts: 8382
Joined: Jun 30 2004

Re: I cannot get ACC ROUTINE to work

Post by mkcolg » Wed May 20, 2020 5:20 pm

Hi chasmotron,

The problem here is that you're not calling the device routine, you're calling the host routine, so nothing is running on the device. A device routine must be called from within a compute region or another device routine.

You had some other issues as well:

1) Parameters shouldn't be put in a data clause since they have no storage. The parameter basically just gets replaced by the literal value. It wont cause an error, just a warning.
2) Putting variables in data regions (declare in this case) will make the variable shared by all threads. This is fine for the arrays since each thread only accesses a single element, but is problematic for the i and j scalars since this would be a race condition. Best to make these local variables and the compiler will make private copies for each thread.

With these changes, the GPU time is 0.655 seconds on my V100 vs 19.66 on a single Skylake core.

Code: Select all

% cat laplace.f90
module data_mod

integer, parameter :: fp_kind=kind(1.0)
integer, parameter :: n=4096, m=4096
real(fp_kind), dimension (:,:), allocatable :: A, Anew
real(fp_kind), dimension (:), allocatable :: y0
real(fp_kind) :: pi=3.14159265

!$acc declare create(A, Anew, y0)

end module data_mod

subroutine work

use data_mod, only: fp_kind, n, m, A, Anew
integer :: i, j

!$acc routine gang
!$acc loop
do j=1,m-2
!$acc loop
do i=1,n-2
Anew(i,j) = 0.25_fp_kind * ( A(i+1,j ) + A(i-1,j ) + &
A(i ,j-1) + A(i ,j+1) )
end do
end do

!$acc loop
do j=1,m-2
!$acc loop
do i=1,n-2
A(i,j) = Anew(i,j)
end do
end do

return

end subroutine work

program laplace
use data_mod, only: fp_kind, n, m, A, Anew, &
& y0, pi

!$acc routine (work) gang
implicit none
real(fp_kind) :: start_time, stop_time
integer iter, i, j

allocate ( A(0:n-1,0:m-1), Anew(0:n-1,0:m-1) )
allocate ( y0(0:m-1) )

A = 0.0_fp_kind

! Set B.C.
y0 = sin(pi* (/ (j,j=0,m-1) /) /(m-1))

A(0,:) = 0.0_fp_kind
A(n-1,:) = 0.0_fp_kind
Anew(0,1:m-1) = 0.0_fp_kind
Anew(n-1,1:m-1) = 0.0_fp_kind
A(:,0) = y0
A(:,m-1) = y0*exp(-pi)
Anew(1:n-1,0) = y0(i)
Anew(1:n-1,m-1) = y0(i)*exp(-pi)

write(*,'(a,i5,a,i5,a)') 'Jacobi relaxation Calculation:', n, ' x', m, ' mesh'

call cpu_time(start_time)

!$acc update device(A)
do iter = 1,1000
!$acc parallel num_gangs(m-2)
call work
!$acc end parallel
end do
!$acc update self(A)

call cpu_time(stop_time)
write(*,'(a,f10.3,a)') ' completed in ', stop_time-start_time, ' seconds'

WRITE(*,'(10F6.3)') ((A(I,J),J=0,9),I=0,n-1,n/16)

deallocate (A,Anew,y0)
end program laplace
% pgfortran laplace.f90 -V20.4 -fast -o cpu.out
% pgfortran laplace.f90 -V20.4 -fast -ta=tesla -Minfo=accel -o acc.out
work:
     19, Loop is parallelizable
     20, Loop is parallelizable
         Generating Tesla code
         19, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
         20, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
     26, Loop is parallelizable
     27, Loop is parallelizable
         Generating Tesla code
         26, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
         27, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
laplace:
     66, Generating update device(a(:,:))
     70, Generating update self(a(:,:))
% ./cpu.out
Jacobi relaxation Calculation: 4096 x 4096 mesh
 completed in     19.666 seconds
 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 0.195 0.188 0.181 0.174 0.167 0.161 0.154 0.147 0.141 0.134
 0.383 0.369 0.355 0.342 0.328 0.315 0.302 0.289 0.276 0.263
 0.556 0.536 0.516 0.496 0.477 0.457 0.438 0.419 0.400 0.382
 0.707 0.682 0.657 0.632 0.607 0.582 0.558 0.533 0.510 0.486
 0.832 0.802 0.772 0.743 0.714 0.684 0.656 0.627 0.599 0.572
 0.924 0.891 0.858 0.825 0.793 0.761 0.729 0.697 0.666 0.635
 0.981 0.946 0.911 0.876 0.842 0.807 0.773 0.740 0.707 0.674
 1.000 0.964 0.929 0.893 0.858 0.823 0.788 0.754 0.721 0.687
 0.981 0.946 0.911 0.876 0.841 0.807 0.773 0.740 0.707 0.674
 0.924 0.891 0.858 0.825 0.793 0.760 0.728 0.697 0.666 0.635
 0.831 0.802 0.772 0.742 0.713 0.684 0.655 0.627 0.599 0.571
 0.707 0.681 0.656 0.631 0.606 0.582 0.557 0.533 0.509 0.486
 0.555 0.535 0.515 0.496 0.476 0.457 0.438 0.419 0.400 0.382
 0.382 0.368 0.355 0.341 0.328 0.314 0.301 0.288 0.275 0.263
 0.194 0.187 0.181 0.174 0.167 0.160 0.153 0.147 0.140 0.134
% acc.out
Jacobi relaxation Calculation: 4096 x 4096 mesh
 completed in      0.606 seconds
 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 0.195 0.188 0.181 0.174 0.167 0.161 0.154 0.147 0.141 0.134
 0.383 0.369 0.355 0.342 0.328 0.315 0.302 0.289 0.276 0.263
 0.556 0.536 0.516 0.496 0.477 0.457 0.438 0.419 0.400 0.382
 0.707 0.682 0.657 0.632 0.607 0.582 0.558 0.533 0.510 0.486
 0.832 0.802 0.772 0.743 0.714 0.684 0.656 0.627 0.599 0.572
 0.924 0.891 0.858 0.825 0.793 0.761 0.729 0.697 0.666 0.635
 0.981 0.946 0.911 0.876 0.842 0.807 0.773 0.740 0.707 0.674
 1.000 0.964 0.929 0.893 0.858 0.823 0.788 0.754 0.721 0.687
 0.981 0.946 0.911 0.876 0.841 0.807 0.773 0.740 0.707 0.674
 0.924 0.891 0.858 0.825 0.793 0.760 0.728 0.697 0.666 0.635
 0.831 0.802 0.772 0.742 0.713 0.684 0.655 0.627 0.599 0.571
 0.707 0.681 0.656 0.631 0.606 0.582 0.557 0.533 0.509 0.486
 0.555 0.535 0.515 0.496 0.476 0.457 0.438 0.419 0.400 0.382
 0.382 0.368 0.355 0.341 0.328 0.314 0.301 0.288 0.275 0.263
 0.194 0.187 0.181 0.174 0.167 0.160 0.153 0.147 0.140 0.134
Note that you'll be better off putting the compute regions inside of work, rather than using a gang routine since the compiler has more information about how to schedule the kernel. You'll see that I had to add to add a "num_gangs" clause in the above example, since the runtime doesn't know how big the i loop is outside of the routine.

Here's the same code but using a kernel region inside the work routine.

Code: Select all

% cat laplace.2.f90
module data_mod

integer, parameter :: fp_kind=kind(1.0)
integer, parameter :: n=4096, m=4096
real(fp_kind), dimension (:,:), allocatable :: A, Anew
real(fp_kind), dimension (:), allocatable :: y0
real(fp_kind) :: pi=3.14159265

!$acc declare create(A, Anew, y0)

end module data_mod

subroutine work

use data_mod, only: fp_kind, n, m, A, Anew
integer :: i, j

!$acc kernels
do j=1,m-2
do i=1,n-2
Anew(i,j) = 0.25_fp_kind * ( A(i+1,j ) + A(i-1,j ) + &
A(i ,j-1) + A(i ,j+1) )
end do
end do

do j=1,m-2
do i=1,n-2
A(i,j) = Anew(i,j)
end do
end do
!$acc end kernels
return

end subroutine work

program laplace
use data_mod, only: fp_kind, n, m, A, Anew, &
& y0, pi

!$acc routine (work) gang
implicit none
real(fp_kind) :: start_time, stop_time
integer iter, i, j

allocate ( A(0:n-1,0:m-1), Anew(0:n-1,0:m-1) )
allocate ( y0(0:m-1) )

A = 0.0_fp_kind

! Set B.C.
y0 = sin(pi* (/ (j,j=0,m-1) /) /(m-1))

A(0,:) = 0.0_fp_kind
A(n-1,:) = 0.0_fp_kind
Anew(0,1:m-1) = 0.0_fp_kind
Anew(n-1,1:m-1) = 0.0_fp_kind
A(:,0) = y0
A(:,m-1) = y0*exp(-pi)
Anew(1:n-1,0) = y0(i)
Anew(1:n-1,m-1) = y0(i)*exp(-pi)

write(*,'(a,i5,a,i5,a)') 'Jacobi relaxation Calculation:', n, ' x', m, ' mesh'

call cpu_time(start_time)

!$acc update device(A)
do iter = 1,1000
call work
end do
!$acc update self(A)

call cpu_time(stop_time)
write(*,'(a,f10.3,a)') ' completed in ', stop_time-start_time, ' seconds'

WRITE(*,'(10F6.3)') ((A(I,J),J=0,9),I=0,n-1,n/16)

deallocate (A,Anew,y0)
end program laplace

% pgfortran laplace.2.f90 -V20.4 -fast -ta=tesla -Minfo=accel -o acc2.out
work:
     19, Loop is parallelizable
     20, Loop is parallelizable
         Generating Tesla code
         19, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
         20, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
     26, Loop is parallelizable
     27, Loop is parallelizable
         Generating Tesla code
         26, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
         27, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
laplace:
     66, Generating update device(a(:,:))
     70, Generating update self(a(:,:))
% ./acc2.out
Jacobi relaxation Calculation: 4096 x 4096 mesh
 completed in      0.455 seconds
 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 0.195 0.188 0.181 0.174 0.167 0.161 0.154 0.147 0.141 0.134
 0.383 0.369 0.355 0.342 0.328 0.315 0.302 0.289 0.276 0.263
 0.556 0.536 0.516 0.496 0.477 0.457 0.438 0.419 0.400 0.382
 0.707 0.682 0.657 0.632 0.607 0.582 0.558 0.533 0.510 0.486
 0.832 0.802 0.772 0.743 0.714 0.684 0.656 0.627 0.599 0.572
 0.924 0.891 0.858 0.825 0.793 0.761 0.729 0.697 0.666 0.635
 0.981 0.946 0.911 0.876 0.842 0.807 0.773 0.740 0.707 0.674
 1.000 0.964 0.929 0.893 0.858 0.823 0.788 0.754 0.721 0.687
 0.981 0.946 0.911 0.876 0.841 0.807 0.773 0.740 0.707 0.674
 0.924 0.891 0.858 0.825 0.793 0.760 0.728 0.697 0.666 0.635
 0.831 0.802 0.772 0.742 0.713 0.684 0.655 0.627 0.599 0.571
 0.707 0.681 0.656 0.631 0.606 0.582 0.557 0.533 0.509 0.486
 0.555 0.535 0.515 0.496 0.476 0.457 0.438 0.419 0.400 0.382
 0.382 0.368 0.355 0.341 0.328 0.314 0.301 0.288 0.275 0.263
 0.194 0.187 0.181 0.174 0.167 0.160 0.153 0.147 0.140 0.134

Hope this helps,
Mat

chasmotron
Posts: 2
Joined: Feb 27 2020

Re: I cannot get ACC ROUTINE to work

Post by chasmotron » Thu May 21, 2020 6:48 am

Mat,

Thank you so much for taking the time to help me out. It is very much appreciated!

Post Reply