## How to efficiently parallelize these loops, Fortran

OpenACC and CUDA Fortran
afiguer
Posts: 10
Joined: Mar 23 2020

### How to efficiently parallelize these loops, Fortran

Hi,

I have the following code written in Fortran (I use the PGI 18.3 compiler) and the output from the compiler:

Code: Select all

c\$acc  parallel present(unkno,vicop,deltp)
c
c     -----loop over the points, computing the allowable timestep
c
c\$acc  loop  reduction(min:dtmin)  collapse(2)
do 1200 ipoiz=izmin,izmax
do 1210 ipoiy=iymin,iymax
ipo0z=nxny*(ipoiz-1)
ipo0y=  nx*(ipoiy-1)+ipo0z
c
ipii0=ixmin+ipo0y
ipii1=ixmax+ipo0y
c
c\$acc  loop  reduction(min:dtmin)
do 1220 ipoin=ipii0,ipii1
&                      +cvelo*sqrt(unkno(2,ipoin)*unkno(2,ipoin)
&                                 +unkno(3,ipoin)*unkno(3,ipoin)
&                                 +unkno(4,ipoin)*unkno(4,ipoin)))
dtvis=couvi/max(epsil,vicop(1,ipoin))
dtcon=couco/max(epsil,vicop(2,ipoin))
deltp(ipoin)=dtpoi
dtmin=min(dtmin,dtpoi)
1220 continue
1210 continue
1200 continue
c\$acc  end parallel
c
dtmi1=1.001d+00*dtmin
c
c
c\$acc  kernels present(deltp)
c\$acc  loop reduction(max:ipmin) collapse(2)
do 1400 ipoiz=izmin,izmax
do 1410 ipoiy=iymin,iymax
ipo0z=nxny*(ipoiz-1)
ipo0y=  nx*(ipoiy-1)+ipo0z
c
ipii0=ixmin+ipo0y
ipii1=ixmax+ipo0y
c
c!\$gpu  parallel do,reduction(max:ipmin)
c
c\$acc  loop reduction(max:ipmin)
do 1420 ipoin=ipii0,ipii1
if(deltp(ipoin).lt.dtmi1) then
ipmin=max(ipmin,ipoin)
endif
1420 continue
1410 continue
1400 continue
c
c\$acc  end kernels

Code: Select all

27100, Generating implicit copy(dtmin)
Generating present(unkno(:,:),vicop(:,:))
Accelerator kernel generated
Generating Tesla code
27105, !\$acc loop gang collapse(2) ! blockidx%x
27106,   ! blockidx%x collapsed
Generating reduction(min:dtmin)
27114, !\$acc loop vector(128) ! threadidx%x
Generating reduction(min:dtmin)
27100, Generating present(deltp(:))
27114, Loop is parallelizable
27132, Generating implicit copy(ipmin)
Generating present(deltp(:))
27134, Loop is parallelizable
27135, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
27134, !\$acc loop gang collapse(2) ! blockidx%x
27135,   ! blockidx%x collapsed
Generating reduction(max:ipmin)
27145, !\$acc loop vector(128) ! threadidx%x
Generating reduction(max:ipmin)
27145, Loop is parallelizable
This code works just fine, however I would like to create a big kernel with both nested loops included. This loops are called thousands of times sequentially and if both loops are in the same kernel it would be possible to reduce the communication overhead created between cpu and gpu (I saw it using Pgprof).

My first approach is:

Code: Select all

c\$acc  kernels present(unkno,vicop,deltp)
c
c     -----loop over the points, computing the allowable timestep
c
c\$acc  loop  reduction(min:dtmin)  collapse(2)
do 1200 ipoiz=izmin,izmax
do 1210 ipoiy=iymin,iymax
ipo0z=nxny*(ipoiz-1)
ipo0y=  nx*(ipoiy-1)+ipo0z
c
ipii0=ixmin+ipo0y
ipii1=ixmax+ipo0y
c
c\$acc  loop  reduction(min:dtmin)
do 1220 ipoin=ipii0,ipii1
&                      +cvelo*sqrt(unkno(2,ipoin)*unkno(2,ipoin)
&                                 +unkno(3,ipoin)*unkno(3,ipoin)
&                                 +unkno(4,ipoin)*unkno(4,ipoin)))
dtvis=couvi/max(epsil,vicop(1,ipoin))
dtcon=couco/max(epsil,vicop(2,ipoin))
deltp(ipoin)=dtpoi
dtmin=min(dtmin,dtpoi)
1220 continue
1210 continue
1200 continue
c
dtmi1=1.001d+00*dtmin
c
c
c\$acc  loop reduction(max:ipmin) collapse(2)
do 1400 ipoiz=izmin,izmax
do 1410 ipoiy=iymin,iymax
ipo0z=nxny*(ipoiz-1)
ipo0y=  nx*(ipoiy-1)+ipo0z
c
ipii0=ixmin+ipo0y
ipii1=ixmax+ipo0y
c
c!\$gpu  parallel do,reduction(max:ipmin)
c
c\$acc  loop reduction(max:ipmin)
do 1420 ipoin=ipii0,ipii1
if(deltp(ipoin).lt.dtmi1) then
ipmin=max(ipmin,ipoin)
endif
1420 continue
1410 continue
1400 continue
c
c\$acc  end kernels

Code: Select all

0, Accelerator kernel generated
Generating Tesla code
27100, Generating implicit copy(ipmin)
Generating present(unkno(:,:),vicop(:,:),deltp(:))
Generating implicit copy(dtmin)
27105, Parallelization requires privatization of deltp as well as last value
27106, Parallelization requires privatization of deltp as well as last value
Accelerator kernel generated
Generating Tesla code
27105, !\$acc loop seq collapse(2)
27106,   collapsed
27114, !\$acc loop seq
27114, Parallelization requires privatization of deltp as well as last value
27126, Accelerator serial kernel generated
27132, Loop is parallelizable
27133, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
27132, !\$acc loop gang collapse(2) ! blockidx%x
27133,   ! blockidx%x collapsed
Generating reduction(max:ipmin)
27143, !\$acc loop vector(128) ! threadidx%x
Generating reduction(max:ipmin)
27143, Loop is parallelizable
As it is seen in the compiler report the first loop is of the type "loop seq" and the performance obtained is really poor. When trying to solve this problem I added deltp as a private variable in the first loop:

Code: Select all

c\$acc  kernels present(unkno,vicop,deltp)
c
c     -----loop over the points, computing the allowable timestep
c
c\$acc  loop  private(deltp) reduction(min:dtmin)  collapse(2)
do 1200 ipoiz=izmin,izmax
do 1210 ipoiy=iymin,iymax
ipo0z=nxny*(ipoiz-1)
ipo0y=  nx*(ipoiy-1)+ipo0z
c
ipii0=ixmin+ipo0y
ipii1=ixmax+ipo0y
c
c\$acc  loop private(deltp) reduction(min:dtmin)
do 1220 ipoin=ipii0,ipii1
&                      +cvelo*sqrt(unkno(2,ipoin)*unkno(2,ipoin)
&                                 +unkno(3,ipoin)*unkno(3,ipoin)
&                                 +unkno(4,ipoin)*unkno(4,ipoin)))
dtvis=couvi/max(epsil,vicop(1,ipoin))
dtcon=couco/max(epsil,vicop(2,ipoin))
deltp(ipoin)=dtpoi
dtmin=min(dtmin,dtpoi)
1220 continue
1210 continue
1200 continue
c
... ... ...
c
c\$acc  end kernels
But I get the following error:

Code: Select all

PGF90-S-0155-A data clause for a variable appears within another region with a data clause for the same variable deltp (fd_genpur.f: 27104)
PGF90-S-0155-A data clause for a variable appears within another region with a data clause for the same variable deltp (fd_genpur.f: 27113)
0 inform,   0 warnings,   2 severes, 0 fatal for fd_predidtacompd1_acc
I tried adding the private directive in several ways but I always get a similar type of error. Is there any way to create a big kernel with this two loops inside?

Thank you,
Alejandro

mkcolg
Posts: 8382
Joined: Jun 30 2004

### Re: How to efficiently parallelize these loops, Fortran

Hi Alejandro,

The deltp array can't be both private and shared (i.e. you have it in both a "present" clause which makes it shared and a "private" clause) so hence the error.

The differences between the first and second cases, is that you're using "parallel" in the first and "kernels" in the second. With "parallel", you're telling the compiler to parallelize the loop. With "kernels" you're telling the compiler to find the parallelism and exploit it. So for kernels, the compiler needs to prove a loop has no dependencies before it can parallelize it.

With this loop:

Code: Select all

c\$acc  loop reduction(min:dtmin)
do 1220 ipoin=ipii0,ipii1
....
deltp(ipoin)=dtpoi
The compiler must assume the values of "ipoin" overlap which would cause a race condition if parallelized so suggests that the array be privatized. However, you know more than the compiler here in that values of "ipoin" are unique (or at least I assume that this is the case), so you need to give the compiler a bit more information by adding an "independent" clause to the "loop" directive.

Code: Select all

c\$acc  loop reduction(min:dtmin) independent
do 1220 ipoin=ipii0,ipii1
....
deltp(ipoin)=dtpoi
"independent" is an assertion to the compiler to go ahead and parallelize the loop and not perform any dependency analysis.

Hope this helps,
Mat

afiguer
Posts: 10
Joined: Mar 23 2020

### Re: How to efficiently parallelize these loops, Fortran

Thank you Mat, I did the following changes in the code:

Code: Select all

c\$acc  kernels present(unkno,vicop,deltp)
c
c     -----loop over the points, computing the allowable timestep
c
c\$acc  loop  independent reduction(min:dtmin)  collapse(2)
do 1200 ipoiz=izmin,izmax
do 1210 ipoiy=iymin,iymax
ipo0z=nxny*(ipoiz-1)
ipo0y=  nx*(ipoiy-1)+ipo0z
c
ipii0=ixmin+ipo0y
ipii1=ixmax+ipo0y
c
c\$acc  loop independent reduction(min:dtmin)
do 1220 ipoin=ipii0,ipii1
&                      +cvelo*sqrt(unkno(2,ipoin)*unkno(2,ipoin)
&                                 +unkno(3,ipoin)*unkno(3,ipoin)
&                                 +unkno(4,ipoin)*unkno(4,ipoin)))
dtvis=couvi/max(epsil,vicop(1,ipoin))
dtcon=couco/max(epsil,vicop(2,ipoin))