How to efficiently parallelize these loops, Fortran

OpenACC and CUDA Fortran
Post Reply
afiguer
Posts: 10
Joined: Mar 23 2020

How to efficiently parallelize these loops, Fortran

Post by afiguer » Tue May 19, 2020 7:14 pm

Hi,

I have the following code written in Fortran (I use the PGI 18.3 compiler) and the output from the compiler:

Code: Select all

c$acc  parallel present(unkno,vicop,deltp) 
c
c     -----loop over the points, computing the allowable timestep
c
c$acc  loop  reduction(min:dtmin)  collapse(2)  
       do 1200 ipoiz=izmin,izmax
       do 1210 ipoiy=iymin,iymax
       ipo0z=nxny*(ipoiz-1)
       ipo0y=  nx*(ipoiy-1)+ipo0z
c
       ipii0=ixmin+ipo0y
       ipii1=ixmax+ipo0y
c
c$acc  loop  reduction(min:dtmin)
       do 1220 ipoin=ipii0,ipii1
       dtadv=coudx/max(epsil,csoun
     &                      +cvelo*sqrt(unkno(2,ipoin)*unkno(2,ipoin)
     &                                 +unkno(3,ipoin)*unkno(3,ipoin)
     &                                 +unkno(4,ipoin)*unkno(4,ipoin)))
       dtvis=couvi/max(epsil,vicop(1,ipoin))
       dtcon=couco/max(epsil,vicop(2,ipoin))
       dtpoi=min(dtadv,dtvis,dtcon)
       deltp(ipoin)=dtpoi
       dtmin=min(dtmin,dtpoi)
 1220 continue
 1210 continue
 1200 continue
 c$acc  end parallel
c
       dtmi1=1.001d+00*dtmin
c 
c
c$acc  kernels present(deltp)
c$acc  loop reduction(max:ipmin) collapse(2)  
       do 1400 ipoiz=izmin,izmax
       do 1410 ipoiy=iymin,iymax
       ipo0z=nxny*(ipoiz-1)
       ipo0y=  nx*(ipoiy-1)+ipo0z
c
       ipii0=ixmin+ipo0y
       ipii1=ixmax+ipo0y
c
c!$gpu  parallel do,reduction(max:ipmin)
c
c$acc  loop reduction(max:ipmin)
       do 1420 ipoin=ipii0,ipii1
       if(deltp(ipoin).lt.dtmi1) then
          ipmin=max(ipmin,ipoin)
       endif
 1420 continue
 1410 continue
 1400 continue
c
c$acc  end kernels

Code: Select all

  27100, Generating implicit copy(dtmin)
         Generating present(unkno(:,:),vicop(:,:))
         Accelerator kernel generated
         Generating Tesla code
      27105, !$acc loop gang collapse(2) ! blockidx%x
      27106,   ! blockidx%x collapsed
             Generating reduction(min:dtmin)
      27114, !$acc loop vector(128) ! threadidx%x
             Generating reduction(min:dtmin)
  27100, Generating present(deltp(:))
  27114, Loop is parallelizable
  27132, Generating implicit copy(ipmin)
         Generating present(deltp(:))
  27134, Loop is parallelizable
  27135, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
      27134, !$acc loop gang collapse(2) ! blockidx%x
      27135,   ! blockidx%x collapsed
             Generating reduction(max:ipmin)
      27145, !$acc loop vector(128) ! threadidx%x
             Generating reduction(max:ipmin)
  27145, Loop is parallelizable
This code works just fine, however I would like to create a big kernel with both nested loops included. This loops are called thousands of times sequentially and if both loops are in the same kernel it would be possible to reduce the communication overhead created between cpu and gpu (I saw it using Pgprof).

My first approach is:

Code: Select all

c$acc  kernels present(unkno,vicop,deltp) 
c
c     -----loop over the points, computing the allowable timestep
c
c$acc  loop  reduction(min:dtmin)  collapse(2) 
       do 1200 ipoiz=izmin,izmax
       do 1210 ipoiy=iymin,iymax
       ipo0z=nxny*(ipoiz-1)
       ipo0y=  nx*(ipoiy-1)+ipo0z
c
       ipii0=ixmin+ipo0y
       ipii1=ixmax+ipo0y
c
c$acc  loop  reduction(min:dtmin)
       do 1220 ipoin=ipii0,ipii1
       dtadv=coudx/max(epsil,csoun
     &                      +cvelo*sqrt(unkno(2,ipoin)*unkno(2,ipoin)
     &                                 +unkno(3,ipoin)*unkno(3,ipoin)
     &                                 +unkno(4,ipoin)*unkno(4,ipoin)))
       dtvis=couvi/max(epsil,vicop(1,ipoin))
       dtcon=couco/max(epsil,vicop(2,ipoin))
       dtpoi=min(dtadv,dtvis,dtcon)
       deltp(ipoin)=dtpoi
       dtmin=min(dtmin,dtpoi)
 1220 continue
 1210 continue
 1200 continue
c
       dtmi1=1.001d+00*dtmin
c 
c
c$acc  loop reduction(max:ipmin) collapse(2)  
       do 1400 ipoiz=izmin,izmax
       do 1410 ipoiy=iymin,iymax
       ipo0z=nxny*(ipoiz-1)
       ipo0y=  nx*(ipoiy-1)+ipo0z
c
       ipii0=ixmin+ipo0y
       ipii1=ixmax+ipo0y
c
c!$gpu  parallel do,reduction(max:ipmin)
c
c$acc  loop reduction(max:ipmin)
       do 1420 ipoin=ipii0,ipii1
       if(deltp(ipoin).lt.dtmi1) then
          ipmin=max(ipmin,ipoin)
       endif
 1420 continue
 1410 continue
 1400 continue
c
c$acc  end kernels

Code: Select all

     0, Accelerator kernel generated
         Generating Tesla code
  27100, Generating implicit copy(ipmin)
         Generating present(unkno(:,:),vicop(:,:),deltp(:))
         Generating implicit copy(dtmin)
  27105, Parallelization requires privatization of deltp as well as last value
  27106, Parallelization requires privatization of deltp as well as last value
         Accelerator kernel generated
         Generating Tesla code
      27105, !$acc loop seq collapse(2)
      27106,   collapsed
      27114, !$acc loop seq
  27114, Parallelization requires privatization of deltp as well as last value
  27126, Accelerator serial kernel generated
  27132, Loop is parallelizable
  27133, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
      27132, !$acc loop gang collapse(2) ! blockidx%x
      27133,   ! blockidx%x collapsed
             Generating reduction(max:ipmin)
      27143, !$acc loop vector(128) ! threadidx%x
             Generating reduction(max:ipmin)
  27143, Loop is parallelizable
As it is seen in the compiler report the first loop is of the type "loop seq" and the performance obtained is really poor. When trying to solve this problem I added deltp as a private variable in the first loop:

Code: Select all

c$acc  kernels present(unkno,vicop,deltp) 
c
c     -----loop over the points, computing the allowable timestep
c
c$acc  loop  private(deltp) reduction(min:dtmin)  collapse(2) 
       do 1200 ipoiz=izmin,izmax
       do 1210 ipoiy=iymin,iymax
       ipo0z=nxny*(ipoiz-1)
       ipo0y=  nx*(ipoiy-1)+ipo0z
c
       ipii0=ixmin+ipo0y
       ipii1=ixmax+ipo0y
c
c$acc  loop private(deltp) reduction(min:dtmin)
       do 1220 ipoin=ipii0,ipii1
       dtadv=coudx/max(epsil,csoun
     &                      +cvelo*sqrt(unkno(2,ipoin)*unkno(2,ipoin)
     &                                 +unkno(3,ipoin)*unkno(3,ipoin)
     &                                 +unkno(4,ipoin)*unkno(4,ipoin)))
       dtvis=couvi/max(epsil,vicop(1,ipoin))
       dtcon=couco/max(epsil,vicop(2,ipoin))
       dtpoi=min(dtadv,dtvis,dtcon)
       deltp(ipoin)=dtpoi
       dtmin=min(dtmin,dtpoi)
 1220 continue
 1210 continue
 1200 continue
c
... ... ... 
c
c$acc  end kernels
But I get the following error:

Code: Select all

PGF90-S-0155-A data clause for a variable appears within another region with a data clause for the same variable deltp (fd_genpur.f: 27104)
PGF90-S-0155-A data clause for a variable appears within another region with a data clause for the same variable deltp (fd_genpur.f: 27113)
  0 inform,   0 warnings,   2 severes, 0 fatal for fd_predidtacompd1_acc
I tried adding the private directive in several ways but I always get a similar type of error. Is there any way to create a big kernel with this two loops inside?

Thank you,
Alejandro

mkcolg
Posts: 8382
Joined: Jun 30 2004

Re: How to efficiently parallelize these loops, Fortran

Post by mkcolg » Wed May 20, 2020 1:17 pm

Hi Alejandro,

The deltp array can't be both private and shared (i.e. you have it in both a "present" clause which makes it shared and a "private" clause) so hence the error.

The differences between the first and second cases, is that you're using "parallel" in the first and "kernels" in the second. With "parallel", you're telling the compiler to parallelize the loop. With "kernels" you're telling the compiler to find the parallelism and exploit it. So for kernels, the compiler needs to prove a loop has no dependencies before it can parallelize it.

With this loop:

Code: Select all

c$acc  loop reduction(min:dtmin)
       do 1220 ipoin=ipii0,ipii1
....
       deltp(ipoin)=dtpoi
The compiler must assume the values of "ipoin" overlap which would cause a race condition if parallelized so suggests that the array be privatized. However, you know more than the compiler here in that values of "ipoin" are unique (or at least I assume that this is the case), so you need to give the compiler a bit more information by adding an "independent" clause to the "loop" directive.

Code: Select all

c$acc  loop reduction(min:dtmin) independent
       do 1220 ipoin=ipii0,ipii1
....
       deltp(ipoin)=dtpoi
"independent" is an assertion to the compiler to go ahead and parallelize the loop and not perform any dependency analysis.

Hope this helps,
Mat

afiguer
Posts: 10
Joined: Mar 23 2020

Re: How to efficiently parallelize these loops, Fortran

Post by afiguer » Thu May 21, 2020 9:12 pm

Thank you Mat, I did the following changes in the code:

Code: Select all

c$acc  kernels present(unkno,vicop,deltp) 
c
c     -----loop over the points, computing the allowable timestep
c
c$acc  loop  independent reduction(min:dtmin)  collapse(2) 
       do 1200 ipoiz=izmin,izmax
       do 1210 ipoiy=iymin,iymax
       ipo0z=nxny*(ipoiz-1)
       ipo0y=  nx*(ipoiy-1)+ipo0z
c
       ipii0=ixmin+ipo0y
       ipii1=ixmax+ipo0y
c
c$acc  loop independent reduction(min:dtmin)
       do 1220 ipoin=ipii0,ipii1
       dtadv=coudx/max(epsil,csoun
     &                      +cvelo*sqrt(unkno(2,ipoin)*unkno(2,ipoin)
     &                                 +unkno(3,ipoin)*unkno(3,ipoin)
     &                                 +unkno(4,ipoin)*unkno(4,ipoin)))
       dtvis=couvi/max(epsil,vicop(1,ipoin))
       dtcon=couco/max(epsil,vicop(2,ipoin))
       dtpoi=min(dtadv,dtvis,dtcon)
       deltp(ipoin)=dtpoi
       dtmin=min(dtmin,dtpoi)
 1220 continue
 1210 continue
 1200 continue
c
... ... ... 
c
c$acc  end kernels
And it works perfectly. I was not sure about using an independent and reduction clause in the same loop directive, but it works just fine.

Thank you,
Alejandro

Post Reply