PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Need help to accelerate

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
ID#cat



Joined: 19 Nov 2012
Posts: 2

PostPosted: Tue Nov 20, 2012 7:43 am    Post subject: Need help to accelerate Reply with quote

Can anybody help me to accelerate region:

Code:

   168  !$acc data copy(CG) copyout(U,STRESS) copyin(PI,B,N1,N2,NG3,NG2)
   169  !$acc parallel
   170  !$acc loop reduction(+:s11,s21,s31,s12,s22,s32,s13,s23,s33,U)
   171        do J3 = J3min,J3max
   172          if (J3.gt.NG3/2) then
   173            I3 = J3 - NG3
   174          else
   175            I3 = J3
   176          endif
   177  !$acc loop
   178          do J2 = J2min,J2max
   179            if (J2.gt.NG2/2) then
   180              I2 = J2 - NG2
   181            else
   182              I2 = J2
   183            endif
   184  !$acc loop private(g) reduction(CG)
   185            do J1 = 0,N1-1
   186              if (J1.gt.N1/2) then
   187                I1 = J1 - N1
   188              else
   189                I1 = J1
   190              endif
   191              G(1)= B(1,1) * I1 + B(1,2) * I2 + B(1,3) * I3
   192              G(2)= B(2,1) * I1 + B(2,2) * I2 + B(2,3) * I3
   193              G(3)= B(3,1) * I1 + B(3,2) * I2 + B(3,3) * I3
   194              G2 = G(1)**2 + G(2)**2 + G(3)**2
   195              J2L = J2 - J2min
   196              J3L = J3 - J3min
   197              J = 1 + J1 + N1 * J2L + N1 * N2 * J3L
   198              if (G2.LT.G2MAX .AND. G2.GT.TINY) then
   199                VG = 8.0_dp * PI / G2
   200                DU = VG * ( CG(1,J)**2 + CG(2,J)**2 )
   201                U = U + DU
   202                C = 2.0_dp * DU / G2
   203               
   204                 s11 = s11 + C * G(1) * G(1)
   205                 s21 = s21 + C * G(1) * G(2)
   206                 s31 = s31 + C * G(1) * G(3)
   207                 
   208                 s12 = s12 + C * G(2) * G(1)
   209                 s22 = s22 + C * G(2) * G(2)
   210                 s32 = s32 + C * G(2) * G(3)
   211                 
   212                 s13 = s13 + C * G(3) * G(1)
   213                 s23 = s23 + C * G(3) * G(2)
   214                 s33 = s33 + C * G(3) * G(3)
   215
   216  !              DO IX = 1,3
   217  !                DO JX = 1,3
   218  !                  STRESS(JX,IX) = STRESS(JX,IX) + C * G(IX) * G(JX)
   219  !                ENDDO
   220  !              ENDDO
   221
   222                CG(1,J) = VG * CG(1,J)
   223                CG(2,J) = VG * CG(2,J)
   224              else
   225                CG(1,J) = 0.0_dp
   226                CG(2,J) = 0.0_dp
   227              endif
   228            enddo
   229  !$end loop
   230          enddo
   231  !$end loop
   232        enddo
   233  !$end loop
   234  !$acc end parallel
   235  !$acc end data


when compile with
Code:
 pgfortran -V12.8 -c -g -acc -ta=nvidia:4.2 -Minfo
i have:
Code:

168, Generating copyout(stress(:,:))
         Generating copyout(u)
         Generating copyin(ng2)
         Generating copyin(ng3)
         Generating copyin(n2)
         Generating copyin(n1)
         Generating copyin(b(:,:))
         Generating copyin(pi)
         Generating copy(cg(:,:))
    169, Accelerator kernel generated
        169, CC 1.3 : 64 registers; 48 shared, 252 constant, 8 local memory bytes
             CC 2.0 : 63 registers; 0 shared, 312 constant, 0 local memory bytes
        171, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
    169, Generating copy(cg(:,:))
         Generating copyin(b(:,:))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
    178, Loop carried reuse of 'cg' prevents parallelization
    185, Loop carried reuse of 'cg' prevents parallelization
         Complex loop carried dependence of 'cg' prevents parallelization

there are reduction around CG, how to around this problem
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Tue Nov 20, 2012 12:23 pm    Post subject: Reply with quote

Hi ID#cat,

First, reduction variables must be scalars so CG can't be used in a reduction clause. However, CG isn't being reduced but rather the compiler's complaining that it can't tell all index of CG are independent due to the use of the compute J index.

The major reason why the inner loops aren't parallelizing is because you have scalar code in between each loop. Try pushing the J3 and J2 if statements inside the J1 loop. In some cases when using the "kernels" the compiler may automatically perform this transformation, but it wont if using "parallel".

You may also try using "kernels" instead of "parallel", just add the "independent" clause to you loop directives to work around CG's compute index issue.

Hope this helps,
Mat
Back to top
View user's profile
ID#cat



Joined: 19 Nov 2012
Posts: 2

PostPosted: Thu Nov 22, 2012 1:08 am    Post subject: Reply with quote

Thank you. I have changed initialization of the loops
Code:

169   !$acc data copy(CG) copyout(U,s11,s21,s31,s12,s22,s32,s13,s23,s33)
   170   !$accx copyin(NG3,NG2,N1,N2,PI,B,G2MAX,TINY)
   171   !$acc kernels
   172   !$acc loop independent
   173         do J3 = J3min,J3max
   174   !$acc loop independent
   175           do J2 = J2min,J2max         
   176   !$acc loop reduction(+:s11,s21,s31,s12,s22,s32,s13,s23,s33,U)
   177   !$accx private(G)
   178             do J1 = 0,N1-1
   179               if (J2.gt.NG2/2) then
   180                 I2 = J2 - NG2
   181               else
   182                 I2 = J2
   183               endif
   184              
   185               if (J3.gt.NG3/2) then
   186                 I3 = J3 - NG3
   187               else
   188                 I3 = J3
   189               endif
   190              
   191               if (J1.gt.N1/2) then
   192                 I1 = J1 - N1
   193               else
   194                 I1 = J1
   195               endif


it compiles with:
Code:

169, Generating copyout(s33)
         Generating copyout(s23)
         Generating copyout(s13)
         Generating copyout(s32)
         Generating copyout(s22)
         Generating copyout(s12)
         Generating copyout(s31)
         Generating copyout(s21)
         Generating copyout(s11)
         Generating copyout(u)
         Generating copyin(tiny)
         Generating copyin(g2max)
         Generating copyin(b(:,:))
         Generating copyin(pi)
         Generating copyin(n2)
         Generating copyin(n1)
         Generating copyin(ng2)
         Generating copyin(ng3)
         Generating copy(cg(:,:))
    171, Generating present_or_copy(cg(:,:))
         Generating present_or_copyin(b(:,:))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
    173, Loop is parallelizable
    175, Loop is parallelizable
         Accelerator kernel generated
        173, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
        175, !$acc loop gang, vector(64) ! blockidx%x threadidx%x
             CC 1.3 : 46 registers; 264 shared, 40 constant, 0 local memory bytes
             CC 2.0 : 62 registers; 0 shared, 296 constant, 0 local memory bytes
    178, Loop carried reuse of 'cg' prevents parallelization
         Complex loop carried dependence of 'cg' prevents parallelization
         Inner sequential loop scheduled on accelerator


but I have the run time error:

Code:
0: ALLOCATE: 18446744071562067970 bytes requested; not enough memory
make: *** [completed_work] Error 127
[/code]
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Mon Nov 26, 2012 10:24 am    Post subject: Reply with quote

Quote:
0: ALLOCATE: 18446744071562067970 bytes requested; not enough memory
make: *** [completed_work] Error 127
Most likely a bogus values is being used when allocating device data. Double check that the arrays are all allocated on the host before being passed to the device.

If that's not it, I'd need to see the full source to determine if it's an error in your program or a compiler error. Can please post a reproducing example or send one to PGI Customer Service (trs@pgroup.com)?

Thanks,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group