PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

PGI attempts to parallelize sequential loop

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
Alexey A. Romanenko



Joined: 17 Feb 2012
Posts: 31

PostPosted: Mon Aug 27, 2012 5:35 am    Post subject: PGI attempts to parallelize sequential loop Reply with quote

Hi all!

1. In my code I have four nested loops. To avoid reduction I mark two most inner loops as sequential. According to compiler output PGI try to parallelize inner loops.

code:
Code:
!$acc data copyout(scf) copyin(Dlocal,Clocal,endpht,lstpht,ilocal)
!$acc kernels
!  Loop over sub-points
!$acc loop independent
        do ispin = 1,nspin  ! <-- 321
!$acc loop independent private(ijl)
           do isp = 1,nsp    ! <-- 325
!$acc loop seq
              do ic = 1,nc     ! <-- 328
                 imp = endpht(ip-1) + ic
                 i = lstpht(imp)
                 il = ilocal(i)
!$acc loop seq
                 do jc = 1,ic   ! <-- 335
                    jl =ilocal(lstpht(endpht(ip-1) + jc)) !ilc(jc)

                    if (il.gt.jl) then
                       ijl = il*(il+1)/2 + jl + 1
                    else
                       ijl = jl*(jl+1)/2 + il + 1
                    endif
                    if (ic .eq. jc) then
                       Dij = Dlocal(ijl,ispin)
                    else
                       Dij = 2*Dlocal(ijl,ispin)
                    endif

                    scf(isp,ip,ispin) = scf(isp,ip,ispin) + &
                        Dij*Clocal(isp,ic) * Clocal(isp,jc)    !Cij(isp)
              enddo
            enddo
          enddo
        enddo
!$acc end kernels
!$acc end data


output:
Code:
pgfortran -c -acc -ta=nvidia:4.0 -g -Minfo   `FoX/FoX-config --fcflags`   scf.f90
rhoofd:
     94, maxval reduction inlined
    134, Possible copy in and copy out of dscfl in call to matdot
    202, Invariant if transformation
    304, sum reduction inlined
    319, Generating copyout(scf(:,:,:))
         Generating copyin(ilocal(:))
         Generating copyin(lstpht(:))
         Generating copyin(endpht(:))
         Generating copyin(clocal(:,:))
         Generating copyin(dlocal(:,:))
    320, Generating copyin(endpht(:))
         Generating copyin(dlocal(:,:))
         Generating copyin(lstpht(:))
         Generating copyin(ilocal(:))
         Generating copyout(scf(:,:,:))
         Generating copyin(clocal(:,:))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
    323, Loop is parallelizable
    325, Loop is parallelizable
    328, Loop carried dependence of 'scf' prevents parallelization
         Loop carried backward dependence of 'scf' prevents vectorization
         Accelerator kernel generated
        323, !$acc loop gang ! blockidx%y
        325, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
        328, CC 1.3 : 27 registers; 224 shared, 4 constant, 0 local memory bytes
             CC 2.0 : 27 registers; 0 shared, 240 constant, 0 local memory bytes
    335, Complex loop carried dependence of 'scf' prevents parallelization
         Loop carried dependence of 'scf' prevents parallelization
         Loop carried backward dependence of 'scf' prevents vectorization


2. Once again about confusing messages on line 320.

3. BTW, this piece of code produce different result being compiled with and without '-acc'. Any idea?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Mon Aug 27, 2012 9:06 am    Post subject: Reply with quote

Hi Alexey,

Quote:
1. In my code I have four nested loops. To avoid reduction I mark two most inner loops as sequential. According to compiler output PGI try to parallelize inner loops.
The compiler is just printing out the analysis information, but isn't actually parallelizing the inner two loops.

Quote:
2. Once again about confusing messages on line 320.
Yep. These are actually "present" checks to allow for things like pointer swapping within data regions. Issue is being tracked as TPR#18858.

Quote:
3. BTW, this piece of code produce different result being compiled with and without '-acc'. Any idea?
I'd need a reproducing example to tell. Though, I'd start by simplifying things. Remove the data region and loop clauses. Next start adding them back one by one, starting with the outer loop then finally the data region.

Hope this helps,
Mat
Back to top
View user's profile
Alexey A. Romanenko



Joined: 17 Feb 2012
Posts: 31

PostPosted: Mon Aug 27, 2012 10:52 pm    Post subject: Reply with quote

Quote:
Quote:
1. In my code I have four nested loops. To avoid reduction I mark two most inner loops as sequential. According to compiler output PGI try to parallelize inner loops.

The compiler is just printing out the analysis information, but isn't actually parallelizing the inner two loops.


In this case I'd suggest to consider this as confusing messages. I marked those loops as seq. explicitly. Therefore I don't want to see any info about them.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Tue Aug 28, 2012 7:20 am    Post subject: Reply with quote

I've complained about this as well. The problem is that the analysis is done before the directives are applied. Though, I'll pass this along since customer complaints tend to get higher priority then when I complain ;).

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group