PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

help understanding compiler information
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Mon Jul 02, 2012 7:17 pm    Post subject: help understanding compiler information Reply with quote

Hi, I am trying to optimize a few routines in a fortran code, but I am having trouble getting parrellization in a couple of them.

In one routine, grid, I have put compute directives around the main computational nested do loop. Here is the compiler feedback:

Code:
grid:
    550, Loop interchange produces reordered loop nest: 551,550
         Loop unrolled 33 times (completely unrolled)
    551, Loop not vectorized: may not be beneficial
         Unrolled inner loop 8 times
         Residual loop unrolled 1 times (completely unrolled)
    561, Generating copyin(rwx(:lr))
         Generating copyin(x3(:))
         Generating copyin(rwy(:lr))
         Generating copyin(y3(:))
         Generating copy(den(:,:))
         Generating copyin(w3(:))
         Generating copyin(mu(:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
    562, Loop carried dependence due to exposed use of 'den(:,:)' prevents parallelization
         Accelerator kernel generated
        562, !$acc do seq
             Non-stride-1 accesses for array 'y3'
             Non-stride-1 accesses for array 'x3'
             Non-stride-1 accesses for array 'mu'
             Non-stride-1 accesses for array 'w3'
             CC 1.0 : 34 registers; 112 shared, 48 constant, 0 local memory bytes; 12% occupancy
             CC 2.0 : 26 registers; 0 shared, 148 constant, 0 local memory bytes; 16% occupancy
    569, Complex loop carried dependence of 'den' prevents parallelization
         Loop carried dependence due to exposed use of 'den(:,:)' prevents parallelization
    604, Loop unrolled 33 times (completely unrolled)
    609, Loop unrolled 33 times (completely unrolled)
    614, Loop interchange produces reordered loop nest: 615,614
         Loop unrolled 33 times (completely unrolled)
    615, Loop not vectorized: may not be beneficial


I am particularly interested in what exposed use of 'den(:,:)' and complex loop carried dep of den(:,:) mean as I expect these are the more important pieces. Is it possible I can parallelize my code and any advice on how I can do so?


The second routine yields the following compiler info:
Code:
cpush:
    385, Generating copy(w2(:))
         Generating copy(y2(:))
         Generating copy(x2(:))
         Generating copy(nos(n))
         Generating copy(ke(n))
         Generating copy(efl(n))
         Generating copy(pfl(n))
         Generating copy(w1(:))
         Generating copy(w3(:))
         Generating copy(y1(:))
         Generating copy(x1(:))
         Generating copy(u2(:))
         Generating copyout(u3(:))
         Generating copy(u1(:))
         Generating copyin(rwx(:lr))
         Generating copy(x3(:))
         Generating copyin(rwy(:lr))
         Generating copy(y3(:))
         Generating copyin(ex(:,:))
         Generating copyin(ey(:,:))
         Generating copyin(mu(:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
    386, Complex loop carried dependence of 'pfl' prevents parallelization
         Loop carried dependence due to exposed use of 'pfl(n)' prevents parallelization
         Complex loop carried dependence of 'efl' prevents parallelization
         Loop carried dependence due to exposed use of 'efl(n)' prevents parallelization
         Complex loop carried dependence of 'ke' prevents parallelization
         Loop carried dependence due to exposed use of 'ke(n)' prevents parallelization
         Complex loop carried dependence of 'nos' prevents parallelization
         Loop carried dependence due to exposed use of 'nos(n)' prevents parallelization
         Accelerator kernel generated
        386, !$acc do seq
             Non-stride-1 accesses for array 'w2'
             Non-stride-1 accesses for array 'y2'
             Non-stride-1 accesses for array 'x2'
             Non-stride-1 accesses for array 'w1'
             Non-stride-1 accesses for array 'w3'
             Non-stride-1 accesses for array 'y3'
             Non-stride-1 accesses for array 'y1'
             Non-stride-1 accesses for array 'x3'
             Non-stride-1 accesses for array 'x1'
             Non-stride-1 accesses for array 'u2'
             Non-stride-1 accesses for array 'u3'
             Non-stride-1 accesses for array 'u1'
             Non-stride-1 accesses for array 'mu'
             CC 1.0 : 43 registers; 256 shared, 44 constant, 0 local memory bytes; 8% occupancy
             CC 2.0 : 37 registers; 0 shared, 292 constant, 0 local memory bytes; 16% occupancy
    393, Loop is parallelizable



I could also post the subroutines if that would be helpful.

Thanks for any advice/documentation I should read,
Ben
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Tue Jul 03, 2012 10:08 am    Post subject: Reply with quote

Hi Ben,
Quote:
I am particularly interested in what exposed use of 'den(:,:)' and complex loop carried dep of den(:,:) mean as I expect these are the more important pieces. Is it possible I can parallelize my code and any advice on how I can do so?
By default, arrays are shared by all threads. Hence, to avoid race conditions, all threads can only write to unique elements in the array. The loop carried dependency message means the multiple threads could be accessing duplicate elements of "den".

I'd need to see the code to know exactly what the dependency is. However, if you are using a computed index or index loop-up table, the compiler can't always tell that every index is unique. If you know that all the calculated index are unique, then you can use the loop "independent" clause to assert that loop is indeed independent.

Another possibility is that "den" is a temporary array used to hold intermediary values within the body of the loop. In this case, you would need to privatize the array (via the "private" clause) to have the compiler create a private copy of the array for every thread.

If this is C, then you may need to add the C99 "restrict" keyword to your pointers, or use the flag "-Msafeptr" to assert to the compiler that the memory your pointers point to don't overlap with other pointers.

Finally, "den" could really have a loop dependency rendering the loop non-parallelizable. In this case you would need to rewrite your code to be parallel.

Quote:
I could also post the subroutines if that would be helpful.
If the above doesn't help, please do post an example.

- Mat
Back to top
View user's profile
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Mon Jul 09, 2012 3:12 pm    Post subject: Reply with quote

Thanks Matt, that was very helpful.

I'm not exactly sure what the situation is, but I have tried both independent and private directives, but I've gotten bad output or slower/same times. I am thinking there's truly a dependancy. If you wouldn't mind taking a look at grid, which corresponds to the first compiler info block in my original post (I added a compute directive around the main loop):

Code:

         subroutine grid

c    source quantities are are calculated: n_i
c    right now only ion quantitities are calculated...
c
         include  'slab.h'
c
         integer m,i,j,l
         real wx0,wx1,wy0,wy1
         real wght
         real xt,yt
         real rhog
c
c     here we set the rho and den equal to zero.
c

c

c
           do 50 i=0,im
              do 60 j=0,jm
                    den(i,j)=0.
 60           continue
 50        continue
c
c     note: that rho and density will be set the same at the end
c           density is a more accurate description of the
c                   quantity than rho.
c
            dv=(dx*dy)
c$acc region
             do 100 m=1,mm
c
                wght=w3(m)/dv/float(lr)
           rhog=sqrt(mu(m))/mims
c
c    continue with calculating grid quantities...
c
   do l=1,lr
                xt=x3(m)+rwx(l)*rhog
                yt=y3(m)+rwy(l)*rhog
c
                if(xt.gt.lx)  xt=2.*lx-xt
                if(xt.lt.0.)  xt=-xt
                if(xt.eq.lx)  xt=0.9999*lx
                if(yt.ge.ly)  yt=yt-ly
                if(yt.lt.0.)  yt=yt+ly

   if (ngp.eq.1) then
                i=int(xt/dx+0.5)
                j=int(yt/dy+0.5)
                den(i,j)=den(i,j)+wght
c     
   else
                i=int(xt/dx)
                j=int(yt/dy)
c     
                wx0=float(i+1)-xt/dx
                wx1=1.-wx0
                wy0=float(j+1)-yt/dy
                wy1=1.-wy0
c
                den(i,j)      =den(i,j)      +wght*wx0*wy0
                den(i+1,j)    =den(i+1,j)    +wght*wx1*wy0
                den(i,j+1)    =den(i,j+1)    +wght*wx0*wy1
                den(i+1,j+1)  =den(i+1,j+1)  +wght*wx1*wy1
   endif
c
   enddo ! end loop over 4pt avg
c
  100         continue
c$acc end region
c
              do 300 j=0,jm
                    den(0,j)=( den(0,j)+den(im,j) )
                    den(im,j)=den(0,j)
  300          continue
c
               do 320 i=0,im
                     den(i,0)=(den(i,0)+den(i,jm))
                     den(i,jm)=den(i,0)
  320               continue
c
                  do 410 i=0,im
                     do 420 j=0,jm
                           den(i,j)=q*den(i,j)/n0
  420                 continue
  410              continue
c
        return
        end
c
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Mon Jul 09, 2012 4:12 pm    Post subject: Reply with quote

Hi Ben,

The code is calculating the i and j index so you would need to use the "independent" clause to get the code to accelerate. However as you note, you'll most likely get wrong answers if not all i and j's are unique. Worse, you also need the i,i+1 and j,j+1 pairs to be unique due to the forward references, i.e. "den(i+1,j)", "den(i,j+1)", "den(i+1,j+1)" and they need to be unique for every iteration of the m loop.

One thing you could do with this loop is to manually privatize "den" by adding a third dimension with the first dimension holding the "m" index, i.e. "denTmp(m,i,j)" and then perform a reduction from the 3D to 2D array after the m loop. However, only the m loop would be run in parallel so it's loop bounds, mm, would need to be sufficiently large for the parallelization to be profitable.

Note that if you do decide to create a denTmp array, be sure to put it in a "local" clause so it doesn't get copied to/from the device. You'll then need to put the initialization and reduction on the device, but that should be straight forward.

Below is the basic outline of what I'm thinking. Caveat is I didn't test this so could have some typos, but hopefully it will illustrate what I have in mind.

- Mat
Code:

         subroutine grid

c    source quantities are are calculated: n_i
c    right now only ion quantitities are calculated...
c
         include  'slab.h'
c
         integer m,i,j,l
         real wx0,wx1,wy0,wy1
         real wght
         real xt,yt
         real rhog
         real denTmp(0:mm,0:im,0:jm) 
c
c     here we set the rho and den equal to zero.
c

c

c
 !$acc data region copyout(den) local(denTmp)

!$acc region
       do 44 m=0,mm
           do 55 i=0,im
              do 66 j=0,jm
                    denTmp(m,i,j)=0.
 66          continue
 55        continue
 44        continue
!$acc end region


c
c     note: that rho and density will be set the same at the end
c           density is a more accurate description of the
c                   quantity than rho.
c
            dv=(dx*dy)
c$acc region
             do 100 m=1,mm
c
                wght=w3(m)/dv/float(lr)
           rhog=sqrt(mu(m))/mims
c
c    continue with calculating grid quantities...
c
   do l=1,lr
                xt=x3(m)+rwx(l)*rhog
                yt=y3(m)+rwy(l)*rhog
c
                if(xt.gt.lx)  xt=2.*lx-xt
                if(xt.lt.0.)  xt=-xt
                if(xt.eq.lx)  xt=0.9999*lx
                if(yt.ge.ly)  yt=yt-ly
                if(yt.lt.0.)  yt=yt+ly

   if (ngp.eq.1) then
                i=int(xt/dx+0.5)
                j=int(yt/dy+0.5)
                denTmp(m,i,j)=denTmp(m,i,j)+wght
c     
   else
                i=int(xt/dx)
                j=int(yt/dy)
c     
                wx0=float(i+1)-xt/dx
                wx1=1.-wx0
                wy0=float(j+1)-yt/dy
                wy1=1.-wy0
c
                denTmp(m,i,j)      =den(m,i,j)      +wght*wx0*wy0
                denTmp(m,i+1,j)    =den(m,i+1,j)    +wght*wx1*wy0
                denTmp(m,i,j+1)    =den(m,i,j+1)    +wght*wx0*wy1
                denTmp(m,i+1,j+1)  =den(m,i+1,j+1)  +wght*wx1*wy1
   endif
c
   enddo ! end loop over 4pt avg
c
  100         continue
c$acc end region
c
   
! perform the reduction in parallel
!$acc region
          do 166 j=0,jm
             do 155 i=0,im
                  den(i,j) = 0
                  do 144 m=0,mm
                    den(i,j) = den(i,j) + denTmp(m,i,j)
 144          continue
 155        continue
 166        continue
!$acc end region


!$acc end data region
! copy back den here

! you might be able to get the next loops to accelerate as well
              do 300 j=0,jm
                    den(0,j)=( den(0,j)+den(im,j) )
                    den(im,j)=den(0,j)
  300          continue
c
               do 320 i=0,im
                     den(i,0)=(den(i,0)+den(i,jm))
                     den(i,jm)=den(i,0)
  320               continue
c
                  do 410 i=0,im
                     do 420 j=0,jm
                           den(i,j)=q*den(i,j)/n0
  420                 continue
  410              continue

c
        return
        end
c
Back to top
View user's profile
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Mon Oct 08, 2012 6:45 pm    Post subject: Reply with quote

Hi again, I know its been awhile but I got another question and its related.

Above we manually privatized den as you described (your solution works btw). What's the difference between what we did and using the do private directive? How come the do private directive doesn't work in this case?

Thanks,
Ben
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group