PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

loop reodering with compiler directive
Goto page Previous  1, 2
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Fri Sep 30, 2011 4:11 pm    Post subject: Reply with quote

Hi Xavier,

Quote:
Note that for the CPU I compiled only with -O3 optimization, I am assuming here that not all compiler will do the loop tranformation from GPU to CPU optimal order.
Use -fast instead. -fast will perform a loop-interchange and vectorize the code. On my system both A and B are 8 seconds with -fast versus 10 and 13 with -O3.

As for the GPU versions, in both cases the k loop can't be parallelized so is made sequential in the kernel. The difference is that in the A version the compiler uses a 256x2 shared cache while in the B version no caching. If I use the flag "-ta=nvidia,nocache" on A and adjust the schedule to use vector(256), the get the same time as B. My assumption is that the overhead of cache outweighs the gains in this example.

I'll ask Michael to take a look once he's back in the office next week.

Best Regards,
Mat


Last edited by mkcolg on Mon Oct 03, 2011 11:52 am; edited 1 time in total
Back to top
View user's profile
Michael Wolfe



Joined: 19 Jan 2010
Posts: 42

PostPosted: Mon Oct 03, 2011 11:41 am    Post subject: Reply with quote

As Mat points out, the compiler is trying to use the cache (shared memory) of the GPU, and in this case it's a bad decision. We'll work on that decision process. I created a version for the GPU that works as fast as b.f90 and still retains the same host performance.

Code:

program main
  implicit none
  integer*4 :: N,nlev,i,k,itime,nt
  real*8, allocatable :: a(:,:), b(:,:)
  integer*4 :: dt1(8), dt2(8), t1, t2
  real*8 :: rt

  N=1E4
  nlev=60
  nt=1000

 allocate(a(N,nlev))
 allocate(b(N,nlev))


b=0.1

!$acc data region local(a,b)
!$acc update device(b)

call date_and_time( values=dt1 )
!time loop
do itime=1,nt
   !initialization
   !$acc region do seq
   do k=1,nlev
      !$acc do parallel vector(256)
      do i=1,N
         a(i,k)=0.0D0
      end do
   end do
   !$acc end region


   ! first layer
   !$acc region do kernel parallel vector(256)
   do i=1,N
      a(i,1)=0.1D0
   end do
   !$acc end region

   ! vertical computation
   !$acc region do seq
   do k=2,nlev
      !$acc do parallel vector(256)
      do i=1,N
         a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*b(i,k)
      end do
   end do
   !$acc end region


end do
!$acc update host(a)

call date_and_time( values=dt2 )

!$acc end data region

 t1 = dt1(8) + 1000*(dt1(7)+60*dt1(6)+60*(dt1(5)))
 t2 = dt2(8) + 1000*(dt2(7)+60*dt2(6)+60*(dt2(5)))
 rt = (t2 - t1)/1000.
write(*,"(A,I,A,I)") 'N=', N, ' , nt=',nt
write(*,"(A,F10.4)")  'time per step (us) =', rt/nt * 1E6
print*, 'sum(a)=',sum(a)
end program main

Note the 'parallel vector(256)' clauses for the 'i' loops, which tells the compiler to focus only on that loop, to run in vector blocks of 256 in parallel.
Back to top
View user's profile
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Mon Oct 03, 2011 11:47 pm    Post subject: Reply with quote

Hi,

Thanks, this is really what I was looking for !

I did try also yesterday to add explicit do seq (for k) and parallel, vector(256) (for i) and found also same performance with case A as case B provided that I compile with -ta=nvidia,nocache.

Xavier


Last edited by xlapillonne on Fri Oct 14, 2011 2:04 am; edited 1 time in total
Back to top
View user's profile
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Fri Oct 14, 2011 2:03 am    Post subject: Reply with quote

Hi,

I have been further exploring the nocache option. One sees that it is kernel dependent whether it is beneficial or not. Is here any though to add a nocahe( list ) clause (like the cache clause) to the loop mapping directive ? I think it could be very useful.

Thanks,

Xavier
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group