PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

loop reodering with compiler directive
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Thu Jul 28, 2011 1:20 am    Post subject: loop reodering with compiler directive Reply with quote

Hi,

I am currently working on porting a large code with acc directives which consist of many part with typically the following structure:

Code:

Version A:
!$acc region
!init 
do k=1,nlev
  do i=1,N
    a(i,k)=0.0D0
  end do
end do
 ! first layer
do i=1,N
  a(i,1)=0.1D0
end do
! vertical computation
do k=2,nlev
  do i=1,N
    a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k)
  end do
end do
!$acc end region


of course to get a better performance one needs here to take out the i loop:
Code:

Version B
!$acc region
   do i=1,N
      !init
      do k=1,nlev   
         a(i,k)=0.0D0
      end do     
      ! first layer
      a(i,1)=0.1D0     
      ! vertical computation
      do k=2,nlev
         a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k)
      end do
     
   end do
!$acc end region

For most of the code, only doing this loop reordering is enough to get good performance (for the most intensive part we do have to introduce some additional optimization like replacing local matrices with scalars ...).

For the vast majority of the code where only this loop reordering is enough we would like to keep the same source code for the CPU and GPU version.

As said in one of the PGI insider, we could in principle have the code in version B, the compiler should be able to transform it in version A when compiling for a CPU. However we have seen that this is not always the case with all compiler and we would like to keep the source code in CPU optimize form (i.e. version A), as most users are targeting such architecture.

For the moment we are considering to have our own prepocess scripts to do this loop reordering. I was just wondering if you were considering to have such fonctionnality within the acc directives ? (Something like $acc swap (i,k) ...)

Thanks,

Xavier
Back to top
View user's profile
Michael Wolfe



Joined: 19 Jan 2010
Posts: 42

PostPosted: Fri Jul 29, 2011 11:15 am    Post subject: Reply with quote

Xavier: I've created a dummy version of a loop like yours, but I'm not sure what performance differences you're seeing for the two versions. Can you tell me representative sizes for 'n' and 'nlev'? I will do some experiments to see how to get the performance you want without modifying the loop structure.
-mw
Back to top
View user's profile
Michael Wolfe



Joined: 19 Jan 2010
Posts: 42

PostPosted: Fri Jul 29, 2011 11:31 am    Post subject: Reply with quote

To continue my last note, what I'm interested in is whether the performance impact comes from treating the whole region as a single kernel, or from the kernel schedules used when they are three separate kernels.

-mw
Back to top
View user's profile
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Tue Aug 02, 2011 12:09 am    Post subject: Reply with quote

Hi Micheal,

Indeed, I forgot to give the size of the problem: N=10000 and nlev=60.

I also think that treating the whole region as a single kernel is probably what makes version B much faster. (I am seeing a difference of about a factor 2.7)

In particular there is a k dependence for the last loops (named vertical computation in the example). In version A I think that the kernel generated for the i loop:
Code:

  do i=1,N
    a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k)
  end do

is launched nlev-1 times, which is probably not very efficient.

-Xavier
Back to top
View user's profile
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Fri Sep 30, 2011 12:16 am    Post subject: Reply with quote

Hi

I am trying to revive this post. I am actually still interested to know what is the best suggested approach given a code in form A, to have it running efficiently on both CPU and GPU.

To distinguish whether the difference came from having one region, here are two modified example:

Code:

Version A2

program main
  implicit none
  integer*4 :: N,nlev,i,k,itime,nt
  real*8, allocatable :: a(:,:), b(:,:)
  integer*4 :: dt1(8), dt2(8), t1, t2
  real*8 :: rt

  N=1E4
  nlev=60
  nt=1000

 allocate(a(N,nlev))
 allocate(b(N,nlev))


b=0.1

!$acc data region local(a,b)
!$acc update device(b)

call date_and_time( values=dt1 )
!time loop
do itime=1,nt
   !initialization
   !$acc region
   do k=1,nlev
      do i=1,N
         a(i,k)=0.0D0
      end do
   end do
   !$acc end region


   ! first layer
   !$acc region
   do i=1,N
      a(i,1)=0.1D0
   end do
   !$acc end region

   ! vertical computation
   !$acc region
   do k=2,nlev
      do i=1,N
         a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*b(i,k)
      end do
   end do
   !$acc end region


end do
!$acc update host(a)

call date_and_time( values=dt2 )

!$acc end data region

 t1 = dt1(8) + 1000*(dt1(7)+60*dt1(6)+60*(dt1(5)))
 t2 = dt2(8) + 1000*(dt2(7)+60*dt2(6)+60*(dt2(5)))
 rt = (t2 - t1)/1000.
write(*,"(A,I,A,I)") 'N=', N, ' , nt=',nt
write(*,"(A,F10.4)") 'time per step (us) =', rt/nt * 1E6
print*, 'sum(a)=',sum(a)
end program main



and

Code:

Version B2

program main
  implicit none
  integer*4 :: N,nlev,i,k,itime,nt
  real*8, allocatable :: a(:,:), b(:,:)
  integer*4 :: dt1(8), dt2(8), t1, t2
  real*8 :: rt

  N=1E4
  nlev=60
  nt=1000

 allocate(a(N,nlev))
 allocate(b(N,nlev))


b=0.1

!$acc data region local(a,b)
!$acc update device(b)

call date_and_time( values=dt1 )
!time loop
do itime=1,nt
   !initialization
   !$acc region do kernel
   do i=1,N
      do k=1,nlev
         a(i,k)=0.0D0
      end do
   end do
   !$acc end region


   ! first layer
   !$acc region do kernel
   do i=1,N
      a(i,1)=0.1D0
   end do
   !$acc end region

   ! vertical computation
   !$acc region do kernel
   do i=1,N
      do k=2,nlev
         a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*b(i,k)
      end do
   end do
   !$acc end region


end do
!$acc update host(a)

call date_and_time( values=dt2 )

!$acc end data region

 t1 = dt1(8) + 1000*(dt1(7)+60*dt1(6)+60*(dt1(5)))
 t2 = dt2(8) + 1000*(dt2(7)+60*dt2(6)+60*(dt2(5)))
 rt = (t2 - t1)/1000.
write(*,"(A,I,A,I)") 'N=', N, ' , nt=',nt
write(*,"(A,F10.4)")  'time per step (us) =', rt/nt * 1E6
print*, 'sum(a)=',sum(a)
end program main


Version B2 is about twice as fast as A2 on the GPU but on the CPU A2 is faster. Note that for the CPU I compiled only with -O3 optimization, I am assuming here that not all compiler will do the loop tranformation from GPU to CPU optimal order.

Thanks,

Xavier
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group