PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Course

Do loop inside kernel

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
rotteweiler



Joined: 26 May 2010
Posts: 20

PostPosted: Fri Jul 16, 2010 1:01 pm    Post subject: Do loop inside kernel Reply with quote

Hi everyone,

I have a question about a loop I placed inside a kernel:

Code:

!Bits of code:

do i = 1, runs

PHIN(x + 1, y + 1 , z ) = AN(x + 1, y + 1, z ) * PHI(x + 1, y + 2, z)&
                                  +AS(x + 1, y + 1, z ) * PHI(x + 1, y    , z)&
                                  +AE(x + 1, y + 1, z ) * PHI(x + 2, y + 1, z)&
                                  +AW(x + 1, y + 1, z ) * PHI(x,     y + 1, z)&
                                  +AP(x + 1, y + 1, z ) * PHI(x + 1, y + 1, z)

end do

! This is how I copied the device data back to the cpu after the execution of the kernel:

 PHIN = dev_PHIN



I noticed that it took longer for the gpu to copy values to the cpu. For example, if runs = 2, it would take twice as long to return data as if PHIN copied twice to the cpu. I do not understand why that is the case because I was only asking the kernel to perform the calculation twice, not return it twice. If someone could please explain why that is the case I would really appreciate it.

Thankfully,

Chris


Last edited by rotteweiler on Tue Jul 20, 2010 7:37 am; edited 1 time in total
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6660
Location: The Portland Group Inc.

PostPosted: Fri Jul 16, 2010 1:18 pm    Post subject: Reply with quote

Hi Chris,

Is this a CUDA Fortran program? If so, then you have total control over when the data is copied back, so my guess is that something else is going on.

If this is the PGI Accelerator model where the "acc region" is within the 'run' loop, then yes the compiler will copy PHIN with each iteration of the loop. In this case, you would want to add a data region outside the run loop to tell the compiler to only copy PHIN at the end of the data region.

Can you post a reproducing example?

Thanks,
Mat
Back to top
View user's profile
rotteweiler



Joined: 26 May 2010
Posts: 20

PostPosted: Fri Jul 16, 2010 1:48 pm    Post subject: Reply with quote

Hi Mat,

Yes, this is actually a kernel in the gpu. In its most simplified version:

Code:

   attributes(global) subroutine assign_kernel(AN,AS,AE,AW,AP,PHI,PHIN,N,M,L, runs)
      implicit none

      integer, value :: N, M, L, i, runs
      real, device, dimension(N,M,L) :: AN, AS, AE, AW, AP, PHI, PHIN
      integer, device :: tx, bx, by

      tx = threadidx%x
      bx = blockidx%x 
      by = blockidx%y 
     

       if ( tx  <  N - 1 .and. bx <  M - 1 .and. by <=  L) then
      do i = 1, runs
      PHIN(tx + 1, bx + 1 , by ) = AN(tx + 1, bx + 1, by ) * PHI(tx + 1, bx + 2, by)&
                                  +AS(tx + 1, bx + 1, by ) * PHI(tx + 1, bx    , by)&
                                  +AE(tx + 1, bx + 1, by ) * PHI(tx + 2, bx + 1, by)&
                                  +AW(tx + 1, bx + 1, by ) * PHI(tx,     bx + 1, by)&
                                  +AP(tx + 1, bx + 1, by ) * PHI(tx + 1, bx + 1, by)
       end do
       end if
   
   end subroutine assign_kernel

! Main Program

    allocate(dev_AN(N,M,L), dev_AS(N,M,L), dev_AE(N,M,L), dev_AW(N,M,L), dev_AP(N,M,L), dev_PHI(N,M,L), dev_PHIN(N,M,L))
 
 
   dev_AN = AN
   dev_AS = AS
   dev_AE = AE
   dev_AW = AW
   dev_AP = AP
   dev_PHI = PHI
   dev_PHIN = PHIN
 
    call system_clock( count = t2)

    call assign_kernel<<<dimGrid, dimBlock>>> (dev_AN,dev_AS,dev_AE,dev_AW,dev_AP,dev_PHI,dev_PHIN,N,M,L)
   
    call system_clock( count = t3)
   
    PHIN = dev_PHIN 
 
    call system_clock( count = t4)



I called the kernel only once and copied the value only once. As for measuring the execution times, I simply called the system clock before and after the kernel as well as before and after PHIN = dev_PHIN. Thanks again for your help!

-Chris
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6660
Location: The Portland Group Inc.

PostPosted: Fri Jul 16, 2010 3:13 pm    Post subject: Reply with quote

Hi Chris,

Quote:
As for measuring the execution times, I simply called the system clock before and after the kernel as well as before and after PHIN = dev_PHIN.
This might be the problem. Host code is asynchronous so will continue after you call a kernel. The host code will block once it encounters the copy from the device. Hence, all your GPU time (kernel and data transfer) is being timed by the second system_clock time.

For CUDA performance, it's best to use CUDA event timers and calls to cudaThreadSynchronize. Here's an example from my article on programing a Monte Carlo Simulation (http://www.pgroup.com/lit/articles/insider/v2n1a4.htm)
Code:

    ! timer variables
    real :: sum_start, sum_end, func_time, datat_time
    type(cudaEvent) :: func_start, func_end, datat_start, datat_end

    ! Initialize our data xfer timing routines
    istat = cudaEventCreate(datat_start)
    istat = cudaEventCreate(datat_end)
    istat = cudaEventRecord(datat_start, 0)

    ! Copy the random points from the host to device
    dX = X
    dY = Y

    ! get the data transfer time
    istat = cudaEventRecord(datat_end, 0)
    istat = cudaThreadSynchronize()
    istat = cudaEventElapsedTime(datat_time, datat_start, datat_end)
    results%time(5) = results%time(5) + (datat_time/1000)

    ! set our Grid and Block sizes
    dimBlock = dim3(256,1,1)
    dimGrid = dim3(N/dimBlock%x,1,1)

    ! Initialize our timing routines
    istat = cudaEventCreate(func_start)
    istat = cudaEventCreate(func_end)
    istat = cudaEventRecord(func_start, 0)

    ! call our device kernel using the grid/block dimensions
    ! and passing in device pointers to our random point and temp array.
    call  montecarlo_cuf1_kernel<<<dimGrid,dimBlock>>>(dX, dY, dTemp, N)

    ! get the function time
    istat = cudaEventRecord(func_end, 0)
    istat = cudaThreadSynchronize()
    istat = cudaEventElapsedTime(func_time, func_start, func_end)
    results%time(3) = results%time(3) + (func_time/1000)

    istat = cudaEventRecord(datat_start, 0)

    ! copy the result temp array back to the host
    temp = dTemp

    istat = cudaEventRecord(datat_end, 0)
    istat = cudaThreadSynchronize()
    istat = cudaEventElapsedTime(datat_time, datat_start, datat_end)
    results%time(5) = results%time(5) + (datat_time/1000)

... continues


Hope this helps,
Mat
Back to top
View user's profile
rotteweiler



Joined: 26 May 2010
Posts: 20

PostPosted: Mon Jul 19, 2010 7:56 am    Post subject: Reply with quote

Thank you Mat! I will start using the CUDA event timer.

-Chris
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group