PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Performance CUDA fortran
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
goblinsqueen



Joined: 04 Feb 2010
Posts: 14

PostPosted: Thu Apr 01, 2010 3:03 am    Post subject: Performance CUDA fortran Reply with quote

Please, consider the following simple code, which represents a "stupid" scheme of the code I'm porting in CUDA Fortran.
Code:

module mod_test
 use cudafor
 implicit none
contains
 attributes(device) real*8 function calc2(i,j,k,l)
   implicit none
   integer, value :: i,j,k,l
   calc2 = (log(real(i))+log(real(j)))/exp(real(k))/exp(real(l))
 end function
 attributes(device) subroutine calc(counter,idx,Vettore)
   implicit none
   integer, value :: counter, idx
   real*8, dimension(counter),device :: Vettore
   real*8, dimension(32), shared :: vettore_shared
   real*8 :: var
   integer :: i,j,k,l, tid
   tid = threadidx%x
   do i=1,20
    do j=1,20
     do k=1,12
      do l=1,200   
         !var = calc2(i,j,k,l)  ! fast solution
         vettore_shared(tid) = calc2(i,j,k,l)  ! slow
         !Vettore(idx) = calc2(i,j,k,l)          ! slow
      end do
     enddo
    enddo
   enddo
 end subroutine
 attributes(global) subroutine kernel_test(counter,Vettore)
   implicit none
   integer, value :: counter
   real*8, dimension(counter),device :: Vettore
   integer :: idx
   idx = (blockidx%x-1)*blockdim%x + threadidx%x
   call calc(counter,idx,Vettore)
 end subroutine
end module
program test
 use cudafor
 use mod_test
 implicit none
 integer :: nblocks, nthreads, counter
 integer :: c1, c2
 real*8, dimension(:), allocatable :: Vettore_host
 real*8, dimension(:), allocatable, device :: Vettore_dev
 counter = 9216
 nthreads = 32
 nblocks = counter/nthreads
 call system_clock(count=c1)
 allocate(Vettore_host(counter),Vettore_dev(counter))
 Vettore_dev = 0.d0
 Vettore_host = 0.d0
 call kernel_test<<<nblocks,nthreads>>>(counter,Vettore_dev)
 Vettore_host = Vettore_dev
 deallocate(Vettore_host,Vettore_dev)
 call system_clock(count=c2)
 write(*,*) 'time ', c2-c1 
end program


I have a routine which computes an element of the array Vettore_dev. If I update directly the device array Vettore_dev on the device subroutine calc, the code takes ~ 4 seconds to execute. The surprising fact for me is that if I use a shared array, i.e. vettore_shared, I still have ~ 4 sec of time to execute.
Why? I expected a much faster program using a shared memory array.
What's wrong?

Thank you in advance for every comment!
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Thu Apr 01, 2010 4:43 pm    Post subject: Reply with quote

Hi goblinqueen,

The compiler is smart enough to notice that the result from 'var' is never used, hence is optimizing away the work. This is why this version is so much faster. Instead, at the bottom of the last do loop, add "Vettore(idx) = var". Note that the "var" version will still be faster then the other two since "var" can be held in a register.

Now as to the differences between shared and global, I'm still in the process of learning this myself so unfortunately don't have any great insights. My experimentation shows little speed-up when using shared memory unless your able to reuse the memory many times. Why? I'm not sure yet.

Any other users have any insights?

- Mat
Back to top
View user's profile
sinsin



Joined: 11 Dec 2008
Posts: 11

PostPosted: Thu Apr 01, 2010 8:55 pm    Post subject: Reply with quote

Dear Mat,

How do you know, "var" can be held in a register?

Many Thanks!

Sin sin
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Fri Apr 02, 2010 8:44 am    Post subject: Reply with quote

Hi Sin sin,

I don't know for sure, just that it can. Though, given that it's a scalar with a very high degree of re-use and that I see a speed-up, it most likely is being placed in a register. Can I prove it? No. Most likely? Yes.

- Mat
Back to top
View user's profile
sinsin



Joined: 11 Dec 2008
Posts: 11

PostPosted: Fri Apr 02, 2010 5:54 pm    Post subject: Reply with quote

Hi Mat,

Thanks a lots! I just wonder is it possible to control the variables held in a register.

Sin sin
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group