PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

volatile in CUDA Fortran
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
DAVID-SPH



Joined: 23 May 2011
Posts: 28

PostPosted: Fri Aug 24, 2012 6:17 am    Post subject: volatile in CUDA Fortran Reply with quote

Is volatile working in CUDA Fortran?
I'm relying in intra-warp implicit synchronization for an algorithm and I'm getting wrong results
Code:
do i = 1, LOG_WARP_SIZE, 1
      offset = 2**(i-1)
      sum = sum + reduction_shared(s - offset)
      reduction_shared(s) = sum
   end do


gets a different result (for same warp threads) than

Code:
do i = 1, LOG_WARP_SIZE, 1
      offset = 2**(i-1)
      sum = sum + reduction_shared(s - offset)
      reduction_shared(s) = sum
               call syncthreads()
   end do

[/code]
Back to top
View user's profile
DAVID-SPH



Joined: 23 May 2011
Posts: 28

PostPosted: Fri Aug 24, 2012 8:18 am    Post subject: Reply with quote

May I ask how does CUDA Fortran organize the warps?
I CUDA C you have warp 0 gets threads 0 to 31, warp 1 threads 32 to 63 etc etc...
Could it be that CUDA Fortran makes warp 1 threads 1, 33, 65, ..etc? like column major order or something like that?...

I'm a bit puzzled with intra-wrap lack of implicit synchronization.

Best regards,
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Fri Aug 24, 2012 9:57 am    Post subject: Reply with quote

Quote:
Is volatile working in CUDA Fortran?
As of 12.4 yes. Though, the "volatile" keyword is simply passed through to the generated low level CUDA C so it's possible that the problem is with the back-end CUDA C compiler. You can see the generated CUDA code via the flag "-Mcuda=keepgpu".

Code:
May I ask how does CUDA Fortran organize the warps?


The "threadidx" and "blockidx" are base 1 since that's Fortran, but there is no change in the way warps are organized. The first warp would get the first 32 threads, no matter if the base index is a 1 or 0. (Under the hood the indexing gets adjusted when translated to CUDA C)

Hope this helps,
Mat
Back to top
View user's profile
DAVID-SPH



Joined: 23 May 2011
Posts: 28

PostPosted: Sat Aug 25, 2012 7:22 am    Post subject: Reply with quote

Well we've been making some tests and I can confirm that implicit intrawarp synchronization is not working properly...

We are trying to figure out how is working, it is not necesarly a bad thing, it seems to provide a better coalesced access without the need of transpose read operation, nevertheless it should be documented.

The following kernels produce different results when they should not.


Code:

attributes(global) subroutine test1warp(input, result)
   integer, dimension(:)         :: input, result
   integer, shared, dimension(:)   :: shared_values(SHAREDSTRIDE * NUMWARPS1)
   integer                     :: tid, warp, lane, sum
   
   tid = threadIdx%x
   warp = (tid-1)/WARP_SIZE
   lane = mod((tid-1), WARP_SIZE) + 1
   index = warp * SHAREDSTRIDE + lane + WARP_SIZE/2
   shared_values(index - 16) = 0
   call syncthreads() !yep this may be important
   sum = input(tid)
   shared_values(index) = sum
   !synchthreads to make sure everything is ok
   call syncthreads()
   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   !manually unrolled loops to make sure everything is ok
   sum = sum + shared_values(index - 1)
   shared_values(index) = sum
   sum = sum + shared_values(index - 2)
   shared_values(index) = sum
   sum = sum + shared_values(index - 4)
   shared_values(index) = sum
   sum = sum + shared_values(index - 8)
   shared_values(index) = sum
   sum = sum + shared_values(index - 16)
   shared_values(index) = sum
   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
   !and then we write out
   call syncthreads()
   result(tid) = shared_values(index)

end subroutine test1warp


And... the explicitly synchronized version
[/code]
attributes(global) subroutine test1warpSync(input, result)
integer, dimension(:) :: input, result
integer, shared, dimension(:) :: shared_values(SHAREDSTRIDE * NUMWARPS1)
integer :: tid, warp, lane, sum

tid = threadIdx%x
warp = (tid-1)/WARP_SIZE
lane = mod((tid-1), WARP_SIZE) + 1
index = warp * SHAREDSTRIDE + lane + WARP_SIZE/2
shared_values(index - 16) = 0
call syncthreads() !yep this may be important
sum = input(tid)
shared_values(index) = sum
!synchthreads to make sure everything is ok
call syncthreads()
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!manually unrolled loops to make sure everything is ok
sum = sum + shared_values(index - 1)
shared_values(index) = sum
call syncthreads()
sum = sum + shared_values(index - 2)
shared_values(index) = sum
call syncthreads()
sum = sum + shared_values(index - 4)
shared_values(index) = sum
call syncthreads()
sum = sum + shared_values(index - 8)
shared_values(index) = sum
call syncthreads()
sum = sum + shared_values(index - 16)
shared_values(index) = sum
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!and then we write out
call syncthreads()
result(tid) = shared_values(index)

end subroutine test1warpSync
[/code]
You can try launching the kernels with any input (integer array) if you try with 32 threads (1 warp) and 1 block you can easily see the difference.
Back to top
View user's profile
DAVID-SPH



Joined: 23 May 2011
Posts: 28

PostPosted: Sat Aug 25, 2012 7:23 am    Post subject: Reply with quote

Vicente wil try to send the code of the tests, he'll be making some more tests during the weekend.

best regards,
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group