PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Used of shared memory in device function
Goto page Previous  1, 2, 3, 4  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
DAVID-SPH



Joined: 23 May 2011
Posts: 28

PostPosted: Wed Aug 15, 2012 12:29 pm    Post subject: Reply with quote

Well this si the function giving all the trouble... I guess is still the same issue. I'm programming Sean Baxter's scan and radix sort subroutines in CUDA Fortran, I guess is still the problem with shared memory declaration?



Code:

type integer2
  integer :: x
  integer :: y
end type

 attributes(device) type(integer2) function Multiscan(tid, x, reduction_shared, totals_shared)
   integer :: tid
   integer :: x
   integer :: warp, lane, i, sum, offset
   integer :: total, totalsSum
   type(integer2) :: result

   integer,volatile, dimension(:)   :: reduction_shared!(SCANSIZE)!(*)!(:)!(3256)!(ScanSize)
   integer,volatile, dimension(:)      :: totals_shared!((*)! (48)!(NUM_WARPS + NUM_WARPS/2)

   integer, volatile :: s, s2 !we have a problem here in the translation of
   
   warp = tid / WARP_SIZE ! check this one for fortran charac.
   lane = IAND((WARP_SIZE - 1), tid) + 1 !in fortran so we are starting in 1; in c: (WARP_SIZE - 1) & tid
   s = SCANSTRIDE * warp + lane + WARP_SIZE / 2 !index/pointer
   reduction_shared(s - 16) = 0 !The first 32 position will be filled with zeros
   reduction_shared(s) = x      !And now only the first 16 will...

   !! Run inclusive scan on each warp's data.
    sum = x
   !CUDA Fortran compiler is suppoused to unroll the loop for us...
   do i = 1, LOG_WARP_SIZE
      offset = ISHFT(1, i-1)!1 << (i - 1)
      sum = sum + reduction_shared(s-offset)
      reduction_shared(s) = 0
   end do

   !! Synchronize to make all totals available to the reduction code
   call syncthreads()

   if(tid < NUM_WARPS)then
      !! Grab the block total for the tid'th block. This is the last element
      !! in the block's scanned sequence. This operation avoids bank
      !! conflicts.
      total = reduction_shared(ScanStride* tid + WARP_SIZE/2 + WARP_SIZE ) !- 1) !this -1 may be eliminated
      totals_shared(tid) = 0
      s2 = NUM_WARPS / 2 + tid
      totalsSum = total
      totals_shared(s2) = total

      !! Compiler shoud unroll this one
      do i = 1, LOG_NUM_WARPS
         offset = ISHFT(1, i-1)!1 << (i - 1)
         totalsSum = totalsSum + totals_shared(s2-offset)
         totals_shared(s2) = totalsSum
      end do

      !! Subtract total from totalsSum for an exclusive scan.
      totals_shared(tid) = totalsSum - total
   end if

   !! Synchronize to make the block scan available to all warps
   call syncthreads()
   sum = sum + totals_shared(warp)
   total = totals_shared(NUM_WARPS + NUM_WARPS / 2) !)- 1) !el - 1
   result%x = sum
   result%y = total
   !!!!!!!!!!!!!!!!!!!!! and return...
   Multiscan = result

 end function Multiscan
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Wed Aug 15, 2012 3:13 pm    Post subject: Reply with quote

Hi David,

I can't really tell much from this. Can you send a reproducing example to PGI Customer Service (trs@pgroup.com) and ask them to send it to me?

Also, which error are you getting with this code? The function 0 ICE or the Shared dummy as an argument?

Thanks,
Mat
Back to top
View user's profile
DAVID-SPH



Joined: 23 May 2011
Posts: 28

PostPosted: Thu Aug 16, 2012 9:30 pm    Post subject: Reply with quote

it is the 0 ICE problem.
Th shared memory dummy was solved with your tip.

I'll try to send the full code laetr today.

Thanks
Back to top
View user's profile
DAVID-SPH



Joined: 23 May 2011
Posts: 28

PostPosted: Sun Aug 19, 2012 8:46 am    Post subject: Reply with quote

Ok the problem seem to be using the ISHFT bit intrinsic, any reason for that?
According to the CUDA Fortran reference is a perfectly valid call...
integer ishft(integer, integer)...

It is relatevely easy to sustitue as I use to calculate multiples of 2 ... but bit intrinsics are fast...I would like to use them..
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Mon Aug 20, 2012 10:11 am    Post subject: Reply with quote

Yep, it's the ISHFT. CUDA Fortran does support ISHFT, but currently only if the "shift" argument is a constant. In this case, ISHFT is inlined but when it's a variable, a call is emitted.

I asked engineering and they do have these on their TODO list but it was pushed to a lower priority (you're the first to ask for these). I added a report (TPR#18883) to help track this and the other missing elemental functions.

Thanks,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3, 4  Next
Page 2 of 4

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group