PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Finding minval and maxval using CUF kernels
Goto page Previous  1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Wed Mar 05, 2014 8:52 pm    Post subject: Reply with quote

Are all values in the CellIndex unique? If so, then you could just put an OpenACC "!$ACC kernels loop independent" directive above the DO I loop (from host code, not within device code).

Though, given you're counting the number of nodes in a cell, I'm guessing J is not unique. In that case, you probably could use atomic operations. They're a bit slow so you'll want to test performance versus other methods, but it may be ok.

OpenACC atomics are still being implemented but you can use CUDA Fortran atomics in OpenACC in 14.1. Though, since you're already using CUDA Fortran in may be easier to just remove the do loop and then create "nTotal" threads, assigning a unique "i" to each thread, then add an atomic around update of "NodeInCell".

The updates to CellList probably don't need an atomic operation given the first update looks to assign values to unique CellList elements and the second would update the entry to same value no matter the i.

One potential problem is the use of "NodesInCell" in the CellList index. At this point we have to assume another thread has updated it and it's value has changed. Instead, you'll need use a local variable here. Though, I've forgotten off hand how to atomically increment NodesInCell and return the value to a scalar. I'll look it up when I get into the office tomorrow.

- Mat
Back to top
View user's profile
SPHriction-3D



Joined: 12 Jan 2014
Posts: 48

PostPosted: Thu Mar 06, 2014 7:11 am    Post subject: Reply with quote

Hi Mat,

Although this is probably not optimal, this works pretty well. I get ~8x speedup. I will have to try to compare it to a CUDA kernel with atomics.

Code:
Subroutine BuildCellListAcc(nTotal,CellIndex_d,CellList_d,maxIndex,NodesInCell_d)

   Implicit None

   Integer:: i, j, d
   Integer:: nTotal, maxIndex   
   Integer, Device, Intent(IN):: CellIndex_d(nTotal,2)
   Integer, Device, Intent(OUT):: CellList_d(maxIndex,27), NodesInCell_d(maxIndex)

   !$acc parallel
   CellList_d      = 0
   NodesInCell_d   = 0
   
   !$acc loop
   Do i = 1, nTotal
      j = CellIndex_d(i,2)            
      NodesInCell_d(j) = NodesInCell_d(j) + 1
      CellList_d(j,NodesInCell_d(j)+1) = CellIndex_d(i,1)      
      CellList_d(j,1) = j   
   End Do      
   !$acc end parallel   
         
   End Subroutine BuildCellListAcc


You are right, j is not unique, each cell holds between 9 to 27 nodes.

Kirk
Back to top
View user's profile
SPHriction-3D



Joined: 12 Jan 2014
Posts: 48

PostPosted: Thu Mar 06, 2014 7:42 am    Post subject: Reply with quote

Mat,

Hmmmm, on second thought, the acc does not work after all. I was hoping that acc magically did all the hard stuff for me... Guess not though, hahaha.

I guess the easiest way will be to create a CUDA kernel and use an atomic to guard NodesInCell.

Let me know if you find the atomic operation to do this. Maybe I need to use atomicinc or atomicadd!?

I imagine that the best way to do this is to write a CUDA kernel and used shared memory for NodesInCell and operate on blocks of data?!

Kirk
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Thu Mar 06, 2014 12:38 pm    Post subject: Reply with quote

Hi Kirk,

Give this a try:

Code:
Subroutine BuildCellListAcc(nTotal,CellIndex_d,CellList_d,maxIndex,NodesInCell_d)
    Use cudadevice
    Implicit None
 
    Integer:: i, j, d
    Integer:: nTotal, maxIndex, idx   
    Integer, Device, Intent(IN):: CellIndex_d(nTotal,2)
    Integer, Device, Intent(OUT):: CellList_d(maxIndex,27), NodesInCell_d(maxIndex)
    CellList_d      = 0
    NodesInCell_d   = 0
     
    !$acc parallel loop
    Do i = 1, nTotal
       j = CellIndex_d(i,2)   
       idx = atomicinc(NodesInCell_d(j), 27)
       idx = idx+2  ! increment since atomicinc returns the old value
       CellList_d(j,idx) = CellIndex_d(i,1)       
       CellList_d(j,1) = j   
    End Do       
    !$acc end parallel   
           
    End Subroutine BuildCellListAcc


Here, OpenACC is really just doing the booking keeping part so you could also turn this into a CUDA Fortran kernel. Atomicinc does return the "old" value from NodesInCell so is why I'm incrementing it by the extra 1.
Back to top
View user's profile
SPHriction-3D



Joined: 12 Jan 2014
Posts: 48

PostPosted: Thu Mar 06, 2014 1:41 pm    Post subject: Reply with quote

Mat,

what version of PGF do I need for atomicinc? I am using 13.10. It doesnt seem to be incrementing NodesInCell as hoped.

Code:

Attributes(Global) Subroutine BuildCellListAtomic(nTotal,CellIndex,CellList,maxIndex,NodesInCell)

   Implicit None

   Integer:: i, j, idx
   Integer, Value:: nTotal, maxIndex   
   Integer, Device, Intent(IN):: CellIndex(nTotal,2)
   Integer, Device, Intent(OUT):: CellList(maxIndex,27), NodesInCell(maxIndex)

   CellList      = 0
   NodesInCell   = 0

   i = (blockIdx%x-1)*blockDim%x + threadIdx%x

   If (i >= 1 .and. i <= nTotal) Then
      j = CellIndex(i,2)   
      idx = atomicinc(NodesInCell(j),27)
      idx = idx + 2         
      CellList(j,idx) = CellIndex(i,1)      
      CellList(j,1) = j   
   End If 
         
End Subroutine BuildCellListAtomic


My kernel launch is something like this

Code:

Integer, Parameter:: BlockX = 128, BlockY = 1
tBlock = Dim3(BlockX,BlockY,1)
grid = Dim3(ceiling(Real(nTotal)/tBlock%x),1,1)   

Call BuildCellListAtomic<<<grid,tBlock>>>(nTotal,CellIndex_d,CellList_d,maxIndex,NodesInCell_d)


Kirk
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page Previous  1, 2, 3  Next
Page 2 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group