PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Finding minval and maxval using CUF kernels
Goto page 1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
SPHriction-3D



Joined: 12 Jan 2014
Posts: 48

PostPosted: Tue Mar 04, 2014 11:03 am    Post subject: Finding minval and maxval using CUF kernels Reply with quote

Hi,

I am having trouble with this bit of code. The idea is to go through a vector and find the min and max values in the vector to determine the bounds of a simulation domain.

It works well on the CPU. When I try to do it with a CUF kernel, I get an incorrect answer. I am looking for an easy way to do this without getting into really complicated CUDA kernels...

I have set up a little test:

Code:
Program Bounds

Use cudafor

Implicit None

! Host Variables
Integer:: d, i, im
Integer, Parameter:: nTotal = 20000     ! Total number of nodes in the domain
Integer:: nCell(3)                      ! Number of cells in the domain
Real:: Bound(3,2)                       ! Bounds for the domain
Real:: x(nTotal,3)                      ! Position vector for nodes

! Device Variables
Integer, Device:: nCell_d(3)            ! Number of cells in the domain
Real, Device:: Bound_d(3,2)             ! Bounds for the domain
Real, Device:: x_d(nTotal,3)            ! Position vector for nodes

! Initialize the position vector for zeroes except for some seeded values
x = 0
x(100,1) = 1.2
x(200,2) = 1.6
x(300,3) = 1.4
x(400,1) = -1.2
x(500,2) = -1.6
x(600,3) = -1.4
nCell = 0

! Perform the algorithm on the CPU
Write(*,*) ""
Write(*,*) "CPU Implementation"
Bound(:,1) = -1.0E+10
Bound(:,2) = 1.0E+10   

Do d = 1, 3
   Bound(d,1) = maxval(x(:,d))
   Bound(d,2) = minval(x(:,d))       
   nCell(d) = ceiling(abs(Bound(d,1) - Bound(d,2))/(0.072))
End Do

Write(*,*) "xBoundMax = ",Bound(1,1)
Write(*,*) "xBoundMin = ",Bound(1,2)
Write(*,*) "yBoundMax = ",Bound(2,1)
Write(*,*) "yBoundMin = ",Bound(2,2)
Write(*,*) "zBoundMax = ",Bound(3,1)
Write(*,*) "zBoundMin = ",Bound(3,2)
Write(*,*) "nCellx    = ",nCell(1)
Write(*,*) "nCelly    = ",nCell(2)
Write(*,*) "nCellz    = ",nCell(3)

! Perform the algorithm on the GPU
Write(*,*) ""
Write(*,*) "GPU Implementation"
Bound(:,1) = -1.0E+10
Bound(:,2) = 1.0E+10   
Bound_d = Bound
nCell_d = nCell
x_d = x

!$CUF Kernel Do <<<*,128>>>
Do i = 1, nTotal
   Do d = 1, 3
      Bound_d(d,1) = max(Bound_d(d,1),x_d(i,d))
      Bound_d(d,2) = min(Bound_d(d,2),x_d(i,d))   
      nCell_d(d) = ceiling(abs(Bound_d(d,1) - Bound_d(d,2))/(0.072))
   End Do
End Do

Bound = Bound_d
nCell = nCell_d

Write(*,*) "xBoundMax = ",Bound(1,1)
Write(*,*) "xBoundMin = ",Bound(1,2)
Write(*,*) "yBoundMax = ",Bound(2,1)
Write(*,*) "yBoundMin = ",Bound(2,2)
Write(*,*) "zBoundMax = ",Bound(3,1)
Write(*,*) "zBoundMin = ",Bound(3,2)
Write(*,*) "nCellx    = ",nCell(1)
Write(*,*) "nCelly    = ",nCell(2)
Write(*,*) "nCellz    = ",nCell(3)

End Program Bounds


The above code is of course useless (normally the position vector would not be mainly zero), but the general idea is there.

Thank you,

Kirk
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6138
Location: The Portland Group Inc.

PostPosted: Tue Mar 04, 2014 12:53 pm    Post subject: Reply with quote

Hi Kirk,

If you use scalars, the compiler will generate parallel max and min reductions.

Code:
% cat test_3_4_14.cuf
Program Bounds

 Use cudafor

 Implicit None

 ! Host Variables
 Integer:: d, i, im
 Integer, Parameter:: nTotal = 20000     ! Total number of nodes in the domain
 Integer:: nCell(3)                      ! Number of cells in the domain
 Real:: Bound(3,2)                       ! Bounds for the domain
 Real:: x(nTotal,3)                      ! Position vector for nodes
 Real:: max1,max2,max3
 Real:: min1,min2,min3

 ! Device Variables
 Real, Device:: x_d(nTotal,3)            ! Position vector for nodes

 ! Initialize the position vector for zeroes except for some seeded values
 x = 0
 x(100,1) = 1.2
 x(200,2) = 1.6
 x(300,3) = 1.4
 x(400,1) = -1.2
 x(500,2) = -1.6
 x(600,3) = -1.4
 nCell = 0

 ! Perform the algorithm on the CPU
 Write(*,*) ""
 Write(*,*) "CPU Implementation"
 Bound(:,1) = -1.0E+10
 Bound(:,2) = 1.0E+10

 Do d = 1, 3
    Bound(d,1) = maxval(x(:,d))
    Bound(d,2) = minval(x(:,d))
    nCell(d) = ceiling(abs(Bound(d,1) - Bound(d,2))/(0.072))
 End Do

 Write(*,*) "xBoundMax = ",Bound(1,1)
 Write(*,*) "xBoundMin = ",Bound(1,2)
 Write(*,*) "yBoundMax = ",Bound(2,1)
 Write(*,*) "yBoundMin = ",Bound(2,2)
 Write(*,*) "zBoundMax = ",Bound(3,1)
 Write(*,*) "zBoundMin = ",Bound(3,2)
 Write(*,*) "nCellx    = ",nCell(1)
 Write(*,*) "nCelly    = ",nCell(2)
 Write(*,*) "nCellz    = ",nCell(3)

 ! Perform the algorithm on the GPU
 Write(*,*) ""
 Write(*,*) "GPU Implementation"
 Bound(:,1) = -1.0E+10
 Bound(:,2) = 1.0E+10
 x_d = x
 max1 = 0.0
 max2 = 0.0
 max3 = 0.0
 min1 = 999999.0
 min2 = 999999.0
 min3 = 999999.0

 !$CUF Kernel Do <<<*,128>>>
 Do i = 1, nTotal
    max1 = max(max1,x_d(i,1))
    min1 = min(min1,x_d(i,1))
    max2 = max(max2,x_d(i,2))
    min2 = min(min2,x_d(i,2))
    max3 = max(max3,x_d(i,3))
    min3 = min(min3,x_d(i,3))
 End Do

 Bound(1,1)=max1
 Bound(1,2)=min1
 Bound(2,1)=max2
 Bound(2,2)=min2
 Bound(3,1)=max3
 Bound(3,2)=min3
 Do d = 1, 3
    nCell(d) = ceiling(abs(Bound(d,1) - Bound(d,2))/(0.072))
 enddo

 Write(*,*) "xBoundMax = ",Bound(1,1)
 Write(*,*) "xBoundMin = ",Bound(1,2)
 Write(*,*) "yBoundMax = ",Bound(2,1)
 Write(*,*) "yBoundMin = ",Bound(2,2)
 Write(*,*) "zBoundMax = ",Bound(3,1)
 Write(*,*) "zBoundMin = ",Bound(3,2)
 Write(*,*) "nCellx    = ",nCell(1)
 Write(*,*) "nCelly    = ",nCell(2)
 Write(*,*) "nCellz    = ",nCell(3)

 End Program Bounds

% pgf90 test_3_4_14.cuf -Minfo; a.out
bounds:
     36, maxval reduction inlined
     37, minval reduction inlined
     65, CUDA kernel generated
         65, !$cuf kernel do <<< (*), (128) >>>
             Cached references to size [(x)x3] block of 'x_d'
         66, Max reduction generated for max1
         67, Min reduction generated for min1
         68, Max reduction generated for max2
         69, Min reduction generated for min2
         70, Max reduction generated for max3
         71, Min reduction generated for min3

 CPU Implementation
 xBoundMax =     1.200000
 xBoundMin =    -1.200000
 yBoundMax =     1.600000
 yBoundMin =    -1.600000
 zBoundMax =     1.400000
 zBoundMin =    -1.400000
 nCellx    =            34
 nCelly    =            45
 nCellz    =            39

 GPU Implementation
 xBoundMax =     1.200000
 xBoundMin =    -1.200000
 yBoundMax =     1.600000
 yBoundMin =    -1.600000
 zBoundMax =     1.400000
 zBoundMin =    -1.400000
 nCellx    =            34
 nCelly    =            45
 nCellz    =            39


- Mat
Back to top
View user's profile
SPHriction-3D



Joined: 12 Jan 2014
Posts: 48

PostPosted: Tue Mar 04, 2014 2:42 pm    Post subject: Reply with quote

Thank you Mat,

Is there an easy way to do what you did but keep the max, min and nCell as device variables. Otherwise I will need to copy back and forth every time I calculate the bounds (this is done every time step).

Kirk
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6138
Location: The Portland Group Inc.

PostPosted: Wed Mar 05, 2014 10:59 am    Post subject: Reply with quote

Hi Kirk,

How about something like the following. I added the device attribute to all the max/min variables, and created device versions of Bounds and NCells. Then use OpenACC "parallel" construct to create a scalar kernel to update the bounds and ncell on the device. You could also write a scalar CUDA Fortran kernel as well.

Code:
% cat test_3_5_14.cuf
Program Bounds

 Use cudafor

 Implicit None

 ! Host Variables
 Integer:: d, i, im
 Integer, Parameter:: nTotal = 20000     ! Total number of nodes in the domain
 Integer:: nCell(3)                      ! Number of cells in the domain
 Real:: Bound(3,2)                       ! Bounds for the domain
 Integer,device:: nCell_d(3)                      ! Number of cells in the domain
 Real,device:: Bound_d(3,2)                       ! Bounds for the domain
 Real:: x(nTotal,3)                      ! Position vector for nodes
 Real, device :: max1,max2,max3
 Real, device :: min1,min2,min3

 ! Device Variables
 Real, Device:: x_d(nTotal,3)            ! Position vector for nodes

 ! Initialize the position vector for zeroes except for some seeded values
 x = 0
 x(100,1) = 1.2
 x(200,2) = 1.6
 x(300,3) = 1.4
 x(400,1) = -1.2
 x(500,2) = -1.6
 x(600,3) = -1.4
 nCell = 0

 ! Perform the algorithm on the CPU
 Write(*,*) ""
 Write(*,*) "CPU Implementation"
 Bound(:,1) = -1.0E+10
 Bound(:,2) = 1.0E+10

 Do d = 1, 3
    Bound(d,1) = maxval(x(:,d))
    Bound(d,2) = minval(x(:,d))
    nCell(d) = ceiling(abs(Bound(d,1) - Bound(d,2))/(0.072))
 End Do

 Write(*,*) "xBoundMax = ",Bound(1,1)
 Write(*,*) "xBoundMin = ",Bound(1,2)
 Write(*,*) "yBoundMax = ",Bound(2,1)
 Write(*,*) "yBoundMin = ",Bound(2,2)
 Write(*,*) "zBoundMax = ",Bound(3,1)
 Write(*,*) "zBoundMin = ",Bound(3,2)
 Write(*,*) "nCellx    = ",nCell(1)
 Write(*,*) "nCelly    = ",nCell(2)
 Write(*,*) "nCellz    = ",nCell(3)

 ! Perform the algorithm on the GPU
 Write(*,*) ""
 Write(*,*) "GPU Implementation"
 Bound(:,1) = -1.0E+10
 Bound(:,2) = 1.0E+10
 x_d = x
 max1 = 0.0
 max2 = 0.0
 max3 = 0.0
 min1 = 999999.0
 min2 = 999999.0
 min3 = 999999.0

 !$CUF Kernel Do <<<*,128>>>
 Do i = 1, nTotal
    max1 = max(max1,x_d(i,1))
    min1 = min(min1,x_d(i,1))
    max2 = max(max2,x_d(i,2))
    min2 = min(min2,x_d(i,2))
    max3 = max(max3,x_d(i,3))
    min3 = min(min3,x_d(i,3))
 End Do

 !$acc parallel
 Bound_d(1,1)=max1
 Bound_d(1,2)=min1
 Bound_d(2,1)=max2
 Bound_d(2,2)=min2
 Bound_d(3,1)=max3
 Bound_d(3,2)=min3
 !$acc loop seq
 Do d = 1, 3
    nCell_d(d) = ceiling(abs(Bound_d(d,1) - Bound_d(d,2))/(0.072))
 enddo
 !$acc end parallel

 Bound=Bound_d
 nCell=nCell_d
 Write(*,*) "xBoundMax = ",Bound(1,1)
 Write(*,*) "xBoundMin = ",Bound(1,2)
 Write(*,*) "yBoundMax = ",Bound(2,1)
 Write(*,*) "yBoundMin = ",Bound(2,2)
 Write(*,*) "zBoundMax = ",Bound(3,1)
 Write(*,*) "zBoundMin = ",Bound(3,2)
 Write(*,*) "nCellx    = ",nCell(1)
 Write(*,*) "nCelly    = ",nCell(2)
 Write(*,*) "nCellz    = ",nCell(3)

 End Program Bounds

% pgf90 -V14.2 -acc -Mcuda -Minfo test_3_5_14.cuf ; a.out
bounds:
     22, Memory zero idiom, array assignment replaced by call to pgf90_mzero4
     38, maxval reduction inlined
     39, minval reduction inlined
     67, CUDA kernel generated
         67, !$cuf kernel do <<< (*), (128) >>>
         68, Max reduction generated for max1
         69, Min reduction generated for min1
         70, Max reduction generated for max2
         71, Min reduction generated for min2
         72, Max reduction generated for max3
         73, Min reduction generated for min3
     76, Accelerator kernel generated
     84, Loop is parallelizable

 CPU Implementation
 xBoundMax =     1.200000
 xBoundMin =    -1.200000
 yBoundMax =     1.600000
 yBoundMin =    -1.600000
 zBoundMax =     1.400000
 zBoundMin =    -1.400000
 nCellx    =            34
 nCelly    =            45
 nCellz    =            39

 GPU Implementation
 xBoundMax =     1.200000
 xBoundMin =    -1.200000
 yBoundMax =     1.600000
 yBoundMin =    -1.600000
 zBoundMax =     1.400000
 zBoundMin =    -1.400000
 nCellx    =            34
 nCelly    =            45
 nCellz    =            39
Back to top
View user's profile
SPHriction-3D



Joined: 12 Jan 2014
Posts: 48

PostPosted: Wed Mar 05, 2014 5:14 pm    Post subject: Reply with quote

Thanks Mat, I have never used acc, so I didn't think of that.

Maybe I could use acc for this bit of code too? CellIndex is a 2 column array that I want to pack into CellList (a 2D array giving the nodes in each cell). NodesInCell is a vector that stores the number of nodes in each cell.

I cant think of a straightforward way to parallelize this since NodesInCell would be accessed by multiple threads at once.

Code:
Attributes(Global) Subroutine BuildCellList(nTotal,CellIndex,CellList,maxIndex,NodesInCell)

   Implicit None

   Integer:: i, j, d
   Integer, Value:: nTotal, maxIndex   
   Integer, Device, Intent(IN):: CellIndex(nTotal,2)
   Integer, Device, Intent(OUT):: CellList(maxIndex,27), NodesInCell(maxIndex)

   CellList      = 0
   NodesInCell   = 0

   Do i = 1, nTotal
      j = CellIndex(i,2)            
      NodesInCell(j) = NodesInCell(j) + 1
      CellList(j,NodesInCell(j)+1) = CellIndex(i,1)      
      CellList(j,1) = j   
   End Do         
         
End Subroutine BuildCellList


As you can see, I have taken a quick stab at it (first attempt being a single thread kernel... not great...). I know that there are really good ways to parallelize this, but maybe there is an easy way for a beginner like me.

Thanks,

Kirk
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group