PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Inconsistent results on GPU

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
SPHriction-3D



Joined: 12 Jan 2014
Posts: 48

PostPosted: Wed Mar 19, 2014 5:58 am    Post subject: Inconsistent results on GPU Reply with quote

Hi,

I have been having trouble with a portion of code that I am trying to port to CUDA. The module is part of a smoothed particle hydrodynamics code for solid mechanics.

The subroutine is used to calculate an "artificial stress" that is used to stabilize particles that are in tension. The algorithm works very well on the CPU, but I just cant seem to get it to work properly on the GPU. On the CPU, the algorithm looks like this:

Code:
SUBROUTINE art_stress_simple(itimestep,ntotal,hsml,mass,niac,rho,pair_i,pair_j,w,dwdx,F_int,countiac,sig) 
 
      IMPLICIT NONE
      INCLUDE 'param90.inc'
     
      INTEGER:: ntotal, niac, pair_i(max_interaction),pair_j(max_interaction)
      INTEGER:: i, j, k, d, n, m, countiac(maxn), itimestep
      DOUBLE PRECISION:: hsml(maxn), mass(maxn), x(dim,maxn),vx(dim,maxn),              &
                         rho(maxn), w(max_interaction), dvxdt(dim,maxn),                &
                         dwdx(dim,max_interaction), w_deltap     
      DOUBLE PRECISION:: dx, epsilon_stress, r, dr, Ra(dim,dim,maxn), Rb(dim,dim,maxn), &
                         sig(dim,dim,maxn), rr, f_abi, F_abj
      DOUBLE PRECISION:: F_int(dim,maxn), F1_int, F2_int, F3_int, deltap, n_stress
   
      F_int    = 0.0   
      deltap   = 0.005
     
      epsilon_stress = 0.3
      n_stress = 6.0
           
      CALL simple_kernel(deltap,hsml(1),w_deltap)
           
!     Calculate SPH sum for artificial stress
     
      DO k=1,niac
        i = pair_i(k)
        j = pair_j(k)   
       
        f_abi = w(k)/w_deltap
        f_abi = f_abi**n_stress
               
        f_abj = w(k)/w_deltap
        f_abj = f_abj**n_stress
                       
!       For the ith particle in the pair       
       
        IF (sig(1,1,i) .gt. 0.0) THEN
            Ra(1,1,i) = -epsilon_stress*sig(1,1,i)
        ELSE
            Ra(1,1,i) = 0.0
        END IF       
        IF (sig(2,2,i) .gt. 0.0) THEN
            Ra(2,2,i) = -epsilon_stress*sig(2,2,i)
        ELSE
            Ra(2,2,i) = 0.0
        END IF       
        IF (sig(3,3,i) .gt. 0.0) THEN
            Ra(3,3,i) = -epsilon_stress*sig(3,3,i)
        ELSE
            Ra(3,3,i) = 0.0
        END IF
       
!       For the jth particle in the pair       
       
        IF (sig(1,1,j) .gt. 0.0) THEN
            Rb(1,1,j) = -epsilon_stress*sig(1,1,j)
        ELSE
            Rb(1,1,j) = 0.0
        END IF   
        IF (sig(2,2,j) .gt. 0.0) THEN
            Rb(2,2,j) = -epsilon_stress*sig(2,2,j)
        ELSE
            Rb(2,2,j) = 0.0
        END IF               
        IF (sig(3,3,j) .gt. 0.0) THEN
            Rb(3,3,j) = -epsilon_stress*sig(3,3,j)
        ELSE
            Rb(3,3,j) = 0.0
        END IF
       
        F1_int = (Ra(1,1,i)*f_abi/rho(i)**2.0 + Rb(1,1,j)*f_abj/rho(j)**2.0)*dwdx(1,k)
                 
        F2_int = (Ra(2,2,i)*f_abi/rho(i)**2.0 + Rb(2,2,j)*f_abj/rho(j)**2.0)*dwdx(2,k)   
                 
        F3_int = (Ra(3,3,i)*f_abi/rho(i)**2.0 + Rb(3,3,j)*f_abj/rho(j)**2.0)*dwdx(3,k) 
       
        F_int(1,i) = F_int(1,i) + F1_int*mass(j)
        F_int(1,j) = F_int(1,j) - F1_int*mass(i)
       
        F_int(2,i) = F_int(2,i) + F2_int*mass(j)
        F_int(2,j) = F_int(2,j) - F2_int*mass(i)
       
        F_int(3,i) = F_int(3,i) + F3_int*mass(j)
        F_int(3,j) = F_int(3,j) - F3_int*mass(i)
      END DO
       
END SUBROUTINE art_stress_simple

SUBROUTINE SimpleKernel(r,hsml,w)

   IMPLICIT NONE

   INCLUDE 'param90.inc'
     
   Real(fp_kind):: r, hsml, w, q, factor
     
   q = r/hsml
   w = 0.e0
      
   factor = 3.e0/(2.e0*pi*hsml*hsml*hsml) 
   If (q.ge.0.and.q.le.1.e0) then         
      w = factor * (2./3. - q*q + q**3 / 2.)           
   Else If (q.gt.1.e0.and.q.le.2) then         
      w = factor * 1.e0/6.e0 * (2.-q)**3                       
   Else
      w = 0.0                     
   End If        
      
END SUBROUTINE SimpleKernel


The first thing that I figured needs to be different on the GPU is that Ra and Rb should be evaluated in one subroutine and then they should be used in an other routine. The GPU version looks like:

Code:
Module Module_Artificial_Stress

Use cudafor
Use Module_Precision
Use Module_Neighbors

Contains

   Attributes(Global) Subroutine artificial_stress(sig,Ra1,Ra2,Ra3,nTotal)

   Implicit None
   
   Integer, Value:: nTotal
   Integer, Device:: i
   Real(fp_kind), Parameter:: AS_factor = 0.3
   Real(fp_kind), Device:: sig(nTotal,Dim,Dim)
   Real(fp_kind), Device:: Ra1(nTotal), Ra2(nTotal), Ra3(nTotal)

   i = (blockIdx%x-1)*blockDim%x + threadIdx%x

   If (i >= 1 .and. i <= nTotal) Then
      If (sig(i,1,1) > 0.0) Then
         Ra1(i) = -AS_factor*sig(i,1,1)
      End If
      If (sig(i,2,2) > 0.0) Then
         Ra2(i) = -As_factor*sig(i,2,2)
      End If
      If (sig(i,3,3) > 0.0) Then
         Ra3(i) = -AS_factor*sig(i,3,3)
      End If
   End If

   End Subroutine artificial_stress

   Attributes(Global) Subroutine artificial_stress_force(Ra1,Ra2,Ra3,w,dwdx,hsml,rho,mass,Neib,F_int,nTotal)

   Implicit None

   Include "param90.inc"

   Integer, Value:: nTotal
   Integer:: i, j, k, d
   Integer, Device, Intent(IN):: Neib(nTotal,nNeib_max)
   Real(fp_kind), Parameter:: n_stress = 6.0
   Real(fp_kind), Parameter:: delta_p = 0.005
   Real(fp_kind), Device:: fab, w_delta_p, R1, R2, R3
   Real(fp_kind), Device:: w(nTotal,nNeib_max), dwdx(nTotal,nNeib_max,Dim), hsml(nTotal), rho(nTotal), mass(nTotal)
   Real(fp_kind), Device:: Ra1(nTotal), Ra2(nTotal), Ra3(nTotal)
   Real(fp_kind), Device:: F_int(nTotal,Dim)

   i = (blockIdx%x-1)*blockDim%x + threadIdx%x

   If (i >= 1 .and. i <= nTotal) Then   
      
      Call SimpleKernel(delta_p,hsml(i),w_delta_p)
      Do k = 3, Neib(i,2) + 2
         j = Neib(i,k)
         fab = (w(i,k)/w_delta_p)**n_stress

         R1 = Ra1(i)*fab/rho(i)**2.0 + Ra1(j)*fab/rho(j)**2.0
         R2 = Ra2(i)*fab/rho(i)**2.0 + Ra2(j)*fab/rho(j)**2.0
         R3 = Ra3(i)*fab/rho(i)**2.0 + Ra3(j)*fab/rho(j)**2.0

         F_int(i,1) = F_int(i,1) + R1*mass(j)*dwdx(i,k,1)
         F_int(i,2) = F_int(i,2) + R2*mass(j)*dwdx(i,k,2)
         F_int(i,3) = F_int(i,3) + R3*mass(j)*dwdx(i,k,3)
      End Do
   End If
         
   End Subroutine artificial_stress_force

   Attributes(Device) SUBROUTINE SimpleKernel(r,hsml,w)

   IMPLICIT NONE

   INCLUDE 'param90.inc'
     
   Real(fp_kind):: r, hsml, w, q, factor
     
   q = r/hsml
   w = 0.e0
      
   factor = 3.e0/(2.e0*pi*hsml*hsml*hsml) 
   If (q.ge.0.and.q.le.1.e0) then         
      w = factor * (2./3. - q*q + q**3 / 2.)           
   Else If (q.gt.1.e0.and.q.le.2) then         
      w = factor * 1.e0/6.e0 * (2.-q)**3                       
   Else
      w = 0.0                     
   End If        
      
   END SUBROUTINE SimpleKernel

End Module Module_Artificial_Stress


The problem is that on the GPU the results just do not line up with the expected results (the CPU algorithm gives the correct results). The results on the GPU are not consistent, the artificial stress is different each time the program is run (makes me think I am doing something wrong with reading and writing the same value across multiple threads).

Maybe I have done something fundamentally wrong in my GPU implementation? Does anything jump out as being wrong to anyone?

This is driving me a LITTLE crazy, I have been over the code a bunch of times, have re-written the artificial stress module from scratch a couple of times... very frustrated!

The rest of the code (if I do not calculate the artificial stress) works as it should on the GPU and the results fit perfectly with the CPU results.

Thank you for any help,

Kirk
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Wed Mar 19, 2014 1:28 pm    Post subject: Reply with quote

Hi Kirk,

I guess I don't quite understand the algorithm changes you made.

What happened to Rb in your GPU code? It seems to have disappeared and Ra is used in it's place. Same with the F_int(j,..) values.

Also, do you really want to compute all values of "i"? In the CPU version you're computing k=1,niac and then selective "i" values.


What I'd recommend is compile in emulation mode (-Mcuda=emu) and then run the program in the PGI debugger. Or you have Allinea DDT, you can debug on the device (with PGI 14.1 or later).

- Mat
Back to top
View user's profile
SPHriction-3D



Joined: 12 Jan 2014
Posts: 48

PostPosted: Thu Mar 20, 2014 8:19 am    Post subject: Reply with quote

Hi Mat,

Thank you for taking a look, looking back, I should have given more information about what I was doing differently on the GPU... I should have realized that I would need to go into some of the more intricate details of smoothed particle hydrodynamics to explain what I am trying to do.

I will have to try out the debugger in one of the latest compilers (I guess I should install 4.3 since this is the most up to date?)

As an aside, I will be at GTC next week, do you have some suggestions for hands on training sessions for us CUDA Fortran guys (I saw that you are given an intro class)?

Taker easy,

Kirk
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Thu Mar 20, 2014 1:20 pm    Post subject: Reply with quote

Quote:
As an aside, I will be at GTC next week, do you have some suggestions for hands on training sessions for us CUDA Fortran guys (I saw that you are given an intro class)?
I am teaching session S4800 from 4:00-5:30 on Tuesday. Actually, I'm more presenting the material. Greg Ruetsch who wrote the book "CUDA Fortran for Scientists and Engineers" put the material together. You're welcome to come, though you might be beyond the intro level!

If you're interested in OpenACC, I'm assisting Michael Wolfe with his tutorial on Wednesday 5-6:30.

I'll be at the "ask the experts" hangout, #C, Tuesday 2-3 and Wednesday 12-1, working in the NVIDIA HPC booth Tuesday night 6-8, and the PGI Booth 7-8 on Wednesday night. I'm sure how much code review we could do, but we could try.

Oh, and I'm presenting the new SPEC ACCEL benchmark (www.spec.org/accel) at 10:30 on Wednesday (Session S4437) if your interested.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group