PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Kernel failure: Invalid argument

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Debugging and Profiling
View previous topic :: View next topic  
Author Message
Mr.Savage



Joined: 14 Feb 2013
Posts: 6

PostPosted: Fri Jul 26, 2013 12:21 am    Post subject: Kernel failure: Invalid argument Reply with quote

Hi guys,

I got a code up and running today on CUDA Fortran, but I have one problem that I'm not able to figure out and need some suggestions for how to debug. Here is the problem section of the code:

Code:
DO j = 1,n_steps
   
CALL build_SC<<<grid_sz,block_sz>>>(Super_set_1_d, Super_set_2_d,       Super_set_3_d, Super_set_4_d, Phi2_d, PHI_d, Phi1_d, termsdim4_d, SC_d, NCRYS, SUP_IND)


CALL mmul<<<dim3(NCRYS/16,1,1),dim3(16,9,1)>>>( SC_d, FoMat_d(:,1:9), StressMat_d(:,1:9), NCRYS,  2*SUP_IND, 9 )

          .
          .
          .
     StressMat = StressMat_d
          .
          .
          .
          w_12,w_13,w_23 updated
          .
          .
          .
    CALL QMAT<<<blocks_texture,threads_texture>>>(Phi2_d, PHI_d, Phi1_d)

    CALL RMAT<<<blocks_texture,threads_texture>>>(w_12_d, w_13_d, w_23_d)         
    iostat = cudaDeviceSynchronize()

    CALL G_flag<<<blocks_texture,threads_texture>>>(Q_p_sam_d, Phi2_d, PHI_d, Phi1_d, j, n_steps)

  ENDDO


The main gist of the kernel Module is:

Code:
MODULE CUDA_Kernels
  USE Variables
  IMPLICIT NONE
  REAL(RP), DIMENSION(3,3,NCRYS), DEVICE :: QMATT, RMATT
CONTAINS
!=================================================================================
!=================================================================================
 
 attributes(global) SUBROUTINE build_SC(SS1, SS2, SS3, SS4, P1, P2, P3, P4, SC, n, m)

  END SUBROUTINE build_SC
!=================================================================================
!=================================================================================
  attributes(global) SUBROUTINE mmul( A, B, C, N, M, L )
 
  END SUBROUTINE mmul
!=================================================================================
!=================================================================================
    attributes(global) SUBROUTINE QMAT(Phi2, PHI, Phi1)
       IMPLICIT NONE
      
      REAL(RP), DIMENSION(NCRYS), DEVICE ::PHI, Phi1,Phi2
      INTEGER :: i
      INTEGER :: tx, bx
      REAL(RP), SHARED :: to_rad
      !Total shared memory used per block is 6*256*8 = 12288 out of 49152
      REAL(RP), DIMENSION(256), SHARED :: SPhi1, CPhi1, SPHI, CPHI, SPhi2, CPhi2
      tx = threadidx%x !will be 1-256
      bx = blockdim%x  !will be 256
      
      i = (blockidx%x-1) * bx + tx

      to_rad = 1.e-03_rp * (PI / 180._rp)
      
      IF (i<=NCRYS) THEN
   
      ENDIF
      RETURN
         
    END SUBROUTINE QMAT
 
  !=================================================================================
  !=================================================================================
    attributes(global) SUBROUTINE RMAT(w21_rec, w31_rec, w32_rec)
    IMPLICIT NONE
   
   REAL(RP), DIMENSION(NCRYS), DEVICE :: w21_rec, w31_rec, w32_rec
    INTEGER :: i
    INTEGER :: tx, bx
   !Total shared memory used is 7*256*8  = 14336 out of 49152
    REAL(RP), DIMENSION(256), SHARED :: ang, axis_1, axis_2, axis_3, w21_recs, w31_recs, w32_recs
   
    tx = threadidx%x !will be 1-256
    bx = blockdim%x  !will be 256
   
    i = (blockidx%x-1) * bx + tx
 
 
   IF (i<=NCRYS) THEN
      
    ENDIF
    RETURN   

  END SUBROUTINE RMAT
 
   
  !=================================================================================
  !=================================================================================
    attributes(global) SUBROUTINE G_flag(Q_p_sam,Phi2, PHI, Phi1, flag, n_steps)
    IMPLICIT NONE
   
   INTEGER, VALUE, INTENT(IN)  :: flag, n_steps
   REAL(RP), DIMENSION(9), DEVICE :: Q_p_sam
   REAL(RP), DIMENSION(NCRYS,9), DEVICE :: G
   REAL(RP), DIMENSION(NCRYS),   DEVICE :: PHI, Phi1,Phi2
    INTEGER :: i
    INTEGER :: tx, bx
   !Total shared memory used is 9*256*8 + 9*8 + 256*9*8 +256*3*8 + 9*8= 45128 out of 49152
    REAL(RP), SHARED :: sQ_p_sam(9), angles(256,3), GG(9), twoPI
    REAL(RP), DIMENSION(256), SHARED :: QMATs_11, QMATs_12, QMATs_13, QMATs_21, &
               QMATs_22, QMATs_23, QMATs_31, QMATs_32, QMATs_33
    REAL(RP), DIMENSION(256), SHARED :: Rs_11, Rs_12, Rs_13, Rs_21, Rs_22, Rs_23, &
               Rs_31, Rs_32, Rs_33
   REAL(RP) :: to_rad, to_deg
   tx = threadidx%x !will be 1-256
    bx = blockdim%x  !will be 256
 
    i = (blockidx%x-1) * bx + tx
   
   sQ_p_sam = Q_p_sam
   to_rad = 1.e-03_rp * (PI / 180._rp)

    to_deg = 1._rp / to_rad
   twoPI = 2._rp*PI

   
   IF (i<=NCRYS) THEN
   QMATs_11(tx) = QMATT(1,1,i)
      QMATs_12(tx) = QMATT(1,2,i)
      QMATs_13(tx) = QMATT(1,3,i)
      QMATs_21(tx) = QMATT(2,1,i)
      QMATs_22(tx) = QMATT(2,2,i)
      QMATs_23(tx) = QMATT(2,3,i)
      QMATs_31(tx) = QMATT(3,1,i)
      QMATs_32(tx) = QMATT(3,2,i)
      QMATs_33(tx) = QMATT(3,3,i)
      
      Rs_11(tx) = RMATT(1,1,i)
      Rs_12(tx) = RMATT(1,2,i)
      Rs_13(tx) = RMATT(1,3,i)
      Rs_21(tx) = RMATT(2,1,i)
      Rs_22(tx) = RMATT(2,2,i)
      Rs_23(tx) = RMATT(2,3,i)
      Rs_31(tx) = RMATT(3,1,i)
      Rs_32(tx) = RMATT(3,2,i)
      Rs_33(tx) = RMATT(3,3,i)
       
      G(i,1) = Rs_11(tx)*QMATs_11(tx) + Rs_21(tx)*QMATs_21(tx) + Rs_31(tx)*QMATs_31(tx)
      G(i,2) = Rs_12(tx)*QMATs_11(tx) + Rs_22(tx)*QMATs_21(tx) + Rs_32(tx)*QMATs_31(tx)
      G(i,3) = Rs_13(tx)*QMATs_11(tx) + Rs_23(tx)*QMATs_21(tx) + Rs_33(tx)*QMATs_31(tx)
      G(i,4) = Rs_11(tx)*QMATs_12(tx) + Rs_21(tx)*QMATs_22(tx) + Rs_31(tx)*QMATs_32(tx)
      G(i,5) = Rs_12(tx)*QMATs_12(tx) + Rs_22(tx)*QMATs_22(tx) + Rs_32(tx)*QMATs_32(tx)
      G(i,6) = Rs_13(tx)*QMATs_12(tx) + Rs_23(tx)*QMATs_22(tx) + Rs_33(tx)*QMATs_32(tx)
      G(i,7) = Rs_11(tx)*QMATs_13(tx) + Rs_21(tx)*QMATs_23(tx) + Rs_31(tx)*QMATs_33(tx)
      G(i,8) = Rs_12(tx)*QMATs_13(tx) + Rs_22(tx)*QMATs_23(tx) + Rs_32(tx)*QMATs_33(tx)
      G(i,9) = Rs_13(tx)*QMATs_13(tx) + Rs_23(tx)*QMATs_23(tx) + Rs_33(tx)*QMATs_33(tx)      
      
      IF (flag == n_steps) THEN
      
            GG(:) = G(i,:)
            G(i,1) = sQ_p_sam(1)*GG(1) + sQ_p_sam(4)*GG(2) + sQ_p_sam(7)*GG(3)
            G(i,2) = sQ_p_sam(2)*GG(1) + sQ_p_sam(5)*GG(2) + sQ_p_sam(8)*GG(3)
            G(i,3) = sQ_p_sam(3)*GG(1) + sQ_p_sam(6)*GG(2) + sQ_p_sam(9)*GG(3)
            G(i,4) = sQ_p_sam(1)*GG(4) + sQ_p_sam(4)*GG(5) + sQ_p_sam(7)*GG(6)
            G(i,5) = sQ_p_sam(2)*GG(4) + sQ_p_sam(5)*GG(5) + sQ_p_sam(8)*GG(6)
            G(i,6) = sQ_p_sam(3)*GG(4) + sQ_p_sam(6)*GG(5) + sQ_p_sam(9)*GG(6)
            G(i,7) = sQ_p_sam(1)*GG(7) + sQ_p_sam(4)*GG(8) + sQ_p_sam(7)*GG(9)
            G(i,8) = sQ_p_sam(2)*GG(7) + sQ_p_sam(5)*GG(8) + sQ_p_sam(8)*GG(9)
            G(i,9) = sQ_p_sam(3)*GG(7) + sQ_p_sam(6)*GG(8) + sQ_p_sam(9)*GG(9)

      ENDIF
      
      IF (abs(G(i,9)) == 1) THEN
         angles(tx,1) = ACOS(G(i,1))
         angles(tx,2) = ACOS(G(i,9))
         angles(tx,3) = 0._rp

         IF (G(i,2) < 0) THEN
            angles(tx,1) = twoPI - angles(tx,1)
         END IF
      ELSE
         angles(tx,1) = ATAN2(G(i,7),-1*G(i,8))
         angles(tx,2) = ACOS(G(i,9))
         angles(tx,3) = ATAN2(G(i,3),G(i,6))
      ENDIF

      IF (angles(tx,1) < 0) THEN
         angles(tx,1) = angles(tx,1)+ twoPI
      ENDIF
      
      IF (angles(tx,2) < 0) THEN
         angles(tx,2) = angles(tx,2)+ twoPI
      ENDIF
      
      IF (angles(tx,3) < 0) THEN
         angles(tx,3) = angles(tx,3)+ twoPI
      ENDIF
      
      angles(tx,:) = NINT( to_deg * angles(tx,:) )
         
      Phi1(i) = angles(tx,1)
      PHI(i)  = angles(tx,2)
      Phi2(i) = angles(tx,3)
      
    ENDIF   
    RETURN
  END SUBROUTINE G_flag

END MODULE CUDA_Kernels



Where the arrays with *_d are allocated on the device from the host and HosttoDevice transfers are done from pinned allocated arrays on host. The other kernel parameters are passed as integer values.

QMATT and RMATT are updated in kernels QMAT and RMAT, then used in G_flag to update the Phi values. After the Phi values are updated the process loops. Everything in the loop except StressMat stays on the device for the duration of the loop.

The kernels are all programmed to use arrays that are powers of 2. Using array size of 1024 and 2048 return the correct results, with no errors. However, when I increase the array size to 4096 I get,

copyout Memcpy (host=0x200000000, dev=0x1b20420000, size=294912) FAILED: 4(unspecified launch failure),

where the size 294912 matches the StressMat = StressMat_d transfer (sizeof(double)*9*4096). When I increase the array size to 8192 i get an invalid argument error directly after the call to G_flag. I tried running G_flag with no arguments and I still received the same error (there is no error before G_flag). With array size 8192 the program completes, only the kernel G_flag does not run. Increasing the array size by powers of 2 from 8192 has same result.


I'm on a Kepler K20.
I checked memory usage right before G_flag launch and I'm nowhere near the card's capacity.

These are the compiler tags from my makefile:

FLAGS = -V13.4 -g #fast -Mconcur=innermost -mp
FLAGS_CUDA =-Mcuda=cuda5.0,cc35 -tp:x64 #-Mcuda=cuda5.0,cc35,rdc -tp:x64
F90=pgf90[/code]

Thanks!
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Fri Jul 26, 2013 10:37 am    Post subject: Reply with quote

Hi Mr. Savage,

I think this might be your problem:

Code:
    attributes(global) SUBROUTINE G_flag(Q_p_sam,Phi2, PHI, Phi1, flag, n_steps)
    IMPLICIT NONE
   
   INTEGER, VALUE, INTENT(IN)  :: flag, n_steps
   REAL(RP), DIMENSION(9), DEVICE :: Q_p_sam
   REAL(RP), DIMENSION(NCRYS,9), DEVICE :: G
   REAL(RP), DIMENSION(NCRYS),   DEVICE :: PHI, Phi1,Phi2

Notice that "G" isn't getting passed in? This makes it a local variable and every thread will create it's own G array. So depending upon how many threads are being created, you could be running out of memory or you need to add the flag "-mcmodel=medium".

Though given the rest of the code, I think you meant to pass G in? or is it a module variable and shouldn't be declared locally?

On a side note, you may want to add error checking after your kernels (at least during development). Without this check, there's no way of telling if your kernel succeed of not giving rise to odd failures later.

Code:
CALL QMAT<<<blocks_texture,threads_texture>>>(Phi2_d, PHI_d, Phi1_d)
ir = cudaGetLastError() if( ir )
print *, cudaGetErrorString( ir )
CALL RMAT<<<blocks_texture,threads_texture>>>(w_12_d, w_13_d, w_23_d)
ir = cudaGetLastError() if( ir )
print *, cudaGetErrorString( ir )
... etc ...


Hope this helps,
Mat
Back to top
View user's profile
Mr.Savage



Joined: 14 Feb 2013
Posts: 6

PostPosted: Fri Jul 26, 2013 5:14 pm    Post subject: Reply with quote

Thanks for the response Mat. You were right about G. I moved the declaration to the top of modules and the code works like a charm now, no errors. I could have also just declared G in local memory as an array of size 9, correct?

Code:
MODULE CUDA_Kernels
  USE Variables
  IMPLICIT NONE
  REAL(RP), DIMENSION(3,3,NCRYS), DEVICE :: QMATT, RMATT
  REAL(RP), DIMENSION(NCRYS,9), DEVICE :: G


I'd rather use shared memory for temp values, but its my understand that shared memory is designated per SM and I'm already using 43080 out of 49152 bytes of shared memory. This means there isn't enough shared memory available to create a G of 256*8*9. Is there good way around having to use global memory in this case? Originally kernels QMAT, RMAT, and G_flag were all the same FORTRAN subroutine. I split them up to optimize the use of shared memory. It would be nice to know if there are any other tricks to dealing with this kind of problem.

On a side note, I've been wondering how constant memory is most often used in kernels?


Code:
On a side note, you may want to add error checking after your kernels (at least during development).  Without this check, there's no way of telling if your kernel succeed of not giving rise to odd failures later.


I do have error checking in the code, but I took out about 90% of the DO loop when I posted the code.

Thank you again,
Dan
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Mon Jul 29, 2013 10:23 am    Post subject: Reply with quote

Quote:
I could have also just declared G in local memory as an array of size 9, correct?
Yep, so long as NCRYS is a parameter.

Quote:
I'd rather use shared memory for temp values, but its my understand that shared memory is designated per SM and I'm already using 43080 out of 49152 bytes of shared memory.
The only reason to use shared memory is when you have data re-use across multiple threads in a block. For a kernel temp variable used by a single thread, these are better kept in registers which the back-end compiler will manage. Only if you use too many registers will the memory "spill" to global memory. The number of registers is fixed for a block, so reducing the number of threads per block will increase the number of registers per thread. Often it takes some experimentation to find the optimal balance.

In this case, is G's values used elsewhere? If not, then why not just make it a local array having 9 elements? Also, it looks like you're only using elements 1, 6, 7, 8, and 9. Why not reduce the size of the array or make them scalars? Granted, I don't know the whole code so ignore this advice if there are reasons for the extra elements.

Quote:
Originally kernels QMAT, RMAT, and G_flag were all the same FORTRAN subroutine. I split them up to optimize the use of shared memory. It would be nice to know if there are any other tricks to dealing with this kind of problem.
One thing to keep in mind is that, since Fermi (CC 2.0), NVIDIA added hardware caching making software managed cache (i.e. shared) not as critical. It can still help in cases where there is a high degree of data sharing among threads in a block, but I don't see this in your code.

Quote:
On a side note, I've been wondering how constant memory is most often used in kernels?
Think parameters or some other constants that all the threads need to look up.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Debugging and Profiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group