PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

basic CUDA help

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Mon Dec 03, 2012 9:06 pm    Post subject: basic CUDA help Reply with quote

Hi, I'm playing around with CUDA for the first time and I'm trying to replace the main loop in a subroutine with a cuda kernel. I'm not getting correct numbers so I'm wondering if someone could just check for glaring CUDA mistakes. I also know its probably written poorly but for now I'm just trying to get it working. Also, when I compile with -fast I sometimes get:

0: copyout Memcpy (host=0xe3e148, dev=0x200100000, size=4356) FAILED: 4(unspecified launch failure)

Code below. It looks a bit long but a lot of it is calculations you can probably ignore. Thanks for any help!

The host code:
Code:
         subroutine grid

c    source quantities are are calculated: n_i
c    right now only ion quantitities are calculated...
c
         use slabcuda
         use cudafor
         include  'slab.h'
c
         integer i,j

c arrays being used in loop declared on device
         real, device :: den_d(0:im,0:jm),w3_d(mm),mu_d(mm)
         real, device :: x3_d(mm),rwx_d(4),y3_d(mm),rwy_d(4)
c gridDim defines geometry (& #) of blocks in the grid
c blockDim defines geometry (& #) of threads in the block
c grid block is 1D since we have threads divided over just mm
         type(dim3) :: bGrid, tBlock
         tBlock = dim3(256,1,1) ! 256 threads per block
         bGrid = dim3(ceiling(real(mm)/tBlock%x),1,1)

c     here we set the rho and den equal to zero.
           do 50 i=0,im
              do 60 j=0,jm
                    den(i,j)=0.
 60           continue
 50        continue

c
            dv=(dx*dy)

c transfer arrays from host to device
           den_d = den
           w3_d = w3
           mu_d = mu
           x3_d = x3
           rwx_d = rwx
           y3_d = y3
           rwy_d = rwy
c launch kernel
           call gridloop<<<bGrid, tBlock>>>(den_d,w3_d,mu_d,
     &                                       x3_d,rwx_d,y3_d,rwy_d,
     &                                       mm,dv,lr,
     &                                       mims,
     &                                       lx,ly,dx,dy)

c transfer den from device back to host
           den = den_d
              do 300 j=0,jm
                    den(0,j)=( den(0,j)+den(im,j) )
                    den(im,j)=den(0,j)
  300          continue
c
               do 320 i=0,im
                     den(i,0)=(den(i,0)+den(i,jm))
                     den(i,jm)=den(i,0)
  320               continue
c
                  do 410 i=0,im
                     do 420 j=0,jm
                           den(i,j)=q*den(i,j)/n0
  420                 continue
  410              continue
c
        return
        end


The device code (module):
Code:

         module slabcuda
         contains
           attributes(global) subroutine gridloop(den,w3,mu,x3,rwx,y3,
     &                                                         rwy,
     &                                             mm,dv,lr,
     &                                             mims,
     &                                             lx,ly,dx,dy)

             implicit none
             real :: den(:,:),w3(:),mu(:),x3(:),rwx(:),y3(:),rwy(:)
             integer :: m,l,i,j,istat
             real, value :: wx0,wx1,wy0,wy1,wght,xt,yt,rhog,lx,ly,dx,dy
             real, value :: dv,mims
             integer, value :: mm,lr

             m = blockDim%x * (blockIdx%x - 1) + threadIdx%x
             if (m<=mm) then

                wght=w3(m)/dv/float(lr)
                rhog=sqrt(mu(m))/mims

               do l=1,lr
                 xt=x3(m)+rwx(l)*rhog
                 yt=y3(m)+rwy(l)*rhog
c
                 if(xt.lt.0.)  xt=-xt
                 if(xt.gt.lx)  xt=2.*lx-xt
                 if(xt.eq.lx)  xt=0.99999*lx
                 if(yt.ge.ly) yt=yt-ly
                 if(yt.le.0.)  yt=yt+ly
                 if(yt.eq.ly)  yt=0.99999*ly


                  i=int(xt/dx)
                  j=int(yt/dy)
c
                  wx0=float(i+1)-xt/dx
                  wx1=1.-wx0
                  wy0=float(j+1)-yt/dy
                  wy1=1.-wy0
c
                  istat = atomicadd(den(i,j), wght*wx0*wy0)
                  istat = atomicadd(den(i+1,j), wght*wx1*wy0)
                  istat = atomicadd(den(i,j+1), wght*wx0*wy1)
                  istat = atomicadd(den(i+1,j+1), wght*wx1*wy1)

               enddo
             endif
           end subroutine gridloop
         end module slabcuda
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6206
Location: The Portland Group Inc.

PostPosted: Tue Dec 04, 2012 10:21 am    Post subject: Reply with quote

Hi Brush,

The error you're seeing is a generic message meaning that your kernel fail for some reason. Though, I don't see anything obvious as to why. Two things to try:

1) Add error checking after your kernel call:
Code:

           call gridloop<<<bGrid, tBlock>>>(den_d,w3_d,mu_d,
     &                                       x3_d,rwx_d,y3_d,rwy_d,
     &                                       mm,dv,lr,
     &                                       mims,
     &                                       lx,ly,dx,dy)
           err = cudaGetLastError()
           if (err .ne. 0) then
                  write(*,*) 'Error:', cudaGetErrorString(err)
           endif


2) Compile with "-g -Mcuda=emu" and then use PGDBG to debug the code in emulation mode. Note everything works the same way in emulation, but it might find the error.

If neither helps, please post or send to PGI Customer Service (trs@pgroup.com) a reproducible example that I can use to investigate.

- Mat
Back to top
View user's profile
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Tue Dec 04, 2012 1:55 pm    Post subject: Reply with quote

Another question: in the host code, after the kernel call, the host keeps running so it seems the following den calculations would be inaccurate since the kernel is still updating the den array.

So it seems like adding whatever = cudaThreadSynchronize() after the kernel call would fix this / change the output, but this doesn't happen, why?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6206
Location: The Portland Group Inc.

PostPosted: Tue Dec 04, 2012 2:12 pm    Post subject: Reply with quote

Hi brush,

You are correct in that kernels are launched asynchronously to the host code. However, the host will block on the "den=den_d" statement and wait for the copy to complete before proceeding. Hence, adding cudaThreadSynchronize would make no difference.

- Mat
Back to top
View user's profile
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Sat Dec 08, 2012 7:23 pm    Post subject: Reply with quote

Thanks Mat.

I made a stand-alone program, pasted below, to test atomicadd. Basically, each of 25 threads simultaneously adds 1 to each element of a 5x5 array, initialized to 0. But my program isn't working as expected.

I pasted the output below: The second list in the output is the final array, which should be filled with all 25's, but for some reason isn't. The first list is the device array, which is printed from a single random thread. I have a dummy loop to take up some time so the device array is filled to what should be all 25s before the list is printed. I thought both should be identical, but they're not. They're also both not filled with all 25s, like I thought they should be. Any idea what is going on?

Device code (module):
Code:

      module test_mod
      contains
        attributes(global) subroutine test_sub(den,a,mm)
          real :: den(:,:)
          integer :: i,j,m,istat
          integer,value :: mm
          real,value :: a
          m = blockDim%x * (blockIdx%x - 1) + threadIdx%x
          if (m<=mm) then
            do i=0,4
               do j=0,4
                 istat = atomicadd(den(i,j),a)
                   if(m==1) then
                     do istat=1,999999 !allow other threads to finish
                     enddo
                     print *,i,j,den(i,j)
                   endif
               enddo
            enddo
          endif
      end subroutine test_sub
      end module test_mod


Host code (program):
Code:

      PROGRAM test_prog

      use test_mod
      use cudafor

      integer mm,i,j
      real a
      real :: den(0:4,0:4)
      real, device :: den_d(0:4,0:4)
      type(dim3) :: bGrid, tBlock
      mm=25
      tBlock = dim3(256,1,1)
      bGrid = dim3(ceiling(real(mm)/tBlock%x),1,1)

      a=1.

      do i=0,4
         do j=0,4
            den(i,j)=0.
         enddo
      enddo

      den_d = den
      call test_sub<<<bGrid, tBlock>>>(den_d,a,mm)
      den = den_d

      do i=0,4
         do j=0,4
            print *, i,j,den(i,j)
         enddo
      enddo
      write(*,*) 'Max error: ', maxval(abs(den-25.0))

      END


Output:
i j den(i,j) <--- printed by a random thread on the device
Code:

 0 0 332.699280
 0 1 331.999420
 0 2 25.000000
 0 3 25.000000
 0 4 25.000000
 1 0 328.685303
 1 1 25.000000
 1 2 25.000000
 1 3 25.000000
 1 4 25.000000
 2 0 332.804474
 2 1 25.000000
 2 2 25.000000
 2 3 25.000000
 2 4 25.000000
 3 0 323.017395
 3 1 25.000000
 3 2 25.000000
 3 3 25.000000
 3 4 25.000000
 4 0 327.530579
 4 1 25.000000
 4 2 25.000000
 4 3 25.000000
 4 4 25.000000


i j den(i,j) <--- printed by the host, after the device array has been copied back
Code:

            0            0    25.00000   
            0            1    25.00000   
            0            2    25.00000   
            0            3    25.00000   
            0            4    0.000000   
            1            0    25.00000   
            1            1    25.00000   
            1            2    25.00000   
            1            3    25.00000   
            1            4    0.000000   
            2            0    25.00000   
            2            1    25.00000   
            2            2    25.00000   
            2            3    25.00000   
            2            4    0.000000   
            3            0    25.00000   
            3            1    25.00000   
            3            2    25.00000   
            3            3    25.00000   
            3            4    0.000000   
            4            0    25.00000   
            4            1    25.00000   
            4            2    25.00000   
            4            3    0.000000   
            4            4    0.000000   
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group