PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

need help with simple cuda test

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
vibrantcascade



Joined: 04 Aug 2011
Posts: 28

PostPosted: Thu Aug 04, 2011 1:35 pm    Post subject: need help with simple cuda test Reply with quote

The following code is a test I'm trying to do to see how easy it's going to be to adapt some serial scientific computing code to CUDA. I'm having some troubles getting it going. So could someone tell me if I'm doing this completely wrong, or if I can't use device functions this way? (The math is irrelevant, I'm just trying to get something going to eat up time so I can see the improvement from using cuda.)

I'm also thinking I need to rework:
a = a + testFunc2(a, i)
as I'm not quite sure how that would execute and add up on the machine. In my non-cuda version of this I simply call testFunc1 and have a do loop from 1 to 10000 in that.


And on the line of:
b = testFunc1
There is actually a one hundred followed by a comma then another one hundred. For some reason the forum won't allow me to enter a line like that.
Code:

      program test
      use cudafor
      double precision a, b,testFunc1
      real t0, t1, tdiff   

      call cpu_time(t0)
      a = 1.0000000d0
      b = testFunc1<<<100>>>(a)
      call cpu_time(t1)

      tdiff = t1 - t0

      write(6,*) a, tdiff
   
      call exit   
      end

      module cudaFuncs
      contains
        double precision attributes(device) function testFunc1(a)
        double precision a, testFunc2
        integer i
          i = (blockIdx%x-1)*blockDim%x + threadIdx%x
          a = a + testFunc2(a, i)
          call syncthreads()
          testFunc1 = a
          return
        end

        double precision attributes(device) function testFunc2(a, i)
        double precision a, testFunc3
        integer i, j
          a = a * i
          a = a / 3.d0
          do j = 1, 10000
            a = testFunc3(a)
          enddo
          testFunc2 = a
          return
        end

        double precision attributes(device) function testFunc3(a)
        double precision a
        integer i
 
          do i = 1, 10000
            a = a * a
            a = sqrt(a)
            a = a * 3.d0
            a = a / 5.d0
          enddo
          testFunc3 = a
          return
        end
      end module cudaFuncs
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Thu Aug 04, 2011 4:24 pm    Post subject: Reply with quote

Hi vibrantcascade,

Quote:
b = testFunc1
There is actually a one hundred followed by a comma then another one hundred. For some reason the forum won't allow me to enter a line like that.
Correct. Only global kernel functions are callable from the host. Device functions can only be called from other device routines located in the same module.

Besides this, there are number of other problems. Foremost, this isn't a parallel program. You could run this serially on a GPU, but your performance would be very poor. A single GPU core is pretty wimpy when compared to a CPU core. Speed-up is obtained by using lots of GPU cores. So if you need to calculate millions of a's then, you'd see improvement, but not for just one.

Below is a modified version of your code which fixes many of the issues. Instead of going through each one, I'd like you to first read some of our PGInsider articles to get a better understand on how CUDA Fortran works. Also, try looking over the CUDA Fortran SDK for code examples. If you still have question, please feel free to ask.

All of the PGInsider articles can be found at: http://www.pgroup.com/resources/articles.htm#pginsider
Also, the CUDA Fortran Programming Guide is good thing to have on hand http://www.pgroup.com/doc/pgicudafortug.pdf

Hope this helps,
Mat

Code:
% cat testFuncs.cuf
      module cudaFuncs
      contains
        attributes(global) subroutine testFunc1(a,nsize)
        double precision, dimension(:) :: a
        integer, value                 :: nsize
        integer i
          i = (blockIdx%x-1)*blockDim%x + threadIdx%x
        if (i .le. nsize) then
          a(i) = a(i) + testFunc2(a(i), i)
        end if
        end

        double precision attributes(device) function testFunc2(a, i)
        double precision a
        integer i, j
          a = a * i
          a = a / 3.d0
          do j = 1, 10000
            a = testFunc3(a)
          enddo
          testFunc2 = a
          return
        end

        double precision attributes(device) function testFunc3(a)
        double precision a
        integer i
          do i = 1, 10000
            a = a * a
            a = sqrt(a)
            a = a * 3.d0
            a = a / 5.d0
          enddo
          testFunc3 = a
          return
        end
      end module cudaFuncs

      program test
      use cudafor
      use cudaFuncs
      integer, parameter :: NSIZE=1024
      integer, parameter :: BSIZE=256
      double precision b
      double precision, dimension(NSIZE)         :: a
      double precision, dimension(NSIZE), device :: aDev
      type(dim3) :: grid, block
      real t0, t1, tdiff   

      block = dim3(BSIZE,1,1)
      grid = dim3((NSIZE+BSIZE-1)/BSIZE,1,1)
      call cpu_time(t0)
      aDev = 1.0000000d0
      call testFunc1<<<grid>>>(aDev,NSIZE)
      a=aDev
      call cpu_time(t1)

      tdiff = t1 - t0

      write(6,*) a(1), a(NSIZE), tdiff
   
      call exit   
      end
Back to top
View user's profile
vibrantcascade



Joined: 04 Aug 2011
Posts: 28

PostPosted: Fri Aug 05, 2011 7:56 am    Post subject: Reply with quote

Thanks for the help! I'm new to fortran and was just trying to figure out how to use cuda functions as I couldn't find a good simple example like this in the programming guide. (I'm mainly a c++/c# programmer.) I have the non-cuda version working fine and figured I might just be able to declare them all device functions and create a thousand threads in the call. This makes plenty of sense though now.

(A nice simple example like this with a few function calls in chapter 5 of the cuda programmers guide would be nice to add. The matrix multiplication just doesn't cover enough.)

Thanks Mat!
Back to top
View user's profile
vibrantcascade



Joined: 04 Aug 2011
Posts: 28

PostPosted: Fri Aug 05, 2011 8:45 am    Post subject: Reply with quote

Ok 1 quick questions on this. I notice you only specified the <<<grid>>> and not the block size in the call. Will it then automatically create as many threads per block as your card can handle or will it implicitly recognize the block size of 256 even though it wasn't passed in the call?

I believe the fermi tesla I'm using supports block sizes of 1024 so technically it could handle all 1024 values in 1 block even though this code is making 4 blocks from what I can tell.


Update: never mind, it appears the forum deletes anything after the comma in a triple <<<>>> block like that, just like what I ran into.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Fri Aug 05, 2011 9:57 am    Post subject: Reply with quote

Quote:
Ok 1 quick questions on this. I notice you only specified the <<<grid>>> and not the block size in the call.
Artifact of the forum. It should be "grid,block" but for some reason when code gets posted the "block" gets removed. I'll try to dive into phpBB's scripts to see if I can correct it.

Quote:
I believe the fermi tesla I'm using supports block sizes of 1024 so technically it could handle all 1024 values in 1 block even though this code is making 4 blocks from what I can tell.
Correct, a Fermi can use up to 1024 threads per block. Note we have utility 'pgaccelinfo' which you can used to query your device's properties.

NSIZE and BSIZE's values are arbitrary. Feel free to make them bigger.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group