PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

How to fill a very large array randomly using CUDA
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
alfvenwave



Joined: 08 Apr 2010
Posts: 79

PostPosted: Mon Apr 26, 2010 9:04 am    Post subject: How to fill a very large array randomly using CUDA Reply with quote

Hi.

I have a general CUDA problem - if anyone can help me find a solution I will be more than happy to reference you in any papers that result (you'll be helping the UK Fusion energy programme).....

Here is what I am stuck with. I have a very large multi-dimensional array (a particle distribution function - 2 spatial coordinates, and 3 velocity space coordinates). The array is about 200MB large, and hence fits into device memory onto a C1060 without any problem.

I then wish to follow a large number of particles (~10e6) in space and time, and augment that array with the time spent in each cell, summed over particles. The result is a particle distribution function.

On the face of it, CUDA seems like a good way of doing this - the problem is effectively SIMD, and the particles do not interact with eachother. Now, if I am doing this on a typical supercomputer, I typically use MPI. This means that each node has a good amount of memory. I can hold a temporary array, of the same dimensionality as my distribution function on each node, then simply sum them all up at the end.

With CUDA though, I have a problem. If you look at the "toy problem" code below, in order to illustrate the problem, you can easily see why it fails, giving an answer of 10 instead of the desired 160. I've stripped out some openMP that was there to split the problem over multiple cards, so sorry if it's not as compact an example as it could be.

Each cuda thread is reading the contents of Fd(1) at the same time then writing to Fd(1) the value Fd(1)+1.0 - obviously a 2 step operation, at each of the 10 iterations of the loop. What we really want is for each thread to complete this operation one after another, for the example I've shown here, augmenting Fd(1) with a value of 16.0 for each of the ten loop iterations.

For my real code (not the toy example below), the probability of there being 2 particles in a cell grid at any one iteration is lower than the example here, but it nonetheless results in what one might call a "thread conflict numerical error".

The obvious way to solve this, would be to give Fd an extra dimension, equal in size to the total number of threads, allowing one to fill the array indexed with thread number as well as spatial and velocity space indices, then integrate over that dimension at the end (analogous to what one usually does with MPI). Unfortunately, for a sensible number of threads, multiplied by my 200MB distribution function, a C1060 does not have enough device memory.

My question then is this - is there some established way of locking down the operation Fd(1) = Fd(1) + 1.0, so that individual threads have to carry this operation out one after another, hence removing the "read-write" conflict between threads? Or is there some other way of filling a very large array in a random fashion so that individual threads don't conflict? Or, is this type of problem currently intractable using CUDA?

Any advice will be greatly appreciated and acknowledged,

kind regards,

Rob.


Code:
module process_mod

use cudafor

implicit none

contains

attributes(global) subroutine process_kernel( Fd )

implicit none

real         :: Fd(1000)
real         :: RAN
integer      :: i,N

do i=1,10

!  Is there some way of locking down this next operation so that it
!  must be completed by thread n before thread n+1 tries to do the same?

   Fd(1)=Fd(1)+1.0
enddo

end

subroutine process( Fh )

   use cudafor

   implicit none

   real         :: Fh(1000)
   real, device :: Fd(1000)
   integer      :: iflag,idev

   Fh = 0.0
   Fd = 0.0

!  Call Kernel for 4x4 = 16 threads:

   call process_kernel<<<4,4>>>( Fd )

   Fh = Fd

end

end module

program memexample

   use cudafor
   use process_mod

   implicit none

   real         :: Fsum(1000),Fh(1000)
   real, device :: Fd  (1000)

   integer      :: i,iflag

   iflag = cudaSetDevice(0)

   call process(Fh)

   print*,  'Answer:', maxval(Fh)

end

Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Mon Apr 26, 2010 12:37 pm    Post subject: Reply with quote

Hi Rob,

Quote:
My question then is this - is there some established way of locking down the operation Fd(1) = Fd(1) + 1.0, so that individual threads have to carry this operation out one after another, hence removing the "read-write" conflict between threads? Or is there some other way of filling a very large array in a random fashion so that individual threads don't conflict? Or, is this type of problem currently intractable using CUDA?

I don't believe there is a good way of doing this. What you want is a mutex lock on a variable. Mutex require atomic operations. While CUDA does have atomic operations, the variables must be integers (floats were added in CUDA 3.0 and compute capability 2.0, i.e. Fermi). However, atomics are only valid for a single block and are very slow. Hence, for large problems this is not a good solution.

What you could do is a reduction. Have each thread store it's own value for the "Fd(1)+1.0" in a temporary array (you only need an array the size of number of threads) and then do a final reduction in serial. To work around the large memory size, you will need to 'strip mine', i.e. only send over portions of your problem at a time.

For example:
Code:

module process_mod

use cudafor

implicit none
real,device         :: Fd(1000)
real,device         :: FdL(16,100)

contains

attributes(global) subroutine process_kernel(N)

implicit none

real         :: RAN
integer, value :: N
integer      :: i,j,idx
idx  = (blockIdx%x-1)*blockDim%x + threadIdx%x

do i=1,N
   do j=1,10
      FdL(idx,i)=FdL(idx,i)+1
   enddo
enddo

end

attributes(global) subroutine process_kernel_sum(start,N)

implicit none

real         :: RAN
integer,value      :: start,end,N,idx
integer      :: i,j

do i=1,N
   do j=1,16
      Fd(start+i-1)=Fd(start+i-1)+FdL(j,i)
   enddo
enddo

end

subroutine process( Fh )

   use cudafor

   implicit none

   real         :: Fh(1000)
   integer      :: iflag,idev,i

   Fh = 0.0
   Fd = 0.0
   FdL = 0.0

!  Call Kernel for 4x4 = 16 threads:

   do i=1,1000,100
      FdL = 0.0
      call process_kernel<<<4,4>>>(100)
      call process_kernel_sum<<<1,1>>>(i,100)
   enddo
   Fh = Fd

end

end module

program memexample

   use cudafor
   use process_mod

   implicit none

   real         :: Fsum(1000),Fh(1000)

   integer      :: i,iflag

   iflag = cudaSetDevice(0)

   call process(Fh)

   print*,  'Answer:', maxval(Fh)
   print*, Fh(1), Fh(101), Fh(1000)

end

Hope this helps,
Mat
Back to top
View user's profile
alfvenwave



Joined: 08 Apr 2010
Posts: 79

PostPosted: Mon Apr 26, 2010 1:40 pm    Post subject: Reply with quote

Hi Mat.

Thanks for this. I can't quite see how this will work for my problem though (I was warned that CUDA can get messy).
Imagine if you will that the line in my code:

Code:
Fd(1)=Fd(1)+1.0


is now replaced by the following:

Code:
call random_number(RAN)
indx = int(RAN)*1000 + 1
Fd(indx) = Fd(indx) + 1.0


I can't see how to cast this into code that looks like yours - it's the random nature of the way a Monte Carlo code works that's giving me grief. Is the way around this to store an array of values of indx then sum up using your strip mine example?

Rob.
Back to top
View user's profile
alfvenwave



Joined: 08 Apr 2010
Posts: 79

PostPosted: Mon Apr 26, 2010 2:03 pm    Post subject: Reply with quote

One other quick question - doesn't the fact that you are only using 10 threads to do the summations make this algorithm slow? Do you need to call syncthreads inside process_kernel to make sure it's completed before process_kernel_sum tries to do its work?

I did toy with the idea of every time step in my code, pass back a long list of particle cell locations to the host (10e6 particles, 5 values each describing position in space and velocity space) then letting the host increment my distribution function whilst the GPU gets on with the next time step. I started along these lines only to discover a very large performance hit by placing the time step loop outside rather than inside my GPU tracking kernel (presumably at each time step, the GPU has to re-read all the particle states from global memory, whereas if the time loop is inside the kernel, particle states only have to be read at the start). This probably doesn't make a lot of sense without seeing the code.

Would you be willing to have a look at the code itself? - it's about 10 times longer than the example I've used here, but well documented and hopefully easy to read. Let me know - I fully appreciate that it's not really your job to help customers with their code so no worries if you haven't got time - thanks for all the help so far....

Rob.
Back to top
View user's profile
alfvenwave



Joined: 08 Apr 2010
Posts: 79

PostPosted: Mon Apr 26, 2010 2:37 pm    Post subject: Reply with quote

One other quick question - doesn't the fact that you are only using 10 threads to do the summations make this algorithm slow? Do you need to call syncthreads inside process_kernel to make sure it's completed before process_kernel_sum tries to do its work?

I did toy with the idea of every time step in my code, pass back a long list of particle cell locations to the host (10e6 particles, 5 values each describing position in space and velocity space) then letting the host increment my distribution function whilst the GPU gets on with the next time step. I started along these lines only to discover a very large performance hit by placing the time step loop outside rather than inside my GPU tracking kernel (presumably at each time step, the GPU has to re-read all the particle states from global memory, whereas if the time loop is inside the kernel, particle states only have to be read at the start). This probably doesn't make a lot of sense without seeing the code.

Would you be willing to have a look at the code itself? - it's about 10 times longer than the example I've used here, but well documented and hopefully easy to read. Let me know - I fully appreciate that it's not really your job to help customers with their code so no worries if you haven't got time - thanks for all the help so far....

Rob.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group