mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Mon Apr 26, 2010 4:23 pm Post subject: |
|
|
Hi Rob,
| Quote: | | I can't see how to cast this into code that looks like yours - it's the random nature of the way a Monte Carlo code works that's giving me grief. Is the way around this to store an array of values of indx then sum up using your strip mine example? | I wrote an article for the PGInsider (our newsletter) which walks through an a simple Monte Carlo code that might help you (See: http://www.pgroup.com/lit/articles/insider/v2n1a4.htm). You can't call random number from a device kernel and will need to pre-compute these values. Though, the article gives a method for doing this.
The article also shows how to perform a simple sum reduction. Your histogram will follow a similar form. Though, instead of a single element in an array, each thread would need it's own "bin".
| Quote: | | One other quick question - doesn't the fact that you are only using 10 threads to do the summations make this algorithm slow? | With any reduction, the code ultimately needs to have a serial portion and it will be slow. Though if done correctly, the serial portion will be very small with little overall performance impact.
| Quote: | | Do you need to call syncthreads inside process_kernel to make sure it's completed before process_kernel_sum tries to do its work? | No. The synchronization is implicit for kernels having the same stream.
| Quote: | | Would you be willing to have a look at the code itself? - it's about 10 times longer than the example I've used here, but well documented and hopefully easy to read. Let me know - I fully appreciate that it's not really your job to help customers with their code so no worries if you haven't got time - thanks for all the help so far.... |
First, why don't you see if my article helps any. If your still having problems, then we can take a look at your code for a few minutes and then try an send you in the right direction.
Note you might take a look PGI Accelerator Model as well. As my article shows, it does a great job with the kernel and reduction code. So if your code is dominated by the compute portion of the code rather then copying the data over to GPU (like my little example), then it might be the best way to go. Also, we are working on supporting CUDA device data within Accelerator regions. Once available, it will solve the overhead of copying the random numbers to the GPU.
- Mat |
|