PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

how to carry out the sum operation in cuda fortran?

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
Bullish



Joined: 21 Mar 2010
Posts: 5

PostPosted: Mon May 10, 2010 2:37 am    Post subject: how to carry out the sum operation in cuda fortran? Reply with quote

For a large size array,it's fairly easy to realize the sum operation in cudaC via pointer, and I just wonder how to perform this operation efficiently in cuda fortran using GPU.
thanks!
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Mon May 10, 2010 4:29 pm    Post subject: Reply with quote

Hi Bullish,

Can you please explain a bit more or post an example of what you mean by "sum operation"?

Performing sum reductions in parallel are quite difficult to perform efficiently, but no more so for Fortran then C. I wrote a basic one for an article I wrote (See: http://www.pgroup.com/lit/articles/insider/v2n1a4.htm), but by no means is it optimal. NVIDIA has a good slide deck on reductions (See: http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf) that helps explain the details.

- Mat
Back to top
View user's profile
Bullish



Joined: 21 Mar 2010
Posts: 5

PostPosted: Tue May 11, 2010 12:31 am    Post subject: Reply with quote

Hi Mat,
Firstly thank you for your reply. The sum operation I mentioned is exactly the intrinsic function sum() in Fortran. I tried to rewrite function sum() with CUDA Fortran, and the GPU code is much slower than CPU.According to my knowledge, CUDA fortran doesn't support direct memory address operation, so the GPU capability is hard to be fulled exploited even with the partial sum trick. Have you encountered such problem?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Tue May 11, 2010 9:22 am    Post subject: Reply with quote

Hi Bullish,

Using the sum intrinsic from within a device kernel would be very slow since each thread would be performing the sum and need to access the device's global memory. I would advice against using the reduction intrinsics in a device kernel unless you are reducing a small local or shared array.

To efficiently perform reductions, you should follow the partial reduction examples described earlier. Note that sum reductions on a GPU are not expected to be faster then the CPU. Rather, they should only be used if the cost to transfer the data is greater than the cost of the reduction.

Note that as of the 10.5 release, the PGI accelerator model is able to use CUDA Fortran device data. This will allow you to utilize the PGI accelerator's highly optimized reductions within CUDA Fortran. For example from the host add the follow and tehn compile with "-ta=nvidia".
Code:
!$acc region
  sumVal = sum(devArr)
!$acc end region


As for your question about direct memory address (DMA) operations, again I'm not clear as to what you mean. DMA has to do with how data is transferred to and from the CPU and GPU. Do you mean pinned memory (which is supported in CUDA Fortran)?

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group