PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Problems with FORTRAN Accelerator and subroutines
Goto page Previous  1, 2, 3, 4, 5  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6120
Location: The Portland Group Inc.

PostPosted: Mon Aug 01, 2011 11:20 am    Post subject: Reply with quote

Quote:
s the same thing as having a kernel which copies every value from U to U_Dev? What I'd like to know is if there is only one thread working or if the compiler spawns many threads to manage the process.
This is a host to device copy so there are no device threads in use.

Since the data is contiguous, the host to device memory copy is performed in a single DMA transfer. In other words, the data is moved in one big block of memory, not element by element. When copying sub-sections of arrays, then the compiler may need to create multiple DMA transfers.

Quote:
If I have blocks with less than 64 values, only one or two blocks are executed on each MultiProcessor?
The number of blocks executed at one time on a mutilprocessor (SM) will depend on how many resources each block uses. Each SM has a fixed number of registers and shared memory space. The more registers used per thread and more shared memory used per block, the fewer number of the threads and blocks that can be executed on an SM.

The amount of registers and shared memory varies by device. To see what's available on your device, see the output of 'pgaccelinfo'. To see how many resources your program uses, add the flag "-Mcuda=ptxinfo".

A useful exercise for you might be to try and calculate your program's occupancy using NVIDIA's occupancy calculator. (http://forums.nvidia.com/index.php?showtopic=31279)

- Mat
Back to top
View user's profile
nicolaprandi



Joined: 06 Jul 2011
Posts: 27

PostPosted: Tue Aug 02, 2011 8:38 am    Post subject: Reply with quote

Thanks for the explanation. Today I finished writing the SWE1D code and got a 26x speedup versus the unparallelized CPU version: tomorrow I'll try to add some tweaks. For the moment, I have another small question regarding the exceeding threads spawning during kernel calls. Would it be of any convenience to create two different kernel calls for each kernel, the first one to manage the fully-occupied thread blocks and the second one to manage the partially occupied thread block (which obviously would be fully occupied by changing the block size)?

I think it would be no good when the thread blocks' number is not a multiple of the SMs' number, since we would require one more step to spawn the last thread block.


Thanks again, Nicola.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6120
Location: The Portland Group Inc.

PostPosted: Tue Aug 02, 2011 1:31 pm    Post subject: Reply with quote

Hi Nicola,

Quote:
Today I finished writing the SWE1D code and got a 26x speedup versus the unparallelized CPU version:
That's great!

Quote:
Would it be of any convenience to create two different kernel calls for each kernel, the first one to manage the fully-occupied thread blocks and the second one to manage the partially occupied thread block (which obviously would be fully occupied by changing the block size)?
Often it does take experimentation to determine the best schedule, so I encourage you to try different ones. In this case though, I doubt you'll see much difference. So long as you have many blocks, having one with a few threads that don't do work, shouldn't matter much.

- Mat
Back to top
View user's profile
nicolaprandi



Joined: 06 Jul 2011
Posts: 27

PostPosted: Thu Aug 04, 2011 8:49 am    Post subject: Reply with quote

Hi Mat, thanks for the answer. I tweaked a little bit further the code by implementing reduction (as shown in Mark Harris' slides for CUDA C language: http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf) and several other minor changes in order improve the MIN calculation of a very large array: by doing so, I reached a 40x speedup.

New day, new question: is it possible to export a value from a kernel to the host when (inside the kernel) we meet a certain condition?

In particular, what I would like to achieve is the following: I pass the variable "t" inside a kernel, update it by adding the variable "Dt" (calculated in a previous kernel) and change the result if "t" is greater than "tsave". If that condition applies, I need to put a flag scalar equal to 1 in order to save the data that I need later on inside the do-loop calling the kernels. Is there a possibility to export that flag scalaronly if "t" is greather than "tsave"? If I export the variable on the host each time, the code's performances are cut down by 20-30% since it makes a lot of memory access operations.

I tried to mix CUDA FORTRAN w/PGI Accelerator (by using the mirror-reflected directives) with no success. I also tried if the cudaEvent functions but it looks like they could only be executed from the host: did I do something wrong with them?


Have a good day, Nicola.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6120
Location: The Portland Group Inc.

PostPosted: Thu Aug 04, 2011 3:37 pm    Post subject: Reply with quote

Quote:
is it possible to export a value from a kernel to the host when (inside the kernel) we meet a certain condition?
There isn't a direct way in CUDA Fortran to have a kernel 'push' data back to host. Data movement is controlled by the host.

Quote:
I tried to mix CUDA FORTRAN w/PGI Accelerator (by using the mirror-reflected directives) with no success.
You can mix the two and the PGI Accelerator model can access CUDA Fortran device variables. However, support for CUDA Fortran accessing PGI Accelerator model device variables is still in process. If you have an example of what you are try to do, I can put in a feature request.

Thanks,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3, 4, 5  Next
Page 4 of 5

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group