PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Problems with FORTRAN Accelerator and subroutines
Goto page Previous  1, 2, 3, 4, 5  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
nicolaprandi



Joined: 06 Jul 2011
Posts: 27

PostPosted: Mon Jul 25, 2011 8:21 am    Post subject: Reply with quote

Hi again! I worked a little bit more on the CUDA FORTRAN code and I obtained the same results as the one given by the original program (which uses the CPU) after one time-step: that's great! But when I try to run more than a few time-steps in emulation mode, the program crashes (I used a do loop with a "c" index in place of the do-while loop in order to test the program). If I remove the emulation mode's flag, the compiler gives me the following error:

Code:
...AppData\Local\Temp\pgcudafor2aBeDrVBDN9Jm.gpu(4): Error: Formal parameter space overflowed (256 bytes max) in function kernel


Any suggestion about it? As the usual, I've upped the latest version of my code on MediaFire:

http://www.mediafire.com/?f87o2dc2jkq6lx8


Thanks, Nicola.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6206
Location: The Portland Group Inc.

PostPosted: Mon Jul 25, 2011 10:02 am    Post subject: Reply with quote

Hi Nicola,

Quote:
I obtained the same results as the one given by the original program (which uses the CPU) after one time-step: that's great!
Unfortunately, I think this is just by luck. When I run your code I get an error in Fluxes_kernel when accessing the ULR arrays. You initialize this array in Reconstruction_kernel, but since in my case not every thread has executed, parts of ULR contain garbage by the time its accessed in Fluxes_kernel.

There isn't any global synchronization in CUDA except between different kernel calls. There is limited kernel synchronization, but only between threads in the same block. You'll need to revisit your algorithm and try and remove dependencies. Also remember that different blocks can be executing different time steps so you can't reinitialize your global arrays. You'll need to try and privatize them (i.e. each thread has it's own copy or add an extra dimension for each time step). Does the values of ULR and the other global arrays change from time step to time step? If not, try pre-computing them.

Quote:
Error: Formal parameter space overflowed (256 bytes max) in function kernel
CUDA limits the amount of data that can be passed in via arguments. Your over this limit. Instead of passing in all your arrays, declare them in your Main_Kernel module. They can then be accessed directly by all device routines in your module so no need to pass them in.

Hope this helps,
Mat
Back to top
View user's profile
nicolaprandi



Joined: 06 Jul 2011
Posts: 27

PostPosted: Mon Jul 25, 2011 11:08 am    Post subject: Reply with quote

Thanks Mat, tomorrow I'll check the code and try to fix the access to the ULR array. While going back home, I came up with some other questions:

- About general synchronization. As you said, there is no general procedure for doing so except with different kernel calls. If I want to achieve it, do I need to move the subroutines outside the main kernel and create a kernel for each one of those? By doing so, would I lose some performances?

- The do-loop (with index "it") should complete the whole "Nt" operations within the single thread block. How this is supposed to work with the (Nf+Nt-1)/Nt = 9 threads? Let's suppose the code would work as it is, will it execute the Nt operations for 9 times?

- I noted a strange behavior by running the code in debug mode (with emulation flag). It executes the main loop with index "c" equal to 1 for about 13-14 threads (index "it"), then changes to c = 2 and it startes over from 1. Shouldn't it reach it = 64 before having c = 2? Does this happen because there is no thread block synchronization?


Thanks in advance for answering, have a good day.

Nicola
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6206
Location: The Portland Group Inc.

PostPosted: Mon Jul 25, 2011 12:17 pm    Post subject: Reply with quote

Quote:
If I want to achieve it, do I need to move the subroutines outside the main kernel and create a kernel for each one of those?
If you can't remove the dependencies, then yes, this is the only way to guarantee global synchronization.

Quote:
By doing so, would I lose some performances?
Yes, but how much will depend upon the the size of Nc. Right now it's relatively small so can't take full advantage of the GPU. If Nc was very large, then you may not loose as much.

Quote:
The do-loop (with index "it") should complete the whole "Nt" operations within the single thread block. How this is supposed to work with the (Nf+Nt-1)/Nt = 9 threads? Let's suppose the code would work as it is, will it execute the Nt operations for 9 times?
Sorry, I'm not sure what you're asking here. "Nt" defines the number of threads. " (Nf+Nt-1)/Nt" defines the number of blocks. Given Nt is 64 and Nf is Nc (512) + 1, or 513, you will have 9 blocks of 64 threads each.

Quote:
I noted a strange behavior by running the code in debug mode (with emulation flag). It executes the main loop with index "c" equal to 1 for about 13-14 threads (index "it"), then changes to c = 2 and it startes over from 1. Shouldn't it reach it = 64 before having c = 2? Does this happen because there is no thread block synchronization?
It's more likely an artifact of how emulation mode is performed. We use OpenMP tasks to simulate device threads. Upon launch of your code, N number of OpenMP threads (N defaults to the number cores on your system). When you launch a kernel, a OpenMP task is added for each device thread to a queue. Then each OpenMP thread will start to execute a device thread from this queue. When a "syncthreads", is encountered, then the device thread is put back on the queue and a new device thread starts up. So what you're actually looking at when debugging, are OpenMP threads which will be executing one or more device threads.

- Mat
Back to top
View user's profile
nicolaprandi



Joined: 06 Jul 2011
Posts: 27

PostPosted: Mon Aug 01, 2011 7:54 am    Post subject: Reply with quote

Hi again, during these days I progressed a little bit and finished to write a working CUDA FORTRAN code: it was necessary to remove the outer kernel in order to synchronize the datas between the inner kernels.

I have some question involving optimization:

- If I have:

Code:
U_Dev = U


is the same thing as having a kernel which copies every value from U to U_Dev? What I'd like to know is if there is only one thread working or if the compiler spawns many threads to manage the process.
- If I have blocks with less than 64 values, only one or two blocks are executed on each MultiProcessor? I have found no clear explanation about it, it looks like "Several concurrent blocks can reside on one SM depending on the blocksí memory requirements and the SMís memory resources" (from nVidia's slides).


Thanks, Nicola.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3, 4, 5  Next
Page 3 of 5

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group