PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

atomicadd for double precision in CUDA Fortran
Goto page Previous  1, 2, 3, 4, 5  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6142
Location: The Portland Group Inc.

PostPosted: Wed Apr 06, 2011 9:49 am    Post subject: Reply with quote

Quote:
1. support multi Modules for global & device subroutines
This is our most requested feature. The problem is that there isn't a linker for device code so no way to statically associate symbols from multiple objects. Secondly, CUDA doesn't support true function calling on the device. Currently, all called device routine must be inlined at compile time.

These are difficult challenges but we are working on them. Dr. Michael Wolfe in this article (http://www.pgroup.com/lit/articles/insider/v2n3a1.htm) discusses these challenges, some of our solutions and future directions.

Quote:
2. minor request -- support character strings in emu mode.
I added a feature request (TPR#17781).

- Mat
Back to top
View user's profile
tlstar



Joined: 31 Mar 2011
Posts: 22

PostPosted: Wed Apr 06, 2011 11:25 am    Post subject: Reply with quote

Quote:
Secondly, CUDA doesn't support true function calling on the device. Currently, all called device routine must be inlined at compile time.


Ah, no stacks for function calling in CUDA? it's really hard jobs ...
It's really nice to work together with a compiler expert.


My questions go on.

In emu mode (-Mcuda=emu -Mbounds), My code works very well.
But in cuda , failed with prompts:
copyout Memcpy (host=0x7c67e0, dev=0x1e620000, size=131072) FAILED: 4(unspecified launch failure)

What is the most possible reason for this error?
What's difference between emu and real CUDA?

How to stop the compiler optimize on kernel code ? -O0 can't stop all.


Last edited by tlstar on Wed Apr 06, 2011 12:10 pm; edited 1 time in total
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6142
Location: The Portland Group Inc.

PostPosted: Wed Apr 06, 2011 12:10 pm    Post subject: Reply with quote

Hi tlstar,

Your kernel is most likely seg faulting due to an out-of-bounds error when accessing energy_inter. I'm guessing that you have more threads than elements in the array.

To fix, add a guard before accessing the array. i.e. "IF (i .LE. ATOMIC_RAYS) then"

Also, your launch configuration could e better. I would suggest having the number of blocks being variable and threads being fixed. The number of blocks is maxed at 64k while the number threads is 512 or 1024 depending on your device. Your current config will break as ATOMIC_RAYS becomes large. Something like the following would be better:

Code:
CALL raycast<<<(ATOMIC_RAYS+GPU_CORES-1)/GPU_CORES, GPU_CORES>>>(point_dev, cell_dev, simul_dev,energy_inter_dev)



Hope this helps,
Mat
Back to top
View user's profile
tlstar



Joined: 31 Mar 2011
Posts: 22

PostPosted: Wed Apr 06, 2011 12:18 pm    Post subject: Reply with quote

Thanks for the most prompting reply.

The difference I found for comment the results output line, may due to the "-fast" option to optimization by compiler.

How to stop the compiler optimize on kernel code ? -O0 can't stop all.

And what's difference between emu and real CUDA?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6142
Location: The Portland Group Inc.

PostPosted: Thu Apr 07, 2011 10:28 am    Post subject: Reply with quote

Hi tlstar,

Quote:
The difference I found for comment the results output line, may due to the "-fast" option to optimization by compiler.
Possible, but given the code you had posted earlier, it's more likely a programing error. I would need a reproducing example to be sure. Please feel free to send the code to PGI Customer Service (trs@pgroup.com) and ask them to send it to me.

Quote:

And what's difference between emu and real CUDA?
Emulation mode (-Mcuda=emu) generates a CPU version of the code that uses OpenMP Tasks to simulate a NVIDIA device. It's best used for debugging since the PGI debugger (pgdbg) is OpenMP capable.

Though, since it is running on the CPU, there still can be differences than running on the GPU. For example, on a CPU if you write beyond the end of an array, the code most likely wont seg fault. You may stomp over another variable's data and cause other problems, but not seg fault. On the GPU, accessing memory even one element beyond the end of an array will trigger a seg fault. Adding array bounds checking (-Mbounds) in emulation mode should help find these errors.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page Previous  1, 2, 3, 4, 5  Next
Page 2 of 5

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group