PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Using constant memory in Fortran CUDA and with multiple GPUs

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
MaciejG



Joined: 04 Jun 2013
Posts: 3

PostPosted: Tue Jun 04, 2013 4:42 pm    Post subject: Using constant memory in Fortran CUDA and with multiple GPUs Reply with quote

Hello,

I'm developing a program in Fortran CUDA and trying to use/access multiple GPUs from a single host thread. The original code (single GPU only) was using global and constant memory and when adding support for multiple GPUs I could not find a way to specify on which device to place Fortran variables with the "constant" attribute.

I have tried this:

integer, constant :: iconst

DO dev = 0, maxdev
ignore = cudaSetDevice(dev)
iconst = 1
END DO

and this compiles and runs but trying to access iconst from a kernel launched on the higher numbered devices results in "unspecified launch failure".

Is there a way to specify the placement of variables in the constant memory on specific device? I looked through the user manual and "CUDA Fortran for Scientists and Engineers" but there is little information on supporting multile-gpus in general.

Thanks,

Maciej
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6141
Location: The Portland Group Inc.

PostPosted: Wed Jun 05, 2013 9:38 am    Post subject: Reply with quote

Hi Maciej,

Did you setup the Peer-To-Peer communication first? It's required in order to use GPUDirect.

My article on multi-GPU program using CUDA Fortran has a section on GPUDirect (part 4) http://www.pgroup.com/lit/articles/insider/v3n3a2.htm, including the set-up code. While I don't use constant memory in this example, I just went back tried adding some variables and it worked as expected. Though, if you continue to encounter issues, let me know and we can work through them,

- Mat
Back to top
View user's profile
MaciejG



Joined: 04 Jun 2013
Posts: 3

PostPosted: Wed Jun 05, 2013 4:40 pm    Post subject: Reply with quote

Hi Mat,

Thanks for your reply.

I'm not sure if GPUDirect is actually relevant to what I am trying to achieve. I understand GPUDirect is required if a kernel running on device 1 is trying to access constant memory on device 0 - is that correct?

What I am trying to do is to have kernel running on dev 0 access constant memory on dev 0 and kernels on dev 1 accessing constant memory on device 1. But it is not clear to me how to specify (we're talking Fortran here) that a variable declared with attribute(constant) is allocated in the constant memory of device 1 or 2 instead of device 0.

Is there a way of achieving this in Fortran CUDA (PGI 13.2 and CUDA5.0) or is it something that is not currently supported?

Cheers,

Maciej
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6141
Location: The Portland Group Inc.

PostPosted: Thu Jun 06, 2013 11:17 am    Post subject: Reply with quote

I believe there are actually multiple context created hence you need establish Peer-to-peer so you can manage them. Granted, I've only done a little work with using multiple GPUs from a single host thread, so there may be an better way, but using Peer-to-Peer seems to work.

Personally, I much prefer using MPI and then establish a single GPU context to each MPI process. Logically I find it easy to manage, cleaner in implementation, and scales better. Of course, you do what's best for your program.

- Mat
Back to top
View user's profile
MaciejG



Joined: 04 Jun 2013
Posts: 3

PostPosted: Thu Jun 06, 2013 9:56 pm    Post subject: Reply with quote

mkcolg wrote:
I believe there are actually multiple context created hence you need establish Peer-to-peer so you can manage them.

I could do that but that would just let me access constant memory on dev A from kernel running on devB, therefore negating performance benefits of using constant memory. ;)

mkcolg wrote:
Personally, I much prefer using MPI and then establish a single GPU context to each MPI process.

That is something that we've been thinking about for later. I hoped that a pipelined copy between two GPUs accessd from same host thread would be a bit faster compared to MPI so that we could reap benefits of multiple GPUs even for moderately sized problems.

Thanks for your help anyway. I've decided to refactor the code so that scalar constant become kernel arguments passed by value, while array constants will move to global memory.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group