PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

accelerate a single loop with mpi and gpu
Goto page 1, 2, 3, 4, 5  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Sun Nov 11, 2012 4:06 pm    Post subject: accelerate a single loop with mpi and gpu Reply with quote

Hi,

If I have a single large loop containing lots of computations that is already divided among mpi tasks, is there an effective way to also use a gpu?

Or is it only possible to either (a) divide the loop into mpi tasks or (b) run the loop threads in parallel on the gpu?

Similarly, to use both mpi and gpus do I need nested loops?

Thanks,
Ben
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Wed Nov 14, 2012 8:40 am    Post subject: Reply with quote

Hi Ben,

You would use MPI to divide your work across multiple accelerators and then use OpenACC to move the individual MPI process's work over to the accelerator.

This article may help: http://www.pgroup.com/lit/articles/insider/v4n1a3.htm.

- Mat
Back to top
View user's profile
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Wed Jan 30, 2013 12:04 am    Post subject: Reply with quote

How do you deal with having a lot more MPI processes then GPUs? For example, if I'm running on nodes with 8 cores / GPU and every core runs an MPI process, they're all fighting over that 1 gpu.

I'm having trouble understanding how you can utilize all 8 cores and the gpu efficiently. Even if you had just 1 MPI process/node and you used something like OpenMP to distribute work among the cores on the node, it still seems you'd have the same problem.

With reguards to the article:
At the very bottom, the table shows the processors used with run times and such. There looks to be 48 cores total. Are they all being used in the GPU run? Also on that system you have 2 cores/GPU. But on computers like Titan at Oak Ridge where you have 16 cores/ gpu will you suffer in performance?

Thanks,
Ben
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Wed Jan 30, 2013 11:02 am    Post subject: Reply with quote

Hi Ben,
Quote:

For example, if I'm running on nodes with 8 cores / GPU and every core runs an MPI process, they're all fighting over that 1 gpu.
It's a problem. Unless you have a K20, NVIDIA doesn't support multiple host processes (MPI) using the same GPU. It may work, it's just not supported. Even if it does work, then you've serialized the GPU portion of the code. This situation works only if the MPI processes use the GPU infrequently and not at the same time.

Quote:
I'm having trouble understanding how you can utilize all 8 cores and the gpu efficiently. Even if you had just 1 MPI process/node and you used something like OpenMP to distribute work among the cores on the node, it still seems you'd have the same problem.
You can do a hybrid model (MPI+OMP) but it is difficult to code. Though, typically using just MPI process (1 per core) is fine. You may loose a bit of time since the processes don't share memory, but it may not be much.

Quote:
At the very bottom, the table shows the processors used with run times and such. There looks to be 48 cores total. Are they all being used in the GPU run? Also on that system you have 2 cores/GPU. But on computers like Titan at Oak Ridge where you have 16 cores/ gpu will you suffer in performance?
On my runs, I had one GPU per MPI process. On my test system, although I had 4 host cores, I only used two MPI processes since I had 2 GPUs on that box. On the cluster run, I had 8 nodes with each node having 3 MPI process and GPUs. For the pure host versions (no GPUs), I used a hybrid MPI/OpenMP version of the code. The number of MPI processes were used, but I then used OpenMP to fully populate the host cores.

- Mat
Back to top
View user's profile
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Tue Apr 09, 2013 6:34 pm    Post subject: Reply with quote

Not entirely related to the previous posts, but:

What if I had an OpenMP + GPU (either CUDA or directives) code, and again I just have one big loop that needs to be parallelized (nested inside is a smaller, but mostly insignificant loop).

Would it be viable to have one MP thread, say the master thread, execute half the entire loop with a call to a GPU kernel, and then have the other 3 open MP threads take on the work of executing the other half of the loop? Since it wouldn't make sense to have the loop split between the threads and then to have every thread call the gpu.

If I had, say, 2 GPUs, and 4 processors, then maybe the better way would be to split the work of the loop between 2 open MP threads, and then allow each open mp thread to call one of the two GPUs to do the actual work. Is that reasoanble?

What would be an example of a code structure that would be able to fully take advantage of both open MP and gpu parallelization? Just large nested loops?
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2, 3, 4, 5  Next
Page 1 of 5

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group