PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

accelerate a single loop with mpi and gpu
Goto page Previous  1, 2, 3, 4, 5  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Wed Apr 10, 2013 9:48 am    Post subject: Reply with quote

Hi Brush,

You certainly could do this with some effort. However, now you have a load balancing problem. You'd need to be able to calculate the amount work being performed, what resources are available, and then divide accordingly. You might be able to for a specific system with a specific workload, but a general solution will be very difficult to get correct.

Also, with a GPU you really want to bring the data over once use it across multiple kernel calls. By introducing the CPU n the mix, you'll need to move data back more often.

It might work better to use MPI and then use the GPU or OpenMP depending upon the resources. Seem to me that it would be easier to manage. Though, you still have a load balancing problem. Maybe you can arrange the problem so that there is limited synchronization between processes, and a variable work flow (such as a producer/consumer model where the master process queues the work and each slave processes take work as their resources are free)

- Mat
Back to top
View user's profile
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Thu Apr 11, 2013 1:11 pm    Post subject: Reply with quote

Why is more data transferred back and forth when more CPUs are introduced?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Thu Apr 11, 2013 1:29 pm    Post subject: Reply with quote

Quote:
Why is more data transferred back and forth when more CPUs are introduced?
My assumption would be if you are splitting up a large loop where a portion is run the host and a portion on the GPU, you would need to synchronize the host and device copies of the data at the end of the loop. Having a GPU only version allows more opportunity for you copy the data over to the device and then re-use it in other accelerated loops without synchronizing the data.

Though, there's no steadfast rule here. I'm sure you can come up with methods where you have one very large OpenMP parallel section which can divide up the work across the CPU and GPU with limited synchronization. If you can get it work for your algorithm, then please do so. Though, to me, this is not natural OpenMP programming and better fits MPI.

- Mat
Back to top
View user's profile
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Sun Apr 14, 2013 3:36 pm    Post subject: Reply with quote

Thanks Mat. So if I had two GPUs, would it make sense to split up the work of a single large loop between the two GPUs? And if not, why? Would OMP or MPI be better for doing this?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Mon Apr 15, 2013 9:17 am    Post subject: Reply with quote

Quote:
So if I had two GPUs, would it make sense to split up the work of a single large loop between the two GPUs?
The bottle neck here is the PCIe bus. Both GPUs will share this one bus and essentially serializes your data copies. Hence, the benefit of splitting a single loop across multiple GPUs will depend upon how much compute is done versus how much data needs to be copied. If there is little to no data movement, then, you can expect a near 2x speed-up going from one to two GPUs. However, as you add data movement the speed-up diminishes.

Quote:
Would OMP or MPI be better for doing this?
In general, you want to move your data over to the GPU once, perform all your computation on the device, and then move it back once. Most likely you'll need to preform some data synchronization in-between, but ideally this is kept to a minimum.

This is why I prefer MPI over OpenMP for multi-GPU programming using OpenACC. MPI more naturally fits this model since data is already decomposed across processes, each process can manage it's own device data, and data synchronization is typically limited.

For OpenMP, there is an assumption of data coherence with shared memory. However, when you start using multiple devices, each with their own discrete memory systems, you the programmer now need to make sure the host shared memory maintains this coherence. Secondly, OpenMP makes it more difficult to have a whole program view of your device data since OpenMP's parallelism is more fine grained. You could have a single parallel region in your main routine that would fix this, but again it's not normally how OpenMP is used.

Now, if you are only splitting a single loop across multiple GPUs, then I can see using OpenMP instead of MPI. However, I would question if you would see benefit of using the GPU at all, and if using two would be worth the extra programming effort. If 99% of your program time is spent in this loop, and there is limited data movement, then it's absolutely worth doing multiple GPUs. If 5% of your program time is spent in this loop, then absolutely not.

Again, there is not one way of doing this. You need to evaluate your particular algorithm as to what is the best solution.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3, 4, 5  Next
Page 2 of 5

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group