PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

accelerate a single loop with mpi and gpu
Goto page Previous  1, 2, 3, 4, 5  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6141
Location: The Portland Group Inc.

PostPosted: Thu May 23, 2013 8:51 am    Post subject: Reply with quote

Hi Ben,

Quote:
Is it that transfering half of the array still takes about the same time as transferring the entire array?
If both threads transfer half the array at about the same time, then yes, it typically takes about the same amount of time as if one thread transferred the whole array. So your compute time should half, but the overall data transfer time will stay about the same.

If you can interleave data and compute, then you might be able to maximize the data bandwidth. Though this is tough to do in an OpenMP context given there's typically a tighter synchronization between threads. Eventually, you'll also be able to use the OpenACC async clauses which might help in interleaving, but unfortunately, we don't quite have async working well enough within OpenMP (hence the PGI_ACC_SYNCHRONOUS variable). Async works fine in a serial and MPI context though.

- Mat
Back to top
View user's profile
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Wed May 29, 2013 4:01 pm    Post subject: Reply with quote

Could you clarify the usage of acc_set_device_num(devicenum,devicetype):

For the device number, are the GPUs numbered 0, 1, 2, ... or 1, 2, 3... I thought it was the former, but according to this link (http://www.catagle.com/26-23/pgi_accel_prog_model_1_2.htm) passing in 0 gives default behavior, not the first GPU. Is the CUDA Device Number, as displayed by pgaccelinfo, the number I need to input as my argument to get that device?

What does a devicetype of 0 or 1 do? (I didn't understand the documentation linked above).

Thanks,
Ben
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6141
Location: The Portland Group Inc.

PostPosted: Fri May 31, 2013 11:02 am    Post subject: Reply with quote

Hi Ben,

The GPU are numbered 0-N.

For us, the default behavior is to use the lowest numbered device the binary will run. Typically this would be device zero, though could be something higher. The device information, including the numbering, can be found by running the "pgaccelinfo" utility.

For the the devicetype, you should use the enumerated names such as ACC_DEVICE_NVIDIA since the numbering may not be consistent between compilers. You can see the PGI list by viewing the header file "include/accel.h" (located in your PGI installation directory).

From 13.5's accel.h:
Code:
typedef enum{
        acc_device_none = 0,
        acc_device_default = 1,
        acc_device_host = 2,
        acc_device_not_host = 3,
        acc_device_nvidia = 4,
        acc_device_radeon = 5,
        acc_device_xeonphi = 6,
        acc_device_pgi_opencl = 7,
        acc_device_nvidia_opencl = 8,
        acc_device_opencl = 9
    }acc_device_t;


Hope this helps,
Mat
Back to top
View user's profile
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Tue Jul 16, 2013 4:21 pm    Post subject: Reply with quote

mkcolg wrote:
Hi Ben,
Quote:

For example, if I'm running on nodes with 8 cores / GPU and every core runs an MPI process, they're all fighting over that 1 gpu.
It's a problem. Unless you have a K20, NVIDIA doesn't support multiple host processes (MPI) using the same GPU. It may work, it's just not supported. Even if it does work, then you've serialized the GPU portion of the code. This situation works only if the MPI processes use the GPU infrequently and not at the same time.


Hi Matt,

If you do have a K20 GPU is there anything special you need to do to make it possible for multiple MPI processes to make calls to the same GPU simultaneously? Or should it work automatically? (ANSWER EDITED IN BELOW!)

Ben

EDIT: I did my homework this time. From the link below:
" All it takes is a Tesla K20 GPU with a CUDA 5 installation and setting an environment variable to let multiple MPI ranks share the GPU Hyper-Q is then ready to use."
http://blogs.nvidia.com/blog/2012/08/23/unleash-legacy-mpi-codes-with-keplers-hyper-q/
Back to top
View user's profile
TheMatt



Joined: 06 Jul 2009
Posts: 317
Location: Greenbelt, MD

PostPosted: Fri Jul 19, 2013 10:29 am    Post subject: Reply with quote

brush wrote:

If you do have a K20 GPU is there anything special you need to do to make it possible for multiple MPI processes to make calls to the same GPU simultaneously? Or should it work automatically? (ANSWER EDITED IN BELOW!)

Ben

EDIT: I did my homework this time. From the link below:
" All it takes is a Tesla K20 GPU with a CUDA 5 installation and setting an environment variable to let multiple MPI ranks share the GPU Hyper-Q is then ready to use."
http://blogs.nvidia.com/blog/2012/08/23/unleash-legacy-mpi-codes-with-keplers-hyper-q/

Here's a question, though: what's the environment variable?
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3, 4, 5  Next
Page 4 of 5

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group