PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

If use multiple GPUs, can I set device variables globally?
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
bsb3166



Joined: 27 Jun 2011
Posts: 10

PostPosted: Thu Jul 21, 2011 4:47 pm    Post subject: If use multiple GPUs, can I set device variables globally? Reply with quote

Am I right?

if I try to use multiple GPUs in mpi environment (each CPU has access to one GPU), device memory should be allocated after cudasetdevice(gpuid) or cula_selectdevice(gpuid). Other, copy data from CPUs to GPUs would result in error, because before the device id is specified, all the device memory allocated are on Device 0 as default.

If in my project, I want to use MPI, multiple GPUs, CULA device routines, CUDA Fortran Kernel function, can I set all the device variables in a moulde to make all of them global variables, then use this module in other subroutines?

Thank you very much.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6123
Location: The Portland Group Inc.

PostPosted: Fri Jul 22, 2011 8:51 am    Post subject: Reply with quote

Hi bsb3166,

I may not fully understand the question but I'll do my best to answer.

Quote:
if I try to use multiple GPUs in mpi environment (each CPU has access to one GPU), device memory should be allocated after cudasetdevice(gpuid) or cula_selectdevice(gpuid). Other, copy data from CPUs to GPUs would result in error, because before the device id is specified, all the device memory allocated are on Device 0 as default.
In 10.8 or later, the device context is created at the time of first use. So after you set the device number via cudasetdevice, the device will be associated the first time you allocate an array, copy data to the device, or launch a kernel. Note that you shouldn't use cula_selectdevice since it will attempt to create a different context. For details on using CULA from CUDA Fortran, the following two articles may be helpful.

Using GPU-enabled Math Libraries with PGI Fortran
Using the CULA GPU-enabled LAPACK Library with CUDA Fortran

Quote:
If in my project, I want to use MPI, multiple GPUs, CULA device routines, CUDA Fortran Kernel function, can I set all the device variables in a moulde to make all of them global variables, then use this module in other subroutines?
You can have device module data that is accessible to the host and all device subroutines within the same module. If you have a release 11.4 or later and a device with compute capability 2.0 or higher, device module data is also accessible by device routines in other modules.

Keep in mind though, that this data is not global across MPI processes. Each MPI process will have it's own copy of the data so it's up to your program to keep the data coherent. Though, this is normal for MPI programming.

Hope this helps,
Mat
Back to top
View user's profile
bsb3166



Joined: 27 Jun 2011
Posts: 10

PostPosted: Fri Jul 22, 2011 11:19 am    Post subject: Reply with quote

Dear Mat,

Quote:
I may not fully understand the question but I'll do my best to answer.


Quote:
In 10.8 or later, the device context is created at the time of first use. So after you set the device number via cudasetdevice, the device will be associated the first time you allocate an array, copy data to the device, or launch a kernel. Note that you shouldn't use cula_selectdevice since it will attempt to create a different context. For details on using CULA from CUDA Fortran, the following two articles may be helpful.


The problem I met is that I have a simple test code which is MPI+CUDA Fortran+CULA routines.

When I have a code like this and make it run on 4 CPUs connected with 4 GPUs by PCIs.

Segmentation Fault, cann't find memory address:
Code:
real,device :: sd(3)

cula_status = cula_selectdevice(gpuid)
call check_status(cula_status)

cula_status = cula_initialize()
call check_status(cula_status)

info=cula_device_cgesvd('a','a', M, N, ad, LDA, sd,ud, LDU,vtd, LDVT)


Works fine:
Code:
real,allocatable,device :: sd(:)

cula_status = cula_selectdevice(gpuid)
call check_status(cula_status)

allocate( sd(3) )

cula_status = cula_initialize()
call check_status(cula_status)

info=cula_device_cgesvd('a','a', M, N, ad, LDA, sd,ud, LDU,vtd, LDVT)


The difference is one I set the dimension as sd(3) which is not allocatable before I call cula_selectdevice(), the other one I set sd() as allocatable before calling cula_selectdevice(), then allocate the memory after call the select device routine.

I guess the segmentation fault result from real,device :: sd(3) is trying to allocate memory on all the 4 devices (every GPU is supposed to have its own device array sd(3) ), but because devices have not been selected for the 4 CPUs.

This issue make me think about the other issue which has same MPI+CUDA Fortran+CULA and has segmentation fault as well in my project.

I have several device variables declared in a module as allocatable.
Code:
module acm_dev
    use numz, only : b4
    use cudafor
   
     complex(b4), device, allocatable :: c_dev(:,:),b_dev(:,:)
     complex(b4), device, allocatable :: eps_dev(:),cnray_dev(:)
     
     complex(b4), device, allocatable :: base_dev(:,:) ! constant
     complex(b4), device, allocatable :: material_dev(:) ! constant
     complex(b4), device, allocatable :: ei_dev(:) ! constant
     
     integer, device, allocatable :: gene_dev(:,:)

end module acm_dev


and I have subroutine to allocate them, then copy from CPU to GPU.
Code:
subroutine laser_sub_1( )
        use acm_dev
        use cudafor
        allocate( c_dev(nmax,nmax), b_dev(nmax,1) )
        allocate( eps_dev(nmax), cnray_dev(nmax), ei_dev(nmax) )     
        allocate( base_dev(nmax,nmax), material_dev(0:3) )
        allocate( gene_dev(gene_size,pop_size) )

        c_dev(1:nmax,1:nmax) = c(1:nmax,1:nmax)
        b_dev = b(1:nmax,1:1)
        eps_dev  = eps(1:nmax)
        cnray_dev = cnray(1:nmax)
        ei_dev = ei(1:nmax) 
        base_dev = base(1:nmax,1:nmax)
        material_dev = material(0:3)
        gene_dev = gene(1:gene_size,1:pop_size)
end subroutine laser_sub_1


I just declare the subroutine laser_sub_1, but I don't call it at all. It seems to affect my program's execution. The program is running on 4 cores with 4 GPUs.

When I have those assignment statement to copy data form GPU to CPU, the program runs to the half then break down. Error info is segmentation fault, memory address not mapped.

When I comment out those assignment statement to copy data form GPU to CPU. the program runs perfect.

Because my project is trying to extend from a CPU version PGA code to a GPU version to boost the performance. There shouldn't be any memory segmentation on CPU. So I guess it might be similar issue as the simple case I describe above.

But It still really confuse me, there should be no difference if I just put several lines there but don't use them at all. But actually, it does.

If the change my subroutine

from
Code:
module acm_dev
    use numz, only : b4
    use cudafor
   
     complex(b4), device, allocatable :: c_dev(:,:),b_dev(:,:)
     complex(b4), device, allocatable :: eps_dev(:),cnray_dev(:)
     
     complex(b4), device, allocatable :: base_dev(:,:) ! constant
     complex(b4), device, allocatable :: material_dev(:) ! constant
     complex(b4), device, allocatable :: ei_dev(:) ! constant
     
     integer, device, allocatable :: gene_dev(:,:)

end module acm_dev

module acm_dev is in a separate file. Global allocatable device variables.
Code:
subroutine laser_sub_1( )
        use acm_dev
        use cudafor
        allocate( c_dev(nmax,nmax), b_dev(nmax,1) )
        allocate( eps_dev(nmax), cnray_dev(nmax), ei_dev(nmax) )     
        allocate( base_dev(nmax,nmax), material_dev(0:3) )
        allocate( gene_dev(gene_size,pop_size) )

        c_dev(1:nmax,1:nmax) = c(1:nmax,1:nmax)
        b_dev = b(1:nmax,1:1)
        eps_dev  = eps(1:nmax)
        cnray_dev = cnray(1:nmax)
        ei_dev = ei(1:nmax) 
        base_dev = base(1:nmax,1:nmax)
        material_dev = material(0:3)
        gene_dev = gene(1:gene_size,1:pop_size)
end subroutine laser_sub_1


to

Code:
     subroutine laser_sub_1( )


        complex(b4), device, allocatable :: c_dev(:,:),b_dev(:,:)
        complex(b4), device, allocatable :: eps_dev(:),cnray_dev(:)
       
        complex(b4), device, allocatable :: base_dev(:,:) ! constant
        complex(b4), device, allocatable :: material_dev(:) ! constant
        complex(b4), device, allocatable :: ei_dev(:) ! constant
       
        integer, device, allocatable :: gene_dev(:,:)   
         
        info=cudasetdevice(gpuid)       
 
        allocate( c_dev(nmax,nmax), b_dev(nmax,1) )
        allocate( eps_dev(nmax), cnray_dev(nmax), ei_dev(nmax) )     
        allocate( base_dev(nmax,nmax), material_dev(0:3) )
        allocate( gene_dev(gene_size,pop_size) )


 
        c_dev(1:nmax,1:nmax) = c(1:nmax,1:nmax)
        b_dev = b(1:nmax,1:1)
        eps_dev  = eps(1:nmax)
        cnray_dev = cnray(1:nmax)
        ei_dev = ei(1:nmax) 
        base_dev = base(1:nmax,1:nmax)
        material_dev = material(0:3)
        gene_dev = gene(1:gene_size,1:pop_size)
       
     end subroutine laser_sub_1


my program works fine again.

The different is the later subroutine follow the steps to first, declared allocatable device variable, set devices for each CPU, allocate memory, copy from CPU to GPU. (The code is MPI code and this subroutine, still, it just be there, I don't call it at all).

Thank you Mat, the thread seems very long for you. Thank you for your patience.

Chong
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6123
Location: The Portland Group Inc.

PostPosted: Fri Jul 22, 2011 2:21 pm    Post subject: Reply with quote

Hi Chong,

Quote:
cula_status = cula_selectdevice(gpuid)
I haven't done much with CULA myself, but I don't think you want to use this function. I'm thinking it creates a separate context, performs it's own initialization and either wipes out or prevents ours from being performed. I'm not sure, but I'd try just using cudaSetDevice and the CULA CUDA Fortran module (i.e. 'use cula') as shown in the articles I posted about.
Quote:

module acm_dev is in a separate file. Global allocatable device variables.
What type GPUs do you have? In order to support this feature, you need the Unified Memory model available only on a Fermi card (i.e. compute capability 2.0).

- Mat
Back to top
View user's profile
bsb3166



Joined: 27 Jun 2011
Posts: 10

PostPosted: Mon Jul 25, 2011 10:44 am    Post subject: Reply with quote

mkcolg wrote:
Hi Chong,

Quote:
cula_status = cula_selectdevice(gpuid)
I haven't done much with CULA myself, but I don't think you want to use this function. I'm thinking it creates a separate context, performs it's own initialization and either wipes out or prevents ours from being performed. I'm not sure, but I'd try just using cudaSetDevice and the CULA CUDA Fortran module (i.e. 'use cula') as shown in the articles I posted about.
Quote:

module acm_dev is in a separate file. Global allocatable device variables.
What type GPUs do you have? In order to support this feature, you need the Unified Memory model available only on a Fermi card (i.e. compute capability 2.0).

- Mat


I post on CULA forum to ask that is this way below
Code:
cuda_status = cudasetdevice(gpuid)
call check_status_cuda(cuda_status )

cula_status = cula_initialize()
call check_status(cula_status)


ok for running CULA on multiple GPUs. An engineer of CULA replied it should work, and he said, culaSelectDevice is merely a passthrough for cudaSelectDevice so that you are not forced to invoke the CUDA toolkit if you only need the Host interface in CULA.

So I tried again to use cudasetdevice(gpuid) instead of cula_selectdevice(gpuid). But still the same problem occurred.

It brings me up the question that Does cula_initialize() create a separate context as well?
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group