PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

alloc of pinned memory has to be _after_ setting device

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
TroelsH



Joined: 24 Mar 2010
Posts: 9

PostPosted: Thu Aug 19, 2010 4:55 am    Post subject: alloc of pinned memory has to be _after_ setting device Reply with quote

I have been experimenting with changing devices for a MPI'zed CUDA-Fortran code. For some time I ran into a seg fault when trying to transfer certain arrays to the GPU after changing device.

It turns out that not only does one have to reallocate all device memory (logical, we are clearing the GPU), but also pinned memory has to be reallocated. Is that the expected behavior ?

The seg fault happens when trying to access the pinned data in any way, both copying to the device or accessing it on the host side.

The array is still marked as allocated though, and maintains it shape.

I believe the correct behavior is either that the pinned array should be marked as unallocated or that the data should still be available.

I tested with version 10.8 of the compiler. My workaround is to select the device just at the very beginning of the program, but it would be nice with a consistent state of data (i.e. either unaffected by the resetting of the device or automatically unallocated).

For illustration, the following program works fine :

Code:

PROGRAM test_set_device
  USE cudafor
  real, pinned, allocatable, dimension(:) :: x
  real, device, allocatable, dimension(:) :: gx
  integer :: ierr
  ierr = cudaThreadExit(); if (ierr > 0) print *,cudaGetErrorString(ierr)
  ierr = cudaSetDevice(0); if (ierr > 0) print *,cudaGetErrorString(ierr)
  allocate( x(10))
  allocate(gx(10))
  print *, allocated(x), shape(x)
  gx = x
END


while this one seg faults at the "gx=x" line :

Code:

PROGRAM test_set_device
  USE cudafor
  real, pinned, allocatable, dimension(:) :: x
  real, device, allocatable, dimension(:) :: gx
  integer :: ierr
  allocate( x(10))
  ierr = cudaThreadExit(); if (ierr > 0) print *,cudaGetErrorString(ierr)
  ierr = cudaSetDevice(0); if (ierr > 0) print *,cudaGetErrorString(ierr)
  allocate(gx(10))
  print *, allocated(x), shape(x)
  gx = x
END


and this one seg faults at the "y=x(1)" line :

Code:

PROGRAM test_set_device
  USE cudafor
  real :: y
  real, pinned, allocatable, dimension(:) :: x
  real, device, allocatable, dimension(:) :: gx
  integer :: ierr
  allocate( x(10))
  x(1) = 1
  ierr = cudaThreadExit(); if (ierr > 0) print *,cudaGetErrorString(ierr)
  ierr = cudaSetDevice(0); if (ierr > 0) print *,cudaGetErrorString(ierr)
  allocate(gx(10))
  print *, allocated(x), shape(x)
  y = x(1)
END
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6146
Location: The Portland Group Inc.

PostPosted: Thu Aug 19, 2010 11:29 am    Post subject: Reply with quote

Hi TroelsH,

While pinned memory is host side, the CUDA driver manages this data. When you destroy your context via cudaThreadExit call, the CUDA driver will also destroy this data. Hence, this behavior is expected.

The simple work around is to not use pinned memory here. Yes you will loose some performance, but x's data will be managed by the host and not destroyed when you change context.

A question for you is why you are calling cudaThreadExit? This will destroy all created context. Are you trying to use OpenMP to utilize multiple GPUs and want to share 'x' across these GPUs? If so, try setting the devices in parallel first before allocating any data.

For example:
Code:
% cat test.cuf
PROGRAM test_set_device
  USE cudafor
  use omp_lib
  real, pinned, allocatable, dimension(:) :: x
  real, device, allocatable, dimension(:) :: gx
  integer :: ierr, tnum

! Create your device context in parallel
!$omp parallel private(tnum)
  tnum = omp_get_thread_num()
  print *, 'TNUM:', tnum
  ierr = cudaSetDevice(tnum); if (ierr > 0) print *,cudaGetErrorString(ierr)
!$omp end parallel

! Perform initialization
  allocate(x(10))
  x= 10.1

! Execute the main problem in parallel
!$omp parallel private(tnum)
  tnum = omp_get_thread_num()
  allocate(gx(10))
  gx = x
  print *, tnum, allocated(x), shape(x), x
!$omp end parallel

END

% pgf90 -fast test.cuf -o test.out -V10.8 -mp
% setenv OMP_NUM_THREADS 4
% test.out
 TNUM:            0
 TNUM:            3
 TNUM:            2
 TNUM:            1
            0  T           10    10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000
            3  T           10    10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000
            2  T           10    10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000
            1  T           10    10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000        10.10000
    10.10000        10.10000        10.10000
%               


Hope this helps,
Mat
Back to top
View user's profile
TroelsH



Joined: 24 Mar 2010
Posts: 9

PostPosted: Fri Aug 20, 2010 9:31 am    Post subject: Reply with quote

Hi Mat,

Thanks for your answer. It makes goods sense.

We have 4 GPU's per node, and are using MPI for the parallelization. I use cudaThreadExit+cudaSetDevice to make sure that each MPI thread gets the correct (and unique!) GPU device.

I still think that if the CUDA driver destroys the pinned arrays, they should be marked as deallocated by CUDA Fortran, which is not the case now.

best,

Troels
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6146
Location: The Portland Group Inc.

PostPosted: Fri Aug 20, 2010 12:53 pm    Post subject: Reply with quote

Quote:
they should be marked as deallocated by CUDA Fortran, which is not the case now.
I agree and have added a feature request (TPR#17189) to perform garbage collection of device and pinned memory after a call to cudaThreadExit is made.

Thanks,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group