PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Memcpy and seg fault problems when combining openMP and CUDA

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
alfvenwave



Joined: 08 Apr 2010
Posts: 79

PostPosted: Tue Apr 13, 2010 3:27 am    Post subject: Memcpy and seg fault problems when combining openMP and CUDA Reply with quote

Hi.

Can anyone explain this? A have a simple code below. I am attempting to transfer some largish data arrays to 3 GPUs, having attached a single openMP thread to each. The code produces the expected result if I don't compile with openMP. It also works fine with three openMP threads if I decrease the size of the arrays being transferred. My problem is that the arrays being transferred are not actually that big, and should surely fit onto the device memory of a C1060 card. The host machine has 24GB of RAM, so that too should not be a problem. You can see in the code that I can try transferring the data in a loop outside the openMP parallel loop - that works fine (Transfer loop 1). Transferring the data inside the openMP loop though (Transfer loop 2) causes a segmentation fault. This is what I get:

export OMP_NUM_THREADS=3
./a.out

Start....
Transfer loop 1 has completed.
Segmentation fault

Here is the code. Can anyone please help?


Thanks,

Rob.

Code:
subroutine memtransfer( Fh )

   use cudafor

   implicit none

   real         :: Fh(20000000)
   real, device :: Fd(20000000)
   integer      :: iflag,idev

   Fd = Fh

!  -> Call device kernel, change Fd in some way....
!     Fill with device id for test....

   iflag = cudaGetDevice(idev)

   Fd    = (idev+1)**2

!  -> Transfer data back to OpenMP host thread:

   Fh = Fd

end

program memexample

   use cudafor

   implicit none

   real         :: Fh(20000000), Fsum(20000000)
   real, device :: Fd(20000000)

   integer      :: i,iflag

   Fh = 0.0

   print*, 'Start....'

   do i = 0,2
      iflag = cudaSetDevice(i)
      Fd = Fh
   enddo

   print*, 'Transfer loop 1 has completed.'

!$OMP PARALLEL PRIVATE(i,Fh) SHARED(Fsum)
!$OMP DO

      do i=0,2

         iflag = cudaSetDevice(i)
         call memtransfer(Fh)

         iflag = cudaThreadSynchronize()

!        Sum up results:

         Fsum = Fsum + Fh

      enddo

!$OMP END DO
!$OMP END PARALLEL

   print*, 'Transfer loop 2 has completed.'

   print*, 'Result (should be = 14.0 for 3 OpenMP threads, 3.0 if no openMP):', Fsum(1)

end
Back to top
View user's profile
alfvenwave



Joined: 08 Apr 2010
Posts: 79

PostPosted: Tue Apr 13, 2010 3:43 am    Post subject: Reply with quote

Some extra info. The largest arrays I can implement that don't cause a crash are 2618746 elements long. If there are 4 bytes per single precision floating point number on a C1060, that means that the arrays are just under 10MB in size. It's my understanding that a C1060 has ~4GB of device memory? Is there a 10MB data transfer limit I should know about?

Rob.
Back to top
View user's profile
alfvenwave



Joined: 08 Apr 2010
Posts: 79

PostPosted: Tue Apr 13, 2010 4:14 am    Post subject: Reply with quote

I think I might have solved my own problem. Through shear luck I have found that if I increase the stack limit above the default of 10240, I can transfer larger arrays. Anyone any idea why the stack limit causes a 10MB limit inside the openMP loop, but not outside it?

Rob.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Tue Apr 13, 2010 12:57 pm    Post subject: Reply with quote

Hi Rob,

You're hitting the OpenMP per thread stack size limit when entering the OpenMP region. The compiler is trying to automatically allocate on the stack a "Fh" for each thread but gets a stack overflow. The default OpenMP stack size is 8Mb but can be increased using the environment variable OMP_STACKSIZE. Also on Linux and OSX, if you have your shell's stack size limit set higher then 8Mb, then this value will be used. The exception is if the limit is set to unlimited, then the default 8Mb is used.

Also, the code:
Code:

   do i = 0,2
      iflag = cudaSetDevice(i)
      Fd = Fh
   enddo
Isn't doing what you want. When a static device variable is first declared (in this case Fd is declared at the start of your program), it's space is allocated on the current device. While you're changing the device number, Fd is only allocated on the default device. There aren't three copies.

However, in the "memtransfer" the local "Fd" array (which is distinct from the main program's Fd array) will be allocated 3 times, once for each device, since the function is called from within the OpenMP region.

Memory can not be shared between nor split across multiple devices. Hence, it's best when working with OpenMP and CUDA, to encapsulate your CUDA code within subroutines called from an OpenMP region.

The Pseudo-code for a mixed OpenMP/CUDA program would be something like:
Code:

- Start an OpenMP Region
-- Set the CUDA Device Number for each thread
-- Divide the work amongst the threads.
-- Call a routine containing the CUDA code.
- End the OpenMP Region


Hope this helps,
Mat
Back to top
View user's profile
alfvenwave



Joined: 08 Apr 2010
Posts: 79

PostPosted: Wed Apr 14, 2010 12:55 am    Post subject: Reply with quote

Thankyou so much for pointing that out Mat. Clearly I've got a lot to learn!

Thanks again,

Rob.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group