PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Problems on OpenMP and multi-GPU
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
Weixiao



Joined: 28 Mar 2012
Posts: 3

PostPosted: Mon Aug 13, 2012 3:04 pm    Post subject: Problems on OpenMP and multi-GPU Reply with quote

Greetings,

I'm a researcher on molecular dynamics, and I'm trying to write a CUDA-Fortran code to get the sum of thousands of pair potentials. As the number of pairs is huge in my system, I have to use multiple GPUs. The idea is to separate the whole system into two parts. I have two GPUs in my computational node, TESLA C2070 and GT 440, and I try to make them work together by OpenMP.

Now I get an error:
0: copyout Memcpy (host=0x7f49880016c0, dev=0x200300800, size=56) FAILED: 4(unspecified launch failure)

The main program is:

Code:
!$omp parallel &
   !$omp private(iCPU, iStart, iEnd, uTot, nBlockSP, prop)
   !$omp master
   nCPUs = omp_get_num_threads()
   print*,'# of CPU(s):',nCPUs
   !$omp end master
   !$omp barrier
   
   iCPU = omp_get_thread_num()
   istat = cudaSetDevice(iCPU)
   call iniGPU() !! copy some parameters to the device
   call uptAllHostToDev() !! copy coordinates to the device
   !$omp barrier
   

   if(iCPU == 0)then
      iStart = 1
      iEnd = nPartA
      nBlockSP = nBlocksA
   else
      iStart = nPartA + 1
      iEnd = nPartsTot
      nBlockSP = nBlocksB
   end if
   
   call getFullEnSP(iStart, iEnd, nBlockSP, uTot)
   print*,iCPU, uTot
   
   istat=cudaThreadExit()
!$omp end parallel


where nBlocksA=7(on device 0: TESLA) and nBlocksB=2(on device 1 GT 440). And the subroutine 'getFullEnSP' is defined as:

Code:
subroutine getFullEnSP(iStart, iEnd, nBlockSP, uTot)
   
   real*8 :: uTot
   integer :: nBlockSP,iStart,iEnd
   
   integer :: iPart
   integer, device :: d_iPart, d_iStart, d_iEnd
   real*8, allocatable, dimension(:), device :: d_block
   real*8, allocatable, dimension(:) :: blockEnergy
   integer :: ierr
   
   allocate(blockEnergy(nBlockSP),d_block(nBlockSP))
   uTot = 0.d0
   blockEnergy = 0.d0
   d_block = 0.d0
   !! GPU total energy
   do iPart=1, nPartsTot
      d_iPart = iPart
      d_iStart = iStart
      d_iEnd = iEnd
      call getPartEnSP<<<nBlockSP,nThreadsPerBlock>>>(d_iPart, d_iStart, d_iEnd, d_block)
      blockEnergy(:) = d_block(:) !!!***ERROR IS HERE!***
      uTot = uTot + sum(blockEnergy)
   end do
   deallocate(blockEnergy,d_block)
   
   uTot = uTot / 2.d0
   print*,uTot
end subroutine


According to the error information, it is easy to locate the error in the subroutine:

blockEnergy(:) = d_block(:)
(Here 7*size(real*8) is 56.)

It's OK if I use only one device, but once I try to use 2, here comes the error.

Does anyone has suggestions?

Thanks in advance
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6141
Location: The Portland Group Inc.

PostPosted: Tue Aug 14, 2012 9:49 am    Post subject: Reply with quote

Hi Weixiao,
Quote:

call getPartEnSP<<<nBlockSP,nThreadsPerBlock>>>(d_iPart, d_iStart, d_iEnd, d_block)
blockEnergy(:) = d_block(:) !!!***ERROR IS HERE!***

The error is most likely coming from your kernel and not the copy. Add a call to "cudaGetLastError" just after your kernel launch to see if you can catch this error.

Since the code does work with a single OpenMP thread, I'm guessing that there is some shared global device data that's being used by both threads. This of course causes problems since multiple GPU context can't share data.

If you can post a complete example of your code or send the full version to PGI Customer service (trs@pgroup.com), I can help find the problem. Otherwise, look at your kernel to see if it's accessing any device module variables that have not been made OpenMP private.

- Mat
Back to top
View user's profile
Weixiao



Joined: 28 Mar 2012
Posts: 3

PostPosted: Tue Aug 14, 2012 10:44 am    Post subject: Code sent Reply with quote

Hi, Mat!

Thanks a lot for your reply. I have sent the code, and please help to check it.

Best,
Weixiao.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6141
Location: The Portland Group Inc.

PostPosted: Tue Aug 14, 2012 2:39 pm    Post subject: Reply with quote

Hi Weixiao,

I see a number of fundamental issues. First, you initialize the device and allocated data before entering the OpenMP region and setting the Cuda device. You need to wait to use the device before you set the context, otherwise a context is implicitly set.

Second, re-evaluate how you are doing your domain decomposition. It looks like of the you divide up the 4000 blocks with the first 3200 on the first GPU and 800 on the second. More importantly, you create an device arrays of the partitioned blocks but don't adjust you indexing. Hence each openMP thread is is looping through 1 to nPartTot but each array is only a subset of nPartTot.

That's all the time I have for today, and unfortunately the code still breaks. Hopefully you can work out the last problems.

- Mat
Back to top
View user's profile
Weixiao



Joined: 28 Mar 2012
Posts: 3

PostPosted: Wed Aug 15, 2012 12:55 pm    Post subject: A naive way to solve the problem Reply with quote

Hi, Mat!

Thanks for your last reply. I have fixed the bug that to initialize the device too early.

I found a naive way to solve the last problem:

Copy the kernel function 'getPartEnSP' and rename it as 'getPartEnSP2', then use

Code:
do iPart=1, nPartsTot
      d_iPart = iPart
      if(omp_get_thread_num == 0) then
         call getPartEnSP<<<nBlockSP,nThreadsPerBlock>>>(d_iPart, d_iStart, d_iEnd, d_nBlockSP, d_block)
      else
         call getPartEnSP2<<<nBlockSP,nThreadsPerBlock>>>(d_iPart, d_iStart, d_iEnd, d_nBlockSP, d_block)
      end if
      blockEnergy(:) = d_block(:)
      uTot = uTot + sum(blockEnergy)

end do


to run then seperately.

Although it is not an intelligent way, it could work somehow. Do you have any idea about reusing the kernel instead of renaming it?

Best,
Weixiao.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group