PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Is it possible to use both OpenMP + CUDA in PGI fortran ?

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
yangmh



Joined: 01 Jul 2005
Posts: 5

PostPosted: Thu Dec 16, 2010 5:56 pm    Post subject: Is it possible to use both OpenMP + CUDA in PGI fortran ? Reply with quote

Dear All,

My jobs could be described like this,

do i=1,n
! copy an arry from host to device
dA = A(*,i)
! perform calculations on device
call sub1 <<<dim,block>>> (dA)
enddo

Because the device memory can't save the whole arry A, the calculation should devided into "n" stpes. Because the copying is a bottleneck, I am trying to rewritte the code liking,

! copy a part of data at first
dA1=A(*,1)

do i=1,n

! begin openmp
!omp parallel ....
!omp do
do icore=1,ncore
if (icore.eq.1) then
! copy an arry from host to device
if (i.lt.n) then
if (mod(i,2).eq.1) then
dA2 = A(*,i+1)
else
dA1 = A(*,i+1)
endif
endif
elseif (core.eq.2) then
if (mod(i,2).eq.1) then
call sub1 <<<dim,block>>> (dA1)
else
call sub1 <<<dim,block>>> (dA2)
endif
endif
!omp end do
!omp end parallel
enddo

Here I use two device arrays, and hope to overlap the data transfer and GPU calculation, liking MPI pragraming.

However, the program is failed, I don't know does PGI fortran could support openmp+cuda ? does anyone has suggestions,

Thanks in advance
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Thu Dec 16, 2010 6:20 pm    Post subject: Reply with quote

Hi yangmh,

Yes, OpenMP can be used with CUDA Fortran and works quite well. Actually my next PGInsider article will be a tutorial on it.

First you must understand that a CPU process/thread can only be attached to a single GPU so in order for a program to use multiple GPUs it must be done from OpenMP, MPI, or other higher parallel layer. Because of this, you need to have the CPU parallel region encompass the CUDA code. Have each process/thread attach to a device and then divide the problem between the devices. In other words, if you have two threads, you'd have two dA1's each holding half of A.

Here's some sample code from the OpenMP CUDA Fortran program I wrote for my next article:

Code:
subroutine lifeMain ()
   
#ifdef _CUDA
   use cudafor
#endif
   use omp_lib
   implicit none

   integer :: Nx, Ny, Nt, temp, count, steps, i, j, angle, nthds, tnum
        integer :: numdev, devnum, istat, myseed, blocksize
   integer, allocatable, DEVICE dimension(:,:) :: dOld, dNew
   integer, allocatable, dimension(:,:) :: A, B
   integer :: start, end
   real, allocatable, dimension(:,:) :: rand
   character(len=80) :: rfile
   integer :: alive
#ifdef _CUDA
       type(dim3) :: dimGrid, dimBlock
#else
        integer,dimension(3) :: dimGrid,dimBlock
#endif
   ! account for the halo
   Nx = NXSIZE+2
   Ny = NYSIZE+2
   myseed = SEED
   allocate(A(Nx,Ny),B(Nx,Ny),rand(Nx,Ny))
   call random_seed(myseed)
   call random_number(rand)
   A=0
   do i=2,Nx-1
      do j=2,Ny-1
        if (rand(i,j) < initPercent) then
           A(i,j) = 1
        else
                A(i,j) = 0
             endif
           enddo
        enddo
        B=A

   ! Set the initial conditions
   count = 0
   steps = 0
   temp = 0
   alive = 1      
#ifdef _CUDA
        istat = cudaGetDeviceCount(numdev)
#endif

!$omp parallel &
!$omp   shared(A,B,count,Nx,Ny,numdev,alive,steps), &
!$omp   private(tnum,dOld,dNew,dimGrid,dimBlock,devnum,istat,Nt, &
!$omp           i,j,blocksize,start,end)

#ifdef _OMP
   nthds = omp_get_num_threads()   
   tnum = omp_get_thread_num()
#else
        nthds = 1
        tnum=0
#endif

!$omp master
   
!$omp end master
!$omp barrier

   blocksize =  NXSIZE/nthds
   Nt = blocksize+2
   start = (tnum*blocksize)+1
   end = start + blocksize + 1

#ifdef _CUDA
        devnum = mod(tnum,numdev)
        istat = cudaSetDevice(devnum)
#endif
      allocate(dNew(Nt,Ny), dOld(Nt,Ny))   

        dOld=0
   dNew=0
#ifdef _CUDA
   dimBlock = dim3(16,16,1)
        dimGrid = dim3((Nt+15)/16,(ny+15)/16,1)
#endif

   do, while (steps.lt.16384.and.count.lt.10.and.alive.gt.0)

       dOld(1:Nt,1:Ny) = A(start:end,1:Ny)
            dNew=dOld
#ifdef _CUDA
            call life_kernel<<<dimGrid,dimBlock>>>(dOld,dNew,Nt,Ny)
#else
            call life(dOld,dNew,Nt,Ny)
#endif
       B(start+1:end-1,1:Ny) = dNew(2:Nt-1,1:Ny)
!$omp barrier

!$omp master
            A=B
       alive = sum(A)
!            if (mod(steps,100).eq.0) then
            print *, 'Step:', steps, ' Alive:', alive
!            endif
            if (temp.lt.(alive+2).and.temp.gt.(alive-2)) then
               count = count + 1
       else
          count = 0
            endif
            temp = alive
       steps = steps + 1

!   pause   

!$omp end master
!$omp barrier
    
   end do
   
   deallocate(dOld)
   deallocate(dNew)
#ifdef _CUDA
   istat=cudaThreadExit()
#endif

!$omp end parallel
 
   alive = sum(A)
   print *, 'Step:', steps, ' Alive:',alive
   print *, 'Init%:', initPercent, ' Actual%:', real(alive)/real(NXSIZE*NYSIZE)
   close(21)
   deallocate(A,B)
      
   stop

end subroutine lifeMain


Hope this helps,
Mat
Back to top
View user's profile
yangmh



Joined: 01 Jul 2005
Posts: 5

PostPosted: Thu Dec 16, 2010 11:22 pm    Post subject: Reply with quote

Dear Mat,

Thansk for your reply. From you code, I understand how to programming with OpenMP and CUDA.

In my work, only one GPU device is used. I plan to use two CPU threads. These two threads attach to the same GPU device. one CPU thread copies data from CPU memory to GPU memory (dA1) and another CPU thread operaters the data have been copied to GPU memory (dA2). In next step, the data copied from CPU memory in saved in GPU memory (dA1) and the GPU calculation is performed with GPU memory (dA1). Something likes MPI_ISEND to overlap the data transfer and the calculations.

I don't know how to do such a work.

Thanks for your help

Yangmh
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Fri Dec 17, 2010 1:16 pm    Post subject: Reply with quote

Hi yangmh,

I don't think it's currently possible for multiple host threads to share a single device context but I'll confirm with my contacts at NVIDIA.

See: http://stackoverflow.com/questions/3300605/can-i-share-cuda-gpu-device-memory-between-host-processes

- Mat
Back to top
View user's profile
yangmh



Joined: 01 Jul 2005
Posts: 5

PostPosted: Fri Dec 17, 2010 5:30 pm    Post subject: Reply with quote

Mat,

I see. Thanks very much.

-YangMH
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group