PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

combine the OpenMP with the OpenACC
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
Chia-WenChan52986



Joined: 25 Feb 2014
Posts: 11

PostPosted: Wed Apr 16, 2014 12:22 am    Post subject: combine the OpenMP with the OpenACC Reply with quote

I program a simple code to do the matrix multiplication. However, I added a loop outside of the kernel used to do the matrix multiply due to the further study. In this test, I would like to use OpenMP to parallel the outside loop and OpenACC to parallel the kernel loops. There are some errors in that code. Some one can help me to solve this problem?
Following is the code. Thank you very much.

Code:

program main
    use accel_lib
    integer :: n        ! size of the vector   
    real,dimension(:,:),allocatable :: a ,b,c,c1
    real,dimension(:),allocatable :: csum 
     integer :: i,j,k,kk,mk
    integer :: t1,t2,thn
    real :: diff,st,pt,speedup
!$ integer:: omp_get_num_threads
!$ integer:: omp_get_num_procs
!$ thn=omp_get_num_procs()
!$ write(*,*) "The number of available processors/threads in the system: ",thn
thn=1    ! when OpenMP is not used
!$ write(*,*) "Enter the number of threads"
!$ read(*,*) thn
!$ call omp_set_num_threads(thn)             ! set the number of threads
!$call acc_init( acc_device_nvidia )
    n =512
    mk=16
    allocate(a(n,n),b(n,n),c(n,n),c1(n,n),csum(mk))
    do i=1,n
        do j=1,n
             a(i,j)=(i+j)/(i)
             b(i,j)=2*(i+j)/(i)
        end do
    end do   
    call system_clock( count=t1 )   
!$omp parallel do  shared(n,a,b),private(c,kk)
    do kk=1,mk                          !CPU processing
        c=0.0d0   
        do i=1,n
            do j=1,n
                do k=1,n
                    c(i,j)=c(i,j)+a(i,k)*b(k,j)*kk
                end do
            end do
        end do
        csum(kk)=sum(c)
    end do
!$omp end parallel do           
    write(*,*)  csum   
    csum=0.0d0
    call system_clock( count=t2 )
    st= (t2-t1)/1.0d6 
    print *, 'CPU time: ', st,  ' seconds'     
    call system_clock( count=t1 )
   
    call acc_init( acc_device_nvidia )       
!$omp parallel do  shared(n,a,b),private(c1,kk)
    do kk=1,mk                            !GPU processing   
        c1=0.0d0
        call obj(n,a,b,c1,kk)
        csum(kk)=sum(c1)
    end do 
!$omp end parallel do     
    write(*,*)  csum
    call system_clock( count=t2 )
    pt=(t2-t1)/1.0d6   
    print *, 'GPU time: ', pt,  ' seconds'   
    speedup=st/pt
    print *, 'speedup: ', speedup 
end program   

subroutine obj(n,a,b,c1,kk)
    implicit none
    integer, intent(in)::n,kk
    real, intent(in)::a(n,n),b(n,n)
    real, intent(out)::c1(n,n)
    integer::i,j,k

!$acc parallel loop
        do j=1,n
            do i=1,n
                do k=1,n
                    c1(i,j)=c1(i,j)+a(i,k)*b(k,j)*kk
                end do
            end do
        end do
!$acc end parallel loop

    return
end subroutine obj
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Wed Apr 16, 2014 5:43 pm    Post subject: Reply with quote

What's the error you're seeing? There's a performance problem in that you initialize the accelerator twice, but removing the second one fixes it. Also, since you have all the threads using the same accelerator, your relative speed-up will decrease as there's more contention on the device:


Code:
% pgf90 -mp -acc mp.f90 -V14.3 -Minfo=accel ; a.out
obj:
     70, Accelerator kernel generated
         71, !$acc loop gang ! blockidx%x
         72, !$acc loop vector(256) ! threadidx%x
     70, Generating present_or_copyin(b(:n,:n))
         Generating present_or_copyin(a(:n,:n))
         Generating present_or_copy(c1(:n,:n))
         Generating NVIDIA code
     72, Loop is parallelizable
     73, Complex loop carried dependence of 'c1' prevents parallelization
         Loop carried dependence of 'c1' prevents parallelization
         Loop carried backward dependence of 'c1' prevents vectorization
 The number of available processors/threads in the system:            12
 Enter the number of threads
2
   2.4596063E+09   4.9192125E+09   7.3787965E+09   9.8384251E+09
   1.2298042E+10   1.4757593E+10   1.7217190E+10   1.9676850E+10
   2.2136357E+10   2.4596085E+10   2.7055665E+10   2.9515186E+10
   3.1974867E+10   3.4434380E+10   3.6893897E+10   3.9353700E+10
 CPU time:     2.648998      seconds
   2.4596063E+09   4.9192125E+09   7.3787965E+09   9.8384251E+09
   1.2298042E+10   1.4757593E+10   1.7217190E+10   1.9676850E+10
   2.2136357E+10   2.4596085E+10   2.7055665E+10   2.9515186E+10
   3.1974867E+10   3.4434380E+10   3.6893897E+10   3.9353700E+10
 GPU time:    8.2851999E-02  seconds
 speedup:     31.97265
Back to top
View user's profile
Chia-WenChan52986



Joined: 25 Feb 2014
Posts: 11

PostPosted: Wed Apr 16, 2014 6:58 pm    Post subject: Reply with quote

Thanks for your help. I used the compile command you gave to me. The compile results are the same. However, I got the error information when I run the execution file as following:

Code:
 The number of available processors/threads in the system:             8
 Enter the number of threads
2
   2.4596063E+09   4.9192125E+09   7.3787965E+09   9.8384251E+09
   1.2298042E+10   1.4757593E+10   1.7217190E+10   1.9676850E+10
   2.2136357E+10   2.4596085E+10   2.7055665E+10   2.9515186E+10
   3.1974867E+10   3.4434380E+10   3.6893897E+10   3.9353700E+10
 CPU time:     2.853000      seconds
call to cuModuleLoadData returned error 201: Invalid context


Can you tell me why? Appreciate your help.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Thu Apr 17, 2014 7:52 am    Post subject: Reply with quote

What device do you have? Older cards don't support multiple host context creation.

- Mat
Back to top
View user's profile
Chia-WenChan52986



Joined: 25 Feb 2014
Posts: 11

PostPosted: Mon Apr 21, 2014 10:57 pm    Post subject: Reply with quote

Hi Mat,
I use GeForce GTX 780 Ti.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group