PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Why my OpenACC code remains slower than OpenMP?

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
catfishwolf



Joined: 31 Mar 2013
Posts: 8

PostPosted: Fri Jul 26, 2013 10:25 am    Post subject: Why my OpenACC code remains slower than OpenMP? Reply with quote

Hi Everyone,

I am a newbie in accelerator programming. I encounter a problem when I try to compare the execution time of a simple one-dimensional vector addition accelerated by OpenMP with that by OpenACC. To my surprise, the execution with OpenMP is far faster than that with OpenACC, no matter how big the array size is. In the case where the size of array is set to 2**26, OpenMP takes 73 (ms) while OpenACC needs to spend 396 (ms) to complete the same computation.

Can anyone tell me anything wrong in my code? I attached the code I used for this experiment. Please see below.

Thanks,
Li

Code:
    subroutine saxpy_openmp(n,a,x,y)
    implicit none
    integer :: n,i
    real, intent(in) :: x(n),a
    real, intent(inout) :: y(n)
    !$omp parallel do
      do i=1,n
        y(i)=a*x(i)+y(i)     
      enddo
    !$omp end parallel do
    end subroutine saxpy_openmp
   
    subroutine saxpy(n,a,x,y)
    implicit none
    integer :: n, i
    real, intent(in) :: x(n), a
    real, intent(inout) :: y(n)
      do i=1,n
        y(i)=a*x(i)+y(i)   
      enddo
    end subroutine saxpy

    subroutine saxpy_openacc(m,a,x,y1)
    implicit none
    integer :: m, i
    real :: x(m), a
    real :: y1(m)
    !$acc kernels loop present(x,y1)
      do i=1,m
        y1(i)=a*x(i)+y1(i)   
      enddo

    end subroutine saxpy_openacc   
   
    program p
     use lapack95
     use blas95
     use omp_lib
     use accel_lib
     implicit none
     integer :: m=2**26 !don't set the power of 2 to exceed 26
     real :: x(m),y1(m),y2(m),y3(m)
     integer :: r1,r0
     integer :: i,j
       
     do i=1,m
      y1(i)=1.0   
      y2(i)=1.0
      y3(i)=1.0
      x(i)=1.0
     enddo
     
     call system_clock(r0)
     call saxpy_openmp(m,2.0,x,y2)
     call system_clock(r1)     
     print*,' time: ',r1-r0
     do i=1,10
       print*,y2(i)     
     enddo
     
     call system_clock(r0)
     call saxpy(m,2.0,x,y3)
     call system_clock(r1)
     print*,' time: ',r1-r0
     do i=1,10
       print*,y3(i)     
     enddo   
     
     call acc_init( acc_device_nvidia )
     call system_clock(r0)   
     !$acc data copy(x(:),y1(:))
     call saxpy_openacc(m,2.0,x,y1)
     !$acc end data
     call system_clock(r1)
     print*,' time: ',r1-r0
     do i=1,10
       print*,y1(i)     
     enddo     

    end program


Code:
-g -Bstatic -Mbackslash -mp -acc -I"C:\Program Files (x86)\Intel\Composer XE 2013\mkl\include" -I"C:\Program Files (x86)\Intel\Composer XE 2013\mkl\interfaces\lapack95\lapack95\include\intel64\lp64" -I"C:\Program Files (x86)\Intel\Composer XE 2013\mkl\interfaces\blas95\lib95\include\intel64\lp64" -I"c:\program files\pgi\win64\12.10\include" -I"C:\Program Files\PGI\Microsoft Open Tools 10\include" -I"C:\Program Files\PGI\Microsoft Open Tools 10\PlatformSDK\include" -I"C:\Program Files\PGI\win64\2012\cuda\4.2\include" -fastsse -Mipa=fast,inline -tp=bulldozer-64 -ta=nvidia,nowait,host -Minform=warn -Minfo=accel


Last edited by catfishwolf on Fri Jul 26, 2013 12:28 pm; edited 1 time in total
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6138
Location: The Portland Group Inc.

PostPosted: Fri Jul 26, 2013 11:21 am    Post subject: Reply with quote

Hi catfishwolf,

This is not too suprising. The problem here is the device data allocatation, free, and movement time. So while your compute time goes down by quite a bit, the data overhead cost overwhelms the overall time.

While you'll see saxpy used as examples for OpenACC, it's actually not a great example for performance since there's not enough computation to justify the data costs. If I modify your example so that each routine is executed many times (I'm using 100 below), then you'll see the GPU giving some speed-up.

Code:

   % cat test.f90
    subroutine saxpy_openmp(n,a,x,y)
    implicit none
    integer :: n,i
    real, intent(in) :: x(n),a
    real, intent(inout) :: y(n)
    !$omp parallel do
      do i=1,n
        y(i)=a*x(i)+y(i)     
      enddo
    !$omp end parallel do
    end subroutine saxpy_openmp
   
    subroutine saxpy(n,a,x,y)
    implicit none
    integer :: n, i
    real, intent(in) :: x(n), a
    real, intent(inout) :: y(n)
      do i=1,n
        y(i)=a*x(i)+y(i)   
      enddo
    end subroutine saxpy

    subroutine saxpy_openacc(m,a,x,y1)
    implicit none
    integer :: m, i
    real :: x(m), a
    real :: y1(m)
    !$acc kernels loop present(x,y1)
      do i=1,m
        y1(i)=a*x(i)+y1(i)   
      enddo

    end subroutine saxpy_openacc   
   
    program p
!     use lapack95
!     use blas95
     use omp_lib
     use accel_lib
     implicit none
     integer,parameter :: m=2**26 !don't set the power of 2 to exceed 26
     real :: x(m),y1(m),y2(m),y3(m)
     integer :: r1,r0
     integer :: i,j, iter
       
     do i=1,m
      y1(i)=1.0   
      y2(i)=1.0
      y3(i)=1.0
      x(i)=1.0
     enddo
     
     call system_clock(r0)
     do iter=1,100
     call saxpy_openmp(m,2.0,x,y2)
     enddo
     call system_clock(r1)     
     print*,' time: ',r1-r0
     do i=1,10
       print*,y2(i)     
     enddo
     
     call system_clock(r0)
     do iter=1,100
     call saxpy(m,2.0,x,y3)
     enddo
     call system_clock(r1)
     print*,' time: ',r1-r0
     do i=1,10
       print*,y3(i)     
     enddo   
     
     call acc_init( acc_device_nvidia )
     call system_clock(r0)   
     !$acc data copyin(x(:)), copy(y1(:))
     do iter=1,100
     call saxpy_openacc(m,2.0,x,y1)
     enddo
     !$acc end data
     call system_clock(r1)
     print*,' time: ',r1-r0
     do i=1,10
       print*,y1(i)     
     enddo     

    end program
% pgf90 -acc -Minfo=accel -fast test.f90 -V13.7 -mp ; a.out
saxpy_openacc:
     28, Generating present(x(:))
         Generating present(y1(:))
         Generating NVIDIA code
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     29, Loop is parallelizable
         Accelerator kernel generated
         29, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
p:
     75, Generating copy(y1(:))
         Generating copyin(x(:))
  time:       2796251
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
  time:       4135116
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
  time:       1254468
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
    201.0000   
Back to top
View user's profile
catfishwolf



Joined: 31 Mar 2013
Posts: 8

PostPosted: Fri Jul 26, 2013 11:37 am    Post subject: Reply with quote

Hi, mkcolg

I felt relieved after seeing your result. But I still got a different outcome. The OpenACC part costs more than 10 seconds to complete. Does it have anything to do with my visual studio environment setting (https://www.dropbox.com/s/fqsuajcw77j05e4/saxpy.rar), or my hardware specification (CPU: AMD FX-4100 @ 4.0 GHz. GPU: Geforece GT 610)?

Li

Code:

  time:       7451000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
  time:       7490001
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
  time:      10263000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
    201.0000
Press any key to continue . . .

Code:
-g -Bstatic -Mbackslash -mp -acc -fastsse -Mipa=fast,inline -tp=bulldozer-64 -ta=nvidia,nowait,host -Minform=warn -Minfo=accel
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6138
Location: The Portland Group Inc.

PostPosted: Fri Jul 26, 2013 12:44 pm    Post subject: Reply with quote

Hi Li,

A GT 601 is a fairly weak card so most likely accounts for the differnece. I'm running a Tesla M2090 with 512 cores with a clock of 1301MHz versus your GT 610 which has 48 cores running at 810MHz. Also, my memory is DDR5 versus your DDR3.

Your card is fine for develpment but you'll want to move to a Tesla card for production runs.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group