PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

performance of PGI openacc directives
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
sjz



Joined: 09 Jan 2013
Posts: 9

PostPosted: Wed Jan 09, 2013 9:48 am    Post subject: performance of PGI openacc directives Reply with quote

We tried to use the pgi fortran compiler for openacc
to port climate and weather physics models. Somehow we cannot reduce the time of using openacc directives
below 0.1 second even with the vector add code+ small size of arrays. Is this a typical time needed for setting up the
communication between cpu and gnu?

Thanks,

sjz

*************************************************************
Here is the hardware specification for a node:

2 Hex-core 2.8 GHz Intel Xeon Westemere Processors (4 flop/s per clock)
48 GB of memory per node
2 NVidia M2070 GPUs each connected through a dedicated x16 PCIe Gen2 connection
Interconnect: Infiniband QDR

Here are the compilation and running environment and commands

module load comp/pgi-12.4.0
module load other/mpi/openmpi/1.4.5-pgi-12.4.0

pgf90 -o vecadd_openacc -acc -ta=nvidia,fastmath vecadd_openacc.F90


pgcudainit &


Here are the performance numbers:


szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 1000
110842 microseconds on gpu
2 microseconds on host
0 errors found
szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 10000
105249 microseconds on gpu
22 microseconds on host
0 errors found
szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 100000
110235 microseconds on gpu
232 microseconds on host
0 errors found
szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 1000000
110693 microseconds on gpu
2206 microseconds on host
0 errors found


Here are the source codes:

szhou@discover25:~/test_gpu/acc_cuda_fortran/acc> cat vecadd_openacc.F90
module vecaddmod

implicit none

contains

subroutine vecaddgpu( r, a, b, n )

real, dimension(:) :: r, a, b

integer :: n
integer :: i

!$acc kernels do copyin(a(1:n),b(1:n)) copyout(r(1:n)) gang vector(256)


do i = 1, n

r(i) = a(i) + b(i)
! r(i) = sqrt(a(i)) + b(i)*a(i)+sin(a(i))*exp(-b(i))*sqrt(a(i)+b(i))

enddo

end subroutine

subroutine vecaddcpu( r, a, b, n )

real, dimension(:) :: r, a, b

integer :: n
integer :: i



do i = 1, n

r(i) = a(i) + b(i)
! r(i) = sqrt(a(i)) + b(i)*a(i)+sin(a(i))*exp(-b(i))*sqrt(a(i)+b(i))

enddo

end subroutine

end module

program main

use vecaddmod

implicit none

integer :: n, i, errs, argcount
integer :: cpu_s, cpu_e, gpu_s, gpu_e

real, dimension(:), allocatable :: a, b, r, e

character*10 :: arg1

argcount = command_argument_count()

n = 1000000 ! default value

if( argcount >= 1 )then

call get_command_argument( 1, arg1 )

read( arg1, '(i)' ) n

if( n <= 0 ) n = 100000

endif


allocate( a(n), b(n), r(n), e(n) )

do i = 1, n

a(i) = i

b(i) = 1000*i

enddo

! compute on the GPU

call system_clock (count=gpu_s)
call vecaddgpu( r, a, b, n )
call system_clock (count=gpu_e)

! compute on the host to compare

!

call system_clock (count=cpu_s)
call vecaddcpu( e, a, b, n )
call system_clock (count=cpu_e)

print *, gpu_e - gpu_s, ' microseconds on gpu'
print *, cpu_e - cpu_s, ' microseconds on host'



! compare results

errs = 0

do i = 1, n

if( abs((r(i) - e(i))/ e(i)) > 1.1 )then

errs = errs + 1

endif

enddo

print *, errs, ' errors found'

if( errs ) call exit(errs)

end program


*********************************************
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Wed Jan 09, 2013 11:01 am    Post subject: Reply with quote

Hi sjz,
Quote:
Is this a typical time needed for setting up the communication between cpu and gnu?
Typically there is a ~1 second per device warm-up cost on Linux, but this can be removed by running pgcudainit to hold open the devices (which you use here).

Next there is ~0.1 second cost to establish a context between the host and the device.

Finally, there is some overhead in copying the kernel code itself over to the device, as well as any arguments. This cost varies depending upon the kernel.

What you can do here is call "acc_init" before your timers to remove the initialization time. It's still part of your overall time, but hopefully in a larger application this overhead would be meaningless.

Hope this helps,
Mat

Code:
% cat vecadd_openacc.F90
module vecaddmod

implicit none

contains

subroutine vecaddgpu( r, a, b, n )
real, dimension(:) :: r, a, b
integer :: n
integer :: i

!$acc kernels do copyin(a(1:n),b(1:n)) copyout(r(1:n)) gang vector(256)
do i = 1, n
r(i) = a(i) + b(i)
! r(i) = sqrt(a(i)) + b(i)*a(i)+sin(a(i))*exp(-b(i))*sqrt(a(i)+b(i))
enddo

end subroutine

subroutine vecaddcpu( r, a, b, n )
real, dimension(:) :: r, a, b
integer :: n
integer :: i

do i = 1, n
r(i) = a(i) + b(i)
! r(i) = sqrt(a(i)) + b(i)*a(i)+sin(a(i))*exp(-b(i))*sqrt(a(i)+b(i))
enddo

end subroutine

end module

program main

use vecaddmod
use openacc

implicit none

integer :: n, i, errs, argcount
integer :: cpu_s, cpu_e, gpu_s, gpu_e
real, dimension(:), allocatable :: a, b, r, e
character*10 :: arg1

argcount = command_argument_count()
n = 1000000 ! default value
if( argcount >= 1 )then
call get_command_argument( 1, arg1 )
read( arg1, '(i)' ) n
if( n <= 0 ) n = 100000
endif

call acc_init(acc_get_device_type())

allocate( a(n), b(n), r(n), e(n) )
do i = 1, n
a(i) = i
b(i) = 1000*i
enddo

! compute on the GPU
call system_clock (count=gpu_s)
call vecaddgpu( r, a, b, n )
call system_clock (count=gpu_e)

! compute on the host to compare
!
call system_clock (count=cpu_s)
call vecaddcpu( e, a, b, n )
call system_clock (count=cpu_e)

print *, gpu_e - gpu_s, ' microseconds on gpu'
print *, cpu_e - cpu_s, ' microseconds on host'

! compare results
errs = 0
do i = 1, n
if( abs((r(i) - e(i))/ e(i)) > 1.1 )then
errs = errs + 1
endif
enddo

print *, errs, ' errors found'
if( errs ) call exit(errs)

end program
% pgf90 -acc -ta=nvidia,4.2 -Minfo=accel vecadd_openacc.F90 -o vecadd_openacc
vecaddgpu:
     12, Generating present_or_copyout(r(:n))
         Generating present_or_copyin(b(:n))
         Generating present_or_copyin(a(:n))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     13, Loop is parallelizable
         Accelerator kernel generated
         13, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
             CC 1.0 : 9 registers; 64 shared, 0 constant, 0 local memory bytes
             CC 2.0 : 14 registers; 0 shared, 80 constant, 0 local memory bytes
% setenv PGI_ACC_TIME 1
% ./vecadd_openacc 100000
         8169  microseconds on gpu
          195  microseconds on host
            0  errors found

Accelerator Kernel Timing data
vecadd_openacc.F90
  vecaddgpu
    12: region entered 1 time
        time(us): total=8,165
                  kernels=33 data=2,598
        13: kernel launched 1 times
            grid: [391]  block: [256]
            time(us): total=33 max=33 min=33 avg=33
acc_init.c
  acc_init
    50: region entered 1 time
        time(us): init=101,173
Back to top
View user's profile
sjz



Joined: 09 Jan 2013
Posts: 9

PostPosted: Thu Jan 10, 2013 11:58 am    Post subject: follow up pgi openacc performance Reply with quote

Hi, Mat:


Hi sjz,
Quote:
Is this a typical time needed for setting up the communication between cpu and gnu?
Typically there is a ~1 second per device warm-up cost on Linux, but this can be removed by running pgcudainit to hold open the devices (which you use here).

----> We did this.


Next there is ~0.1 second cost to establish a context between the host and the device.

Finally, there is some overhead in copying the kernel code itself over to the device, as well as any arguments. This cost varies depending upon the kernel.

----> Are these two costs occurring each time when this kernel is called. Will these two costs be smaller if using cuda fortran directly? Thanks, SJZ

What you can do here is call "acc_init" before your timers to remove the initialization time. It's still part of your overall time, but hopefully in a larger application this overhead would be meaningless.

Hope this helps,
Mat
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Thu Jan 10, 2013 12:15 pm    Post subject: Reply with quote

Quote:
----> Are these two costs occurring each time when this kernel is called. Will these two costs be smaller if using cuda fortran directly?

For the copying of kernels, if the kernel is called multiple times in succession, then the cost to copy the kernel to the device occurs only once. However, if there are many other kernels in between calls, then there is the potential that the kernel code needs to be copied over again.

Arguments will be copied over each time but until the size of your argument list grows to >256 bytes on older devices or >1024 bytes on newer (this is a CUDA limit), it will have very little impact. In order to support larger argument lists, we will wrap the arguments up into a single struct, copy the struct to the device, and then pass a pointer to the struct as an argument. This can have some impact on performance.

One thing to keep in mind is, yes, there is some overhead here but it really is quite small (10-100us). In my opinion, if you are writing kernels where this overhead greatly impacts your performance, then you many not want to be putting these algorithms on an accelerator. Not every algorithm works well on an accelerator.

We only use vecadd in our examples because it's easy to illustrate the mechanics of OpenACC, but it isn't really a good algorithm for an accelerator since there's not enough computation to make it worthwhile.

- Mat
Back to top
View user's profile
sjz



Joined: 09 Jan 2013
Posts: 9

PostPosted: Fri Jan 11, 2013 11:35 am    Post subject: Reply with quote

"In order to support larger argument lists, we will wrap the arguments up into a single struct, copy the struct to the device, and then pass a pointer to the struct as an argument. This can have some impact on performance. "

Do you have a sample code on this trick?


Thanks
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group