|
| View previous topic :: View next topic |
| Author |
Message |
sjz
Joined: 09 Jan 2013 Posts: 6
|
Posted: Wed Jan 09, 2013 9:48 am Post subject: performance of PGI openacc directives |
|
|
We tried to use the pgi fortran compiler for openacc
to port climate and weather physics models. Somehow we cannot reduce the time of using openacc directives
below 0.1 second even with the vector add code+ small size of arrays. Is this a typical time needed for setting up the
communication between cpu and gnu?
Thanks,
sjz
*************************************************************
Here is the hardware specification for a node:
2 Hex-core 2.8 GHz Intel Xeon Westemere Processors (4 flop/s per clock)
48 GB of memory per node
2 NVidia M2070 GPUs each connected through a dedicated x16 PCIe Gen2 connection
Interconnect: Infiniband QDR
Here are the compilation and running environment and commands
module load comp/pgi-12.4.0
module load other/mpi/openmpi/1.4.5-pgi-12.4.0
pgf90 -o vecadd_openacc -acc -ta=nvidia,fastmath vecadd_openacc.F90
pgcudainit &
Here are the performance numbers:
szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 1000
110842 microseconds on gpu
2 microseconds on host
0 errors found
szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 10000
105249 microseconds on gpu
22 microseconds on host
0 errors found
szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 100000
110235 microseconds on gpu
232 microseconds on host
0 errors found
szhou@warp01a002:~/test_gpu/acc_cuda_fortran/acc> ./vecadd_openacc 1000000
110693 microseconds on gpu
2206 microseconds on host
0 errors found
Here are the source codes:
szhou@discover25:~/test_gpu/acc_cuda_fortran/acc> cat vecadd_openacc.F90
module vecaddmod
implicit none
contains
subroutine vecaddgpu( r, a, b, n )
real, dimension(:) :: r, a, b
integer :: n
integer :: i
!$acc kernels do copyin(a(1:n),b(1:n)) copyout(r(1:n)) gang vector(256)
do i = 1, n
r(i) = a(i) + b(i)
! r(i) = sqrt(a(i)) + b(i)*a(i)+sin(a(i))*exp(-b(i))*sqrt(a(i)+b(i))
enddo
end subroutine
subroutine vecaddcpu( r, a, b, n )
real, dimension(:) :: r, a, b
integer :: n
integer :: i
do i = 1, n
r(i) = a(i) + b(i)
! r(i) = sqrt(a(i)) + b(i)*a(i)+sin(a(i))*exp(-b(i))*sqrt(a(i)+b(i))
enddo
end subroutine
end module
program main
use vecaddmod
implicit none
integer :: n, i, errs, argcount
integer :: cpu_s, cpu_e, gpu_s, gpu_e
real, dimension(:), allocatable :: a, b, r, e
character*10 :: arg1
argcount = command_argument_count()
n = 1000000 ! default value
if( argcount >= 1 )then
call get_command_argument( 1, arg1 )
read( arg1, '(i)' ) n
if( n <= 0 ) n = 100000
endif
allocate( a(n), b(n), r(n), e(n) )
do i = 1, n
a(i) = i
b(i) = 1000*i
enddo
! compute on the GPU
call system_clock (count=gpu_s)
call vecaddgpu( r, a, b, n )
call system_clock (count=gpu_e)
! compute on the host to compare
!
call system_clock (count=cpu_s)
call vecaddcpu( e, a, b, n )
call system_clock (count=cpu_e)
print *, gpu_e - gpu_s, ' microseconds on gpu'
print *, cpu_e - cpu_s, ' microseconds on host'
! compare results
errs = 0
do i = 1, n
if( abs((r(i) - e(i))/ e(i)) > 1.1 )then
errs = errs + 1
endif
enddo
print *, errs, ' errors found'
if( errs ) call exit(errs)
end program
********************************************* |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Wed Jan 09, 2013 11:01 am Post subject: |
|
|
Hi sjz,
| Quote: | | Is this a typical time needed for setting up the communication between cpu and gnu? | Typically there is a ~1 second per device warm-up cost on Linux, but this can be removed by running pgcudainit to hold open the devices (which you use here).
Next there is ~0.1 second cost to establish a context between the host and the device.
Finally, there is some overhead in copying the kernel code itself over to the device, as well as any arguments. This cost varies depending upon the kernel.
What you can do here is call "acc_init" before your timers to remove the initialization time. It's still part of your overall time, but hopefully in a larger application this overhead would be meaningless.
Hope this helps,
Mat
| Code: | % cat vecadd_openacc.F90
module vecaddmod
implicit none
contains
subroutine vecaddgpu( r, a, b, n )
real, dimension(:) :: r, a, b
integer :: n
integer :: i
!$acc kernels do copyin(a(1:n),b(1:n)) copyout(r(1:n)) gang vector(256)
do i = 1, n
r(i) = a(i) + b(i)
! r(i) = sqrt(a(i)) + b(i)*a(i)+sin(a(i))*exp(-b(i))*sqrt(a(i)+b(i))
enddo
end subroutine
subroutine vecaddcpu( r, a, b, n )
real, dimension(:) :: r, a, b
integer :: n
integer :: i
do i = 1, n
r(i) = a(i) + b(i)
! r(i) = sqrt(a(i)) + b(i)*a(i)+sin(a(i))*exp(-b(i))*sqrt(a(i)+b(i))
enddo
end subroutine
end module
program main
use vecaddmod
use openacc
implicit none
integer :: n, i, errs, argcount
integer :: cpu_s, cpu_e, gpu_s, gpu_e
real, dimension(:), allocatable :: a, b, r, e
character*10 :: arg1
argcount = command_argument_count()
n = 1000000 ! default value
if( argcount >= 1 )then
call get_command_argument( 1, arg1 )
read( arg1, '(i)' ) n
if( n <= 0 ) n = 100000
endif
call acc_init(acc_get_device_type())
allocate( a(n), b(n), r(n), e(n) )
do i = 1, n
a(i) = i
b(i) = 1000*i
enddo
! compute on the GPU
call system_clock (count=gpu_s)
call vecaddgpu( r, a, b, n )
call system_clock (count=gpu_e)
! compute on the host to compare
!
call system_clock (count=cpu_s)
call vecaddcpu( e, a, b, n )
call system_clock (count=cpu_e)
print *, gpu_e - gpu_s, ' microseconds on gpu'
print *, cpu_e - cpu_s, ' microseconds on host'
! compare results
errs = 0
do i = 1, n
if( abs((r(i) - e(i))/ e(i)) > 1.1 )then
errs = errs + 1
endif
enddo
print *, errs, ' errors found'
if( errs ) call exit(errs)
end program
% pgf90 -acc -ta=nvidia,4.2 -Minfo=accel vecadd_openacc.F90 -o vecadd_openacc
vecaddgpu:
12, Generating present_or_copyout(r(:n))
Generating present_or_copyin(b(:n))
Generating present_or_copyin(a(:n))
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
13, Loop is parallelizable
Accelerator kernel generated
13, !$acc loop gang, vector(256) ! blockidx%x threadidx%x
CC 1.0 : 9 registers; 64 shared, 0 constant, 0 local memory bytes
CC 2.0 : 14 registers; 0 shared, 80 constant, 0 local memory bytes
% setenv PGI_ACC_TIME 1
% ./vecadd_openacc 100000
8169 microseconds on gpu
195 microseconds on host
0 errors found
Accelerator Kernel Timing data
vecadd_openacc.F90
vecaddgpu
12: region entered 1 time
time(us): total=8,165
kernels=33 data=2,598
13: kernel launched 1 times
grid: [391] block: [256]
time(us): total=33 max=33 min=33 avg=33
acc_init.c
acc_init
50: region entered 1 time
time(us): init=101,173
|
|
|
| Back to top |
|
 |
sjz
Joined: 09 Jan 2013 Posts: 6
|
Posted: Thu Jan 10, 2013 11:58 am Post subject: follow up pgi openacc performance |
|
|
Hi, Mat:
Hi sjz,
Quote:
Is this a typical time needed for setting up the communication between cpu and gnu?
Typically there is a ~1 second per device warm-up cost on Linux, but this can be removed by running pgcudainit to hold open the devices (which you use here).
----> We did this.
Next there is ~0.1 second cost to establish a context between the host and the device.
Finally, there is some overhead in copying the kernel code itself over to the device, as well as any arguments. This cost varies depending upon the kernel.
----> Are these two costs occurring each time when this kernel is called. Will these two costs be smaller if using cuda fortran directly? Thanks, SJZ
What you can do here is call "acc_init" before your timers to remove the initialization time. It's still part of your overall time, but hopefully in a larger application this overhead would be meaningless.
Hope this helps,
Mat |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Thu Jan 10, 2013 12:15 pm Post subject: |
|
|
| Quote: | | ----> Are these two costs occurring each time when this kernel is called. Will these two costs be smaller if using cuda fortran directly? |
For the copying of kernels, if the kernel is called multiple times in succession, then the cost to copy the kernel to the device occurs only once. However, if there are many other kernels in between calls, then there is the potential that the kernel code needs to be copied over again.
Arguments will be copied over each time but until the size of your argument list grows to >256 bytes on older devices or >1024 bytes on newer (this is a CUDA limit), it will have very little impact. In order to support larger argument lists, we will wrap the arguments up into a single struct, copy the struct to the device, and then pass a pointer to the struct as an argument. This can have some impact on performance.
One thing to keep in mind is, yes, there is some overhead here but it really is quite small (10-100us). In my opinion, if you are writing kernels where this overhead greatly impacts your performance, then you many not want to be putting these algorithms on an accelerator. Not every algorithm works well on an accelerator.
We only use vecadd in our examples because it's easy to illustrate the mechanics of OpenACC, but it isn't really a good algorithm for an accelerator since there's not enough computation to make it worthwhile.
- Mat |
|
| Back to top |
|
 |
sjz
Joined: 09 Jan 2013 Posts: 6
|
Posted: Fri Jan 11, 2013 11:35 am Post subject: |
|
|
"In order to support larger argument lists, we will wrap the arguments up into a single struct, copy the struct to the device, and then pass a pointer to the struct as an argument. This can have some impact on performance. "
Do you have a sample code on this trick?
Thanks |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|