|
| View previous topic :: View next topic |
| Author |
Message |
elephant
Joined: 24 Feb 2011 Posts: 22
|
Posted: Wed Jun 22, 2011 5:18 am Post subject: MAXVAL with PGI Accelerator |
|
|
Hello
A) Is there a possibility to search for the maximum value in an array using the PGI Accelerator model?
Code looks like:
| Code: |
MAX_VALUE = maxval(ARRAY(1:1000000))
|
B) And another question: If I have a loop that is not well suited for the GPU, but still I put the "!$acc region" clause around it and the compiler tells me that the loop is executed sequentially: "!$acc do seq"... is the loop executed on the host or device? Why do I still get performance increase compared to execution without the accelerator region? Does a loop that sequentially executes on the device is faster than if it sequentially executes on the host???
C) Can the compiler and the GPU deal "good" with a loop that looks for example like that:
| Code: |
do i=1,1000000
A1 = Array(kc,1) + Array(kc,6) + ...
A2 = Array(kc,2) + ...
B = 1./A1
C = B*3.25
K(i)=B*C*.....
end do
|
I mean is it bad or ok to generate such "private" variables like A1, B,... that are used later in the loop.
If I would add an index to all of my variables I would exceed the memory of the GPU...
D) Is there a good strategy to reduce memory usage? I did some test cases where I dynamically allocated memory using the "real, dimension(:,:), allocatable, device :: x" clause for some arrays that will be only used temporary during calculation and dont have to be copyied into or out of the device. I explored that if I allocate during execution it is much faster than having a static allocated array (only used local on the GPU). But however, the deallocation of the array on the GPU was the bottleneck and slowed down my performance (wose than running code on CPU).
How can I also fastly deallocate a local array... Otherwise I have to use it for several purposes (overwrite it) which would make my code completely unreadable for others... ;(
Any tipps on that... since I am facing memory space problem...
THANK YOU VERY MUCH IN ADVANCE |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4997 Location: The Portland Group Inc.
|
Posted: Wed Jun 22, 2011 10:22 am Post subject: |
|
|
Hi elephant,
| Quote: |
A) Is there a possibility to search for the maximum value in an array using the PGI Accelerator model? | Yes. The compiler is able to recognize and parallelize most reductions including maxval. Here's an example:
| Code: | % cat maxval.f90
program testmaxval
real, allocatable, dimension(:) :: arr
real :: maxv
integer :: i, N
N = 10000000
allocate(arr(N))
!$acc region local(arr)
do i=1,N
arr(i) = real(i)/N
end do
maxv = maxval(arr)
!$acc end region
print *, maxv
deallocate(arr)
end program testmaxval
% pgf90 maxval.f90 -Minfo=accel -ta=nvidia
testmaxval:
10, Generating local(arr(:))
Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
11, Loop is parallelizable
Accelerator kernel generated
11, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
CC 1.0 : 3 registers; 40 shared, 12 constant, 0 local memory bytes; 100% occupancy
CC 1.3 : 3 registers; 40 shared, 12 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 8 registers; 8 shared, 52 constant, 0 local memory bytes; 100% occupancy
14, Loop is parallelizable
Accelerator kernel generated
14, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
Max reduction generated for arr$r <<<<< HERE Compiler is parallelizing maxval
CC 1.0 : 6 registers; 1080 shared, 20 constant, 0 local memory bytes; 100% occupancy
CC 1.3 : 6 registers; 1080 shared, 20 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 8 registers; 1032 shared, 64 constant, 0 local memory bytes; 100% occupancy
% a.out
1.000000
|
| Quote: | | "!$acc do seq"... is the loop executed on the host or device? | Device, albeit very slowly since it will use only a single thread.
| Quote: | | Why do I still get performance increase compared to execution without the accelerator region? | If this is the only accelerator routine, then I don't know since I would expect it to run slowly. I would need an example to better understand why.
If there are other parallelized routines that use the same data, then sequential routines can still help improve performance by eliminating the need to copy data to/from the host.
| Quote: | | Does a loop that sequentially executes on the device is faster than if it sequentially executes on the host? | Comparing just a single threaded, sequential kernel with a host side counter part, I would expect the host to run much faster. Besides the overhead of copying data the the device, the individual thread processors on the device are usually much slower. Again, I would need an example to understand why you are seeing the contrary.
| Quote: | | Can the compiler and the GPU deal "good" with a loop that looks for example like that: | It should.
| Quote: | | I mean is it bad or ok to generate such "private" variables like A1, B,... that are used later in the loop. If I would add an index to all of my variables I would exceed the memory of the GPU... | Scalars are privatized by default and in general are a good thing since they can be stored in registers which are quick to access. The only caveat is if you use too many registers, you must reduce the number of threads per block.
| Quote: | | D) Is there a good strategy to reduce memory usage? | We just added a new feature in 11.6 which allows for shared memory automatic arrays. This will allow you to define at runtime the size in bytes of the total amount of shared memory, as the third argument of the kernel launch, which is then mapped to automatics in the kernel. Here's an example:
| Code: | % cat automatic.cuf
module m
contains
attributes(global) subroutine ss1( a, b, k )
implicit none
integer, value :: k
integer, dimension(:,:) :: a, b
integer(8), shared, dimension(blockdim%x, blockdim%y) :: s1
integer, shared, dimension(k+1, k) :: s2
integer ti, tj, i, j
ti = threadidx%x
tj = threadidx%y
i = (blockidx%x-1)*blockdim%x + ti
j = (blockidx%y-1)*blockdim%y + tj
s1(ti,tj) = a(i,j)
s2(ti,tj) = b(i,j)
call syncthreads()
a(i,j) = s2(tj,ti)
b(i,j) = s1(tj,ti)
end subroutine
end module
program p
use m
use cudafor
implicit none
integer, dimension(:,:), allocatable, device :: da, db
integer, dimension(:,:), allocatable :: ha0, hb0, ha1, hb1, haa, hbb
type(dim3) :: grid,block
integer :: n, i, j, ierr
n = 128
allocate(haa(n,n), hbb(n,n))
allocate(ha0(n,n), hb0(n,n))
allocate(ha1(n,n), hb1(n,n))
allocate(da(n,n), db(n,n))
do j = 1,n
do i = 1,n
haa(i,j) = i*1000 + j
hbb(i,j) = -i*1000 - j
enddo
enddo
da = haa
db = hbb
grid = dim3(16,16,1)
block = dim3(8,8,1)
print *, 'calling ss1'
! The third argument, '900' is the size in bytes
! of the shared memory segment which maps to the
! kernel's shared automatics
call ss1<<<grid,block,900>>>( da, db, 8 )
print *, 'back from ss1'
ierr = cudathreadsynchronize()
if( ierr .ne. 0 )then
print *, 'ss1 launch'
print *, cudageterrorstring( ierr )
endif
print *,'last error is ', ierr
ha1 = da
hb1 = db
end program
% pgf90 -fast automatic.cuf; a.out
calling ss1
back from ss1
last error is 0
|
Hope this helps,
Mat |
|
| Back to top |
|
 |
elephant
Joined: 24 Feb 2011 Posts: 22
|
Posted: Thu Jul 14, 2011 8:56 am Post subject: |
|
|
Hi Mat,
Still working with this maxval thing! Can you tell me why this example is not working?
| Code: |
program main
use accel_lib
implicit none
integer :: k, i
integer, parameter :: kend=1000000
double precision :: MAXv
double precision,allocatable,dimension(:,:) :: Arr
!$acc mirror(Arr)
double precision,allocatable,dimension(:) :: MX_GPU
!$acc mirror(MX_GPU)
allocate(Arr(kend,7))
allocate(MX_GPU(kend))
do i=1,7
do k=1,kend
Arr(k,i) = sin(k*1.0)+cos(i*1.0)+2.1000000
end do
end do
!$acc update device(Arr)
!$acc region
DO K = 1, KEND
MX_GPU(K) = 1./Arr(K,7)
END DO
MAXv = maxval(MX_GPU)
!$acc end region
print*
print*,'MaxValue correct = 0.5394028083693713'
print*,'MaxValue = ',MAXv
end program main
|
When I run it on the CPU it works fine. But on the GPU not:
| Code: |
MaxValue correct = 0.5394028083693713
MaxValue = -6.8642848977706330E-145
|
Thank you |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4997 Location: The Portland Group Inc.
|
Posted: Thu Jul 14, 2011 10:16 am Post subject: |
|
|
Hi elephant,
Looks like there is a problem with using a 'mirror' array in MAXVAL when contained in a region having multiple kernels. I created a report (TPR#18009) and sent it to our engineers for further investigation.
The work around is to put the MAXVAL in it's own ACC region or declare MX_GPU as a 'local' instead of 'mirror'.
Example using multiple ACC regions:
| Code: | % cat maxv2.f90
program main
use accel_lib
implicit none
integer :: k, i
integer, parameter :: kend=1000000
double precision :: MAXv
double precision,allocatable,dimension(:,:) :: Arr
!$acc mirror(Arr)
double precision,allocatable,dimension(:) :: MX_GPU
!$acc mirror(MX_GPU)
allocate(Arr(kend,7))
allocate(MX_GPU(kend))
do i=1,7
do k=1,kend
Arr(k,i) = sin(k*1.0)+cos(i*1.0)+2.1000000
end do
end do
MX_GPU=0.
!$acc update device(Arr)
!$acc region
DO K = 1, KEND
MX_GPU(K) = 1./Arr(K,7)
END DO
!$acc end region
!$acc region
MAXv = maxval(MX_GPU)
!$acc end region
print*
print*,'MaxValue correct = 0.5394028083693713'
print*,'MaxValue = ',MAXv
end program main
% pgf90 -ta=nvidia maxv2.f90 -V11.7
% a.out
MaxValue correct = 0.5394028083693713
MaxValue = 0.5394028083693713
|
Example using a local clause:
| Code: | % cat maxv1.f90
program main
use accel_lib
implicit none
integer :: k, i
integer, parameter :: kend=1000000
double precision :: MAXv
double precision,allocatable,dimension(:,:) :: Arr
!$acc mirror(Arr)
double precision,allocatable,dimension(:) :: MX_GPU
allocate(Arr(kend,7))
allocate(MX_GPU(kend))
do i=1,7
do k=1,kend
Arr(k,i) = sin(k*1.0)+cos(i*1.0)+2.1000000
end do
end do
MX_GPU=0.
!$acc update device(Arr)
!$acc region local(MX_GPU)
DO K = 1, KEND
MX_GPU(K) = 1./Arr(K,7)
END DO
MAXv = maxval(MX_GPU)
!$acc end region
print*
print*,'MaxValue correct = 0.5394028083693713'
print*,'MaxValue = ',MAXv
end program main
% pgf90 -ta=nvidia maxv1.f90 -V11.7
% a.out
MaxValue correct = 0.5394028083693713
MaxValue = 0.5394028083693713
|
Hope this helps,
Mat |
|
| Back to top |
|
 |
jtull
Joined: 30 Jun 2004 Posts: 234
|
Posted: Fri May 17, 2013 5:18 pm Post subject: |
|
|
Hello,
TPR 18009 has been determined fixed in 13.5, though it may have been
fixed in an earlier release.
Thanks for your problem submission.
regards,
dave |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|