PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

nested loop with re-used variable optimization

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
acolomb



Joined: 18 Nov 2011
Posts: 3

PostPosted: Fri Nov 18, 2011 4:56 am    Post subject: nested loop with re-used variable optimization Reply with quote

Hello,
I started looking at PGI accellerator programming paradigm since only a week. I am checking the feasibility of porting on gpu a large code parallelized with MPI.
I compare the accellerated loop to the traditional host executed loop using some fortran intrinsic function.
Just by using the !$acc clause i get an enormous speedup, however it appears that the innermost loop is not parallelizable because the variable idx_rot is re-used after the loop.

I post the bulk of the code, inside are calls to other function and data I/O that are not relevant for gpu operations.

Code:
module lapo_pixel
integer*4:: nlatzones                         !Number of rows of  the grid
                                                 
integer*4:: nvox                              !Total number of voxels
real*4,dimension(:),allocatable:: x_pixel,y_pixel,z_pixel,k_pixel   !Number of voxels
real*4,dimension(:),allocatable:: x_pixct,y_pixct,z_pixct,k_pixct
real*4,dimension(:),allocatable:: th_pixel,phi_pixel

end module

program gpunight
use lapo_pixel
use accel_lib
implicit none
real*4, dimension(:),allocatable :: sigll,zigll
real*4, dimension(:), allocatable ::x,y,z,phigd,phird,dr
real*4, dimension(:), allocatable :: x_pix,y_pix,z_pix
real*4,dimension(:), allocatable :: xtemp,ytemp,dr_min
integer*4 :: nproc,npts2d,rows,nring_tot,ct,iring,npts3d,ipt
integer*4,dimension(:), allocatable :: idx_rot
real*4 ::dphi,pi,phi,radius,temp,dr2
integer*4 :: c0, c1, c2, c3, cgpu,chost,i,mindist


pi=3.141592653589793;
nproc=8;
rows=168;
npts2D=rows*nproc;

print*, 'npts2d:',rows*nproc,npts2d,'rows:',rows,'nproc',nproc

allocate(sigll(npts2d),zigll(npts2d))
allocate(xtemp(npts2d),ytemp(npts2d))
call load_grid_bdr(sigll,zigll,dphi,radius,rows,nproc)

nring_tot=ceiling(2*pi/dphi)
npts3d=npts2d*(nring_tot+1)

allocate(x(npts3d),y(npts3d),z(npts3d),dr(npts3d))
ct=0
do iring=0,nring_tot
   phi=dphi*iring;
   call cyl2cart(xtemp,ytemp,sigll,zigll,phi,rows*4);
   x(ct*npts2D+1:(ct+1)*npts2D)=xtemp;
   y(ct*npts2D+1:(ct+1)*npts2D)=ytemp;
   z(ct*npts2D+1:(ct+1)*npts2D)=zigll;
   ct=ct+1
enddo

call pixel_grid(radius)

allocate(x_pix(nvox),y_pix(nvox),z_pix(nvox))
allocate(idx_rot(nvox),dr_min(nvox))

call acc_init( acc_device_nvidia )

x_pix=x_pixct;y_pix=y_pixct;z_pix=z_pixct

print*, 'size x & xpix',size(x),size(x_pixct),npts3d,nvox
call system_clock( count=c1 )

!$acc region
!$acc do private(dr2,mindist)
do ipt=1,nvox
   mindist=1000000000000000.
   do i=1,npts3D
      dr2=sqrt((-x_pix(ipt)+x(i))**2+(-y_pix(ipt)+y(i))**2+(-z_pix(ipt)+z(i))**2);
      if (dr2<=mindist) then
          idx_rot(ipt)=i;
      end if
   end do
end do
!$acc end region

call system_clock( count=c2 )
cgpu = c2 - c1


do ipt=1,nvox
   dr=sqrt((-x_pixct(ipt)+x)**2+(-y_pixct(ipt)+y)**2+(-z_pixct(ipt)+z)**2);   
   idx_rot(ipt)=minloc(dr,1)
enddo

call system_clock( count=c3 )
chost = c3 - c2
 print *, cgpu, ' microseconds on GPU'
 print *, chost, ' microseconds on host'

endprogram gpunight


compiled with: pgfortran lapo_mesh.f90 -o xlapo.exe -O3 -fastsse -Minline -g -ta=nvidia,time -Minfo

That gives the followingoutput concerning gpu and host loop:
73, Generating copyin(x(1:(nring_tot+1)*1344))
Generating copyin(y_pix(1:nvox))
Generating copyin(y(1:(nring_tot+1)*1344))
Generating copyin(z_pix(1:nvox))
Generating copyin(z(1:(nring_tot+1)*1344))
Generating copy(idx_rot(1:nvox))
Generating copyin(x_pix(1:nvox))
Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
75, Loop is parallelizable
77, Loop carried reuse of 'idx_rot' prevents parallelization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
75, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
Using register for 'idx_rot'
Using register for 'x_pix'
Using register for 'y_pix'
Using register for 'z_pix'
77, !$acc do seq(256)
Cached references to size [256] block of 'x'
Cached references to size [256] block of 'y'
Cached references to size [256] block of 'z'
CC 1.0 : 14 registers; 3180 shared, 20 constant, 0 local memory bytes; 66% occupancy
CC 1.3 : 14 registers; 3180 shared, 20 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 24 registers; 3092 shared, 112 constant, 0 local memory bytes; 83% occupancy
90, Loop not vectorized/parallelized: contains call
92, Generated 2 alternate versions of the loop
Generated vector sse code for the loop
Generated 3 prefetch instructions for the loop


The output from the code and the timing analysis is:

73: region entered 1 time
time(us): total=464729 init=1 region=464728
kernels=461062 data=2565
w/o init: total=464728 max=464728 min=464728 avg=464728
77: kernel launched 1 times
grid: [41] block: [256]
time(us): total=461062 max=461062 min=461062 avg=461062
acc_init.c
acc_init
41: region entered 1 time
time(us): init=2387859
464740 microseconds on GPU
37325989 microseconds on host


Already satisfactory I would say, but I have few question:
1-can the sequential loop inside the accelerated region be further optimised?
2-Inserting a data region clause around the compute region would improve the data management performance?
3-I noticed in a previous post the lack of the minlock/maxlock function, is it going to be available in the near future?

Thanks in advance
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Fri Nov 18, 2011 11:32 am    Post subject: Reply with quote

Hi acolomb

Quote:
1-can the sequential loop inside the accelerated region be further optimised?
Yes, but you need to manually privatize it by adding another dimension for the "i" index and then do a max reduction from the 2D temp array back to idx_rot. Something like the following. Note that dr2 and mindist are private by default so I removed this clause. Also, I don't know if the code is faster in this case. It will depend on if the increased parallelism helps more then the added cost to perform the reduction.
Code:
% cat test_2.f90
module lapo_pixel
integer*4:: nlatzones                         !Number of rows of  the grid
                                                 
integer*4:: nvox                              !Total number of voxels
real*4,dimension(:),allocatable:: x_pixel,y_pixel,z_pixel,k_pixel   !Number of voxels
real*4,dimension(:),allocatable:: x_pixct,y_pixct,z_pixct,k_pixct
real*4,dimension(:),allocatable:: th_pixel,phi_pixel

end module

program gpunight
use lapo_pixel
use accel_lib
implicit none
real*4, dimension(:),allocatable :: sigll,zigll
real*4, dimension(:), allocatable ::x,y,z,phigd,phird,dr
real*4, dimension(:), allocatable :: x_pix,y_pix,z_pix
real*4,dimension(:), allocatable :: xtemp,ytemp,dr_min
integer*4 :: nproc,npts2d,rows,nring_tot,ct,iring,npts3d,ipt
integer*4,dimension(:), allocatable :: idx_rot
integer*4,dimension(:,:), allocatable :: idx_rot2
real*4 ::dphi,pi,phi,radius,temp,dr2
integer*4 :: c0, c1, c2, c3, cgpu,chost,i,mindist


pi=3.141592653589793;
nproc=8;
rows=168;
npts2D=rows*nproc;

print*, 'npts2d:',rows*nproc,npts2d,'rows:',rows,'nproc',nproc

allocate(sigll(npts2d),zigll(npts2d))
allocate(xtemp(npts2d),ytemp(npts2d))
call load_grid_bdr(sigll,zigll,dphi,radius,rows,nproc)

nring_tot=ceiling(2*pi/dphi)
npts3d=npts2d*(nring_tot+1)

allocate(x(npts3d),y(npts3d),z(npts3d),dr(npts3d))
ct=0
do iring=0,nring_tot
   phi=dphi*iring;
   call cyl2cart(xtemp,ytemp,sigll,zigll,phi,rows*4);
   x(ct*npts2D+1:(ct+1)*npts2D)=xtemp;
   y(ct*npts2D+1:(ct+1)*npts2D)=ytemp;
   z(ct*npts2D+1:(ct+1)*npts2D)=zigll;
   ct=ct+1
enddo

call pixel_grid(radius)

allocate(x_pix(nvox),y_pix(nvox),z_pix(nvox))
allocate(idx_rot(nvox),dr_min(nvox))
allocate(idx_rot2(npts3D,nvox))

call acc_init( acc_device_nvidia )

x_pix=x_pixct;y_pix=y_pixct;z_pix=z_pixct

print*, 'size x & xpix',size(x),size(x_pixct),npts3d,nvox
call system_clock( count=c1 )

mindist=1000000000000000.
!$acc region local(idx_rot2)
!$acc do
do ipt=1,nvox
   do i=1,npts3D
      dr2=sqrt((-x_pix(ipt)+x(i))**2+(-y_pix(ipt)+y(i))**2+(-z_pix(ipt)+z(i))**2);
      if (dr2<=mindist) then
          idx_rot2(i, ipt)=i
      else
          idx_rot2(i, ipt)=0
      end if
   end do
end do
do ipt=1,nvox
   idx_rot(ipt)=maxval(idx_rot2(:,ipt))
enddo
!$acc end region

call system_clock( count=c2 )
cgpu = c2 - c1


do ipt=1,nvox
   dr=sqrt((-x_pixct(ipt)+x)**2+(-y_pixct(ipt)+y)**2+(-z_pixct(ipt)+z)**2);   
   idx_rot(ipt)=minloc(dr,1)
enddo

call system_clock( count=c3 )
chost = c3 - c2
 print *, cgpu, ' microseconds on GPU'
 print *, chost, ' microseconds on host'

endprogram gpunight
% pgf90 -ta=nvidia -fast -c test_2.f90 -Minfo=accel
gpunight:
     65, Generating copyout(idx_rot(1:nvox))
         Generating copyin(x_pix(1:nvox))
         Generating copyin(x(1:(nring_tot+1)*1344))
         Generating copyin(y_pix(1:nvox))
         Generating copyin(y(1:(nring_tot+1)*1344))
         Generating copyin(z_pix(1:nvox))
         Generating copyin(z(1:(nring_tot+1)*1344))
         Generating local(idx_rot2(:,:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     67, Loop is parallelizable
     68, Loop is parallelizable
         Accelerator kernel generated
         67, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
             Cached references to size [16] block of 'x_pix'
             Cached references to size [16] block of 'y_pix'
             Cached references to size [16] block of 'z_pix'
         68, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             Using register for 'idx_rot2'
             Cached references to size [16] block of 'x'
             Cached references to size [16] block of 'y'
             Cached references to size [16] block of 'z'
             CC 1.0 : 11 registers; 536 shared, 12 constant, 0 local memory bytes; 66% occupancy
             CC 2.0 : 13 registers; 432 shared, 128 constant, 0 local memory bytes; 100% occupancy
     77, Loop is parallelizable
         Accelerator kernel generated
         77, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.0 : 11 registers; 72 shared, 8 constant, 0 local memory bytes; 66% occupancy
             CC 2.0 : 20 registers; 8 shared, 80 constant, 0 local memory bytes; 100% occupancy
     78, Loop is parallelizable

Quote:
2-Inserting a data region clause around the compute region would improve the data management performance?
Not in this case since you only have the one compute region. Data regions should be used when you have multiple compute regions needing the same data and/or a compute region is reused multiple times (such as in a time-step loop).
Quote:
3-I noticed in a previous post the lack of the minlock/maxlock function, is it going to be available in the near future?
Probably not in the near future. MINLOC/MAXLOC are quite complex operations to perform in a general way. Hence, the operation must be performed by a runtime function call which doesn't exist on the device. Our engineers are working on solutions but at this time it has been given a lower priority.

Hope this helps,
Mat
Back to top
View user's profile
acolomb



Joined: 18 Nov 2011
Posts: 3

PostPosted: Thu Nov 24, 2011 3:05 am    Post subject: Reply with quote

Hello Mat,

Thanks for the help, after a little while of testing I confirm that the original pattern of the loop was performing much better. The vector idx_rot2(npts3D,nvox) is, in practical cases, too big (5-10 Gbyte) to fit in the device memory and therefore produce an error when allocating.
When the problem size is reduced the cost of data transfer exceeds the parallelization benefits.
Concluding, it is better to stay with the original structure as the speed up is already impressive!

Thanks
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group