|
| View previous topic :: View next topic |
| Author |
Message |
Pebbles
Joined: 24 Sep 2010 Posts: 13
|
Posted: Thu Jan 06, 2011 8:27 am Post subject: Improve Kernel Performance |
|
|
When I run the following kernel code the performance is not as good as I would expect. On the CPU it runs in 93.22 seconds; with CUDA Fortran 7.12 seconds; and with the PGI accelerator 2.00 seconds. I am attempting to parallelize approximately 1,054,200 loops but the results don't seem to scale. How can I know what order of magnitude of improvement I should expect by parallelizing the code? Also is there a way to tell how may threads are actually executing the code? How could the kernel performance be improved?
Thanks for your help
module NORMALIZE_KERNEL
use cudafor
contains
attributes(global) subroutine NORMALIZE_IMAGE_KERNEL(P_RAD, M_RAD, IEND_MEMBER, END_MEMBER, DOTPR_MIN, DOTPR_MAX, DOTPR, NWL, NX, NY, NB)
real, device :: M_RAD(NWL, NB), P_RAD(NX, NY, NWL), DOTPR_MIN(0:255), DOTPR_MAX(0:255), DOTPR(NX, NY)
integer, device :: IEND_MEMBER(NX, NY), END_MEMBER(0:255,2)
integer, value :: NWL, NB, NX, NY, KMAX
integer :: tx, ibk, iwl, i, kk, iy, ix
real :: bl, ibk0, dotpr_matl, iem, dotpr0
real, parameter :: EPSMIN4 = 1.1754944E-38
real, parameter :: CLOSENESS_OF_FIT = 0.90
tx = threadidx%x
i = ( blockidx%x-1 ) * blockdim%x + tx
if ( i .le. NX*NY ) then
kk = (i-1)/NX
iy = kk+1
ix = i-kk*NX
bl=0.0
do iwl = 1,NWL
if ( P_RAD(ix,iy,iwl) > 0.0 ) bl = bl + P_RAD(ix,iy,iwl)**2
end do
if ( bl < EPSMIN4 ) then
IEND_MEMBER(ix,iy) = 255
ibk0 = 255
bl = 1.0
else
IEND_MEMBER(ix,iy) = 254
ibk0=254
end if
bl = sqrt(bl)
do iwl = 1,NWL
P_RAD(ix,iy,iwl) = P_RAD(ix,iy,iwl)/bl
end do
dotpr_matl = 0.0
do ibk = 1,NB
iem = END_MEMBER(ibk,1)
dotpr0 = 0.0
do iwl = 1,NWL
dotpr0 = dotpr0 + P_RAD(ix,iy,iwl) * M_RAD(iwl,ibk)
end do
if ( dotpr0 >= CLOSENESS_OF_FIT .AND. dotpr0 > dotpr_matl .AND. iem/=39 .AND. iem/=49 .AND. iem/=17 .AND. iem/=12 .AND. iem/=13 .AND. iem/=16 .AND. iem/=18) then
dotpr_matl = dotpr0
IEND_MEMBER(ix,iy) = END_MEMBER(ibk,1)
DOTPR(ix,iy) = dotpr0
ibk0 = ibk
end if
end do
if ( IEND_MEMBER(ix,iy) == 0 ) END_MEMBER(0,2) = END_MEMBER(0,2) + 1
do ibk = 1, NB
if ( IEND_MEMBER(ix,iy) == END_MEMBER(ibk,1) ) END_MEMBER(ibk,2) = END_MEMBER(ibk,2) + 1
end do
if ( DOTPR(ix,iy) < DOTPR_MIN(ibk0) ) DOTPR_MIN(ibk0) = DOTPR(ix,iy)
if ( DOTPR(ix,iy) > DOTPR_MAX(ibk0) ) DOTPR_MAX(ibk0) = DOTPR(ix,iy)
end if
call syncthreads()
end subroutine
end module NORMALIZE_KERNEL
*****************************************************
PROGRAM TESTPGI
USE CONSTANTSPGI
USE NORMALIZE_KERNEL
REAL , PARAMETER :: CLOSENESS_OF_FIT=0.90
INTEGER, PARAMETER :: NB = 64
INTEGER :: I, J, K, KK, IX, IY, IWL, OUTERLOOP, OUTERLOOPMAX, NX, NY, NWL, END_MEMBER(0:255,2)
INTEGER (KIND=KI4), ALLOCATABLE :: IEND_MEMBER(:,:)
REAL, ALLOCATABLE :: P_RAD(:,:,:), M_RAD(:,:), DOTPR(:,:)
REAL :: EXECTIME(5), BL, DOTPR_MATL, DOTPR0
REAL, DEVICE, ALLOCATABLE :: M_RAD_DEV(:,:), P_RAD_DEV(:,:,:), DOTPR_DEV(:,:)
REAL, DEVICE :: DOTPR_MIN(0:255), DOTPR_MAX(0:255)
INTEGER (KIND=KI4), DEVICE, ALLOCATABLE :: IEND_MEMBER_DEV(:,:)
INTEGER, DEVICE :: END_MEMBER_DEV(0:255,2)
DATA ((END_MEMBER(I,J),J=1,2),I=0,NB) / 0,0, &
1,0, 2,0, 3,0, 4,0, 5,0, 6,0, 7,0, 8,0, 9,0, 10,0, 11,0, 12,0, 13,0, 14,0, 15,0, 16,0, &
17,0, 18,0, 19,0, 20,0, 21,0, 22,0, 23,0, 24,0, 25,0, 26,0, 27,0, 28,0, 29,0, 30,0, 31,0, 32,0, 33,0, &
34,0, 35,0, 36,0, 37,0, 38,0, 39,0, 40,0, 41,0, 42,0, 43,0, 44,0, 45,0, 46,0, 47,0, 48,0, 49,0, 50,0, &
51,0, 52,0, 53,0, 54,0, 55,0, 56,0, 57,0, 58,0, 59,0, 60,0, 61,0, 62,0, 63,0, 64,0/
! 65,0, 66,0, 67,0, 68,0, 69,0, 70,0, 71,0, 72,0, 73,0, &
! 74,0, 75,0, 76,0, 77,0, 78,0, 79,0, 80,0, 81,0/
NX = 753
!NY = 1924
NY = 1400
NWL = 224
OUTERLOOPMAX = NX * NY
ALLOCATE(M_RAD(NWL,NB),STAT=IOS)
ALLOCATE(P_RAD(NX,NY,NWL),STAT=IOS)
ALLOCATE(IEND_MEMBER(NX,NY),DOTPR(NX,NY),STAT=IOS)
DO I = 1,NWL
DO J = 1,NX
DO K = 1,NY
P_RAD(J,K,I) = .005 ** 2
END DO
END DO
END DO
DO I = 1,NWL
DO J = 1,NB
M_RAD(I,J) = .003 ** 2
END DO
END DO
DO IBK = 1,NB
DOTPR_MIN(IBK) = 1.0
DOTPR_MAX(IBK) = -1.0
END DO
CALL CPU_TIME(EXECTIME(1))
ALLOCATE(M_RAD_DEV(NWL,NB),STAT=IOS)
ALLOCATE(P_RAD_DEV(NX,NY,NWL),STAT=IOS)
ALLOCATE(IEND_MEMBER_DEV(NX,NY),DOTPR_DEV(NX,NY),STAT=IOS)
IF (IOS/=0) THEN
STOP 'ALLOCATE Error 6'
END IF
P_RAD_DEV = P_RAD(1:NX, 1:NY, 1:NWL)
END_MEMBER_DEV = END_MEMBER(0:255, 1:2)
IEND_MEMBER_DEV = IEND_MEMBER(1:NX, 1:NY)
DOTPR_DEV = DOTPR(1:NX, 1:NY)
call NORMALIZE_IMAGE_KERNEL<<<(OUTERLOOPMAX-1)/512+1,512>>>(P_RAD_DEV, M_RAD_DEV, IEND_MEMBER_DEV, END_MEMBER_DEV, DOTPR_MIN, DOTPR_MAX, DOTPR_DEV, NWL, NX, NY, NB)
END_MEMBER(1:NBCKGND, 1:2) = END_MEMBER_DEV
IEND_MEMBER(1:NX, 1:NY) = IEND_MEMBER_DEV
CALL CPU_TIME(EXECTIME(2))
print *, "Time ", EXECTIME(2) - EXECTIME(1)
DEALLOCATE(P_RAD,STAT=IOS)
DEALLOCATE(M_RAD,STAT=IOS)
END PROGRAM TESTPGI |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Thu Jan 06, 2011 6:36 pm Post subject: |
|
|
Hi Pebbles,
| Quote: | | How can I know what order of magnitude of improvement I should expect by parallelizing the code? | There really isn't a good way of doing this since there are too many factors.
| Quote: |
Also is there a way to tell how may threads are actually executing the code? | In the PGI Accelerator model, using the flag "-ta=nvidia,time" will give profile information including the kernel schedule. For CUDA Fortran, you explicitly set the number of threads when you launch your kernel.
| Quote: |
How could the kernel performance be improved? | The only specific thing I see off hand is that you should remove the extraneous call to syncthreads.
Try running your program using the PGI profile utility "pgcollect" with the flag "-cuda=gmem" of "-cuda=branch". This will gather Host and GPU profile information that you can then review within the PGI profiler PGPROF. (for details please refer to the PGI Tools Guide). This should give you a better idea of where your performance bottlenecks are.
While not necessarily performance relevant, I would suggest modifying your kernel schedule to be two dimensions. i.e. change your kernel launch schedule to:
| Code: | ...
INTEGER, DEVICE :: END_MEMBER_DEV(0:255,2)
TYPE(dim3) :: dimGrid, dimBlock
...
dimBlock = dim3(16,16,1)
dimGrid = dim3((NX+15)/16,(NY+15)/16,1)
call NORMALIZE_IMAGE_KERNEL<<<dimGrid,dimBlock>>>(P_RAD_DEV, M_RAD_DEV, IEND_MEMBER_DEV, END_MEMBER_DEV, &
DOTPR_MIN, DOTPR_MAX, DOTPR_DEV, NWL, NX, NY, NB)
... |
And then you can simply your device kernel index:
| Code: |
ix = ( blockidx%x-1 ) * blockdim%x + threadidx%x
iy = ( blockidx%y-1 ) * blockdim%y + threadidx%y
if ( ix .le. NX .and. iy.le.NY ) then
|
Hope this helps,
Mat |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|