PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Improve Kernel Performance

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
Pebbles



Joined: 24 Sep 2010
Posts: 13

PostPosted: Thu Jan 06, 2011 8:27 am    Post subject: Improve Kernel Performance Reply with quote

When I run the following kernel code the performance is not as good as I would expect. On the CPU it runs in 93.22 seconds; with CUDA Fortran 7.12 seconds; and with the PGI accelerator 2.00 seconds. I am attempting to parallelize approximately 1,054,200 loops but the results don't seem to scale. How can I know what order of magnitude of improvement I should expect by parallelizing the code? Also is there a way to tell how may threads are actually executing the code? How could the kernel performance be improved?

Thanks for your help

module NORMALIZE_KERNEL
use cudafor

contains

attributes(global) subroutine NORMALIZE_IMAGE_KERNEL(P_RAD, M_RAD, IEND_MEMBER, END_MEMBER, DOTPR_MIN, DOTPR_MAX, DOTPR, NWL, NX, NY, NB)

real, device :: M_RAD(NWL, NB), P_RAD(NX, NY, NWL), DOTPR_MIN(0:255), DOTPR_MAX(0:255), DOTPR(NX, NY)
integer, device :: IEND_MEMBER(NX, NY), END_MEMBER(0:255,2)

integer, value :: NWL, NB, NX, NY, KMAX
integer :: tx, ibk, iwl, i, kk, iy, ix
real :: bl, ibk0, dotpr_matl, iem, dotpr0
real, parameter :: EPSMIN4 = 1.1754944E-38
real, parameter :: CLOSENESS_OF_FIT = 0.90

tx = threadidx%x

i = ( blockidx%x-1 ) * blockdim%x + tx

if ( i .le. NX*NY ) then
kk = (i-1)/NX
iy = kk+1
ix = i-kk*NX

bl=0.0
do iwl = 1,NWL
if ( P_RAD(ix,iy,iwl) > 0.0 ) bl = bl + P_RAD(ix,iy,iwl)**2
end do
if ( bl < EPSMIN4 ) then
IEND_MEMBER(ix,iy) = 255
ibk0 = 255
bl = 1.0
else
IEND_MEMBER(ix,iy) = 254
ibk0=254
end if
bl = sqrt(bl)
do iwl = 1,NWL
P_RAD(ix,iy,iwl) = P_RAD(ix,iy,iwl)/bl
end do

dotpr_matl = 0.0
do ibk = 1,NB
iem = END_MEMBER(ibk,1)
dotpr0 = 0.0
do iwl = 1,NWL
dotpr0 = dotpr0 + P_RAD(ix,iy,iwl) * M_RAD(iwl,ibk)
end do

if ( dotpr0 >= CLOSENESS_OF_FIT .AND. dotpr0 > dotpr_matl .AND. iem/=39 .AND. iem/=49 .AND. iem/=17 .AND. iem/=12 .AND. iem/=13 .AND. iem/=16 .AND. iem/=18) then
dotpr_matl = dotpr0
IEND_MEMBER(ix,iy) = END_MEMBER(ibk,1)
DOTPR(ix,iy) = dotpr0
ibk0 = ibk
end if
end do

if ( IEND_MEMBER(ix,iy) == 0 ) END_MEMBER(0,2) = END_MEMBER(0,2) + 1
do ibk = 1, NB
if ( IEND_MEMBER(ix,iy) == END_MEMBER(ibk,1) ) END_MEMBER(ibk,2) = END_MEMBER(ibk,2) + 1
end do
if ( DOTPR(ix,iy) < DOTPR_MIN(ibk0) ) DOTPR_MIN(ibk0) = DOTPR(ix,iy)
if ( DOTPR(ix,iy) > DOTPR_MAX(ibk0) ) DOTPR_MAX(ibk0) = DOTPR(ix,iy)
end if

call syncthreads()

end subroutine

end module NORMALIZE_KERNEL

*****************************************************

PROGRAM TESTPGI

USE CONSTANTSPGI
USE NORMALIZE_KERNEL

REAL , PARAMETER :: CLOSENESS_OF_FIT=0.90

INTEGER, PARAMETER :: NB = 64
INTEGER :: I, J, K, KK, IX, IY, IWL, OUTERLOOP, OUTERLOOPMAX, NX, NY, NWL, END_MEMBER(0:255,2)
INTEGER (KIND=KI4), ALLOCATABLE :: IEND_MEMBER(:,:)
REAL, ALLOCATABLE :: P_RAD(:,:,:), M_RAD(:,:), DOTPR(:,:)
REAL :: EXECTIME(5), BL, DOTPR_MATL, DOTPR0
REAL, DEVICE, ALLOCATABLE :: M_RAD_DEV(:,:), P_RAD_DEV(:,:,:), DOTPR_DEV(:,:)
REAL, DEVICE :: DOTPR_MIN(0:255), DOTPR_MAX(0:255)
INTEGER (KIND=KI4), DEVICE, ALLOCATABLE :: IEND_MEMBER_DEV(:,:)
INTEGER, DEVICE :: END_MEMBER_DEV(0:255,2)

DATA ((END_MEMBER(I,J),J=1,2),I=0,NB) / 0,0, &
1,0, 2,0, 3,0, 4,0, 5,0, 6,0, 7,0, 8,0, 9,0, 10,0, 11,0, 12,0, 13,0, 14,0, 15,0, 16,0, &
17,0, 18,0, 19,0, 20,0, 21,0, 22,0, 23,0, 24,0, 25,0, 26,0, 27,0, 28,0, 29,0, 30,0, 31,0, 32,0, 33,0, &
34,0, 35,0, 36,0, 37,0, 38,0, 39,0, 40,0, 41,0, 42,0, 43,0, 44,0, 45,0, 46,0, 47,0, 48,0, 49,0, 50,0, &
51,0, 52,0, 53,0, 54,0, 55,0, 56,0, 57,0, 58,0, 59,0, 60,0, 61,0, 62,0, 63,0, 64,0/
! 65,0, 66,0, 67,0, 68,0, 69,0, 70,0, 71,0, 72,0, 73,0, &
! 74,0, 75,0, 76,0, 77,0, 78,0, 79,0, 80,0, 81,0/

NX = 753
!NY = 1924
NY = 1400
NWL = 224

OUTERLOOPMAX = NX * NY

ALLOCATE(M_RAD(NWL,NB),STAT=IOS)
ALLOCATE(P_RAD(NX,NY,NWL),STAT=IOS)
ALLOCATE(IEND_MEMBER(NX,NY),DOTPR(NX,NY),STAT=IOS)

DO I = 1,NWL
DO J = 1,NX
DO K = 1,NY
P_RAD(J,K,I) = .005 ** 2
END DO
END DO
END DO

DO I = 1,NWL
DO J = 1,NB
M_RAD(I,J) = .003 ** 2
END DO
END DO

DO IBK = 1,NB
DOTPR_MIN(IBK) = 1.0
DOTPR_MAX(IBK) = -1.0
END DO

CALL CPU_TIME(EXECTIME(1))
ALLOCATE(M_RAD_DEV(NWL,NB),STAT=IOS)
ALLOCATE(P_RAD_DEV(NX,NY,NWL),STAT=IOS)
ALLOCATE(IEND_MEMBER_DEV(NX,NY),DOTPR_DEV(NX,NY),STAT=IOS)
IF (IOS/=0) THEN
STOP 'ALLOCATE Error 6'

END IF

P_RAD_DEV = P_RAD(1:NX, 1:NY, 1:NWL)
END_MEMBER_DEV = END_MEMBER(0:255, 1:2)
IEND_MEMBER_DEV = IEND_MEMBER(1:NX, 1:NY)
DOTPR_DEV = DOTPR(1:NX, 1:NY)

call NORMALIZE_IMAGE_KERNEL<<<(OUTERLOOPMAX-1)/512+1,512>>>(P_RAD_DEV, M_RAD_DEV, IEND_MEMBER_DEV, END_MEMBER_DEV, DOTPR_MIN, DOTPR_MAX, DOTPR_DEV, NWL, NX, NY, NB)

END_MEMBER(1:NBCKGND, 1:2) = END_MEMBER_DEV
IEND_MEMBER(1:NX, 1:NY) = IEND_MEMBER_DEV

CALL CPU_TIME(EXECTIME(2))
print *, "Time ", EXECTIME(2) - EXECTIME(1)
DEALLOCATE(P_RAD,STAT=IOS)
DEALLOCATE(M_RAD,STAT=IOS)

END PROGRAM TESTPGI
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6206
Location: The Portland Group Inc.

PostPosted: Thu Jan 06, 2011 6:36 pm    Post subject: Reply with quote

Hi Pebbles,

Quote:
How can I know what order of magnitude of improvement I should expect by parallelizing the code?
There really isn't a good way of doing this since there are too many factors.
Quote:

Also is there a way to tell how may threads are actually executing the code?
In the PGI Accelerator model, using the flag "-ta=nvidia,time" will give profile information including the kernel schedule. For CUDA Fortran, you explicitly set the number of threads when you launch your kernel.
Quote:

How could the kernel performance be improved?
The only specific thing I see off hand is that you should remove the extraneous call to syncthreads.

Try running your program using the PGI profile utility "pgcollect" with the flag "-cuda=gmem" of "-cuda=branch". This will gather Host and GPU profile information that you can then review within the PGI profiler PGPROF. (for details please refer to the PGI Tools Guide). This should give you a better idea of where your performance bottlenecks are.


While not necessarily performance relevant, I would suggest modifying your kernel schedule to be two dimensions. i.e. change your kernel launch schedule to:


Code:
...
INTEGER, DEVICE :: END_MEMBER_DEV(0:255,2)
TYPE(dim3) :: dimGrid, dimBlock
...
dimBlock = dim3(16,16,1)
dimGrid = dim3((NX+15)/16,(NY+15)/16,1)

call NORMALIZE_IMAGE_KERNEL<<<dimGrid,dimBlock>>>(P_RAD_DEV, M_RAD_DEV, IEND_MEMBER_DEV, END_MEMBER_DEV,  &
                                                  DOTPR_MIN, DOTPR_MAX, DOTPR_DEV, NWL, NX, NY, NB)
...


And then you can simply your device kernel index:
Code:

ix = ( blockidx%x-1 ) * blockdim%x + threadidx%x
iy = ( blockidx%y-1 ) * blockdim%y + threadidx%y

if ( ix .le. NX .and. iy.le.NY ) then


Hope this helps,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group