PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

kernel launch overhead
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Wed Dec 07, 2011 4:23 am    Post subject: kernel launch overhead Reply with quote

Hi,

I am trying to better understand the time overhead required to launch kernels.
I have written the following test code:
Code:

program main
#ifdef _MCUDA_
USE cudafor
#endif
  implicit none
  integer*4 :: N,nc,i,k,itime,nt,nargs,ierr,iwait
  real*8, allocatable :: a(:), b(:)
  character*16 arg
  integer*4 :: dt1(8), dt2(8), it1, it2,istat, &
       idummy,icountrate,icountmax
  real*8 :: rt1,rt2

  N=1E4        !number of parallel threads
  nt=1000      !number of timing iterations
  nc=10        !number of compute iterations
  iwait=1      !wait for completion

  nargs = command_argument_count()
    if( nargs >= 1 ) then
       call getarg( 1, arg )
       read(arg,'(i)') N
    endif

    if( nargs >= 2 ) then
       call getarg( 2, arg )
       read(arg,'(i)') iwait
    endif

 allocate(a(N),b(N))

!--------------------------------------------
!test1: compute loop inside kernel
!$acc data region local(a,b)
!init
a=0.01
!$acc region
do i=1,N
b(i)=0.1
end do
!$acc end region

!$acc update device(a)
#ifdef _MCUDA_
 istat=cudaThreadSynchronize()
#endif
CALL SYSTEM_CLOCK(COUNT=it1,COUNT_RATE=icountrate,COUNT_MAX=icountmax)

!iteration loop to have some statistic
do itime=1,nt
   !initialization
#ifdef _MCUDA_
   IF (iwait==1) istat=cudaThreadSynchronize()
#endif
   !$acc region do kernel parallel, vector(256)
   do i=1,N
      do k=1,nc  !compute loop
         a(i)=a(i)*0.01+exp(b(i)*b(i))
      end do
   end do
   !$acc end region
end do


#ifdef _MCUDA_
 istat=cudaThreadSynchronize()
#endif
CALL SYSTEM_CLOCK(COUNT=it2)
!$acc update host(a)
!$acc end data region



 rt1 = ( REAL(it2) - REAL(it1) ) / REAL(icountrate)

write(*,"(A7,I,A7,I,A7,I)") ' N=', N, ' , nt=',nt,' , nc=',nc
print*, '1: sum(a)=',sum(a)
write(*,"(A,F10.2)")  ' 1: time per step (us) =', rt1/nt * 1E6

!--------------------------------------------
!test2: compute loop outside kernel
a=0.01

!$acc data region local(a,b)
!$acc update device(a)
#ifdef _MCUDA_
 istat=cudaThreadSynchronize()
#endif
CALL SYSTEM_CLOCK(COUNT=it1,COUNT_RATE=icountrate,COUNT_MAX=icountmax)
!iteration loop to have some statistic
do itime=1,nt
   do k=1,nc  !compute loop
#ifdef _MCUDA_
      IF (iwait==1) istat=cudaThreadSynchronize()
#endif
   !$acc region do kernel parallel, vector(256)
   do i=1,N
      a(i)=a(i)*0.01+exp(b(i)*b(i))
   end do
   !$acc end region
   end do
end do
#ifdef _MCUDA_
 istat=cudaThreadSynchronize()
#endif
CALL SYSTEM_CLOCK(COUNT=it2)
!$acc update host(a)
!$acc end data region

rt2 = ( REAL(it2) - REAL(it1) ) / REAL(icountrate)
!print time
print*, '2: sum(a)=',sum(a)
write(*,"(A,F10.2)")  ' 2: time per step (us) =', rt2/nt * 1E6
write(*,"(A,F10.2)")  ' Mean kernel overhead per launch (us)=', abs(rt1-rt2)/(real(nt*nc))*1E6

end program main

which I compile with the following command:
pgf90 -ta=nvidia -O3 -Minfo=accel -Mcuda -D_MCUDA_ -Mpreprocess -o kernel_overhead_timing kernel_overhead_timing.f90

If I now run it for 10000 parallel threads, I got overhead of about 30 us
./kernel_overhead_timing 10000 1
N= 10000 , nt= 1000 , nc= 10
1: sum(a)= 10202.52694098092
1: time per step (us) = 53.76
2: sum(a)= 10202.52694098092
2: time per step (us) = 334.37
Mean kernel overhead per launch (us)= 28.06

Note that for this first experiment, with the last option being set to 1, there is a call to cudaThreadSynchronize() before each kernel launch.

If I now run with option 0:
./kernel_overhead_timing 10000 0
N= 10000 , nt= 1000 , nc= 10
1: sum(a)= 10202.52694098092
1: time per step (us) = 26.44
2: sum(a)= 10202.52694098092
2: time per step (us) = 61.95
Mean kernel overhead per launch (us)= 3.55

I get 3.55 us. Which seems to indicate that the different kernel executions have been overlapping.

For a code using directives, would there be cases where an equivalent to cudaThreadSynchronize() is issued ? In this case one should count on a 30 us additional time when using multiple kernels.

Thanks,

Xavier
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Wed Dec 07, 2011 3:43 pm    Post subject: Reply with quote

Hi Xavier,

Host code will block until the kernel is done executing so there is no need to add a call to "cudaThreadSynchronize". I think the difference you're seeing between the two runs is the overhead of calling this function.

I ran your program and set "CUDA_PROFILE" to 1 to get a device profile. Using a perl script I wrote to sum the values, I show very little difference in the GPU time between the two runs.
Code:

% pgf90 -ta=nvidia -O3  -Mcuda -D_MCUDA_ -Mpreprocess test.f90 -V11.10
% a.out 10000 1
     N=       10000  , nt=        1000  , nc=          10
 1: sum(a)=    10202.52694098092     
 1: time per step (us) =     45.08
 2: sum(a)=    10202.52694098092     
 2: time per step (us) =    201.89
 Mean kernel overhead per launch (us)=     15.68
% perl totalProf.pl cuda_profile_0.log
                             main_37_gpu   0.000007   0.000020
                              memcpyDtoH   0.000029   0.000083
                              memcpyHtoD   0.000031   0.000048
                             main_55_gpu   0.026302   0.004748
                             main_96_gpu   0.060909   0.037238
                                 Totals:   0.087277   0.042137

% a.out 10000 0
     N=       10000  , nt=        1000  , nc=          10
 1: sum(a)=    10202.52694098092     
 1: time per step (us) =     28.91
 2: sum(a)=    10202.52694098092     
 2: time per step (us) =     81.23
 Mean kernel overhead per launch (us)=      5.23
% perl totalProf.pl cuda_profile_0.log
                             main_37_gpu   0.000007   0.000021
                              memcpyDtoH   0.000029   0.000086
                              memcpyHtoD   0.000031   0.000046
                             main_55_gpu   0.026188   0.004331
                             main_96_gpu   0.059881   0.034520
                                 Totals:   0.086135   0.039004


The main overhead in launching a kernel has to do with copying the kernel code itself to the GPU as well as copying the arguments. Hence, the overhead can vary quite a bit. One thing you can look at is the profile output from the "-ta=nvidia,time" flag. It will give you initialization times.

Code:
% pgf90 -ta=nvidia,time -O3 -Mcuda -D_MCUDA_ -Mpreprocess test.f90 -V11.10
% a.out 10000 0
     N=       10000  , nt=        1000  , nc=          10
 1: sum(a)=    10202.52694098092     
 1: time per step (us) =     48.39
 2: sum(a)=    10202.52694098092     
 2: time per step (us) =    276.12
 Mean kernel overhead per launch (us)=     22.77

Accelerator Kernel Timing data
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    106: region entered 1 time
        time(us): init=0
                  data=55
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    95: region entered 10000 times
        time(us): total=274319 init=817 region=273502
                  kernels=145376 data=0
        w/o init: total=273502 max=61 min=26 avg=27
        96: kernel launched 10000 times
            grid: [40]  block: [256]
            time(us): total=145376 max=48 min=13 avg=14
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    84: region entered 1 time
        time(us): init=0
                  data=48
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    83: region entered 1 time
        time(us): total=276442 init=1 region=276441
        w/o init: total=276441 max=276441 min=276441 avg=276441
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    68: region entered 1 time
        time(us): init=0
                  data=52
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    54: region entered 1000 times
        time(us): total=48202 init=76 region=48126
                  kernels=34788 data=0
        w/o init: total=48126 max=63 min=46 avg=48
        55: kernel launched 1000 times
            grid: [40]  block: [256]
            time(us): total=34788 max=45 min=34 avg=34
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    42: region entered 1 time
        time(us): init=0
                  data=57
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    36: region entered 1 time
        time(us): total=80 init=1 region=79
                  kernels=38 data=0
        w/o init: total=79 max=79 min=79 avg=79
        37: kernel launched 1 times
            grid: [40]  block: [256]
            time(us): total=38 max=38 min=38 avg=38
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    33: region entered 1 time
        time(us): total=170986 init=121122 region=49864
        w/o init: total=49864 max=49864 min=49864 avg=49864


Hope this helps,
Mat
Back to top
View user's profile
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Thu Dec 08, 2011 5:00 am    Post subject: Reply with quote

Hi Mat,

Thanks for clarifying things.

In fact I am trying to understand why is it faster in this case to call one larger kernel as compare to many smaller kernels. So looking only at the case

Code:

% a.out 10000 1
...
1: time per step (us) =     45.0
...
2: time per step (us) =    201.89

I though at first that the difference between the two timings was coming from the initialisation time, but the ta=nvidia,time info seems to indicate that it is in fact very small.
Code:

    95: region entered 10000 times
        time(us): total=274319 init=817 region=273502
                  kernels=145376 data=0

which would mean 817/10000 per launch. This is way under the 3.5 us I was looking at.

How should one interpret the difference between region=273502 and kernels=145376, what is the code doing during this time ?

Xavier[/code]
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Thu Dec 08, 2011 10:59 am    Post subject: Reply with quote

Quote:
How should one interpret the difference between region=273502 and kernels=145376, what is the code doing during this time ?
It's due to the fact that the profile information is being gathered from the host side. Hence the region timing can include CPU execution time, and much of this time can be attributed to setting up the call to the device. For this code, calling the kernel few times does seem to be better.

Quote:
In fact I am trying to understand why is it faster in this case to call one larger kernel as compare to many smaller kernels.
In this case, the kernel is the same size (which is actually very small). You're just giving it a different amount of work.

Typically, the large versus small kernel challenge has to do with the amount of resources, registers, constant memory, and shared memory, a kernel consumes. Since there are finite amount of resources, the more a single kernel uses, the few number of kernels that can be run at the same time. This is called "Occupancy". (NVIDIA has a nice Execl spreadsheet which helps determine occupany http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls) Generally, a program with low occupancy will have lower performance. But a high occupancy kernel may or may not have higher performance.

- Mat

Occupany information can be seen in the compiler feedback messages:
Code:
% pgf90 -ta=nvidia -O3 -Mpreprocess test.f90 -V11.10 -Minfo=accel
main:
     33, Generating local(b(:))
         Generating local(a(:))
     36, Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     37, Loop is parallelizable
         Accelerator kernel generated
         37, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.3 : 6 registers; 44 shared, 4 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 8 registers; 4 shared, 56 constant, 0 local memory bytes; 100% occupancy
     42, Generating !$acc update device(a(:))
     54, Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     55, Loop is parallelizable
         Accelerator kernel generated
         55, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.3 : 14 registers; 60 shared, 28 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 18 registers; 4 shared, 84 constant, 0 local memory bytes; 100% occupancy
     56, Complex loop carried dependence of 'a' prevents parallelization
         Loop carried dependence of 'a' prevents parallelization
         Loop carried backward dependence of 'a' prevents vectorization
     68, Generating !$acc update host(a(:))
     83, Generating local(b(:))
         Generating local(a(:))
     84, Generating !$acc update device(a(:))
     95, Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     96, Loop is parallelizable
         Accelerator kernel generated
         96, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 1.3 : 10 registers; 60 shared, 28 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 13 registers; 4 shared, 84 constant, 0 local memory bytes; 100% occupancy
    106, Generating !$acc update host(a(:))
Back to top
View user's profile
JMa



Joined: 30 Nov 2012
Posts: 22

PostPosted: Wed Jul 16, 2014 1:41 pm    Post subject: Reply with quote

Hi Mat,
In PVF 2014 project setting , where I can activate the flag of "-ta=nvidia,time"?

Actually what I'm interested in is in PVF how can I get the overhead time of kernel launch? When I turn on "accelerator profiling" in PVF, I only get something like below, without telling me the kernel launch overhead.

275: cmompute region reached 1347 times
276: kernel launched 1347 times
grid: [1] block: [128]
device time(us): total=2,698 max=4 min=2 avg=2
elapsed time(us): total=219,000 max=16,000 min=0 avg=162


Thanks,
JMa

mkcolg wrote:
Hi Xavier,
I ran your program and set "CUDA_PROFILE" to 1 to get a device profile. Using a perl script I wrote to sum the values, I show very little difference in the GPU time between the two runs.
Code:

% pgf90 -ta=nvidia -O3  -Mcuda -D_MCUDA_ -Mpreprocess test.f90 -V11.10
% a.out 10000 1
     N=       10000  , nt=        1000  , nc=          10
 1: sum(a)=    10202.52694098092     
 1: time per step (us) =     45.08
 2: sum(a)=    10202.52694098092     
 2: time per step (us) =    201.89
 Mean kernel overhead per launch (us)=     15.68
% perl totalProf.pl cuda_profile_0.log
                             main_37_gpu   0.000007   0.000020
                              memcpyDtoH   0.000029   0.000083
                              memcpyHtoD   0.000031   0.000048
                             main_55_gpu   0.026302   0.004748
                             main_96_gpu   0.060909   0.037238
                                 Totals:   0.087277   0.042137

% a.out 10000 0
     N=       10000  , nt=        1000  , nc=          10
 1: sum(a)=    10202.52694098092     
 1: time per step (us) =     28.91
 2: sum(a)=    10202.52694098092     
 2: time per step (us) =     81.23
 Mean kernel overhead per launch (us)=      5.23
% perl totalProf.pl cuda_profile_0.log
                             main_37_gpu   0.000007   0.000021
                              memcpyDtoH   0.000029   0.000086
                              memcpyHtoD   0.000031   0.000046
                             main_55_gpu   0.026188   0.004331
                             main_96_gpu   0.059881   0.034520
                                 Totals:   0.086135   0.039004


The main overhead in launching a kernel has to do with copying the kernel code itself to the GPU as well as copying the arguments. Hence, the overhead can vary quite a bit. One thing you can look at is the profile output from the "-ta=nvidia,time" flag. It will give you initialization times.

Code:
% pgf90 -ta=nvidia,time -O3 -Mcuda -D_MCUDA_ -Mpreprocess test.f90 -V11.10
% a.out 10000 0
     N=       10000  , nt=        1000  , nc=          10
 1: sum(a)=    10202.52694098092     
 1: time per step (us) =     48.39
 2: sum(a)=    10202.52694098092     
 2: time per step (us) =    276.12
 Mean kernel overhead per launch (us)=     22.77

Accelerator Kernel Timing data
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    106: region entered 1 time
        time(us): init=0
                  data=55
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    95: region entered 10000 times
        time(us): total=274319 init=817 region=273502
                  kernels=145376 data=0
        w/o init: total=273502 max=61 min=26 avg=27
        96: kernel launched 10000 times
            grid: [40]  block: [256]
            time(us): total=145376 max=48 min=13 avg=14
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    84: region entered 1 time
        time(us): init=0
                  data=48
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    83: region entered 1 time
        time(us): total=276442 init=1 region=276441
        w/o init: total=276441 max=276441 min=276441 avg=276441
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    68: region entered 1 time
        time(us): init=0
                  data=52
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    54: region entered 1000 times
        time(us): total=48202 init=76 region=48126
                  kernels=34788 data=0
        w/o init: total=48126 max=63 min=46 avg=48
        55: kernel launched 1000 times
            grid: [40]  block: [256]
            time(us): total=34788 max=45 min=34 avg=34
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    42: region entered 1 time
        time(us): init=0
                  data=57
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    36: region entered 1 time
        time(us): total=80 init=1 region=79
                  kernels=38 data=0
        w/o init: total=79 max=79 min=79 avg=79
        37: kernel launched 1 times
            grid: [40]  block: [256]
            time(us): total=38 max=38 min=38 avg=38
/home/colgrove/support/xlapillonne/12_07_11/test.f90
  main
    33: region entered 1 time
        time(us): total=170986 init=121122 region=49864
        w/o init: total=49864 max=49864 min=49864 avg=49864


Hope this helps,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group