PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Issue on profiling accelerator kernels

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Debugging and Profiling
View previous topic :: View next topic  
Author Message
Seyong



Joined: 14 Jun 2011
Posts: 4

PostPosted: Thu Jun 16, 2011 11:02 am    Post subject: Issue on profiling accelerator kernels Reply with quote

Can anyone explain the profiled output shown below?
As you can see, there is a huge gap between "region" time
and the sum of "kernels" time and "data" time.

PGI user guide (PGI2011 User guide, chapter 7, Profiling Accelerator kernels (page 105)) says that the execution time ("regions" time) is divided into kernel execution time ("kernels" time) and data transfer time between the host and GPU ("data" time), but below output shows that something big occurs during the region time.

Can anyone explain why "regions" time can be much bigger than the sum of "kernels" time and "data" time?
(Accelerator regions are called repeatedly in a nested loop, but the loop does not contain any CPU code except for the loop itself, no data are transferred between CPU and GPU.)

[Added later]
1) Manually written CUDA program shows kernel execution times similar to the ones shown below, and each accelerator loop uses many private scalar variables, which may mean that PGI Accelerator compiler generates very inefficient code to handle loop-private data.
2) pgprof shows that about 49 seconds are spent for routine 'sstk', I don't know why it takes so long time (The total execution time of the manual version is less than 8 seconds.)
3) CUDA visual profiler also shows that there are non-trivial time gap between each kernel invocation, even though no data transfers occur between kernel calls.


Test environment:
PGI comipler version: 11.5
GPU: Tesla C2050 with runtime version 4.0
CPU: Intel Xeon CPU E5520 @2.27GHz
OS: Cent OS release 5.6

=====================================
Accelerator Kernel Timing data
/home/f6l/work/pgi/src/CFD/euler3d_cpu.c
main
598: region entered 6000 times
time(us): total=2871953 init=1222 region=2870731
kernels=713274 data=0
w/o init: total=2870731 max=716 min=427 avg=478
599: kernel launched 6000 times
grid: [1512] block: [128]
time(us): total=713274 max=130 min=117 avg=118
/home/f6l/work/pgi/src/CFD/euler3d_cpu.c
main
440: region entered 6000 times
time(us): total=56968526 init=1170 region=56967356
kernels=7977452 data=0
w/o init: total=56967356 max=16848 min=8962 avg=9494
441: kernel launched 6000 times
grid: [1512] block: [128]
time(us): total=7977452 max=1377 min=1243 avg=1329
/home/f6l/work/pgi/src/CFD/euler3d_cpu.c
main
410: region entered 2000 times
time(us): total=3495504 init=384 region=3495120
kernels=334715 data=0
w/o init: total=3495120 max=1980 min=1672 avg=1747
411: kernel launched 2000 times
grid: [1512] block: [128]
time(us): total=334715 max=187 min=166 avg=167
/home/f6l/work/pgi/src/CFD/euler3d_cpu.c
main
402: region entered 2000 times
time(us): total=664865 init=376 region=664489
kernels=244456 data=0
w/o init: total=664489 max=408 min=319 avg=332
403: kernel launched 2000 times
grid: [7560] block: [128]
time(us): total=244456 max=126 min=121 avg=122
/home/f6l/work/pgi/src/CFD/euler3d_cpu.c
main
385: region entered 1 time
time(us): total=489
kernels=61 data=0
386: kernel launched 1 times
grid: [1512] block: [128]
time(us): total=61 max=61 min=61 avg=61
/home/f6l/work/pgi/src/CFD/euler3d_cpu.c
main
377: region entered 1 time
time(us): total=64112060 init=94186 region=64017874
data=6017
w/o init: total=64017874 max=64017874 min=64017874 avg=64017874
Back to top
View user's profile
Seyong



Joined: 14 Jun 2011
Posts: 4

PostPosted: Mon Jun 20, 2011 11:44 am    Post subject: Reply with quote

The above problem disappeared if all private clauses are removed; it seems that the PGI Accelerator compiler has a bug in handling private clauses. For the tested program, both versions with explicit private clauses and without private causes generate the same, correct outputs, but the version with explicit private data clauses takes much longer time than the one without private clauses (63 sec v.s. 9 sec).
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Tue Jun 21, 2011 4:45 pm    Post subject: Reply with quote

Hi Seyong,

Is it possible for you to post or send to PGI Customer Service (trs@pgroup.com) a specific example?

Thanks,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Debugging and Profiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group