PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

profiling individual subroutines

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Debugging and Profiling
View previous topic :: View next topic  
Author Message
brush



Joined: 26 Jun 2012
Posts: 44

PostPosted: Tue Jun 11, 2013 3:13 pm    Post subject: profiling individual subroutines Reply with quote

I've measured overall speedups of the entire code with "time ./slab", but I want to measure the speedup of an individual subroutine. Since accelerator directives are only in such subroutine in question, I'd like to know its speedup rather then the entire code. Now, I know I can calculate this by assuming that the rest of the code takes the same time to run, but is there a way to measure it directly?

Using PGI_ACC_TIME=1 I guess the bolded number below shows me the total time spent in accelerator regions, including kernel and data transfers. Since this region makes up essentially the entire subroutine, I expected this to accuratly give me the accelerated time of the subroutine.
Code:
Accelerator Kernel Timing data
/home/ben/slab_support/slab.f
  ppush  NVIDIA  devicenum=0
    [b]time(us): 10,855,571[/b]
    276: compute region reached 40 times
        276: data copyin reached 520 times
             device time(us): total=3,714,420 max=10,758 min=6 avg=7,143
        277: kernel launched 40 times
            grid: [65535]  block: [128]
             device time(us): total=3,868,258 max=114,120 min=91,532 avg=96,706
            elapsed time(us): total=3,869,863 max=114,154 min=91,566 avg=96,746
        363: data copyout reached 320 times
             device time(us): total=3,272,893 max=10,823 min=10,170 avg=10,227


However, this time seems to be inconsistent with the overall speedup I am observing, leading me to believe that the above profile is missing some time somehow.

I'd like to use the pgprofiler, but I don't know how to find the wallclock time of cpush and ppush based on the profile, as I get functions like __select_nocancel. Apparently the 169 seconds of __select_nocancel happens somewhere inside ppush or cpush, but I don't know how exactly.

Ben
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6215
Location: The Portland Group Inc.

PostPosted: Tue Jun 11, 2013 4:15 pm    Post subject: Reply with quote

Hi Ben,

In 13.7, you'll be able to use 'pgcollect' to create a mixed Host/Device profile and then view the results in PGPROF. Hopefully this will give you an easier method to extract the information you are looking for.

Quote:
Since this region makes up essentially the entire subroutine, I expected this to accuratly give me the accelerated time of the subroutine.
This gives you the total time spent in this region, including kernel, data, nested regions, and even CPU time.

Quote:
However, this time seems to be inconsistent with the overall speedup I am observing, leading me to believe that the above profile is missing some time somehow.
You may be encountering the pinned memory issue I discuss here:http://www.pgroup.com/userforum/viewtopic.php?t=3902, or some other CUDA/device overhead issue which is not measured by PGI_ACC_TIME. For this detail, you'd need to use NVVP.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Debugging and Profiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group