PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Very slow performance of some loops

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
elephant



Joined: 24 Feb 2011
Posts: 22

PostPosted: Fri Jul 22, 2011 6:05 am    Post subject: Very slow performance of some loops Reply with quote

Hi

I am facing performance problems of my code. I have several subroutine calls in a main program. all the subroutines are ported to the GPU. The time spent in the loops of the subroutines should be all in the order of 10^-3 to 10^-4 seconds. When I compile the program and run it several times, there are always three loops that are VERY slow (order of 1-2 seconds). Strange thing is that this times are jumping around, meaning that for one execution loop3 in subroutine1 has 2sec while as when I compile and execute the program again then loop6 in subroutine1 has about 2s...
Please check the following output:
Code:

 **********************************************************
 Time for Sub1,Loop1 [s]        =   8.2000000000000001E-005
 Time for Sub1,Loop2 [s]        =   2.0699999999999999E-004
 Time for Sub1,Loop3 [s]        =    2.107871000000000        !!!
 Time for Sub1,Loop4 [s]        =   1.5999999999999999E-004
 Time for Sub1,Loop5 [s]        =   8.0599999999999997E-004
 Time for Sub1,Loop6 [s]        =   8.3400000000000002E-003
 Time for Sub1,Loop7 [s]        =   2.4399999999999999E-004
 Time for Sub1,Loop8 [s]        =   1.0300000000000000E-004
 Time for Sub1,Loop9 [s]        =   8.2999999999999998E-005
 Time for Sub1,Loop10[s]        =   2.2551999999999999E-002
 **********************************************************
 Time for Sub2,Loop1 [s]        =   1.7500000000000000E-004
 Time for Sub2,Loop2 [s]        =   8.3999999999999995E-005
 Time for Sub2,Loop3 [s]        =   7.0999999999999991E-005
 Time for Sub2,Loop4 [s]        =   2.7799999999999998E-004
 Time for Sub2,Loop5 [s]        =   1.3799999999999999E-004
 Time for Sub2,Loop6 [s]        =   6.3000000000000000E-005
 **********************************************************
 Time for Sub3,Loop1 [s]        =   8.6379999999999998E-003
 Time for Sub3,Loop2 [s]        =   0.2743080000000000        !!!
 Time for Sub3,Loop3 [s]        =   8.5999999999999990E-005
 **********************************************************
 Time for Sub4,Loop1 [s]        =   1.5999999999999999E-004
 Time for Sub4,Loop2 [s]        =   1.7799999999999999E-004
 Time for Sub4,Loop3 [s]        =    2.636425000000000        !!!
 **********************************************************
 
  *********************************************************
 Time for Sub1,Loop1 [s]        =   6.7000000000000002E-005
 Time for Sub1,Loop2 [s]        =   1.5233000000000000E-002
 Time for Sub1,Loop3 [s]        =   1.1600000000000000E-004
 Time for Sub1,Loop4 [s]        =   1.3799999999999999E-004
 Time for Sub1,Loop5 [s]        =   2.4163000000000000E-002
 Time for Sub1,Loop6 [s]        =   3.7700000000000000E-004
 Time for Sub1,Loop7 [s]        =   1.9899999999999999E-004
 Time for Sub1,Loop8 [s]        =   6.9999999999999994E-005
 Time for Sub1,Loop9 [s]        =   6.4999999999999994E-005
 Time for Sub1,Loop10[s]        =    1.978296000000000        !!!
 **********************************************************
 Time for Sub2,Loop1 [s]        =   1.6899999999999999E-004
 Time for Sub2,Loop2 [s]        =   7.4999999999999993E-005
 Time for Sub2,Loop3 [s]        =   7.3999999999999996E-005
 Time for Sub2,Loop4 [s]        =   2.5799999999999998E-004
 Time for Sub2,Loop5 [s]        =   1.3500000000000000E-004
 Time for Sub2,Loop6 [s]        =   6.7999999999999999E-005
 **********************************************************
 Time for Sub3,Loop1 [s]        =   8.6379999999999998E-003
 Time for Sub3,Loop2 [s]        =   0.2743080000000000        !!!
 Time for Sub3,Loop3 [s]        =   8.5999999999999990E-005
 **********************************************************
 Time for Sub4,Loop1 [s]        =   8.0800000000000002E-004
 Time for Sub4,Loop2 [s]        =    1.442023000000000        !!!
 Time for Sub4,Loop3 [s]        =   8.7599999999999993E-004
 **********************************************************

Is it possible that from time to time the launch of a new kernel might lead to a 1 to 2 second breakdown of the program? That the GPU stands still? To me it seems like arbitrary behaviour where the slow loop is. I studyied also the compiler message carefouly and there is no device-host data transfer during execution!
Thank you very much!!!
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Fri Jul 22, 2011 9:09 am    Post subject: Reply with quote

Hi elephant,

How are you timing the loops? If you are using a CPU timer, then it's possible that the OS is interrupting your process, and this delay is being picked up in your timer.

If using the PGI Accelerator Model, add the "-ta=nvidia,time" option to get some basic profiling information about your kernels. For CUDA Fortran, try using the CUDA events to get more accurate timing information.

Example:
Code:
...
    real :: func_time
    type(cudaEvent) :: func_start, func_end
...
    istat = cudaEventCreate(func_start)
    istat = cudaEventCreate(func_end)    i
    istat = cudaEventRecord(func_start, 0)
    call  mykernel<<<dimGrid>>>(dX, dY, dTemp, N)
    istat = cudaEventRecord(func_end, 0)
    istat = cudaThreadSynchronize()
    istat = cudaEventElapsedTime(func_time, func_start, func_end)
    mytime = func_time/1000
...


Also, try using a profiler to get a better understanding on what's happening. You can use "pgcollect -cuda <myexe>" to gather performance data and then view this data in the PGI profiler, pgprof. Or set the environment variable "CUDA_PROFILE=1" to enable the CUDA profiler to gather information about your program.

Hope this helps,
Mat
Back to top
View user's profile
elephant



Joined: 24 Feb 2011
Posts: 22

PostPosted: Fri Jul 22, 2011 11:30 am    Post subject: Reply with quote

Hey Mat,
thank you for the fast response!
I am using the pgi accelerator model. The way I am capturing the time is as following. I haved used the "-ta=nvidia,time" flag before on some test loops but have noticed that this slowes down the performance. I use the "call system clock" inbetween two accelerated loops, not in a compute region. Do you think this still can cause this problem?
Code:

!$acc region
   loop1
!$acc end region

call sytem clock...

!$acc region
   loop2
!$acc end region

Can I use "pgcollect -cuda <myexe>" and "CUDA_PROFILE=1" even if I dont use CUDA?
Thank you!
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Fri Jul 22, 2011 2:29 pm    Post subject: Reply with quote

Quote:
Do you think this still can cause this problem?
Because the performance difference varies from routine to routine and run to run, it just seems like it.

Quote:
Can I use "pgcollect -cuda <myexe>" and "CUDA_PROFILE=1" even if I dont use CUDA?
Both work just take off the "-cuda" from pgcollect. The "-cuda" enables GPU hardware counters. We're working on adding this support for the accelerator model, but for now you only get information about how much time each kernel takes, how much time was spent copying data, and initialization time. (It's similar to the -ta=nvidia,time output).

The CUDA prof info is the same, just not aggregated.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group