PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

First run is slower than the following

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
lechat



Joined: 14 Oct 2012
Posts: 10

PostPosted: Sun Oct 21, 2012 1:29 pm    Post subject: First run is slower than the following Reply with quote

Hi there,

I noticed this across my benchmarks: the first time a kernel is run, it takes longer, regardless of the usage of acc_init(0).

For each benchmark I have, I run it several times and take the average time of the executions.
I use clock() to measure time, and the main loop is basically:

Code:
for (int i=0; i<num_tests; i++) {
  time_t begin = clock();
  benchmark;
  time_t end = clock();
}


I found out that the first iteration would take longer than the subsequent ones.
On the beginning I thought it was because I had forgotten the acc_init() in the beginning of my program, but it actually doesn't make much of a difference.

I then supposed that the compiler was re-using the data, instead of copying it over again.
I did a small experiment: I created another input/output arrays, with different data, and performed an "empty" benchmark with the fake data before running my real benchmark.
This way, if there was some sort of clever re-using of data, it wouldn't affect my program: well, it still worked faster.

Then what I did was to change the "warm-up" benchmark to a different thing. I was doing a matrix transpose, but then I wrote a simple array copy with openacc and run that as my warm-up (the benchmark was still the transpose, but the warm-up was the copy).
It didn't make any difference: regardless of which warm-up, the code ran faster than the warm-up-less code.

Apparently, the first openacc invocation that happens STILL needs to perform some initialization.

Here are some numbers for the matrix transpose 8192x8192:
No warm-up: 0.362 seconds (run 5 times, max: 0.39 min: 0.32)

Code:
Accelerator Kernel Timing data
/home/lechat/tcc/transpose/src/trans_acc.c
  trans
    9: region entered 1 time
        time(us): total=40,066 init= region=40,065
                  kernels=39,934
        w/o init: total=40,065 max=40,065 min=40,065 avg=40,065
        12: kernel launched 1 times
            grid: [128x2048]  block: [64x4]
            time(us): total=39,934 max=39,934 min=39,934 avg=39,934
/home/lechat/tcc/transpose/src/trans_acc.c
  trans
    6: region entered 1 time
        time(us): total=372,274 init=126,802 region=245,472
                  data=204,136
        w/o init: total=245,472 max=245,472 min=245,472 avg=245,472


With warm-up: 0.24s (max 0.25, min 0.23)
Code:
Accelerator Kernel Timing data
/home/lechat/tcc/transpose/src/trans_acc.c
  trans
    9: region entered 2 times
        time(us): total=80,044 init= region=80,043
                  kernels=79,892
        w/o init: total=80,043 max=40,088 min=39,955 avg=40,021
        12: kernel launched 2 times
            grid: [128x2048]  block: [64x4]
            time(us): total=79,892 max=39,953 min=39,939 avg=39,946
/home/lechat/tcc/transpose/src/trans_acc.c
  trans
    6: region entered 2 times
        time(us): total=563,089 init=82,602 region=480,487
                  data=398,955
        w/o init: total=480,487 max=242,329 min=238,158 avg=240,243



My questions are:
Is there something I'm not aware?
Is this a standard/expected behaviour?
Is running a warm-up computation the right way to handle this?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Mon Oct 22, 2012 12:08 pm    Post subject: Reply with quote

Hi Lechat,

What you're seeing is the cost to copy and load the kernel itself over to the GPU. When you call the same kernel repeatedly in succession, it doesn't need to copy it over again.

Hope this explains what's going on.
- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group