PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

quite puzzled
Goto page Previous  1, 2
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
KevinWoo



Joined: 08 Aug 2012
Posts: 19

PostPosted: Wed Oct 17, 2012 7:13 am    Post subject: Thanks a lot Reply with quote

Dear Mat,

Thank you very much for sending me the double versions.

I`ve tried the version 1 and it told me that "
6112, Accelerator restriction: induction variable live-out from loop: i
Accelerator restriction: induction variable live-out from loop: k
6113, Accelerator restriction: induction variable live-out from loop: i
Accelerator restriction: induction variable live-out from loop: k
......"
and I don`t know how to add the "[do private] derective eventhrough with some tests.

code of version 2 is the one I am using. I was supposed to put the !$acc data outside.It seems like the !$acc kernel loop should be !$acc kernels.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6129
Location: The Portland Group Inc.

PostPosted: Wed Oct 17, 2012 9:33 am    Post subject: Reply with quote

Quote:
I`ve tried the version 1 and it told me that "
6112, Accelerator restriction: induction variable live-out from loop: i
Accelerator restriction: induction variable live-out from loop: k
6113, Accelerator restriction: induction variable live-out from loop: i
Accelerator restriction: induction variable live-out from loop: k
I'd need to see the full code to tell why but most likely you're using i and k later in the program without initializing them (or in a conditional branch). To work around this, add the private clause.
Code:
!$acc loop collapse(2) gang private(i,k)


Quote:
I was supposed to put the !$acc data outside.
Yes, you want he data region above the outermost loop so that you don't repeatedly copy data over for each iteration of i and k. Though, I made a mistake adding the "A" arrays in version 2. They are updated in host code so shouldn't be part of the data region.

Quote:
It seems like the !$acc kernel loop should be !$acc kernels.
Either works. You just need to remember to add the end kernel directive at then end of the n loop.

- Mat
Back to top
View user's profile
KevinWoo



Joined: 08 Aug 2012
Posts: 19

PostPosted: Fri Nov 23, 2012 5:43 am    Post subject: how come Reply with quote

To be familiar with acc, I`ve tried to add those directives to a couple of test programs. Some succeeded with almost the same speed to my GPU code(Cuda Fortran) while others failed.
I thought I had found the problem. But unfortunately, I am still struggling for why and how. Just like this.
Case1
region entered 1638 times
time(us): total=653,108 init=138 region=652,970 ---A---
kernels=601,373 ---B---
w/o init: total=652,970 max=496 min=393 avg=398
68: kernel launched 1638 times
grid: [2x32] block: [64x4]
time(us): total=601,373 max=377 min=361 avg=367
Case2
region entered 360 times
time(us): total=16,644 init=25 region=16,619 ---C---
kernels=3,954 ---D---
w/o init: total=16,619 max=190 min=43 avg=46
2140: kernel launched 360 times
grid: [1-52] block: [128]
time(us): total=3,954 max=18 min=10 avg=10

As you see, case1`s kernels nearly equal to region (marked with A&B)while case2`s region is 4 times larger. Another place may counts is probably the message of grid that one is definitely [2*32] and the other one [1~52].

Since those directives are all added by me. I did not expect different results like that and just can`t get through it.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6129
Location: The Portland Group Inc.

PostPosted: Mon Nov 26, 2012 10:55 am    Post subject: Reply with quote

Hi Kevin,

The "region" is measured from the CPU while the "kernels" is measured from the device. The difference between the two is the basically the overhead to launch the kernel. For case 1, the overhead per kernel launch is ~31.5 us: (652970-601373-138) / 1638. While in case 2 the overhead is just a bit higher at ~35.1 us: (16,619-3954-25) / 360.

The main difference between Case 1 and Case 2 is that the average kernel time is much smaller in Case 2 causing the overhead to dominate the total time.

- Mat
Back to top
View user's profile
KevinWoo



Joined: 08 Aug 2012
Posts: 19

PostPosted: Mon Nov 26, 2012 5:20 pm    Post subject: Thank you very much Reply with quote

Dear Mat,

That is it. I see your point there and I know how to do it now.

Thank you very much.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group