PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Some troubles with kernel generation in OpenACC
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
paokara



Joined: 06 Feb 2011
Posts: 24

PostPosted: Thu Jan 24, 2013 8:37 am    Post subject: Some troubles with kernel generation in OpenACC Reply with quote

Hi,

I have some troubles with my program. The structure of my code is the following(in fortran):

Code:

main function
.
call function1(input arrays)
.
end main

function1(input arrays)
.
!$acc data copy(input and output arrays) , present_or_create(internal arrays)
!$acc kernels

!Then follow about 5 or 6 1D loops
!$acc loop
do i=1.etc
.
.
.

!$acc loop
do i=1.etc
...

!And here is the problem.The first 2D loop
!$acc loop independent gang
  do i=1,N
!$acc loop independent gang vector
    do j=1,M
 independent calculations..
   enddo
  enddo


!and then again follow 1D loops
!$acc loop
do i=1.etc
.
.
.

!$acc end kernels
!$acc end data
end function1




My problem is that when i compile my code, i get the correct parallelization for all the 1D loops,
for example :
108, Loop is parallelizable
Accelerator kernel generated
108, !$acc loop gang, vector(128) ! blockidx%x threadidx%x

but the compiler for the 2D loop gives me the same parallelization

for example i expect some thing like:
57, Loop is parallelizable <--REFERS TO I
59, Loop is parallelizable <-- REFERS TO J
Accelerator kernel generated
57, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
59, !$acc loop gang ! blockidx%y
CC 1.3 : 35 registers; 100 shared, 8 constant, 0 local memory bytes
CC 2.0 : 38 registers; 0 shared, 156 constant, 0 local memory bytes

but i get
189, Loop is parallelizable <--REFERS TO I
Accelerator kernel generated
189, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
CC 1.3 : 30 registers; 112 shared, 56 constant, 0 local memory bytes
CC 2.0 : 41 registers; 0 shared, 276 constant, 0 local memory bytes
191, Loop is parallelizable <-- REFERS TO J

And when i finally take my time analysis for my program, i see that this 2D loop is not in parallel but sequantial.

Now the strange part. If a use OpenACC only to the part with the 2D loop,
for example:
Code:

function1(input arrays)
.
do i=1.etc
.
.
.


!$acc data copy(input and output arrays) , present_or_create(internal arrays)
!$acc kernels
!$acc loop independent
  do i=1,N
!$acc loop independent
    do j=1,M
 independent calculations..
   enddo
  enddo
!$acc end kernels
!$acc end data

do i=1.etc
.
.
.
end function1




I get the correct parallelization. I can't figure out what is going.

Thanks, Sotiris
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5871
Location: The Portland Group Inc.

PostPosted: Thu Jan 24, 2013 2:46 pm    Post subject: Reply with quote

Hi Sotiris,

Sorry, I can't tell what the issue is from what you posted. Can you please post or send to PGI Customer Service (trs@pgroup.com) a reproducing example with illustrates the issue.

Thanks,
Mat
Back to top
View user's profile
paokara



Joined: 06 Feb 2011
Posts: 24

PostPosted: Fri Jan 25, 2013 1:27 am    Post subject: Reply with quote

Hi Mat and thank you for the help you provide us,

I will post an example of that issue, but let me ask you something else first(an different version of my code). Consider that i use the first case(all parts in parallel):

Code:

main function
.
call function1(input arrays)
.
end main

function1(input arrays)
.
!$acc data copy(input and output arrays) , present_or_create(internal arrays)
!$acc kernels

!Then follow about 5 or 6 1D loops
!$acc loop
do i=1.etc
.
.
.

!$acc loop
do i=1.etc
...

!And here is the problem.The first 2D loop
!$acc loop independent gang
  do i=1,N
!$acc loop independent gang vector
    do j=1,M
 independent calculations..
   enddo
  enddo


!and then again follow 1D loops
!$acc loop
do i=1.etc
.
.
.

!$acc end kernels
!$acc end data
end function1


but at the point where the double loop is placed, i use another function(function2(input arrays)) with the form


Code:

function2

!$acc kernels
!$acc loop independent gang
  do i=1,N
!$acc loop independent gang vector
    do j=1,M
 independent calculations..
   enddo
  enddo
!$acc end kernels


and inside my data region i call function2

Code:

call function2(input_arrays)


In that case compiler gives me the CORRECT parallelization (2D grid) BUT i have now another problem. Whenever function2 is called i have data movement from host to device and backwards, and of course I don't need that because i've already placed my data at the beginning of function1. In this case my program is correct(correct results) but all the time spent in data movement.Is there a possible solution for that?

Thank you,
Sotiris
Back to top
View user's profile
paokara



Joined: 06 Feb 2011
Posts: 24

PostPosted: Fri Jan 25, 2013 2:48 am    Post subject: Reply with quote

Mat forget my previous post, I solved the problem by adding

Code:

present_or_create(my arrays)


inside function2(). So i avoid data movement and i get correct results and my program works fine with the desirable parallelization and time exacution. For my first post and the situation there i will post the exactly problem in the near future to give you a general idea.

Thanks,
Sotiris
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5871
Location: The Portland Group Inc.

PostPosted: Mon Jan 28, 2013 11:31 am    Post subject: Reply with quote

Hi Sotiris,

If I understand this correctly, you have a function (function1) which contains an OpenACC data region and a compute region. Near the 2d nest loop, you have a function call (to function2) where you want to use the same data from the local arrays (the ones in the outer data region's create clause).

Since data regions can span across multiple compute regions as well as host code including calls, you can have access to function1's device data from within function2 provided that function2 is called from within a data region that created the array and you use one of the "present" clauses to tell the compiler that the data is already on the device.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group