|
| View previous topic :: View next topic |
| Author |
Message |
paokara
Joined: 06 Feb 2011 Posts: 19
|
Posted: Thu Jan 24, 2013 8:37 am Post subject: Some troubles with kernel generation in OpenACC |
|
|
Hi,
I have some troubles with my program. The structure of my code is the following(in fortran):
| Code: |
main function
.
call function1(input arrays)
.
end main
function1(input arrays)
.
!$acc data copy(input and output arrays) , present_or_create(internal arrays)
!$acc kernels
!Then follow about 5 or 6 1D loops
!$acc loop
do i=1.etc
.
.
.
!$acc loop
do i=1.etc
...
!And here is the problem.The first 2D loop
!$acc loop independent gang
do i=1,N
!$acc loop independent gang vector
do j=1,M
independent calculations..
enddo
enddo
!and then again follow 1D loops
!$acc loop
do i=1.etc
.
.
.
!$acc end kernels
!$acc end data
end function1
|
My problem is that when i compile my code, i get the correct parallelization for all the 1D loops,
for example :
108, Loop is parallelizable
Accelerator kernel generated
108, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
but the compiler for the 2D loop gives me the same parallelization
for example i expect some thing like:
57, Loop is parallelizable <--REFERS TO I
59, Loop is parallelizable <-- REFERS TO J
Accelerator kernel generated
57, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
59, !$acc loop gang ! blockidx%y
CC 1.3 : 35 registers; 100 shared, 8 constant, 0 local memory bytes
CC 2.0 : 38 registers; 0 shared, 156 constant, 0 local memory bytes
but i get
189, Loop is parallelizable <--REFERS TO I
Accelerator kernel generated
189, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
CC 1.3 : 30 registers; 112 shared, 56 constant, 0 local memory bytes
CC 2.0 : 41 registers; 0 shared, 276 constant, 0 local memory bytes
191, Loop is parallelizable <-- REFERS TO J
And when i finally take my time analysis for my program, i see that this 2D loop is not in parallel but sequantial.
Now the strange part. If a use OpenACC only to the part with the 2D loop,
for example:
| Code: |
function1(input arrays)
.
do i=1.etc
.
.
.
!$acc data copy(input and output arrays) , present_or_create(internal arrays)
!$acc kernels
!$acc loop independent
do i=1,N
!$acc loop independent
do j=1,M
independent calculations..
enddo
enddo
!$acc end kernels
!$acc end data
do i=1.etc
.
.
.
end function1
|
I get the correct parallelization. I can't figure out what is going.
Thanks, Sotiris |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Thu Jan 24, 2013 2:46 pm Post subject: |
|
|
Hi Sotiris,
Sorry, I can't tell what the issue is from what you posted. Can you please post or send to PGI Customer Service (trs@pgroup.com) a reproducing example with illustrates the issue.
Thanks,
Mat |
|
| Back to top |
|
 |
paokara
Joined: 06 Feb 2011 Posts: 19
|
Posted: Fri Jan 25, 2013 1:27 am Post subject: |
|
|
Hi Mat and thank you for the help you provide us,
I will post an example of that issue, but let me ask you something else first(an different version of my code). Consider that i use the first case(all parts in parallel):
| Code: |
main function
.
call function1(input arrays)
.
end main
function1(input arrays)
.
!$acc data copy(input and output arrays) , present_or_create(internal arrays)
!$acc kernels
!Then follow about 5 or 6 1D loops
!$acc loop
do i=1.etc
.
.
.
!$acc loop
do i=1.etc
...
!And here is the problem.The first 2D loop
!$acc loop independent gang
do i=1,N
!$acc loop independent gang vector
do j=1,M
independent calculations..
enddo
enddo
!and then again follow 1D loops
!$acc loop
do i=1.etc
.
.
.
!$acc end kernels
!$acc end data
end function1
|
but at the point where the double loop is placed, i use another function(function2(input arrays)) with the form
| Code: |
function2
!$acc kernels
!$acc loop independent gang
do i=1,N
!$acc loop independent gang vector
do j=1,M
independent calculations..
enddo
enddo
!$acc end kernels
|
and inside my data region i call function2
| Code: |
call function2(input_arrays)
|
In that case compiler gives me the CORRECT parallelization (2D grid) BUT i have now another problem. Whenever function2 is called i have data movement from host to device and backwards, and of course I don't need that because i've already placed my data at the beginning of function1. In this case my program is correct(correct results) but all the time spent in data movement.Is there a possible solution for that?
Thank you,
Sotiris |
|
| Back to top |
|
 |
paokara
Joined: 06 Feb 2011 Posts: 19
|
Posted: Fri Jan 25, 2013 2:48 am Post subject: |
|
|
Mat forget my previous post, I solved the problem by adding
| Code: |
present_or_create(my arrays)
|
inside function2(). So i avoid data movement and i get correct results and my program works fine with the desirable parallelization and time exacution. For my first post and the situation there i will post the exactly problem in the near future to give you a general idea.
Thanks,
Sotiris |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Mon Jan 28, 2013 11:31 am Post subject: |
|
|
Hi Sotiris,
If I understand this correctly, you have a function (function1) which contains an OpenACC data region and a compute region. Near the 2d nest loop, you have a function call (to function2) where you want to use the same data from the local arrays (the ones in the outer data region's create clause).
Since data regions can span across multiple compute regions as well as host code including calls, you can have access to function1's device data from within function2 provided that function2 is called from within a data region that created the array and you use one of the "present" clauses to tell the compiler that the data is already on the device.
- Mat |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|