PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Complex loop carried dependence of 'd'
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
joshua_pedrick



Joined: 23 Sep 2009
Posts: 4

PostPosted: Thu Sep 24, 2009 2:04 pm    Post subject: Complex loop carried dependence of 'd' Reply with quote

I'm attempting to compile the following code:

!$ACC REGION
!$ACC& LOCAL(loc1, loc2, nd, nd2, k, knd, total, zero)
!$ACC& COPYIN(a(:), ia(neq + 1), b(neq), ja(:))
!$ACC& COPY(d(neq))
do nd = 1, neq
loc1 = ia(nd) - 1
loc2 = ia(nd+1) - 1
knd = loc2 - loc1
! total = b(nd)
do k = 1, knd
nd2 = ja(loc1+k)
d(nd2) = d(nd2) + a(loc1+k)*b(nd)
end do
end do
!$ACC END REGION

I get the following errors from the accelerator:

67, Loop carried scalar dependence for 'total' at line 67
Scalar last value needed after loop for 'total' at line 69
73, No parallel kernels found, accelerator region ignored
77, Complex loop carried dependence of 'd' prevents parallelization
82, Complex loop carried dependence of 'd' prevents parallelization

I'm not really sure how to overcome this...
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Thu Sep 24, 2009 5:30 pm    Post subject: Reply with quote

Hi Joshua,

The problem is that the value of nd2 could be the same for multiple threads. Hence, the actual value stored in "d(nd2)" will depend upon which ever thread was last to store the value and give you non-deterministic results. What you really need the compiler to do here is to create a private copy of 'd' for each thread and then perform a summation at the end of region. While we're adding this support, it wont be available until the next major release.

A second issue that that your loops are triangular. Meaning that the inner loop bounds is calculated within the body of the outer loop ("knd = loc2 - loc1"). GPUs can only work with rectangular loops. To work around this, you'll need to have the inner loop bounds set to the maximum value of knd and then use an if statement to skip the code if the value of k is > knd. For example:
Code:

knd = loc2 - loc1
do k = 1, max_knd
   if (k .le. knd) then
        ..
   end if
end do


The statement "!$ACC& COPY(d(neq))" tells the compiler to copy in and out only a single element of d. I'm assuming the you want the entire array so should use "d(:)".

Scalars are always defined as being "LOCAL" so there is no need to use the local directive here. It doesn't hurt, but is just redundant.

Finally, you may want to back-up and first evaluate if this section of code is worthwhile to send to the GPU. I see a lot of memory movement and little computation so will guess that the computational intensity of this loop is less than 1. I like to see at least an intensity of 4 before attempting to accelerate a region and prefer 10. One thing that might be helpful is to walk through the benchAMD tutorial that I wrote (See: http://www.pgroup.com/lit/articles/insider/v1n2a4.htm). It describes the process of determining the computational intensity of your loops.

Hope this helps,
Mat
Back to top
View user's profile
joshua_pedrick



Joined: 23 Sep 2009
Posts: 4

PostPosted: Fri Sep 25, 2009 3:16 pm    Post subject: Thanks Reply with quote

Thanks Matt,
I think I can move the accelerated region up a level and get a lot more work done with less copying(this example is part of an iterative solver for a sparse matrix). I'll look forward to the next major release. :)

Regards,
-Joshua
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Fri Sep 25, 2009 4:08 pm    Post subject: Reply with quote

Hi Joshua,

Great. If you can push the parallelization out and make these loops serial, then it might map to the GPU better.

- Mat
Back to top
View user's profile
joshua_pedrick



Joined: 23 Sep 2009
Posts: 4

PostPosted: Tue Sep 29, 2009 11:42 am    Post subject: Complex loop carried dependence of 'd' Reply with quote

Quote:
What you really need the compiler to do here is to create a private copy of 'd' for each thread and then perform a summation at the end of region.While we're adding this support, it wont be available until the next major release.


I'm finding that I'm simply not able to get this code to work without the summation option. Any idea when the next major release might be available?
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group