PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Loop carried reuse prevents parallelization

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
TheMatt



Joined: 06 Jul 2009
Posts: 322
Location: Greenbelt, MD

PostPosted: Tue Sep 01, 2009 8:19 am    Post subject: Loop carried reuse prevents parallelization Reply with quote

As I'm trying to learn to rewire my brain for parallel thinking, I've been trying various things to reduce the number of "loop carried dependence", "loop carried reuse" and other issues reported by -Minfo=accel. One particular loop has been stymieing me, so I'm coming here to try and figure it out.

To wit, the loop:
Code:
217       do i=1,m
218        do k=0,np
219         fsdir(i)=tda(i,k,2)
220        enddo
221       enddo
where those are line numbers, not statement labels.

By the time the code gets to here, tda has been constructed, and fsdir has not appeared anywhere else (and never does again). Also, tda(:,:,:) is local to the whole !$acc region and fsdir(:) is copyout.

When the compiler gets here it says:
Code:
    217, Loop is parallelizable
    218, Loop carried reuse of fsdir prevents parallelization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
        217, !$acc do parallel, vector(256)
             Using register for 'fsdir'
        218, !$acc do seq
I guess I'm confused as to why this is not parallel, vector(16)-parallel, vector(16) as I'm used to seeing in cases like this. Is it because fsdir(:) is a copyout array and as such has internal restrictions regarding memory layout or the like? (And, of course, it maybe that is faster than the 16x16 method, I'm just wondering about that 'loop carried reuse' issue.)
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Thu Sep 03, 2009 9:45 am    Post subject: Reply with quote

Hi Matt,

The outer "i" loop is being parallelized. However the inner loop is not because for each iteration of the k loop, the same element of fsdir is being assigned to (i.e. loop carried re-use). So if the k loop were to be parallelized, all "k" threads would be trying to assign their values to the same spot, leading to nod-deterministic results. To parallelize the k loop, you'll need to make fsidr a two dimensional array.

Note that we are working on adding support for reductions within accelerator regions. My guess is that your code is more like "fsdir(i) = fsdir(i) + tda(i,k,2)", in which case we should be able to parallelize the inner loop once this support has been added.

- Mat
Back to top
View user's profile
TheMatt



Joined: 06 Jul 2009
Posts: 322
Location: Greenbelt, MD

PostPosted: Thu Sep 03, 2009 12:06 pm    Post subject: Reply with quote

You know, you are right and that is actually what the code is doing (in some ways). Guess I've found a place to redo a bit of coding!

Thanks,
Matt
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group