PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

"Unexpected flow graph", "exposed use",

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
dcwarren



Joined: 18 Jun 2012
Posts: 29

PostPosted: Thu Jan 31, 2013 9:21 pm    Post subject: "Unexpected flow graph", "exposed use", Reply with quote

I have a bunch of questions this time.

In attempting to compile an OpenACC code, I'm getting a message telling me that the compiler failed to translate the accelerator region due to an "Unexpected flow graph". I think I understand in broad terms what this means, but I would appreciate a more specific explanation.

The same set of compiler outputs contains repeated mentions of
Code:
"Loop carried dependence due to exposed use of [array] prevents parallelization"
My first interpretation was that multiple threads were trying to update the same array, something that could be handled with atomics in CUDA. An alternative to atomics, which I implemented in the accelerator code, was to (1) create a special array just for the accelerator region, (2) zero it out before the OpenACC kernel, (3) perform a sum reduction over the special array after the kernel, and (4) add it back to the global array. However, that returns the same error message. So what is responsible for the message?

Lastly, there are several references to "Accelerator restriction: induction variable live-out from loop: i". Some of these line numbers point to loops where the induction variable has been declared private; this suggests I don't understand how the private declaration works, or what a live-out variable is. There are weirder instances of this message, though: sometimes it points to subroutine calls that don't use that induction variable (edit to add: the subroutine is being inlined; I know OpenACC doesn't handle subprogram calls right now). What's going on there?

Thanks for any/all the advice you can give.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Fri Feb 01, 2013 10:29 am    Post subject: Reply with quote

Quote:
"Unexpected flow graph"
This is a compiler error. Can you please send PGI Customer Service (trs@pgroup.com) a reproducing example?

Quote:
"Loop carried dependence due to exposed use of [array] prevents parallelization". So what is responsible for the message?
This means that one or more of the arrays elements is being written to by more than one thread.

For your special array, have you manually privatized it? (i.e. added extra dimensions for each level of parallelization?) Or are every iteration of the loop using the same elements of the arrays (i.e. it's a local scratch array)?

Quote:
perform a sum reduction over the special array after the kernel,
Ideally, you want to use scalars for reductions so that you can utilize the "reduction" clause with a "loop" directive.
Quote:
or what a live-out variable is.
A "live-out" means that the value of the scalar must be stored back to memory for later use. The problem being, which of the thread's value gets stored back?
The obvious cases are when the variable is used on the right-hand side of an expression or in a subroutine call after the end of compute routine. Though, it can also occur when the variable has static storage (i.e. the SAVE attribute, using in contained routine, it's a module variable, passed in as an argument). Also, sometimes branching can cause cases where the variable may or may not be assigned, causing it to remain "live" (i..e. it's value is needed).

Hope this helps,
Mat
Back to top
View user's profile
dcwarren



Joined: 18 Jun 2012
Posts: 29

PostPosted: Fri Feb 01, 2013 1:05 pm    Post subject: Reply with quote

mkcolg wrote:
For your special array, have you manually privatized it? (i.e. added extra dimensions for each level of parallelization?) Or are every iteration of the loop using the same elements of the arrays (i.e. it's a local scratch array)?

I'm treating it as a local scratch array (that is to say, I'm trying to give every thread its own copy of the array, then reduce all the copies of arrays), but it occurs to me that there's much more than 64kB of memory being used if there is any significant number of threads per block (apologies for the CUDA terms, but that's obviously the conceptual framework I'm coming from). This probably means that the compiler is shifting all those arrays back into device global memory, which will probably cause slowdowns as various threads work with non-contiguous chunks of global memory.

I think adding an extra dimension for threads -- and then doing a reduction along that dimension -- would work best, but how can I get OpenACC to do that? With CUDA, it'd be a cinch (pseudocode-wise, at least).

mkcolg wrote:
A "live-out" means that the value of the scalar must be stored back to memory for later use. The problem being, which of the thread's value gets stored back?

That matches what my working definition of "live-out" was, and it means the OpenACC compiler isn't parsing my code like I'm expecting it to. Almost all of the messages I'm getting have to do with induction variables or variables that *should* be local to the accelerator region.

Edit: I suppose trying to diagnose the phantom "live-out" messages -- those pointing to a line that didn't involve the variable mentioned -- would require seeing the code.

Quote:
Hope this helps,
Mat

I'm appreciative of all the answers you're giving. And hopefully future Google-searchers are as well.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Fri Feb 01, 2013 3:16 pm    Post subject: Reply with quote

Quote:
t: I suppose trying to diagnose the phantom "live-out" messages -- those pointing to a line that didn't involve the variable mentioned -- would require seeing the code
Feel free to send the code to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me. I'll see what I can find out.

Quote:
I think adding an extra dimension for threads -- and then doing a reduction along that dimension -- would work best, but how can I get OpenACC to do that?
The extra dimension would correspond to the loop's iteration count, which in turn gets translated into the blocks and threads. Granted, there's not necessarily a one-to-one correspondence, so you may be wasting some memory. You can also use the "private" clause at the gang (block) or vector (thread) loop in which case only the need number of copies of the array is created. You safe some memory but just loose some explicit control.

Quote:
I'm appreciative of all the answers you're giving. And hopefully future Google-searchers are as well.
You're welcome!

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group