PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Course

How to reduce branch divergence?

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
slee91955



Joined: 31 Jul 2015
Posts: 7

PostPosted: Wed Sep 16, 2015 10:58 pm    Post subject: How to reduce branch divergence? Reply with quote

Hello.
I have a C code to parallelizing with OpenACC (target:NVIDIA GPU)
The code has many conditional instructions(i.e. if-else).
So it makes many branch and it occurs 'inactive' CUDA thread status (80% over of total running time)
And I heard NVIDIA GPU has no branch prediction and speculative execution method like CPU.
So, I'm looking for optimizing method to reduce branch divergence and improve locality of code.

I think there are 2 ways.

1. Use CUDA grid, block and thread index to my code to join together special conditions.
Can I determine block and thread index on source code level with PGI OpenACC?
If try this method, I have to enable to determine block and thread index through conditions.
i.e.)
Code:
       if(a==0) { //run on blockIdx.x=10, threadIdx.x=10  }
       else { //run on blockIdx.x=20, threadIdx.x=20  }



2. Optimizing code in Compile time
Is there some compiler switches to optimize conditional execution code join together?
(i.e. GNU gcc's -freorder-blocks-and-partitions switch)
I'm compiling -O3 option now, but from result of profiling with nvvp, I found conditional state a lot and it makes 'inactive' thread status.



Always Thanks for your help. :)
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6697
Location: The Portland Group Inc.

PostPosted: Thu Sep 17, 2015 8:26 am    Post subject: Reply with quote

Quote:
Can I determine block and thread index on source code level with PGI OpenACC?
No and if we did I'd highly recommend you *not* use them. The main benefit of OpenACC is performance portability across multiple accelerators. By putting in non-portable, target specific API calls, it would defeat this benefit.

If this is something you really need to do, I'd suggest writing this particular kernel in Cuda C. OpenACC is interoperable with CUDA C so work well together.

Quote:
Is there some compiler switches to optimize conditional execution code join together?
Branch optimizations are enabled by default but we don't have compiler flags which you can set to enable or disable them.


So what to do here? Branching can be a major bottleneck on a GPU due to branch divergence. Since threads in a warp are executed in SIMT (single instruction multiple threads), if one thread takes a branch, all must execute the same branch. If one or more of threads in a warp take a different branch, then they all execute all taken branches but just ignore the instructions on the other branches. If they all take the same branch, then there's no penalty. If they all take differing branches, the code can be as much as 32x slower.

From the compiler perspective, there's not much that can be done to help with branch divergence. The compiler has no way of knowing at compilation which thread will take which branch (plus it can change depending upon the data set). Optimizations such branch prediction can help order the branching so the ones taken more often are checked first, but this wouldn't help much with heavily divergent threads (plus to be accurate it really requires profile guided feedback which isn't available on the GPU).

Really this is an algorithmic issue. Are you able to reorganize your code to either reduce the number of branches or reorganize the code so that consecutive loop iterations take the same branch?

- Mat
Back to top
View user's profile
rikisyo



Joined: 20 Jun 2013
Posts: 16

PostPosted: Fri Sep 18, 2015 8:52 am    Post subject: Reply with quote

It highly likely that you can recover your block and thread index from your loop index variables.

Write an outside loop and force it to be mapped to gangs. Explicitly specify the size of the gang. Inside the gang loop write another loop and force it to be mapped to vector level (CUDA thread level), so that you have a direct correlation between OpenACC loop levels and CUDA grids. The index of the gang loop is now the block index, and the vector loop is now the thread index. This is equivalent to a 1D CUDA grid arrangement.

Index recovery is perfectly legal in OpenACC and actually very useful in reducing warp divergence. However, it does makes your code less portable, as the code may not run very efficiently on other architectures with different grid arrangements.

To make this scheme slightly more portable, I prefer to collapse all loops into a single one and do index recovery inside, instead of writing two loops and explicitly map them to the gang and vector levels.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group