PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

GPU-enabling a loop

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
szczelba



Joined: 29 Jun 2010
Posts: 26

PostPosted: Fri Oct 01, 2010 4:09 am    Post subject: GPU-enabling a loop Reply with quote

Hello everyone,

I have a piece of code that I have to accelerate using the GPU. There is a loop similar to the one shown below. I've put an !$acc markers do execute it on the GPU.

Code:
!$acc region do local(n, nbl, ijk)
     do nbl=1,nblcks
       do n=k,l
         ijk=ijkpr(n)
         vvect(ijk,1)=vvect(ijk,1)*rrnorm
       enddo
     enddo
!$acc end region


Unfortunately, during compilation I have a following error:
Quote:
56, No Parallel kernels found, acceleration region ignored.

It seems to be a problem with the ijk variable, which is used to access the vvect array elements in consecutive iterations. It is not sequential but depends on the ijkpr table.

How can I change this loop to make it GPU-enabled?

Thanks for any ideas!
Jacek
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Fri Oct 01, 2010 3:30 pm    Post subject: Reply with quote

Hi Jacek,

In order to parallelize your code, you need to make sure that all iterations of the loops are independent. In your case, 'vvect' has two dependencies.

The first is that you are using a computed index, ijk. If any of the values of ijk are the same for multiple threads, then the value of vvect(ijk,1) will depend on the order in which the threads are run. Since threads can be run in any order, your results will be non-deterministic. If you know that all values of ijk are unique, then you could use the "independent" clause to over ride the compilers dependency analysis and have the loop parallelized anyway.

However, even if all the values of ijkpr are unique, because multiple iteration of the outer loop access the same elements of ijkpr, you will definitely have multiple threads accessing the same elements of vvect. Again, the value of vvect will change from run to run.

Now you may think that due to the associativity of your calculation, the order in which vvect*rnorm is performed doesn't really matter. The problem here is that this is not an atomic operation. Hence, you could the following scenario:

Code:
Thread 1 reads vvect(1,1) from memory
Thread 2 reads vvect(1,1) from memory
Thread 2 calculates vvect(1,1)*rnorm
Thread 2 writes the result back to memory
Thread 1 calculates vvect(1,1)*rnorm
Thread 1 writes the result back to memory


Though, all hope is not lost. Assuming you have enough computation, what you can do is manually privatiaze vvect (i.e. create another array with an additional dimension indexed using the parallel loop). You then need to perform a reduction to populate the original vvect array. Something like:

Code:
!$acc region copy(vvectP)
!$acc do kernel
     do nbl=1,nblcks
       do n=k,l
         ijk=ijkpr(n)
         vvectTMP(nbl,ijk,1)=vvectTMP(nbl,ijk,1)*rrnorm
       enddo
     enddo
!$acc end region

     do nbl=1,nblcks
       do n=k,l
         ijk=ijkpr(n)
         vvect(ijk,1) = vvect(ijk,1)+vvectTMP(nbl,ijk,1)
       enddo
     enddo


On side note, you do not need to use "local(n, nbl, ijk)". Scalars are privatized (i.e. each thread has their own copy) by default. "local" is only needed for arrays that are local to the device.

Hope this helps,
Mat
Back to top
View user's profile
szczelba



Joined: 29 Jun 2010
Posts: 26

PostPosted: Mon Oct 04, 2010 1:37 am    Post subject: Reply with quote

Mat, thanks for your clear message.

There are some problems, though.
Firstly, I can not afford to double the loop. In your solution, you've divided the calculations on both, the GPU and the CPU. The CPU part has similar size to the original loop, so the acceleration will be too small (if any).
Additionally, creating a temporal array will consume a lot of memory and probably also some CPU time. I'm fighting for any milisecond here. ;)

I tried also the "independent" option. Assuming that ijkpr() has a unique values, for this loop it increased the calculation time almost 20 times (strange!).

Code:

!$acc region local(n, ijk, nbl)
!$acc do independent
     do nbl=1,nblcks
       do n=k,l
         ijk=ijkpr(n)
         vvect(ijk,1)=vvect(ijk,1)*rrnorm        
       enddo
     enddo
!$acc end region


While this code's execution time decreased when 'independent':

Code:

!$acc region local(n, ijk, nbl)
!$acc do independent
     do nbl=1,nblcks
       do n=k,l
         ijk=ijkpr(n)
         vvect(n,1)=vvect(n,1)*rrnorm
       enddo
     enddo
!$acc end region


I was thinking if it is possible to somehow divide the ijkpr() array and this way make the threads independent?

Any other ideas?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Mon Oct 04, 2010 2:12 am    Post subject: Reply with quote

Quote:
Firstly, I can not afford to double the loop. In your solution, you've divided the calculations on both, the GPU and the CPU. The CPU part has similar size to the original loop, so the acceleration will be too small (if any).
Additionally, creating a temporal array will consume a lot of memory and probably also some CPU time. I'm fighting for any milisecond here. ;)
I understand but getting very fast wrong answers doesn't help you either.

Quote:
I was thinking if it is possible to somehow divide the ijkpr() array and this way make the threads independent?
The problem with using ijkpr, besides not being able to guarantee independence, is that presumably the elements are not in consecutive order. This will cause non-contiguous memory access (memory divergence) and slow down you code. If you can use 'n' for vvect's index and remove ijkpr then you will improve memory access.
Code:

!$acc region local(n, ijk, nbl)
!$acc do independent
     do nbl=1,nblcks
       do n=k,l
         ijk=ijkpr(n)
         vvect(n,1)=vvect(n,1)*rrnorm
       enddo
     enddo
!$acc end region

Assuming your algorithm allows it, using 'n' instead of ijk will help. Though your code is still not parallel since multiple iterations of the 'nbl' loop will access the same elements of vvect. What you could do is interchange the n and nbl loops and then run nbl sequentially.

Code:

!$acc region
!$acc do kernel
     do n=k,l
       do nbl=1,nblcks
         ijk=ijkpr(n)
         vvect(n,1)=vvect(n,1)*rrnorm
       enddo
     enddo
!$acc end region


- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group