PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Loop tuning

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
elephant



Joined: 24 Feb 2011
Posts: 22

PostPosted: Thu Sep 01, 2011 5:03 am    Post subject: Loop tuning Reply with quote

Hi

I am poreting a large code to the GPU with the PGI Acc Model. Currently running 14x...
Now I want to do some fine tuning. I have 5 Loops that are showing not that good performance yet. The Loops look like:
Code:

!$acc region

         do NN=1,Number_of_nodes
            do CI=1,Number_of_cells_per_node(NN)
      
             face1 = face1(RNODE2(NN,CI))*I_GPU(RNODE1(NN,CI),1)    &
                    +face2(RNODE2(NN,CI))*J_GPU(RNODE1(NN,CI),1)    &
                    +face3(RNODE2(NN,CI))*K_GPU(RNODE1(NN,CI),1)
             face2 = face1(RNODE2(NN,CI))*I_GPU(RNODE1(NN,CI),2)    &
                    +face2(RNODE2(NN,CI))*J_GPU(RNODE1(NN,CI),2)    &
                    +face3(RNODE2(NN,CI))*K_GPU(RNODE1(NN,CI),2)
             face3 = face1(RNODE2(NN,CI))*I_GPU(RNODE1(NN,CI),3)    &
                    +face2(RNODE2(NN,CI))*J_GPU(RNODE1(NN,CI),3)    &
                    +face3(RNODE2(NN,CI))*K_GPU(RNODE1(NN,CI),3)
             
             
             ARRAY(NN,1) = ARRAY(NN,1) +(face1*F(RNODE1(NN,CI),1)     &
                                       + face2*G(RNODE1(NN,CI),1)     &
                                       + face3*H(RNODE1(NN,CI),1)     &
                                       + 0.125/DXX(RNODE1(NN,CI))     &
                                       * DELTA(RNODE1(NN,CI),1))                 
            end do       
         end do
         
!$acc end region



Number_of_nodes is a big number (2 Million)
Number_of_cells_per_node(NN) has values varying between 1 to 10 (mostly 8)
RNODE2(:,:) has values varying between 1:8
RNODE1(:,:) has values from 1 to 2M

And the Minfo:
Code:

    140, Generating compute capability 2.0 binary
    141, Loop is parallelizable
         Accelerator kernel generated
        141, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             Using register for 'rnode_cnt_gpu'
             CC 2.0 : 63 registers; 4 shared, 488 constant, 0 local memory bytes; 33% occupancy
    142, Complex loop carried dependence of 'dq_gpu' prevents parallelization
         Loop carried dependence of 'dq_gpu' prevents parallelization
         Loop carried backward dependence of 'dq_gpu' prevents vectorization
         Inner sequential loop scheduled on accelerator

And the ,time output is:
Code:

    140: region entered 20 times
        time(us): total=177978 init=4 region=177974
                  kernels=176084 data=0
        w/o init: total=177974 max=8967 min=8854 avg=8898
        141: kernel launched 20 times
            grid: [3182]  block: [256]
            time(us): total=176084 max=8880 min=8766 avg=8804

Is there anything I can do to gain performance??? I guess this inner loop is the killing part, right?
I would be happy for any kind of tipp to tune this loop.
Thank you so much!
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6119
Location: The Portland Group Inc.

PostPosted: Tue Sep 06, 2011 11:03 am    Post subject: Reply with quote

Hi elephant,

What I'd try doing is assigning the RNODE2(NN,CI) and RNODE1(NN,CI) values to temp variables and replacing each instance with the loop with the temp variable. For example:
Code:
!$acc region

         do NN=1,Number_of_nodes
            do CI=1,Number_of_cells_per_node(NN)
             rn1 = RNODE1(NN,CI)
             rn2 = RNODE2(NN,CI)
             face1 = face1(rn2)*I_GPU(rn1,1)    &
                    +face2(rn2)*J_GPU(rn1,1)    &
                    +face3(rn2)*K_GPU(rn1,1)
... continues
 


The code is using a lot of registers. My best guess is many of these registers are being used to hold the address calculation for each of the RNODE address. Granted, the compiler may already be recognizing the redundant look-ups and has already replaced them with temp variables. In which case, manually replacing them wont matter. Worth a try though.

Next, I'd use a temp variable to accumulate, otherwise, your storing "ARRAY(NN,1)" to global memory after each iteration of the loop.
Code:

!$acc region

         do NN=1,Number_of_nodes
            tempsum = ARRAY(NN,1)
            do CI=1,Number_of_cells_per_node(NN)
...
                tempsum = tempsum + (face1*F(RNODE1(NN,CI),1)     &
....
            enddo
           ARRAY(NN,1) = tempsum
     enddo
!$acc end region



The next thing to try is setting the maxregcount to 16 (-Mcuda=maxregcount:16). This should boost the occupancy from 33% to 100%. Though, increasing the occupancy doesn't always mean better performance since less registers can mean more global memory fetches. Though, you don't have a lot of data reuse, so may be ok.

Hope this helps,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group