PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

unordered array access is faster than ordered access???

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
elephant



Joined: 24 Feb 2011
Posts: 22

PostPosted: Mon Sep 12, 2011 3:48 am    Post subject: unordered array access is faster than ordered access??? Reply with quote

Hi!
I am a little bit confused about a certain issue:
I am porting an unstructured grid application. In order to have coalesced memory access, I generated a new vector (Q_GPU_kc). It is ordered with respect to the cells (kc).
The original vecotr (Q_GPU) is orderd with respect to the nodes.
PCELL_GPU is an array that contains the information about which nodes belong to a cell.

Q_GPU_kc that is ordered with respect to the cells was created as folows:
Code:

!$acc region
      DO i=1,8
         DO kc=1,kcend
            Q_GPU_kc(kc,i,1) = Q_GPU(pcell_GPU(kc,i),1)
            Q_GPU_kc(kc,i,2) = Q_GPU(pcell_GPU(kc,i),2)
            Q_GPU_kc(kc,i,3) = Q_GPU(pcell_GPU(kc,i),3)
            Q_GPU_kc(kc,i,4) = Q_GPU(pcell_GPU(kc,i),4)
            Q_GPU_kc(kc,i,5) = Q_GPU(pcell_GPU(kc,i),5)
            Q_GPU_kc(kc,i,6) = Q_GPU(pcell_GPU(kc,i),6) 
         END DO
      END DO
!$acc end region


What I thought, is that if uing this ordered vector inside a loop that is indexing over kc would improve my performance!!! However, the opposit happend...

Is it because a kernel has problems to deal with 3dim arrays? The memory access of the loop calling Q_GPU_kc should be very efficient, no?
I must also say that the grid, which I am using is not fully unstructured. Hoever, I should still see better performance using the (Q_GPU_kc) vector to my oppinion.
Do you have any explanation for this?

The code with the ordered vector is foolowing one:
Code:

!$acc region
      DO kc =1, kcend
         DELTA_Q_T(KC,1) =                               &
                Q_GPU_kc(KC,1,2)*S_X_GPU(kc,1)           &
               +Q_GPU_kc(KC,2,2)*S_X_GPU(kc,2)           &
               +Q_GPU_kc(KC,3,2)*S_X_GPU(kc,3)           &
               +Q_GPU_kc(KC,4,2)*S_X_GPU(kc,4)           &
               +Q_GPU_kc(KC,5,2)*S_X_GPU(kc,5)           &
               +Q_GPU_kc(KC,6,2)*S_X_GPU(kc,6)           &
               +Q_GPU_kc(KC,7,2)*S_X_GPU(kc,7)           &
               +Q_GPU_kc(KC,8,2)*S_X_GPU(kc,8)           &
               +Q_GPU_kc(KC,1,3)*S_Y_GPU(kc,1)           &
               +Q_GPU_kc(KC,2,3)*S_Y_GPU(kc,2)           &
               +Q_GPU_kc(KC,3,3)*S_Y_GPU(kc,3)           &
               +Q_GPU_kc(KC,4,3)*S_Y_GPU(kc,4)           &
               +Q_GPU_kc(KC,5,3)*S_Y_GPU(kc,5)           &
               +Q_GPU_kc(KC,6,3)*S_Y_GPU(kc,6)           &
               +Q_GPU_kc(KC,7,3)*S_Y_GPU(kc,7)           &
               +Q_GPU_kc(KC,8,3)*S_Y_GPU(kc,8)           &
               +Q_GPU_kc(KC,1,4)*S_Z_GPU(kc,1)           &
               +Q_GPU_kc(KC,2,4)*S_Z_GPU(kc,2)           &
               +Q_GPU_kc(KC,3,4)*S_Z_GPU(kc,3)           &
               +Q_GPU_kc(KC,4,4)*S_Z_GPU(kc,4)           &
               +Q_GPU_kc(KC,5,4)*S_Z_GPU(kc,5)           &
               +Q_GPU_kc(KC,6,4)*S_Z_GPU(kc,6)           &
               +Q_GPU_kc(KC,7,4)*S_Z_GPU(kc,7)           &
               +Q_GPU_kc(KC,8,4)*S_Z_GPU(kc,8)

      END DO
!$acc end region


     47, Generating compute capability 2.0 binary
     49, Loop is parallelizable
         Accelerator kernel generated
         49, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 2.0 : 63 registers; 4 shared, 208 constant, 0 local memory bytes; 33% occupancy


    47: region entered 20 times
        time(us): total=77350 init=5 region=77345
                  kernels=75924 data=0
        w/o init: total=77345 max=3945 min=3847 avg=3867
        49: kernel launched 20 times
            grid: [3063]  block: [256]
            time(us): total=75924 max=3807 min=3778 avg=3796

-------------------------------------------------------------------------------------
The code with the "unordered", original vector is the following one:
Code:

!$acc region
      DO kc =1, kcend   
         DELTA_Q_T(KC,1) =                                       &
                Q_GPU(PCELL_GPU(KC,1),2)*S_X_GPU(kc,1)           &
               +Q_GPU(PCELL_GPU(KC,2),2)*S_X_GPU(kc,2)           &
               +Q_GPU(PCELL_GPU(KC,3),2)*S_X_GPU(kc,3)           &
               +Q_GPU(PCELL_GPU(KC,4),2)*S_X_GPU(kc,4)           &
               +Q_GPU(PCELL_GPU(KC,5),2)*S_X_GPU(kc,5)           &
               +Q_GPU(PCELL_GPU(KC,6),2)*S_X_GPU(kc,6)           &
               +Q_GPU(PCELL_GPU(KC,7),2)*S_X_GPU(kc,7)           &
               +Q_GPU(PCELL_GPU(KC,8),2)*S_X_GPU(kc,8)           &
               +Q_GPU(PCELL_GPU(KC,1),3)*S_Y_GPU(kc,1)           &
               +Q_GPU(PCELL_GPU(KC,2),3)*S_Y_GPU(kc,2)           &
               +Q_GPU(PCELL_GPU(KC,3),3)*S_Y_GPU(kc,3)           &
               +Q_GPU(PCELL_GPU(KC,4),3)*S_Y_GPU(kc,4)           &
               +Q_GPU(PCELL_GPU(KC,5),3)*S_Y_GPU(kc,5)           &
               +Q_GPU(PCELL_GPU(KC,6),3)*S_Y_GPU(kc,6)           &
               +Q_GPU(PCELL_GPU(KC,7),3)*S_Y_GPU(kc,7)           &
               +Q_GPU(PCELL_GPU(KC,8),3)*S_Y_GPU(kc,8)           &
               +Q_GPU(PCELL_GPU(KC,1),4)*S_Z_GPU(kc,1)           &
               +Q_GPU(PCELL_GPU(KC,2),4)*S_Z_GPU(kc,2)           &
               +Q_GPU(PCELL_GPU(KC,3),4)*S_Z_GPU(kc,3)           &
               +Q_GPU(PCELL_GPU(KC,4),4)*S_Z_GPU(kc,4)           &
               +Q_GPU(PCELL_GPU(KC,5),4)*S_Z_GPU(kc,5)           &
               +Q_GPU(PCELL_GPU(KC,6),4)*S_Z_GPU(kc,6)           &
               +Q_GPU(PCELL_GPU(KC,7),4)*S_Z_GPU(kc,7)           &
               +Q_GPU(PCELL_GPU(KC,8),4)*S_Z_GPU(kc,8)
      END DO
!$acc end region


     49, Loop is parallelizable
         Accelerator kernel generated
         49, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             CC 2.0 : 54 registers; 4 shared, 232 constant, 0 local memory bytes; 33% occupancy


    47: region entered 20 times
        time(us): total=59540 init=3 region=59537
                  kernels=58114 data=0
        w/o init: total=59537 max=3034 min=2950 avg=2976
        49: kernel launched 20 times
            grid: [3063]  block: [256]
            time(us): total=58114 max=2930 min=2876 avg=2905


Thanky ou very much!!!
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Wed Sep 14, 2011 11:03 am    Post subject: Reply with quote

Hi Mark,

I sent a response to the email you sent PGI Customer Support but haven't seen a response from you yet. In case you missed the mail, I've posted it below:

Can you send me both versions of the code? I can assume what the problem is, but if I have the code I can profile it and look at the generate GPU code to get you a better answer.

As far as my guesses, the initialization code has a very small outer loop (8) so the problem here is most likely due to scheduling then anything else. I'd try inverting the i and kc loops and make the i loop sequential. Also, I'd cache index fetched address from pcell_GPU. Often the compiler will do this optimization, but it's not guaranteed so I like to make sure.

For example:
Code:

!$acc region
!$acc do parallel, vector(256), kernel
         DO kc=1,kcend
          DO i=1,8
            idx = pcell_GPU(kc,i)
            Q_GPU_kc(kc,i,1) = Q_GPU(idx,1)
            Q_GPU_kc(kc,i,2) = Q_GPU(idx,2)
            Q_GPU_kc(kc,i,3) = Q_GPU(idx,3)
            Q_GPU_kc(kc,i,4) = Q_GPU(idx,4)
            Q_GPU_kc(kc,i,5) = Q_GPU(idx,5)
            Q_GPU_kc(kc,i,6) = Q_GPU(idx,6)
         END DO
      END DO
!$acc end region


In C, there aren't true multi-dimensional arrays so all multi-dimensional Fortran arrays need to be linearized. I'd like to see the generated kernel code (-ta=nvidia,keepgpu) to look at the indexing. I think the data should be accessed contiguously, but is probably needing to make more calculations to determine the index. Notice that the first example uses 63 registers versus 54 in the second. Most likely this is due to the increased number of index calculations.

I think I'd try re-rolling the second example and add in a "i" loop. Like above, make the i loop sequential and cache the value of pcell_cpu. Hopefully this will reduce the number of registers required and increase your occupancy.

Again, these are just guesses, but worth trying.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group