PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

compiler output -Minfo: one loop slower than the other

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
elephant



Joined: 24 Feb 2011
Posts: 22

PostPosted: Thu Jun 30, 2011 7:56 am    Post subject: compiler output -Minfo: one loop slower than the other Reply with quote

Hi
I am porting a large CFD code to the GPU.
My strategy is to take a subroutine (sub1), rewrite it as a "stand alone" program, accelerate it using the pgi accelerator model, and then rewrite it as a subroutine again and call it in a "dummy main program" in a loop that simulates the iterations.

Now, sub1 had a speedup of 117x in the "stand alone version". I declared the arrays static. used the !$acc data region clause to ensure that no data transfer occurs during calculation.

For the main program version, where sub1 is a subroutine again, I have the exact same code for sub1. Allthough the arrays are declared dynamic witht the !$acc mirror clause. Before entering the loop where sub1 is called, I allocate the arrays and transfer the data to the GPU using !$acc update device.

My problem is, that I only have now 45x for sub1.....

I checked the compiler output informaton (-Minfo): there is no datatransfer during calculation either, so this can not be the reason for the slowdown...

My sub1 consits out of 6 loops. When I measured the time used for each loop, I saw that the first 2 loos where almost identical, whereas for loop 3 there was a huge difference! (And also a slowdown for the other loops)

The -Minfo of this loop3 shows (CC 2.0)...

... for the standalone version(117x):
register: 47
shared: 4
constant: 112
local: 0
occupancy: 33%
... and for the version where sub1 is called from a main program (45x):
register: 41
shared: 4
constant: 304
local: 0
occupancy: 50 %


It seems wired to me that the faster loop has less occupancy....

Can you explain with the provided information why this could happen????
To measure the time of loop3 I captured the time before and after the loop, so I think that the difference in allocation of the arrays cant cause the slowdown, since when the loop is entered, in both cases the arrays are allocated on the device and also the data is on the device...

Or could it be due to this difference? since at compilation the compiler does not know the size of the arrays and therefore it can not opimally generate a strategy where to save which array and so on (register,shared,constant,...)

Thank you very much!!!!
Back to top
View user's profile
elephant



Joined: 24 Feb 2011
Posts: 22

PostPosted: Thu Jun 30, 2011 8:05 am    Post subject: Reply with quote

By the way,: loop3 looks like this:
Code:

!$acc region

    DO kc=1,kcend
   
         inttemp1 = pcell_T(KC,1)
         inttemp2 = pcell_T(KC,2)
         inttemp3 = pcell_T(KC,3)
         inttemp4 = pcell_T(KC,4)
         inttemp5 = pcell_T(KC,5)
         inttemp6 = pcell_T(KC,6)
         inttemp7 = pcell_T(KC,7)
         inttemp8 = pcell_T(KC,8)

         ArrTemp4(kc,1,1) =   Q_T(inttemp1,1)        &
                            +Q_T(inttemp2,1)       &
                            +Q_T(inttemp3,1)       &
                            +Q_T(inttemp4,1)       &
                            +Q_T(inttemp5,1)       &
                            +Q_T(inttemp6,1)       &
                            +Q_T(inttemp7,1)       &
                            +Q_T(inttemp8,1)
             
              ArrTemp4(kc,2,1) =   Q_T(inttemp1,2)         &
                            +Q_T(inttemp2,2)       &
                            +Q_T(inttemp3,2)       &
                            +Q_T(inttemp4,2)       &
                            +Q_T(inttemp5,2)       &
                            +Q_T(inttemp6,2)       &
                            +Q_T(inttemp7,2)       &
                            +Q_T(inttemp8,2)
             
              ArrTemp4(kc,3,1) =   Q_T(inttemp1,3)         &
                       +Q_T(inttemp2,3)       &
                       +Q_T(inttemp3,3)       &
                       +Q_T(inttemp4,3)       &
                       +Q_T(inttemp5,3)       &
                       +Q_T(inttemp6,3)       &
                       +Q_T(inttemp7,3)       &
                       +Q_T(inttemp8,3)
             
              ArrTemp4(kc,4,1) =   Q_T(inttemp1,4)         &
                       +Q_T(inttemp2,4)       &
                       +Q_T(inttemp3,4)       &
                       +Q_T(inttemp4,4)       &
                       +Q_T(inttemp5,4)       &
                       +Q_T(inttemp6,4)       &
                       +Q_T(inttemp7,4)       &
                       +Q_T(inttemp8,4)
             
              ArrTemp4(kc,5,1) =   Q_T(inttemp1,5)         &
                       +Q_T(inttemp2,5)       &
                       +Q_T(inttemp3,5)       &
                       +Q_T(inttemp4,5)       &
                       +Q_T(inttemp5,5)       &
                       +Q_T(inttemp6,5)       &
                       +Q_T(inttemp7,5)       &
                       +Q_T(inttemp8,5)
          
         ArrTemp4(kc,6,1) =  CMU_GPU(inttemp1)       &
                       +CMU_GPU(inttemp2)         &
             +CMU_GPU(inttemp3)         &
             +CMU_GPU(inttemp4)         &
             +CMU_GPU(inttemp5)         &
             +CMU_GPU(inttemp6)         &
             +CMU_GPU(inttemp7)         &
             +CMU_GPU(inttemp8)
            
              ArrTemp4(kc,7,1) = CMUT_GPU(inttemp1)       &
                       +CMUT_GPU(inttemp2)       &
             +CMUT_GPU(inttemp3)       &
             +CMUT_GPU(inttemp4)       &
             +CMUT_GPU(inttemp5)       &
             +CMUT_GPU(inttemp6)       &
             +CMUT_GPU(inttemp7)       &
             +CMUT_GPU(inttemp8)
            
         ArrTemp4(kc,8,1) =   QT_T(inttemp1,1)       &
                       +QT_T(inttemp2,1)        &
             +QT_T(inttemp3,1)        &
             +QT_T(inttemp4,1)        &
             +QT_T(inttemp5,1)        &
             +QT_T(inttemp6,1)        &
             +QT_T(inttemp7,1)        &
             +QT_T(inttemp8,1)
          
         ArrTemp4(kc,9,1) =    Y_GPU(inttemp1)       &
                       +Y_GPU(inttemp2)            &
             +Y_GPU(inttemp3)            &
             +Y_GPU(inttemp4)            &
             +Y_GPU(inttemp5)            &
             +Y_GPU(inttemp6)            &
             +Y_GPU(inttemp7)            &
             +Y_GPU(inttemp8)
          
         ArrTemp4(kc,10,1) =    Z_GPU(inttemp1)            &
                       +Z_GPU(inttemp2)            &
             +Z_GPU(inttemp3)            &
             +Z_GPU(inttemp4)            &
             +Z_GPU(inttemp5)            &
             +Z_GPU(inttemp6)            &
             +Z_GPU(inttemp7)            &
             +Z_GPU(inttemp8)
        
         ArrTemp4(kc,11,1) =  DQT_T(inttemp1,1)       &
                       +DQT_T(inttemp2,1)       &
             +DQT_T(inttemp3,1)       &
             +DQT_T(inttemp4,1)       &
             +DQT_T(inttemp5,1)       &
             +DQT_T(inttemp6,1)       &
             +DQT_T(inttemp7,1)       &
             +DQT_T(inttemp8,1)

         END DO

!$acc end region
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6120
Location: The Portland Group Inc.

PostPosted: Wed Jul 06, 2011 10:56 am    Post subject: Reply with quote

Hi elephant,

While I can't tell without doing an in-depth investigation, my best guess is that it's the array descriptors that accounts for the difference. With static arrays, the compiler is able to optimize the address calculations but this is more difficult with dynamic arrays. Improving this is something our engineers are investigating, however.

I'd be interested in knowing the the output from the basic profiling information (i.e. -ta=nvidia,time). In particular, the actual schedule used. I believe the increase in occupancy is due to the decrease in register usage and therefor an increase in the number of threads per block. It's possible, that the increased number of threads causes other resources constraints and it's better to reduce the number via the "!$acc do vector(nnn)" directive.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group