PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Performance decrease with PGI 12.1
Goto page 1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Fri Feb 17, 2012 3:14 am    Post subject: Performance decrease with PGI 12.1 Reply with quote

Hi,

I am seeing some performance decrease with one of my code (about 1.4x) when going from pgi 11.10 to 12.1.

Looking at the compiler's feedbacks for the most time consuming kernel I can see that the new version seems to use register differently:


***
PGI 11.10:
Code:

    977, Loop is parallelizable
         Accelerator kernel generated
        977, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             Cached references to size [5] block of 'tinc'
             Using register for 'tfh'
             Using register for 'tfm'
             Using register for 'fr_land'
             Using register for 't_g'
             Using register for 'lo_ice'
             Using register for 'h_ice'
             Using register for 'gz0'
             Using register for 'hhl'
             Non-stride-1 accesses for array 'grad'
             Using register for 'z0m'
             Using register for 'tcm'
             Using register for 'tp'
             Using register for 'lcircterm'
             Using register for 'd_pat'
             Using register for 'l_pat'
             Using register for 'lay'
             Using register for 'ps'
             Using register for 'qd'
             Using register for 'ql'
             Using register for 'pr'
             Using register for 'frc'
             Using register for 'src'
             Using register for 'qc'
             Non-stride-1 accesses for array 'tinv'
             CC 1.3 : 124 registers; 60 shared, 1792 constant, 168 local memory bytes; 9% occupancy
             CC 2.0 : 63 registers; 44 shared, 1688 constant, 0 local memory bytes; 25% occupancy
 




*****
PGI 12.1
Code:

    977, Loop is parallelizable
         Accelerator kernel generated
        977, !$acc do parallel, vector(96) ! blockidx%x threadidx%x
             Cached references to size [5] block of 'tinc'
             Non-stride-1 accesses for array 'grad'
             Non-stride-1 accesses for array 'tinv'
             CC 1.3 : 124 registers; 60 shared, 1792 constant, 168 local memory bytes; 9% occupancy
             CC 2.0 : 63 registers; 44 shared, 1688 constant, 0 local memory bytes; 25% occupancy


note that the "Non-stride-1 accesses for array 'grad'" should not be an issue here as this is just a private coefficients array with 4 elements.


Any suggestion on how I could re-activate the previous optimization.

Thanks,

Xavier
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6072
Location: The Portland Group Inc.

PostPosted: Fri Feb 17, 2012 12:17 pm    Post subject: Reply with quote

Hi Xavier,

Can you please send a reproducing example to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me? I'll need to the code in order to investigate what's happening.

Thanks,
Mat
Back to top
View user's profile
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Mon Feb 27, 2012 2:54 am    Post subject: Reply with quote

Hi Mat,

I did send a reproducing example one week ago to trs@pgroup.com. Did you recieved it ?

Xavier
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6072
Location: The Portland Group Inc.

PostPosted: Mon Feb 27, 2012 3:54 pm    Post subject: Reply with quote

Hi Xavier,

Sorry about that. They got it but didn't forward it on to me.

I took a look at the code and it appears to me that the performance difference is being caused by the CUDA version being used. We switched from using CUDA 3.2 to CUDA 4.0 as the default device tool chain. I show the following kernel times for the loop at line 977 (Times are in microseconds).

17957 11.10 with CUDA 3.2 (default)
29402 11.10 with CUDA 4.0 (-ta=nvidia,4.0)
28076 12.2 with CUDA 4.0 (default)
17921 12.2 with CUDA 3.2 (-ta=nvidia,3.2)

I also looked at the PGI generated CUDA kernels and see only minor differences. We'll need to contact NVIDIA since it seems to be an issue with their back end tools.. Do you mind if we share your code with them?

FYI, this issue is being tracked as TPR#18489.

Note that CUDA 3.2 does not ship with PGI 2012 so I needed to add a soft link from the "$PGI/2011/cuda/3.2/" directory to "$PGI/2012/cuda/".

- Mat
Back to top
View user's profile
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Tue Feb 28, 2012 2:07 am    Post subject: Reply with quote

Hi,

Sure you can share with them the test code.
Also there was a second question on this e-mail, would it be possible to have a comment on it ?

Thanks,

Xavier
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group