PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Performance decrease with PGI 12.1
Goto page Previous  1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6215
Location: The Portland Group Inc.

PostPosted: Tue Feb 28, 2012 10:58 am    Post subject: Reply with quote

Hi Xavier,

Quote:
Also there was a second question on this e-mail, would it be possible to have a comment on it ?
I'm assuming you mean the question about the non-stride-1 messages for grad. Here the minfo is correct since for private arrays since one large block of memory is allocated and then partitioned across threads. i.e. thread 0 gets the first four elements, thread 1, the next four, and so fourth. Thus non-stride-1 accesses.

I had not really questioned this until now, but thought since the compiler controls how grad is created, can't it make the accesses contiguous? I just talked with Michael Wolfe and he agreed to investigate if this can be improved. I have added TPR#18490 to track this request.

Best Regards,
Mat
Back to top
View user's profile
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Wed Feb 29, 2012 12:44 am    Post subject: Reply with quote

Hi Mat

No this is fine, the non stride one index acces is probably not a big issue here.

The question was relative to the email I send you with this code:

Quote:
2) I had an old open TPR18188 some time last year with the same test code. If I try to compile this code with –Mcuda the compiler crashes during compilation with the following error:
PGF90-F-0701-Error reading temp file - nmptrs (/project/s83/lapixa/GPU/tmp/PGI/turb_standalone_2/src/turb_standalone.f90)
PGF90/x86-64 Linux 12.1-0: compilation aborted


Is there any update about this TPR18188 ?

Thanks,

Xavier
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6215
Location: The Portland Group Inc.

PostPosted: Wed Feb 29, 2012 2:41 pm    Post subject: Reply with quote

Quote:
Is there any update about this TPR18188 ?
It looks like this may have been fixed in 12.2. However, I need to check with our engineers why the TPR is still open. There may be addition issues that they are working on.

With 11.10, I see the original ICE:
Code:

pgf90 -ta=nvidia -Mcuda -V11.10 -r8 -Kieee -Mbyteswapio -Mfree -Mmpi -Mpreprocess -Minform=inform -D__COSMO__ -c turb_standalone.f90 -o ../obj/turb_standalone.o
PGF90-S-0000-Internal compiler error. size_of:bad dtype    1892 (turb_standalone.f90: 1021)
PGF90-S-0000-Internal compiler error. size_of: bad dtype      697 (turb_standalone.f90: 1021)


With 12.1, I see the "temp" file error:
Code:

pgf90 -ta=nvidia -Mcuda -V12.1 -r8 -Kieee -Mbyteswapio -Mfree -Mmpi -Mpreprocess -Minform=inform -D__COSMO__ -c turb_standalone.f90  -o ../obj/turb_standalone.o
PGF90-F-0701-Error reading temp file - lab (turb_standalone.f90)
PGF90/x86-64 Linux 12.1-0: compilation aborted


With 12.2, the code compiles:
Code:
pgf90 -ta=nvidia -Mcuda -V12.2 -r8 -Kieee -Mbyteswapio -Mfree -Mmpi -Mpreprocess -Minform=inform -D__COSMO__ -c turb_standalone.f90  -o ../obj/turb_standalone.o
pgf90 -ta=nvidia -Mcuda -V12.2 -o turb_standalone_1  ../obj/*.o


- Mat
Back to top
View user's profile
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Wed May 09, 2012 12:52 am    Post subject: Reply with quote

Hi,

Is there any news concerning the performance issue reported in the first post here ?

Quote:

I took a look at the code and it appears to me that the performance difference is being caused by the CUDA version being used. We switched from using CUDA 3.2 to CUDA 4.0 as the default device tool chain. I show the following kernel times for the loop at line 977 (Times are in microseconds).

17957 11.10 with CUDA 3.2 (default)
29402 11.10 with CUDA 4.0 (-ta=nvidia,4.0)
28076 12.2 with CUDA 4.0 (default)
17921 12.2 with CUDA 3.2 (-ta=nvidia,3.2)

I also looked at the PGI generated CUDA kernels and see only minor differences. We'll need to contact NVIDIA since it seems to be an issue with their back end tools.. Do you mind if we share your code with them?



Did you have any chance to share with nvidia the code I send you, and to get some feedback ?

Best regards,

Xavier
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6215
Location: The Portland Group Inc.

PostPosted: Wed May 09, 2012 4:42 pm    Post subject: Reply with quote

Hi Xavier,

Using CUDA 4.1, the time reduces to 23011 microseconds. Not quite as good as CUDA 3.2, but better then 4.0.

Though, in looking at the code I think this might be a good candidate for the OpenACC Parallel construct. The PGI Accelerator Model as well as the OpenACC Kernels construct, only work well on tightly nested loops. Since this code has conditional code surrounding the inner loops, the inner loops can't be parallelized. However, OpenACC "Parallel" will allow you to use a "gang" (in CUDA terms a block) to parallel the outer loop and a "vector" (in CUDA terms the threads in a block) to parallelize the inner loops.

"Parallel" is still in development but let me check where we're at and I'll see what I can do with your code.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3  Next
Page 2 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group