PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

PGI Acc on Fermi: Does the compiler disable caching?
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
xray



Joined: 21 Jan 2010
Posts: 84

PostPosted: Thu Mar 17, 2011 5:27 am    Post subject: PGI Acc on Fermi: Does the compiler disable caching? Reply with quote

Hi,
I implemented a SAXPY program (y[i] = a*x[i] + y[i]) using CUDA C and PGI Accelerator C. for CUDA I set the number of threads per block constantly to 256 and varied the vector size (from 1024 to 16776960). I did the same for the PGI Accelerator implementation (schedule: parralel, vector (256) and the same vector sizes).
I compared the results of CUDA and PGI Acc obtained on a Nvidia C2050 GPU:
1) GFlops of the whole program, i.e. kernel execution an data transfer: The PGI Accelerator values were always a bit below the GFlops of the CUDA implementation.
2) GFlops only of the kernel: PGI Accelerator achieves more Gflops than CUDA up to a vector size of about 3 000 000. Then, its GFlops number is below CUDA's.

Issue (2) is the one I don't understand: Why is PGI Accelerator faster for a certain vector size?
My assumption/question: Is it possible that the internal optimizations may disable Fermi's caching for small vector sizes?

Cheers, Sandra
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5871
Location: The Portland Group Inc.

PostPosted: Thu Mar 17, 2011 9:44 am    Post subject: Reply with quote

Hi Sandra,

Since there isn't any data reuse, I doubt that caching is the culprit. Though I don't know what the problem could be. Can you send me your code?

FYI, Dr. Michael Wolfe will be giving a training at your site next week. While you probably don't need the training, it might be nice to stop by and introduce yourself. I can send details if needed.

Also, I saw that Michael added to April's 11.4 release, the profiling routines that you requested (TPR#17668).

Best Regards,
Mat
Back to top
View user's profile
xray



Joined: 21 Jan 2010
Posts: 84

PostPosted: Fri Mar 18, 2011 12:55 am    Post subject: Reply with quote

Hi Mat,
Thanks for your information, but Michael's coming is organized by my department, so I already know :-)

Of course, caches for SAXPY do not make sense. But, as far as I know, enabling/disabling caching changes the access pattern to global memory. If caching is enabled, it will always be a whole cache line requested, whereas without caching, smaller data sizes are carried, which might (!) be good if you don't use caches (as in SAXPY).

But, I admit, it seems not really plausible to me, either. But that was the only thing I could think of. Thus, there aren't any other optimizations for smaller data sizes?

BTW: Is it possible with your compiler (using a flag) to disable/enable Fermi-caching manually?
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5871
Location: The Portland Group Inc.

PostPosted: Fri Mar 18, 2011 10:34 am    Post subject: Reply with quote

Quote:
Is it possible with your compiler (using a flag) to disable/enable Fermi-caching manually?
The compiler will still generate software caching, where applicable, for a Fermi even thought it has less effect it did on a Tesla. Also, we do not modify the hardware caching.

Quote:
Michael's coming is organized by my department, so I already know :-)
Great. I gave him your name and I'm sure he'd enjoy talking to you about your experiences using the PGI Accelerator model as well as any area in which we could improve.

- Mat
Back to top
View user's profile
xray



Joined: 21 Jan 2010
Posts: 84

PostPosted: Tue Mar 22, 2011 4:25 am    Post subject: Reply with quote

mkcolg wrote:
The compiler will still generate software caching, where applicable, for a Fermi even thought it has less effect it did on a Tesla. Also, we do not modify the hardware caching.


Can you say a few more words about your software caching? Amongst others, what is the amount and the optimal access? Anything special?

Second, hardware caching cannot be influenced by the user, right? Is there a special reason?
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group