PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

CUDA Fortran vs. CUDA C on Fermi

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
xray



Joined: 21 Jan 2010
Posts: 85

PostPosted: Wed Apr 13, 2011 12:01 am    Post subject: CUDA Fortran vs. CUDA C on Fermi Reply with quote

Hi,
I had implemented several tuned versions of my program using CUDA C. Now, I did the same using CUDA Fortran. On a NVIDIA Tesla S1070 (cc 1.3) and a NVIDIA GeForce GT220 (cc 1.2), I get almost the same performance of C and fortran (for single and double precision). However, if I run both versions on a Fermi GPU (C2050) then CUDA Fortran is suddenly slower (for single precision even worse than for double precision).
Do you have an explanation why there is a performance difference only on Fermi?

BTW: For CUDA Fortran I use pgf90 11.1 and -Mcuda=fastmath,cuda3.2.

Bye, Sandra
Back to top
View user's profile
Michael Wolfe



Joined: 19 Jan 2010
Posts: 42

PostPosted: Wed Apr 13, 2011 3:24 pm    Post subject: Reply with quote

Try compiling with -Minfo when you build the CUDA Fortran application. Look at the registers used for compute capability 1.3 and 2.0. For compute capability 2.0 (64-bit mode), pointers are 64-bits and take two GPU registers, whereas for compute capability 1.3 pointers are only 32-bits, since the max memory size is 4GB. The CUDA Fortran compiler may not be optimizing pointer usage as much as it should. Let us know what you get, we'd be really interested if there's a big difference.
Back to top
View user's profile
xray



Joined: 21 Jan 2010
Posts: 85

PostPosted: Thu Apr 14, 2011 8:32 am    Post subject: Reply with quote

I know that if I use -Minfo=accel I can get such an information for the PGI Accelerator programming Model. But if I use -Minfo (with or without "accel") for CUDA Fortran I get no output. What am I missing?
Is there another option for Minfo for CUDA Fortran? (the compiler reference does not tell me any other option).
Cheers, Sandra
Back to top
View user's profile
TheMatt



Joined: 06 Jul 2009
Posts: 317
Location: Greenbelt, MD

PostPosted: Thu Apr 14, 2011 9:30 am    Post subject: Reply with quote

xray wrote:
I know that if I use -Minfo=accel I can get such an information for the PGI Accelerator programming Model. But if I use -Minfo (with or without "accel") for CUDA Fortran I get no output. What am I missing?
Is there another option for Minfo for CUDA Fortran? (the compiler reference does not tell me any other option).
Cheers, Sandra

For CUDA Fortran, you can use the ptxinfo option for -Mcuda. So: -Mcuda=fastmath,cuda3.2,ptxinfo. This will return the lmem, smem, cmem, and registers used for each CC.

Matt
Back to top
View user's profile
xray



Joined: 21 Jan 2010
Posts: 85

PostPosted: Fri Apr 15, 2011 7:34 am    Post subject: Reply with quote

Okay, here are my results:

With respect to CUDA Fortran, my application uses 60 registers if compiled for cc 2.0 and only 27 registers if compiled for cc 1.3/1.2. Thus, this is what Michael expected. If I run my application using the executable compiled for cc 1.3, I get the same performance as for the CUDA C run. Are there any further flags (besides fastmath) that I could specify for cc 2.0 so that it runs faster?

I also just realized that our CUDA C version wasn't compiled for cc 2.0, either (if I do so applying -ftz=true -prec-sqrt=false -prec-div=false, the application runs still slower than the one for cc 1.3 and approximately the same time than the CUDA Fortran version compiled for cc 2.0 using fastmath... without these flags it is even slower than the CUDA Fortran version), but for cc 1.3 for the best effort approach. The cc 2.0-version uses 60 registers without the flags mentioned above and 36 registers using the flags mentioned above. The original version for cc 1.3 uses 27 registers.

Thus, the results/runtimes/registers are comparable again between CUDA Fortran and CUDA C.
But, I think, it is still unexpected to have such as great slow-down to change the compute capability from 1.3 to 2.0 on Fermi...
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group