PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

PGI Accelerator on NVIDIA S1070 and S2050 Fermi

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
sindimo



Joined: 30 Nov 2010
Posts: 29
Location: Saudi Aramco

PostPosted: Tue Feb 01, 2011 12:19 am    Post subject: PGI Accelerator on NVIDIA S1070 and S2050 Fermi Reply with quote

Dear Mat,

We have some code being accelerated on an NVIDIA S1070 GPU (compute capability 1.3) using PGI 10.9 directives (we are not using PGI 11.1 because we get internal compiler errors for some reason, but it compiles fine on 10.9).

The NVIDIA driver we are using is for CUDA 3.1 since the manual of PGI 10.9 still does not certify the compiler to run on a CUDA 3.2 driver.

The code runs fine and produces correct results on the S1070 GPU.

Next we wanted to run the code on an S2050 GPU (Fermi with compute capability 2.0). We noticed that when you compile the code, the compiler by default creates a binary that is compute capability 1.3 even though we are running on a Fermi. To overwrite the default, we use the cc20 flag "-ta=nvidia,time,cuda3.0,cc20" and it produces a 2.0 compute capability binary.

If we don't use the cc20 flag on Fermi, the code produces totally bizzare results.

When we use the cc20 flag on Fermi, the code runs but produces results that are a bit off than what we are expecting (the run on the S1070 produces correct expected results though).

Is there any special settings needed from the PGI side to get things working on Fermi correctly other than compiling with the cc20 option?

I found this PGI posting by Michael Wolfe and that's where I got the cc20 option from:
http://www.pgroup.com/lit/articles/insider/v2n2a1.htm

We have NVIDIA involved in this as well but no answers so far.

Thank you for your help.

Mohamad Sindi
Back to top
View user's profile
TheMatt



Joined: 06 Jul 2009
Posts: 317
Location: Greenbelt, MD

PostPosted: Tue Feb 01, 2011 5:58 am    Post subject: Reply with quote

sindimo,

In my experience, when the GPU results are a bit off, the first thing you might want to try is using the nofma option in your -ta/-Mcuda option list. I've found that, on Teslas at least, nofma seems to help with accuracy, though at the cost of some performance. (NB: I don't have access to a Fermi plus PGI compiler yet, so I'm not sure if nofma has as large--if any--an effect with them or not.) I'm also not sure what effect -Kieee would have on your code, but you might want to try it as well.

Also, the example -ta settings you list have "cuda3.0". Is there a reason you are using that and not "cuda3.1"? If nothing else, I seem to recall the cuda3.1 PTX assembler was faster, which is nice.

If these don't work, real/PGI Mat will know more.

Matt
Back to top
View user's profile
sindimo



Joined: 30 Nov 2010
Posts: 29
Location: Saudi Aramco

PostPosted: Tue Feb 01, 2011 6:59 am    Post subject: Reply with quote

Thanks Matt for your feedback.

we already tried the nofma option as we looked it up in the PGI manual and it didn't help, and we're already using the -Kieee flag during compilation. Also using nofma did slow down the run a bit.

As for the 3.0 vs 3.1, the PGI compiler seems to mess up during compilation when I use 3.1 with cc2.0 while 3.0 seems to work better, see below the registers, shared, etc... being assigned to zeros with 3.1 while with 3.0 it shows correct assignment. Both binaries still run and produce same results, but the run with 3.0 seem to be slightly faster than 3.1 for some reason.

3.0
Code:

Accelerator kernel generated
        278, !$acc do vector(32)
        283, !$acc do parallel
             Cached references to size [32] block of 'jeven'
             Cached references to size [32] block of 'jodd'
             CC 2.0 : 62 registers; 1028 shared, 960 constant, 592 local memory bytes; 16 occupancy


3.1
Code:

   Accelerator kernel generated
        278, !$acc do vector(32)
        283, !$acc do parallel
             Cached references to size [32] block of 'jeven'
             Cached references to size [32] block of 'jodd'
             CC 2.0 : 0 registers; 0 shared, 0 constant, 0 local memory bytes; 16 occupancy


Thanks
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6129
Location: The Portland Group Inc.

PostPosted: Tue Feb 01, 2011 9:33 am    Post subject: Reply with quote

Hi Mohamad Sindi,

It sounds like there are multiple points of failure, so it might be best if you can send a report with a reproducing example to PGI Customer Service (trs@pgroup.com)? Ask them to forward the mail to me.
Quote:

We noticed that when you compile the code, the compiler by default creates a binary that is compute capability 1.3 even though we are running on a Fermi.
The default is to produce multiple compute capabilities. It should have produced cc13 and cc20.

Thanks,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group