PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

PGI 14.1 and K20x Cards: Best Mcuda flag to use?

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
TheMatt



Joined: 06 Jul 2009
Posts: 322
Location: Greenbelt, MD

PostPosted: Wed Feb 05, 2014 7:18 am    Post subject: PGI 14.1 and K20x Cards: Best Mcuda flag to use? Reply with quote

In my investigations of using a K20x card, I've found the best compilation strategy is:
Code:
-Mcuda=nofma,5.0,kepler,ptxinfo -Mcuda=maxregcount:72

When I do this, the code compiles and generates:
Code:
pgfortran -fast -r4 -Mextend -Mpreprocess -Ktrap=fp -Kieee -tp=sandybridge-64 -Mcuda=nofma,5.0,cc35,ptxinfo -Mcuda=maxregcount:72  -DNITERS=6 -DBIG -DGPU_PRECISION=8 -c src/sorad.F90
ptxas info    : 3248 bytes gmem, 576 bytes cmem[3]
ptxas info    : Compiling entry function 'soradmod_sorad_' for 'sm_35'
ptxas info    : Function properties for soradmod_sorad_
    18704 bytes stack frame, 1168 bytes spill stores, 1336 bytes spill loads
ptxas info    : Used 72 registers, 344 bytes cmem[0], 280 bytes cmem[2]
leading to these timers:
Code:
 ----- Timings -----
 Time in Milliseconds
    Total :   2831.156 +/-      3.349
   Kernel :   2425.110 +/-      1.852
Data Xfer :    382.362 +/-      2.245
I hit upon the 72 registers as the best number, so I've been forcing that.

But, I decided to look and see if 14.1 has better/newer settings, to wit:
Code:
$ pgfortran -Mcuda=help
...
    emu             Enable emulation mode
    tesla           Compile for Tesla architecture
    tesla+          Compile for Tesla architecture and above
    cc1x            Compile for compute capability 1.x
    cc1+            Compile for compute capability 1.x and above
    fermi           Compile for Fermi architecture
    fermi+          Compile for Fermi architecture and above
    cc2x            Compile for compute capability 2.x
    cc2+            Compile for compute capability 2.x and above
    kepler          Compile for Kepler architecture
    kepler+         Compile for Kepler architecture and above
    cc3x            Compile for compute capability 3.x
    cc3+            Compile for compute capability 3.x and above
    ...

I see that cc35 isn't here (although it is in the man page), so I wondered is cc35 discouraged? I tried running with -Mcuda=kepler, thinking maybe it would detect the card correctly but I got this:
Code:
pgfortran -fast -r4 -Mextend -Mpreprocess -Ktrap=fp -Kieee -tp=sandybridge-64 -Mcuda=nofma,5.0,kepler,ptxinfo -Mcuda=maxregcount:72  -DNITERS=6 -DBIG -DGPU_PRECISION=8 -c src/sorad.F90
ptxas warning : Too big maxrregcount value specified 72, will be ignored
ptxas info    : 3248 bytes gmem, 576 bytes cmem[3]
ptxas info    : Compiling entry function 'soradmod_sorad_' for 'sm_30'
ptxas info    : Function properties for soradmod_sorad_
    18768 bytes stack frame, 1656 bytes spill stores, 1844 bytes spill loads
ptxas info    : Used 63 registers, 344 bytes cmem[0], 284 bytes cmem[2]
...
 ----- Timings -----
 Time in Milliseconds
    Total :   3166.526 +/-      1.995
   Kernel :   2492.151 +/-      1.438
Data Xfer :    650.974 +/-      2.208

As you can see, it targeted cc30 (sm_30), not cc35 and so led to slower timings. This makes since, I suppose, since Kepler is not just cc35, but cc30 too, but I guess I thought "kepler" might notice a cc35 and target it.

Also, as a note, I'm using cuda50 here because cuda55 leads to worse performance:
Code:
pgfortran -fast -r4 -Mextend -Mpreprocess -Ktrap=fp -Kieee -tp=sandybridge-64 -Mcuda=nofma,5.5,cc35,ptxinfo -Mcuda=maxregcount:72  -DNITERS=6 -DBIG -DGPU_PRECISION=8 -c src/sorad.F90
ptxas info    : 3259 bytes gmem, 576 bytes cmem[3]
ptxas info    : Compiling entry function 'soradmod_sorad_' for 'sm_35'
ptxas info    : Function properties for soradmod_sorad_
    18584 bytes stack frame, 572 bytes spill stores, 544 bytes spill loads
ptxas info    : Used 72 registers, 344 bytes cmem[0], 276 bytes cmem[2]
...
 ----- Timings -----
 Time in Milliseconds
    Total :   3501.825 +/-      8.523
   Kernel :   2824.677 +/-      8.268
Data Xfer :    653.633 +/-      0.822

Looks like it does the "spill" heuristics differently...and in a bad way for me. Hmm...

Thanks,
Matt
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Wed Feb 05, 2014 1:29 pm    Post subject: Reply with quote

Hi Matt,

Quote:
I see that cc35 isn't here (although it is in the man page), so I wondered is cc35 discouraged?
The x in "cc3x" is trying to indicate "insert number here", i.e. "cc30", "cc35". I'll see if we can make this more clear. So, yes cc35 is still supported and encouraged if you have a CC 3.5 device.

Though, why CUDA 5.5 is giving slower performance is a mystery.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group