PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Fast math (exp,pwd,log)?

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
ofuhrer



Joined: 18 Feb 2008
Posts: 16

PostPosted: Sun Jan 17, 2010 3:58 am    Post subject: Fast math (exp,pwd,log)? Reply with quote

Hi,

I am using Craypat profiling tools on a code running on a XT5 on 984 cores. The sampling profile gives me the profiling output below. Everything looks quite straightforward and understandable. Some of the routines seem to be quite heavy on the intrinsic EXP, POW and LOG functions. Is there any way to try out optimized/accelerated maths functions to see the impact on the code? How would I have to compile/link my code in order to try them out?

Thanks,
Oli

Code:

 100.0% | 82937 |       -- |     -- |Total
|------------------------------------------------
|  54.5% | 45213 |       -- |     -- |USER
||-----------------------------------------------
||  15.2% | 12567 |  1252.56 |   9.1% |fast_waves_rk_fast_waves_runge_kutta_
||   4.9% |  4028 |   604.77 |  13.1% |src_turbdiff_turbdiff_
||   4.6% |  3800 |   364.13 |   8.8% |numeric_utilities_interpol_sl_tricubic_
||   4.1% |  3368 |   448.57 |  11.8% |src_slow_tendencies_rk_complete_tendencies_uvwtpp_
||   2.4% |  2008 |   445.34 |  18.2% |environment_putbuf_
||   2.1% |  1782 |   245.12 |  12.1% |src_advection_rk_advection_
||   2.1% |  1769 |   255.89 |  12.6% |src_runge_kutta_org_runge_kutta_
||   1.6% |  1362 |   357.93 |  20.8% |environment_getbuf_
||   1.6% |  1332 |   534.36 |  28.7% |src_gscp_hydci_pp_gr_
||   1.4% |  1138 |   146.15 |  11.4% |src_advection_rk_adv_upwind5_lat_
||   1.4% |  1131 |   138.14 |  10.9% |lmorg_initialize_loop
||   1.2% |  1004 |   192.37 |  16.1% |src_advection_rk_adv_upwind5_lon_
||===============================================
|  26.4% | 21870 |       -- |     -- |ETC
||-----------------------------------------------
||   5.1% |  4213 |  1460.19 |  25.8% |__mth_i_dexp
||   4.2% |  3509 |  1248.63 |  26.3% |PtlEQPeek
||   2.8% |  2338 |   259.03 |  10.0% |__c_mzero8
||   2.1% |  1752 |   717.68 |  29.1% |fast_nal_poll
||   2.0% |  1665 |   374.32 |  18.4% |__mth_i_dpowd
||   1.6% |  1332 |   528.43 |  28.4% |PtlEQGet
||   1.5% |  1203 |   499.00 |  29.3% |__mth_i_dlog
||   1.2% |   977 |   343.05 |  26.0% |PtlEQGet_internal
||   1.1% |   899 |   432.91 |  32.5% |MPIDI_CRAY_smpdev_progress
||===============================================
|  19.1% | 15854 |       -- |     -- |MPI
||-----------------------------------------------
||   9.1% |  7544 | 70473.83 |  90.4% |mpi_recv_
||   4.6% |  3823 |   227.10 |   5.6% |mpi_allgather_
||   1.9% |  1561 |    84.60 |   5.1% |mpi_scatter_
||   1.5% |  1213 |   156.15 |  11.4% |mpi_allreduce_
|================================================
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Tue Jan 19, 2010 10:23 am    Post subject: Reply with quote

Hi Oli,

Which flags are you using? Given the symbol names, I'm assuming that you have "-Kieee" enabled. For the optimized versions ("__fmth_i_dexp" instead of "__mth_i_dexp"), please remove this flag. The "fast" version are slightly less precise (~1Ulp) but should improve performance.

For some other operations (sqrt, rsqrt, div), you can also try the flag "-Mfprelaxed".

If you wish to try out your own versions of these routines, then you would need to rename these intrinsics to match your version name. A preprocessor directive such as "#define EXP MYEXP", at the top of each of your source files should do the substitution for you.

Hope this helps,
Mat
Back to top
View user's profile
ofuhrer



Joined: 18 Feb 2008
Posts: 16

PostPosted: Tue Jan 19, 2010 3:24 pm    Post subject: Reply with quote

Dear Mat,

Yes, this helps! The options I am using are...

-Kieee -Mbyteswapio -Mfree -Mpreprocess -Mcache_align -Mflushz -Mlre -Mprefetch -Mpreprocess -Mscalarsse -Mvect=noassoc -Mvect=sse -O3 -Mipa=fast,inline -fastsse

I will try out the suggestions you have made and post results as soon as I have them. Do you have any experience in using the ACML, as we are running on Quad-core AMD Opterons on a Cray XT4?

Cheers,
Oli
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Tue Jan 19, 2010 4:32 pm    Post subject: Reply with quote

Hi Oli,

Quote:
Do you have any experience in using the ACML, as we are running on Quad-core AMD Opterons on a Cray XT4?


I've used it, but have not done much performance testing with it.

- Mat
Back to top
View user's profile
ofuhrer



Joined: 18 Feb 2008
Posts: 16

PostPosted: Wed Jan 20, 2010 6:13 am    Post subject: Reply with quote

Mat,

I tried removing -Kieee and adding -Mfprelax to my compilation options and got the following results...

Code:

 100.0% | 78197 |       -- |     -- |Total
|------------------------------------------------
|  56.8% | 44386 |       -- |     -- |USER
||-----------------------------------------------
||  16.0% | 12523 |  1293.95 |   9.4% |fast_waves_rk_fast_waves_runge_kutta_
||   4.8% |  3766 |   541.95 |  12.6% |src_turbdiff_turbdiff_
||   4.7% |  3700 |   384.86 |   9.4% |numeric_utilities_interpol_sl_tricubic_
||   4.3% |  3365 |   409.66 |  10.9% |src_slow_tendencies_rk_complete_tendencies_uvwtpp_
||   2.6% |  2050 |   459.89 |  18.3% |environment_putbuf_
||   2.3% |  1781 |   249.64 |  12.3% |src_runge_kutta_org_runge_kutta_
||   2.1% |  1650 |   230.54 |  12.3% |src_advection_rk_advection_
||   1.8% |  1388 |   358.70 |  20.6% |environment_getbuf_
||   1.6% |  1279 |   515.36 |  28.8% |src_gscp_hydci_pp_gr_
||   1.5% |  1134 |   137.31 |  10.8% |lmorg_initialize_loop
||   1.4% |  1084 |   108.88 |   9.1% |src_advection_rk_adv_upwind5_lat_
||   1.2% |   954 |   208.70 |  18.0% |src_advection_rk_adv_upwind5_lon_
||   1.0% |   767 |   250.12 |  24.6% |src_relaxation_sardass_
||   1.0% |   744 |   145.97 |  16.4% |src_slow_tendencies_rk_implicit_vert_diffusion_uvwt_
||===============================================
|  23.2% | 18152 |       -- |     -- |ETC
||-----------------------------------------------
||   4.1% |  3213 |  1201.95 |  27.3% |PtlEQPeek
||   3.1% |  2386 |   295.83 |  11.0% |__c_mzero8
||   2.7% |  2083 |   806.15 |  27.9% |__fmth_i_dexp
||   2.2% |  1695 |   619.89 |  26.8% |fast_nal_poll
||   1.6% |  1232 |   483.04 |  28.2% |PtlEQGet
||   1.2% |   916 |   364.49 |  28.5% |__fmth_i_dlog
||   1.2% |   910 |   330.43 |  26.7% |PtlEQGet_internal
||   1.1% |   834 |   313.52 |  27.3% |MPIDI_CRAY_smpdev_progress
||===============================================
|  20.0% | 15659 |       -- |     -- |MPI
||-----------------------------------------------
||   9.1% |  7147 | 66181.37 |  90.3% |mpi_recv_
||   5.1% |  4016 |   234.95 |   5.5% |mpi_allgather_
||   2.0% |  1567 |    79.37 |   4.8% |mpi_scatter_
||   1.7% |  1320 |   140.20 |   9.6% |mpi_allreduce_
|================================================


Obviously we are using fast math routines now. The powd completely vanished below the cutoff (1%) and exp/log improved. So things seem to be working. Now I will have to investigate the impact on the results. People running meteorological codes are usually quite hesitant on using fast math libraries.

Concerning the ACML, I'm still trying to get it to run.

Thanks again,
Oli
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group