PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Terrible HPL performance

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Performance and Benchmarking
View previous topic :: View next topic  
Author Message
gormanly



Joined: 09 Nov 2004
Posts: 9

PostPosted: Wed Aug 20, 2008 8:04 am    Post subject: Terrible HPL performance Reply with quote

I've been benchmarking some new Sun Fire X4600 M2 servers using our normal software stack of PGI (with bundled MPICH and ACML) on RHEL 4.6 using HPL and the performance is awful.

The hardware is 8* Opteron 8356 with 64GB RAM, which has an Rpeak of 294.4 GFLOPS.

With PGI 7.2, we get a Rmax of 58.18 GFLOPS.

I tried out a different compiler and MPI, Sun StudioExpress 07/08 and Sun HPC Cluster Tools 8.0EA2, and Rmax with the same input file is 169.6 GFLOPS.

Compiler options were:
PGI: -tp barcelona-64 -fastsse -O3 -Munroll
Studio: -fast -xtarget=barcelona -m64 -xvector=simd

Any suggestions ?
Back to top
View user's profile
gormanly



Joined: 09 Nov 2004
Posts: 9

PostPosted: Wed Aug 20, 2008 8:40 am    Post subject: Reply with quote

Hmm, possibly it's the MPI implementation, as the run time and CPU utilization is wildly different between processes:

Code:

top - 16:33:34 up 14 days, 16 min,  2 users,  load average: 25.06, 25.54, 25.91
Tasks: 455 total,  26 running, 429 sleeping,   0 stopped,   0 zombie
Cpu(s): 71.9% us,  3.9% sy,  0.0% ni, 23.7% id,  0.0% wa,  0.0% hi,  0.5% si
Mem:  65910064k total, 55307004k used, 10603060k free,   272532k buffers
Swap: 20482832k total,        0k used, 20482832k free,  3703764k cached

  PID USER      PR  NI %CPU    TIME+  %MEM  VIRT  RES  SHR S COMMAND           
 8731 atg       25   0  100  84:20.77  2.4 1603m 1.5g 1780 R xhpl               
 8623 atg       23   0  100  80:26.38  2.4 1564m 1.5g 1780 R xhpl               
 8839 atg       23   0  100  81:33.91  2.4 1564m 1.5g 1780 R xhpl               
 9055 atg       23   0  100  79:20.34  2.4 1564m 1.5g 1780 R xhpl               
 9136 atg       16   0   99  72:53.32  2.4 1607m 1.5g 1780 R xhpl               
 9271 atg       23   0   99  83:59.40  2.4 1564m 1.5g 1780 R xhpl               
 8503 atg       25   0   99  82:25.96  2.4 1602m 1.5g 1780 R xhpl               
 8785 atg       19   0   98  73:19.73  2.4 1598m 1.5g 1780 R xhpl               
 9163 atg       25   0   98  62:56.78  2.4 1597m 1.5g 1780 S xhpl               
 8812 atg       17   0   97  83:03.70  2.4 1574m 1.5g 1800 R xhpl               
 8758 atg       18   0   97  82:03.88  2.4 1598m 1.5g 1780 R xhpl               
 8677 atg       16   0   90  75:13.70  2.4 1572m 1.5g 1780 S xhpl               
 9244 atg       17   0   89  87:08.42  2.4 1574m 1.5g 1796 R xhpl               
 8893 atg       16   0   86  72:39.59  2.4 1572m 1.5g 1780 R xhpl               
 8920 atg       16   0   85  66:57.44  2.4 1596m 1.5g 1780 S xhpl               
 9001 atg       19   0   84  76:36.71  2.4 1599m 1.5g 1780 R xhpl               
 8496 atg       16   0   84  71:25.74  2.4 1597m 1.5g 1868 S xhpl               
 8596 atg       17   0   81  66:07.80  2.4 1574m 1.5g 1800 R xhpl               
 9325 atg       16   0   73  76:16.28  2.4 1572m 1.5g 1780 S xhpl               
 8704 atg       16   0   72  71:16.70  2.4 1606m 1.5g 1780 R xhpl               
 9190 atg       18   0   71  77:12.88  2.4 1599m 1.5g 1780 R xhpl               
 8947 atg       16   0   69  62:06.32  2.4 1596m 1.5g 1780 S xhpl               
 9109 atg       16   0   69  73:57.95  2.4 1572m 1.5g 1780 S xhpl               
 9028 atg       17   0   67  68:42.16  2.4 1574m 1.5g 1800 R xhpl               
 8569 atg       18   0   62  72:27.98  2.4 1598m 1.5g 1780 R xhpl               
 8974 atg       18   0   61  81:05.97  2.4 1599m 1.5g 1780 R xhpl               
 9217 atg       19   0   60  76:22.10  2.4 1599m 1.5g 1780 R xhpl               
 8534 atg       18   0   58  77:05.74  2.4 1598m 1.5g 1780 R xhpl               
 9298 atg       17   0   25  75:44.63  2.4 1569m 1.5g 1780 R xhpl               
 8866 atg       17   0   24  79:14.78  2.4 1568m 1.5g 1780 R xhpl               
 8650 atg       16   0   24  80:27.81  2.4 1568m 1.5g 1780 R xhpl               
 9082 atg       17   0   18  76:55.07  2.4 1569m 1.5g 1780 R xhpl               
 9876 atg       16   0    1   0:00.51  0.0  8732 1540  932 R top               
 9365 root      15   0    0   0:34.35  0.0 35396  11m 2608 S X                 
 9699 gdm       16   0    0   0:10.12  0.0  124m  11m 6940 S gdmgreeter         
    1 root      16   0    0   0:02.71  0.0  4756  556  456 S init               
    2 root      RT   0    0   0:00.13  0.0     0    0    0 S migration/0       
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Wed Aug 20, 2008 9:07 am    Post subject: Reply with quote

Hi gormanly,

I would definitely try a different MPI implementation. The MPICH we ship uses basic TCP/IP and is meant for portability. For high performance applications, I would use the MPI version recommended by your interconnect vendor or one that optimized for your interconnect.

- Mat
Back to top
View user's profile
gormanly



Joined: 09 Nov 2004
Posts: 9

PostPosted: Thu Aug 21, 2008 9:02 am    Post subject: Reply with quote

It's not just a MPI problem, but that is part of it: further testing has given me

Code:
GFLOPS         MPI              compiler         OS

 58.18         MPICH 1.2.7      PGI 7.2          RHEL 4
 98.83         OpenMPI 1.2.6    PGI 7.2          RHEL 4
122.6          OpenMPI 1.2.5    Studio 12        Solaris 10
123.4          OpenMPI 1.3pre   Studio 12        RHEL 4
169.6          OpenMPI 1.3pre   Studio Express   RHEL 4


with the same input file.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Thu Aug 21, 2008 10:06 am    Post subject: Reply with quote

Hi gormanly,

The vast majority of HPL's time is spent in the Math library (specifically DGEMM) and the compiler has very little to do with the overall performance. Hence, you should next focus on math library used. It appears to me that Sun has a very good parallel math library. Can you try linking it with the PGI version? What happens if you use ACML with Sun? Do the ATLAS or GOTO Lapack libraries help?

On a side note, this interview (See: http://www.hpcwire.com/features/17886034.html) with PGI's director, Doug Miles, posted on hpcwire.com might be of interest.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Performance and Benchmarking All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group