PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Tutorial2 from PGI - Does-it accelerate ?

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
AGy



Joined: 06 Jul 2010
Posts: 8

PostPosted: Thu Jul 08, 2010 6:30 am    Post subject: Tutorial2 from PGI - Does-it accelerate ? Reply with quote

Hi,

I am currently testing the Tutorial2 from this site : http://www.pgroup.com/resources/articles.htm PGI Accelerator tutorial examples

It is clearly interesting to see how various accelerator's options could make the execution time reducing.

But, I have been surprised when, trying to execute the same program without options, (that is to says without -ta=nvidia) the execution-time that I had got was incredibly shorter.

I let you see by your own :

platform linux SENTOS 5.5 x86_64
Processor (HOST) Intel Xeon E5420 2.5 GHz
Nvidia Quadro FX 1700 + Nvidia Quadro FX 1700

Code:

cat /...l/pgi/linux86-64/10.5/bin/sitenvrc
#!/bin/sh
export NVOPEN64DIR=/.../Nvidia/cuda/3.0/open64/lib;
export CUDADIR=/.../Nvidia/cuda/3.0/bin;
export CUDALIB=/.../Nvidia/cuda/3.0/lib;

and
Code:

$ cat /.../Nvidia/cuda/3.0/Env_cuda.sh
export PATH=${PATH}:/.../Nvidia/cuda/3.0/bin
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/.../Nvidia/cuda/3.0/lib64:/appl/Nvidia/cuda/3.0/lib


Code:

$ make clean
rm -f a.out *.exe *.o *.obj *.gpu *.bin *.ptx *.s *.mod *.g *.emu *.time *.uni
$ make J1.exe
pgfortran -ta=nvidia -fast -c Jmain.f90 -Minfo=accel
pgfortran -ta=nvidia -fast -c J1.f90 -Minfo=accel
jacobi:
     18, Generating copyin(a(1:m,1:n))
         Generating copyout(a(2:m-1,2:n-1))
         Generating copyout(newa(2:m-1,2:n-1))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     19, Loop is parallelizable
     20, Loop is parallelizable
         Accelerator kernel generated
         19, !$acc do parallel, vector(16)
         20, !$acc do parallel, vector(16)
             Cached references to size [18x18] block of 'a'
             CC 1.0 : 17 registers; 1328 shared, 132 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 17 registers; 1328 shared, 132 constant, 0 local memory bytes; 75 occupancy
     27, Loop is parallelizable
         Accelerator kernel generated
         24, Max reduction generated for change
         27, !$acc do parallel, vector(16)
             CC 1.0 : 9 registers; 24 shared, 116 constant, 0 local memory bytes; 100 occupancy
             CC 1.3 : 9 registers; 24 shared, 116 constant, 0 local memory bytes; 100 occupancy
pgfortran -o J1.exe -ta=nvidia Jmain.o J1.o
$ ./J1.exe 500
reached delta= 0.09991 in         1624 iterations for  500 x  500 array
time=12.9260 seconds
$ ./J1.exe 1000
reached delta= 0.09998 in         3347 iterations for 1000 x 1000 array
time=91.2280 seconds
============================================================
$ make clean
rm -f a.out *.exe *.o *.obj *.gpu *.bin *.ptx *.s *.mod *.g *.emu *.time *.uni
$  pgfortran -c J1.f90
$ pgfortran -c Jmain.f90
$ pgfortran -o J1.exe Jmain.o J1.o
$ ./J1.exe 500
reached delta= 0.09995 in         1624 iterations for  500 x  500 array
time= 5.8620 seconds
$ ./J1.exe 1000
reached delta= 0.09998 in         3347 iterations for 1000 x 1000 array
time=49.7940 seconds


Conclusion :
12 seconds vs. 6 seconds
91 seconds vs. 50 seconds
Obviously my question is :

"why the time performances seems to evolute in the wrong way ?"

I suppose that I am doing somethings wrong. But I don't know what.
I am trying to execute with much more iterations but I quickly reach the memory limit of my graphics cards.

Thank for answering.
Have a nice day.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Mon Jul 12, 2010 1:58 pm    Post subject: Reply with quote

Hi AGy,

These tutorials are meant to show how to a user can use the accelerator directives to gain better performance with there codes. The largest improvement comes in the "J3" example which shows how to use data regions. The first example, J1, is not expected to run faster than the CPU. Here's my times on a Core-i7 system using a Nvidia T10 Tesla card:

Code:
% J1.exe 1000
reached delta= 0.09998 in         3347 iterations for 1000 x 1000 array
time=16.4950 seconds
% J2.exe 1000
reached delta= 0.09998 in         3347 iterations for 1000 x 1000 array
time=12.6650 seconds
% J3.exe 1000
reached delta= 0.09998 in         3347 iterations for 1000 x 1000 array
time= 3.5820 seconds
% J3_cpu.exe 1000
reached delta= 0.09991 in         3348 iterations for 1000 x 1000 array
time= 5.7710 seconds

While not a huge speed-up over the CPU, this small 9-point stencil example does show improvement with a little tuning.

On Linux, the NVIDIA drivers will power down between usage. It takes approximately 1 second per attached device to reinitialize the devices. The device initialization occurs the first time your program uses a devices and can have an severe impact on the overall performance of these short running programs. To help, we have a utility, 'pgcudainit', which will hold open the NVIDIA drivers and eliminate the initialization cost. I have the utility 'pgcudainit' running in the background on my system.

Note that a Quadro FX 1700 has 32 thread processors versus 240 a Tesla T10 as well as a slower clock, memory bandwidth, and memory capacity.

- Mat
Back to top
View user's profile
AGy



Joined: 06 Jul 2010
Posts: 8

PostPosted: Mon Jul 12, 2010 11:49 pm    Post subject: Reply with quote

Thank you to answered.

have a nice day.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group