PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

less speed of accelerator directives
Goto page Previous  1, 2
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Mon Mar 26, 2012 9:52 am    Post subject: Reply with quote

Hi kuldeep gupta,

Again, I'm guessing here since you don't provide details as to your execution times or how you are measuring your results.

Going back to your first example, I updated it so that at least a part of "C" is printed. The compiler optimisation could eliminated dead-code like this giving you the false impression that the CPU code is faster since nothing is actually computed.

Next, I initialised the GPU before your timers. At least on Linux, the GPU takes ~1 second per device to warm-up so can skew timings when running these very small programs.

Finally, I set the environment variable "ACC_NOTIFY" to show when a kernel is launched.
Code:
% cat ex.F90
program ex
#ifdef _ACCEL
  use accel_lib
#endif
implicit none
real :: a(256,256),b(256,256),c(256,256),t1,t2
integer i,j,k,sm
sm=0
#ifdef _ACCEL
 call acc_init(acc_device_nvidia)
#endif
  do j = 1,256
      do i = 1,256
         a(i,j) = 1
       b(i,j) = 1
       c(i,j) = 0.0
      enddo
   enddo
   call cpu_time(t1)
!$acc region
  do i=1,256

      do j=1,256

           sm=0
           do k=1,256

               sm=sm+a(i,k)*b(k,j)
           c(i,j)=sm
           end do
      end do
      end do
!$acc end region
     call cpu_time(t2)
     print*,"cpu time=",t2-t1
     print*,c(12,12)
     end program ex

% pgf90 ex.F90 -fast -Mpreprocess -o ex_cpu.out
% pgf90 ex.F90 -ta=nvidia,time -fast -Mpreprocess -o ex_gpu.out
% setenv ACC_NOTIFY 1
% ./ex_cpu.out
 cpu time=   6.2897921E-02
    256.0000
% ./ex_gpu.out
launch kernel  file=/tmp/qa/ex.F90 function=ex line=26 device=0 grid=16x16 block=16x16
 cpu time=   1.1019707E-03
    256.0000

Accelerator Kernel Timing data
/tmp/qa/ex.F90
  ex
    20: region entered 1 time
        time(us): total=1097
                  kernels=352 data=337
        26: kernel launched 1 times
            grid: [16x16]  block: [16x16]
            time(us): total=352 max=352 min=352 avg=352
acc_init.c
  acc_init
    42: region entered 1 time
        time(us): init=114125


So, the GPU version is roughly 50 times faster.

Quote:
so i am confused whether directives are working or not??
This shows you what I did and hopefully you can the figure out it on your side.

- Mat
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6134
Location: The Portland Group Inc.

PostPosted: Mon Mar 26, 2012 3:36 pm    Post subject: Reply with quote

Also, if you fix your code so that the "c(i,j)=sm" is placed after the "k" loop, the PGI accelerator kernel time reduces to 123ms.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group