PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

declarative data error in PGI Fortran 10
Goto page Previous  1, 2, 3
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
sindimo



Joined: 30 Nov 2010
Posts: 29
Location: Saudi Aramco

PostPosted: Tue Dec 28, 2010 1:07 am    Post subject: Reply with quote

Ok I think I figured out what the issue was and got it working.

I had to make the subroutine part of a module then use that module in the main program.

This is a run comparison between using "reflected" and not (matrices A and B get multiplied 10 times):

#With reflected (data movement is around 3.5 seconds)
Code:

[sindimo@slcb100 working-fortran-example-with-gpu]$ ./a.out
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.60642      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.32226      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.33978      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.31944      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.32152      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.34007      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.32129      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.32207      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.33026      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.32959      s

Accelerator Kernel Timing data
reflected.f
  mm
    46: region entered 10 times
        time(us): total=363552644 init=4 region=363552640
                  kernels=360440121 data=3037902
        w/o init: total=363552640 max=36606408 min=36319440 avg=36355264
        48: kernel launched 10 times
            grid: [625x625]  block: [16x16]
            time(us): total=165389 max=16550 min=16533 avg=16538
        52: kernel launched 10 times
            grid: [625x625]  block: [16x16]
            time(us): total=360274732 max=36028468 min=36025824 avg=36027473
reflected.f
  main
    23: region entered 1 time
        time(us): total=365187413 init=1079992 region=364107421
                  data=541007
        w/o init: total=364107421 max=364107421 min=364107421 avg=364107421


#Without reflected (data movement is around 8.5 seconds)
Code:

[sindimo@slcb100 working-fortran-example-with-gpu]$ ./a.out
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    38.23182      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.88260      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.89074      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.88908      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.87273      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.89082      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.89038      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.89151      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.89142      s
 time for C(        10000 ,        10000 ) = A(        10000 ,        10000
 ) B(        10000 ,        10000 ) is    36.88925      s

Accelerator Kernel Timing data
reflected.f
  main
    25: region entered 10 times
        time(us): total=370220259 init=1085643 region=369134616
                  kernels=360431990 data=8507312
        w/o init: total=369134616 max=37146134 min=36872727 avg=36913461
        25: kernel launched 20 times
            grid: [625x625]  block: [16x16]
            time(us): total=360431990 max=36027676 min=16539 avg=18021599

So 3.5 v.s. 8.5 seconds is around 58% cut in data movement.

Here's the code if anyone else is interested to look at it:
Code:

[sindimo@slcb100 working-fortran-example-with-gpu]$ cat reflected.f
         program main
         use myModule
         use accel_lib

         integer dim1, dim2, dim3, seed
         parameter (dim1 = 10000, dim2 = 10000, dim3 = 10000)
         double precision A(dim1, dim2), B(dim2, dim3), C(dim1, dim3)
         
              !populate 2 random matrices
                seed=7654321
                do i = 1, dim1
                do j = 1, dim2
                  A(i, j) = ran(seed)
               enddo
               enddo
               do i = 1, dim2
               do j = 1, dim3
               B(i, j) = ran(seed)
               enddo
               enddo

           !Trying to multiple the 2 matricies several times (only load them once into the GPU memory)
!$acc data region copyin(A,B)
           do i = 1, 10
             call MM(A,B,C)
           enddo
!$acc end data region

         end program main


         module myModule
         contains
         subroutine MM (X,Y,Z)
         integer dim1, dim2, dim3
         parameter (dim1 = 10000, dim2 = 10000, dim3 = 10000)
         double precision X(dim1, dim2), Y(dim2, dim3), Z(dim1, dim3)
         real start, finish

!$acc reflected(X,Y)     

      call cpu_time(start)


!$acc region
        do j = 1, dim3
        do i = 1, dim1
          Z(i, j) = 0
        enddo
        do k = 1, dim2
          do i = 1, dim1
            Z(i, j) = Z(i, j) + X(i, k)*Y(k, j)
          enddo
        enddo
       enddo
!$acc end region


      call cpu_time(finish)

      print *,'time for C(',dim1,',',dim3,') = A(',dim1,',',dim2,') B(',
     1dim2,',',dim3,') is',finish - start,' s'
     
      end subroutine MM
      end module myModule

I hope others find this useful.

Mohamad Sindi
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3
Page 3 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group