PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Course

matrix reduction using cuda fortran and GPU
Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
Dolf



Joined: 22 Mar 2012
Posts: 155

PostPosted: Tue Dec 11, 2012 12:02 pm    Post subject: RE: Reply with quote

what I did, I used the keys you use when compile into the command line inside visual studio project properties, it gave me the same parallel messages and compiled successfully, but I had to remove "use openacc"

is that correct way of doing it?

also, I cannot do parallalization for matrix of (3000,3000), do you know why?


Dolf
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6660
Location: The Portland Group Inc.

PostPosted: Tue Dec 11, 2012 12:13 pm    Post subject: Reply with quote

Hi Dolf,

Within PVF, you'd set the properties option "Fortran->Target Accelerator->Target NVIDIA Accelerator" to 'Yes'. From the command line, you would compile using the flag "-acc" and/or "-ta=nvidia". Either are fine.

Quote:
but I had to remove "use openacc"
This doesn't make sense. This module is in the standard PGI include directory so should be found by the compiler
Quote:
PGI$ ls /c/PROGRA~1/PGI/win64/12.10/include/openacc.mod
/c/PROGRA~1/PGI/win64/12.10/include/openacc.mod


Quote:
I cannot do parallalization for matrix of (3000,3000), do you know why?
No, the size shouldn't prevent parallelization. Can you post an example?

- Mat
Back to top
View user's profile
Dolf



Joined: 22 Mar 2012
Posts: 155

PostPosted: Tue Dec 11, 2012 1:03 pm    Post subject: RE: Reply with quote

yeah, sure, here is the program:

!
! Fortran Console Application
! Generated by PGI Visual Fortran(R)
! 12/10/2012 3:02:36 PM
!

program openacc1

!use openacc
implicit none

integer :: nx,ny,i,j,ak
integer, allocatable, dimension (:,:) :: A
integer :: start_time(8), end_time(8)
CHARACTER (LEN = 12) REAL_CLOCK (3)
CALL DATE_AND_TIME (REAL_CLOCK (1), REAL_CLOCK (2),&
REAL_CLOCK (3), start_time )

nx = 3000
ny = 3000
ak = 0

allocate (a(nx,ny))
A(1:nx,1:ny) = 2
!$acc kernels loop
do i = 1, nx
do j = 1, ny
ak = ak + A(i,j)
enddo
enddo
write(*,*) 'ak = ' ,ak
write(*,*)
CALL DATE_AND_TIME (REAL_CLOCK (1), REAL_CLOCK (2),&
REAL_CLOCK (3), end_time )
write(*,10) 'PROGRAM STARTED AT: ', START_TIME(5), START_TIME(6),&
START_TIME(7), START_TIME(8)
write(*,15) 'PROGRAM ENDED AT: ', end_time(5), end_time(6), &
end_time(7),end_time(8)
continue
deallocate(a)
10 format(1X, A, I2.2, ':', I2.2, ':', I2.2, ':', I3.3)
15 format(1X, A, I2.2, ':', I2.2, ':', I2.2, ':', I3.3)

end program openacc1


Fortran->target accelerators->Target NVIDIA Accelerators = yes
Fortran->Command Line -> -acc -Minfo=accel

the output after I compile: ( no "use openacc")

------ Rebuild All started: Project: OpenACC1, Configuration: Release x64 ------
Deleting intermediate and output files for project 'OpenACC1', configuration 'Release'
Compiling Project ...
OpenACC1.f90
openacc1:
25, Generating copyin(a(1:3000,1:3000))
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
26, Loop is parallelizable
27, Loop is parallelizable
Accelerator kernel generated
26, !$acc loop gang, vector(8) ! blockidx%x threadidx%x
27, !$acc loop gang, vector(8) ! blockidx%y threadidx%y
CC 1.0 : 8 registers; 304 shared, 32 constant, 0 local memory bytes; 66% occupancy
CC 2.0 : 10 registers; 264 shared, 64 constant, 0 local memory bytes; 33% occupancy
28, Sum reduction generated for ak
Linking...
OpenACC1 build succeeded.

Build log was saved at "file://D:\Cuda Dev\OpenACC1\x64\Release\BuildLog.htm"

========== Rebuild All: 1 succeeded, 0 failed, 0 skipped ==========

still can use "use openacc", don't know why.

output when execute the code: -> (using !$acc before do loops)
ak = 18000000

PROGRAM STARTED AT: 11:56:30:675
PROGRAM ENDED AT: 11:56:30:776
Press any key to continue . . .

time of execution = 101 msec

output when execute the code: -> (without using !$acc before do loops)
ak = 18000000

PROGRAM STARTED AT: 11:58:14:167
PROGRAM ENDED AT: 11:58:14:179
Press any key to continue . . .
time of execution = 12 msec !!

so, this means time using parallel acc loops is longer than using cpu to do loops!
does that make sense to you?? I think I am doing something wrong here.

Dolf
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6660
Location: The Portland Group Inc.

PostPosted: Tue Dec 11, 2012 2:14 pm    Post subject: Reply with quote

Quote:
still can use "use openacc", don't know why.
Sorry, I don't know either. It works fine for me.

Quote:
this means time using parallel acc loops is longer than using cpu to do loops!
does that make sense to you??
It makes perfect sense. Accelerator code does have some overhead in launching kernels, initializing the device, and copying data to the device. Also, a performing a parallel reduction requires a partial reduction step followed by a second kernel launch to perform the final reduction.

Now, if I change your code so these loops are executed 10,000 times, the overhead gets amortized across multiple invocations and we begin to see speed-up. Note that the call to foo is needed else the compiler will optimize away the outer iteration loop. Also, this example was run on Linux but should be the same for Windows.

Code:
% cat test.f90
program openacc1

use openacc
implicit none

integer :: nx,ny,i,j,ak,it,ak2,foo
integer, allocatable, dimension (:,:) :: A
integer :: start_time(8), end_time(8)
CHARACTER (LEN = 12) REAL_CLOCK (3)
CALL DATE_AND_TIME (REAL_CLOCK (1), REAL_CLOCK (2),&
REAL_CLOCK (3), start_time )

nx = 3000
ny = 3000
ak = 0

allocate (a(nx,ny))
A(1:nx,1:ny) = 2
!$acc data copyin(A)
do it=1,10000
!$acc kernels loop
do i = 1, nx
do j = 1, ny
ak = ak + A(i,j)
enddo
enddo
ak2 = foo(ak)
enddo
!$acc end data

write(*,*) 'ak = ' ,ak2
write(*,*)
CALL DATE_AND_TIME (REAL_CLOCK (1), REAL_CLOCK (2),&
REAL_CLOCK (3), end_time )
write(*,10) 'PROGRAM STARTED AT: ', START_TIME(5), START_TIME(6),&
START_TIME(7), START_TIME(8)
write(*,15) 'PROGRAM ENDED AT: ', end_time(5), end_time(6), &
end_time(7),end_time(8)
continue
deallocate(a)
10 format(1X, A, I2.2, ':', I2.2, ':', I2.2, ':', I3.3)
15 format(1X, A, I2.2, ':', I2.2, ':', I2.2, ':', I3.3)

end program openacc1

function foo(ak)
  integer foo
  integer ak, tmp
  tmp=ak
  ak = 0
  foo=tmp
end function foo

% pgf90 test.f90 -fast -o cpu.out
% pgf90 test.f90 -fast -acc -o gpu.out
% time cpu.out
 ak =      18000000
 
 PROGRAM STARTED AT: 13:03:18:798
 PROGRAM ENDED AT: 13:03:53:390
34.302u 0.066s 0:34.59 99.3%   0+0k 0+0io 0pf+0w
% setenv PGI_ACC_TIME 1
% time gpu.out
 ak =      18000000
 
 PROGRAM STARTED AT: 13:03:58:855
 PROGRAM ENDED AT: 13:04:12:457

Accelerator Kernel Timing data
  openacc1
    27: region entered 10000 times
        time(us): total=13,511,799 init=549 region=13,511,250
                  kernels=10,656,934
        w/o init: total=13,511,250 max=110,737 min=1,317 avg=1,351
        29: kernel launched 10000 times
            grid: [24x3000]  block: [128]
            time(us): total=9,698,070 max=1,463 min=966 avg=969
        30: kernel launched 10000 times
            grid: [1]  block: [256]
            time(us): total=958,864 max=400 min=95 avg=95
test.f90
  openacc1
    25: region entered 1 time
        time(us): total=13,593,179 init=69,314 region=13,523,865
                  data=9,027
        w/o init: total=13,523,865 max=13,523,865 min=13,523,865 avg=13,523,865
5.495u 7.962s 0:13.64 98.6%   0+0k 0+56io 0pf+0w


Granted, this speed-up here is relatively small, but as you increase the amount of computation, typically you increase your speed-up.

- Mat
Back to top
View user's profile
Dolf



Joined: 22 Mar 2012
Posts: 155

PostPosted: Tue Dec 11, 2012 4:00 pm    Post subject: Reply with quote

I see now!
I am using PGI 12.3 only, could this be the problem??

how can I compile using 12.3 and using acc since acc only added to 12.6 and above??

thanks,
Dolf
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3, 4, 5, 6, 7  Next
Page 3 of 7

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group