PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

openmp orphaned-style programming model

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
franzisko



Joined: 11 Jan 2011
Posts: 25

PostPosted: Tue Jan 11, 2011 4:51 am    Post subject: openmp orphaned-style programming model Reply with quote

Hello,

I am considering using PGI Accelerator for improving my code performance. I wonder if it is possible to use an openmp orphaned-directive style. From the post:

https://www.pgroup.com/userforum/viewtopic.php?t=2202

I guess that the reflected directice should help.
In particular, I would like to move data only at the beginning of the whole computation and to move back them at the end. To achieve this I should have:
1) the main program opens data region and performs loops (or similar parts) using acc regions.
2) the main program also contains subroutine calls. If I understand well,
subroutine must have explicit interface (I would use modules for that) and reflected directives for dummy arguments. Is it possible to acc regions inside subroutines in this way? (using 11.0)
3) What about variables inside subroutines which are not passed trough actual-dummy arguments?
3.a) automatic variables: may I use "device resident" directive for that variables (to avoid the cpu creation)?
3.b) device variables in modules: can I use them? how?

thank you for help!
Francesco
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Tue Jan 11, 2011 11:51 am    Post subject: Reply with quote

Hi Francesco,

Quote:
2) the main program also contains subroutine calls. If I understand well,
subroutine must have explicit interface (I would use modules for that) and reflected directives for dummy arguments. Is it possible to acc regions inside subroutines in this way? (using 11.0)
The only caveat to using reflected is if you have multiple levels of subroutine calls, each level must use the reflected directive.

Quote:
3) What about variables inside subroutines which are not passed trough actual-dummy arguments?
If the variables are module allocatable arrays, then you can use the 'mirrored' directive. When you allocate a 'mirrored' array, the array is allocated both on the host and the device. You then manage the data movement using the 'update host' and 'update device' directives.

Quote:
3.a) automatic variables: may I use "device resident" directive for that variables (to avoid the cpu creation)?
Yes, however "device resident" and the other new directives are still in development so wont be available until later this year.

Quote:
3.b) device variables in modules: can I use them? how?

Use the 'mirrored' directive.

Hope this helps,
Mat
Back to top
View user's profile
franzisko



Joined: 11 Jan 2011
Posts: 25

PostPosted: Tue Jan 18, 2011 11:22 am    Post subject: a further insight (and thank you) Reply with quote

Hi,
thank you for the fast and useful reply. According to what you told me,
I am considering to use the mirror directive for module variables.
In particular, if a have arrays in the module and
scalar variables in the same module, I would write something like that:


Code:
module storage
   integer, parameter :: N=1000
   real, dimension(:,:), allocatable :: A,B,C
   real k_scalar
   real startTime,endTime
end module storage

subroutine matadd_sub()
   use storage  ; implicit none  ; integer i,j,k
!$acc region
!$acc do
   do i=1,N  ; do j=1,N  ; do k=1,N
      k_scalar = A(i,j) + B(i,j)
      C(i,j) = k_scalar
   enddo     ; enddo     ; enddo
!$acc end region
end subroutine matadd_sub

subroutine matmul_sub()
   use storage  ; implicit none  ; integer i,j,k
!$acc region
!$acc do
   do i=1,N  ; do j=1,N  ; do k=1,N
      C(i,j) = 3+C(i,j)
   enddo     ; enddo     ; enddo
!$acc end region
end subroutine matmul_sub

program matadd
   use storage  ; implicit none
   allocate(A(N,N),B(N,N),C(N,N))
   A = 1.  ; B = 2.  ; C = 0.
!$acc data region mirror(A,B,C,N)
!$acc update device(A,B)
   call matadd_sub()
   call matmul_sub()
!$acc update host(C)
!$acc end data region
   print*,"C: ",C(6,3)
!   print*,"k:  ",k
end program matadd

1) My problem is that I did not understand how to control when data
are updated to device or host.
When opening data region data seem to be automatically updated to device,
even without specifying update device(A,B), is it? or maybe it happens
in acc region opening. Second, also deleting the update host statement
the code is ok.

More specifically, I would need to update to host ONLY some data and at a
CHOSEN time without closing the data region because I need to send them
using MPI subroutines.

2) May I use static arrays or scalar variables in modules and mirror
them avoiding, if needed, the host updating (so they work as local variables)?

I am using PGI 10.8 for now.

thank you very much for any help!
Francesco
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6213
Location: The Portland Group Inc.

PostPosted: Tue Jan 18, 2011 3:45 pm    Post subject: Reply with quote

Hi Francesco,

Quote:
I am using PGI 10.8 for now.
You'll need to update your compiler since 'mirrored' was added in the 11.0 compilers.

Quote:

1) My problem is that I did not understand how to control when data
are updated to device or host.
When opening data region data seem to be automatically updated to device,
even without specifying update device(A,B), is it? or maybe it happens
in acc region opening. Second, also deleting the update host statement
the code is ok.

More specifically, I would need to update to host ONLY some data and at a
CHOSEN time without closing the data region because I need to send them
using MPI subroutines.


When you use mirrored, it is not part of a data region clause, rather it's applied to the variable at the point it's declared in the module. I've corrected the code (see below).

An update may occur any time while running from host code (i.e. any time outside of a compute region). It does not need to occur within a data region.

Quote:
2) May I use static arrays or scalar variables in modules and mirror
them avoiding, if needed, the host updating (so they work as local variables)?

No, mirrored only works with allocatables.

Note that I also modified the directives a bit to help with performance.

Hope this helps.,
Mat

Code:

% cat test.f90
module storage
   integer, parameter :: N=1000
   real, dimension(:,:), allocatable :: A,B,C
!$acc mirror(A,B,C)
   real startTime,endTime
end module storage

!contains

subroutine matadd_sub()
   use storage; implicit none  ; integer i,j,k
   real k_scalar
!$acc region
   do i=1,N  ; do j=1,N 
!$acc do seq
 do k=1,N
      k_scalar = A(i,j) + B(i,j)
      C(i,j) = k_scalar
   enddo     ; enddo     ; enddo
!$acc end region
end subroutine matadd_sub

subroutine matmul_sub()
   use storage; implicit none  ; integer i,j,k
!$acc region
   do i=1,N  ; do j=1,N ;
!$acc do seq
do k=1,N
      C(i,j) = 3+C(i,j)
   enddo     ; enddo     ; enddo
!$acc end region
end subroutine matmul_sub


program matadd
   use storage  ; implicit none
   allocate(A(N,N),B(N,N),C(N,N))
   A = 1.  ; B = 2.  ; C = 0.
!$acc update device(A,B)
   call matadd_sub()
   call matmul_sub()
!$acc update host(C)
   print*,"C: ",C(6,3)
!   print*,"k:  ",k
end program matadd

% pgf90 -ta=nvidia,time -Minfo -V11.1 -fast test.f90 -o test_gpu.out
matadd_sub:
     13, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     14, Loop is parallelizable
     16, Loop carried reuse of 'c' prevents parallelization
         Accelerator kernel generated
         14, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
             !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             Using register for 'c'
             Using register for 'a'
             Using register for 'b'
         16, !$acc do seq
             CC 1.0 : 17 registers; 48 shared, 12 constant, 0 local memory bytes; 33% occupancy
             CC 1.3 : 17 registers; 48 shared, 12 constant, 0 local memory bytes; 75% occupancy
             CC 2.0 : 27 registers; 8 shared, 56 constant, 0 local memory bytes; 66% occupancy
matmul_sub:
     25, Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     26, Loop is parallelizable
     28, Complex loop carried dependence of 'c' prevents parallelization
         Loop carried dependence of 'c' prevents parallelization
         Loop carried backward dependence of 'c' prevents vectorization
         Accelerator kernel generated
         26, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
             !$acc do parallel, vector(16) ! blockidx%x threadidx%x
             Using register for 'c'
         28, !$acc do seq
             CC 1.0 : 10 registers; 32 shared, 12 constant, 0 local memory bytes; 100% occupancy
             CC 1.3 : 10 registers; 32 shared, 12 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 14 registers; 8 shared, 40 constant, 0 local memory bytes; 100% occupancy
matadd:
     38, Memory set idiom, array assignment replaced by call to pgf90_mset4
         Memory zero idiom, array assignment replaced by call to pgf90_mzero4
     39, Generating !$acc update device(b(:,:))
         Generating !$acc update device(a(:,:))
     42, Generating !$acc update host(c(:,:))

% test_gpu.out
 C:     3003.000   

Accelerator Kernel Timing data
/tmp/qa/test.f90
  matmul_sub
    25: region entered 1 time
        time(us): total=7684
                  kernels=7573 data=0
        28: kernel launched 1 times
            grid: [63x63]  block: [16x16]
            time(us): total=7573 max=7573 min=7573 avg=7573
/tmp/qa/test.f90
  matadd_sub
    13: region entered 1 time
        time(us): total=416
                  kernels=285 data=0
        16: kernel launched 1 times
            grid: [63x63]  block: [16x16]
            time(us): total=285 max=285 min=285 avg=285
/tmp/qa/test.f90
  matadd
    35: region entered 1 time
        time(us): total=137905 init=121907 region=15998
                  data=2827
        w/o init: total=15998 max=15998 min=15998 avg=15998


Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group