PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Variable Attributes

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling
View previous topic :: View next topic  
Author Message
adityaks



Joined: 02 Oct 2012
Posts: 8

PostPosted: Wed Oct 10, 2012 8:14 am    Post subject: Variable Attributes Reply with quote

I am trying to implement an algorithm whose pseudo code looks like below. What options are there for the attributes of the variables constvars, derivedvals, furtherderivedvals, args and which would work best?

I tried pinned, but apparently, pinned variables MUST be of allocatable type

Code:

module const
integer :: constvars
!(could as well be real/complex type)
.
.
end module const


module derivednum
use const

integer :: derivedvals
!(could as well be real/complex type)
.
.
end module derivednum

module gpuparsection
use const
use derivednum

attributes(global) subroutine dev_kernel( args )
_calculations depend on_
constvars
derivedvals
furtherderivedvals

end subroutine dev_kernel

subroutine callingroutine
integer :: furtherderivedvals, args
!(could as well be real/complex type)
.
.
call dev_kernel <<<  >>> ( args )
end subroutine callingroutine

end module gpuparsection

program main
use const
use derivednum
use gpuparsection

call callingroutine

end program main
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Wed Oct 10, 2012 8:51 am    Post subject: Reply with quote

Hi adityaks,

Constant module variables would have the "constant" attribute while device variables would have the "device" attribute. The main difference being that "constant" variables are read/write from the host but read-only on the device, while "device" variables are read/write from both. Also, "constant" variables are stored in a fast access memory area.

Note that I would recommend to avoid use of module device scalars. If they are read-only then put them in constant memory. If you do need write to them from the device, then you'll need to worry about synchronization.

By "furtherderivedvals" I'm assuming you mean local device variables? In this case you can add the "device" attribute, but it's implied for all variables declared in device code so is unnecessary.

On second glance, I see that you have some "furtherderivedvals" in the host calling routine and want to access them in the device code. This wont work due to scoping. They need to be passed as an argument or be put in a module.

For args, you can only pass in "device" variables or host scalars by value (i.e. add the "value" attribute to the local device declaration of the scalar variable).

The "pinned" attribute is for host allocatable arrays. You still need to create a corresponding device array and copy the host array to the device array before it can be used in device code. The "pinned" attribute simply requests that the host memory be pinned in physical memory and not swapped out to virtual memory. This saves an extra host-to-host copy.

Hope this helps,
Mat
Back to top
View user's profile
adityaks



Joined: 02 Oct 2012
Posts: 8

PostPosted: Wed Oct 10, 2012 11:34 am    Post subject: Reply with quote

Hi Mat,

Thanks for the explanations.

Went through your answer and prepared a 2nd iteration. I didn't get the terminology when you say
Quote:
module device scalars. If they are read-only then put them in constant memory. If you do need write to them from the device, then you'll need to worry about synchronization.


Can you point out which part of the 1st iteration (or the 2nd below, whichever) relates to module device scalars. By scalars, you mean a single value variable such as constvalN and derivedvalN (in contrast to arrays and derived data types) ?

The 2nd iteration is below. Please suggest if the attributes and assignments are proper.

Code:
! This is a pseudo code which resembles the fortran implementation of an algorithm.
! This is to understand CUDA functionalities and develop a CUDA Fortran version of the code.

! val = func(args) is used in the mathematical sense.
! Could be anything from a binary operation among args to a function/subroutine call that will set the value for val.

! N is used in algebraic sense. N can be any positive integer. So constvarN could be constvar1/constvar34..

module const
  integer,constant :: constvar1, constvar2, constvarN
  !(could as well be real/complex type in the rest of this code)
  !.
  !.
end module const

module derivednum
  use const

  integer :: derivedval1, derivedval2, derivedvalN
  integer, dimension(constvarN) :: derivedarrN
  derivedval1 = func ( .. ,constvarN,..)
  derivedarrN = func(..,constvarN,..)
  !.
  !.
end module derivednum

module gpuparsection
  use const
  use derivednum
  integer :: furtherderivedvals,
  integer,device :: devderivedvalN
  integer,device,dimension(constvarN) :: devderivedarrN

  attributes(global) subroutine dev_kernel( args, devderivedvalN, devderivedarrN  )
    ! kernel has 3-6 level of not-tightly-nested loops
    !_at different stages, calculations depend on_
    constvar1, constvar2..constvarN
    devderivedvalN
    devderivedarrN
    furtherderivedvals

  end subroutine dev_kernel

  subroutine callingroutine
    integer :: args
    !.
    !.
    derivedval2 = func (constvar1, .. constvarN)
    devderivedval2 = derivedval2

    devderivedarrN = derivedarrN
    call dev_kernel <<<  >>> ( args, devderivedvalN, devderivedarrN )
  end subroutine callingroutine

end module gpuparsection

program main
  use const
  use derivednum
  use gpuparsection

  call callingroutine

end program main



In particular, as a first issue subroutine dev_kernel, needs to make use of a number of scalars (in the way I have understood and mentioned earlier in this post), arrays etc ( constvar1, constvar2..constvarN
devderivedvalN
devderivedarrN
furtherderivedvals
)
In fact, our code will have a couple of tens of constvar, derivedval and derivedarr. Copying them to device-attributed-variables and/or passing them as arguments to subroutine dev_kernel will be lengthy and, more importantly, untidy. And I suspect such discrete copies would make the overall runtime slower than a non-CUDA code!

As a second issue, the prospective subroutine dev_kernel will have upto 6 or 8 loosely nested do loops. That is, many scalars and some arrays need to be calculated before the next level of do loop starts, which will then need those values. This being the scenario, I am again worried about the speed up of the CUDA version of the code against the original fortran one.

So I am wondering what is your/PGI's take on these two issues.

- Aditya
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6218
Location: The Portland Group Inc.

PostPosted: Wed Oct 10, 2012 12:58 pm    Post subject: Reply with quote

Quote:
. By scalars, you mean a single value variable such as constvalN and derivedvalN (in contrast to arrays and derived data types) ?
Correct.

Quote:
Can you point out which part of the 1st iteration (or the 2nd below, whichever) relates to module device scalars


Since "derivedvals" is a scalar integer defined in the data declaration portion of a module (i.e. a module device scalar), I don't recommend making it a device variable.
Code:
module derivednum
use const

integer :: derivedvals 
!(could as well be real/complex type)
.
.
end module derivednum


Quote:
In fact, our code will have a couple of tens of constvar, derivedval and derivedarr. Copying them to device-attributed-variables and/or passing them as arguments to subroutine dev_kernel will be lengthy and, more importantly, untidy. And I suspect such discrete copies would make the overall runtime slower than a non-CUDA code!
No need to pass them if they are module variables. You do need to copy the data to/from the device at some point but that's a separate issue and I don't see anything in your code indicating where you're performing the copy.


Quote:
As a second issue, the prospective subroutine dev_kernel will have upto 6 or 8 loosely nested do loops. That is, many scalars and some arrays need to be calculated before the next level of do loop starts, which will then need those values. This being the scenario, I am again worried about the speed up of the CUDA version of the code against the original fortran one.
Having a lot of local variables will increase the number of registers used per thread, thus lowering the occupancy and performance. However, so long as you have enough parallelism and miniimize data movement, you still can see speed-up over the CPU.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Programming and Compiling All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group