PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

how does this newbie fix his code?
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
cablesb



Joined: 21 Jan 2010
Posts: 33

PostPosted: Mon Mar 26, 2012 2:05 pm    Post subject: how does this newbie fix his code? Reply with quote

I'm a complete CUDA novice. Am trying to run a little test code to figure out the workings of CUDA. I have a subroutine:

attributes(global) subroutine time_step(a,v,dt,dx,ntimes,simlength)

use cudafor

real*8, intent(inout) :: a(simlength)
real*8, intent(in) :: v, dt, dx
integer :: ntimes
integer, value :: simlength
integer :: i, n
real*8 :: aux(250)

real*8 :: decl

decl=v*dt/dx
aux=0.

i = blockDim%x*(blockIdx%x-1)+threadIdx%x
do n = 1,ntimes
if (i>1 .and. i<=simlength) aux(i)=a(i)-decl*(a(i)-a(i-1))
! this is really what's going on:
! if (i>1 .and. i<=simlength) aux(i)=a(i-1)
a(i)=aux(i)
enddo


end subroutine time_step

Three problems: 1) Most importantly, this code should just be moving the contents of array "a" one step to the right for each iteration in the loop. It does this for most part, but it drops a couple of numbers. e.g., after 40 iterations, where the data should look like 17, 18, 19, 20 ... it looks like 17, 19, 20 ... (By the way, I have run this code as you see it here, and with the simple"this is what's going on line" replacing the line above it; same results.) 2)I want "aux" to be an automatic array. Tried making it shared, but then the code won't run. Dies with "unspecified failure to launch". 3) I really don't want to define a separate "aux" array. I want "a" to be something like a(:,2). Then I would update a(:,2) in every iteration and copy it to a(:,1). Doing this in a naive way produces a code that will compile but not run. Anyone know what to do about these things? Much obliged.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Mon Mar 26, 2012 2:59 pm    Post subject: Reply with quote

Hi cablesb,

Without a complete reproducing example, I'll need to make some guesses.

For issue #1, since your expression contains a backward dependency, it is not parallel. Here you have multiple threads all accessing and updating different elements of a. Since there is no guarantee the order in which a is updated, the value of a(i-1) may be a new value or an old value. You just don't know. Each thread needs to be working on it's own element of a.

To make this code parallel, you need to rethink it to be something along the lines of:
Code:

Host code

do n = 1,ntimes
  call time_step{{{dimGrid,dimBlock}}}(..args...)  ! note change } to >
  call copyaux_a{{{dimGrid,dimBlock}}}(..args...)  !
end do

Device code
attributes(global) subroutine time_step(a,aux, v,dt,dx,ntimes,simlength)
,,,,
i = blockDim%x*(blockIdx%x-1)+threadIdx%x
f (i>1 .and. i<=simlength) aux(i)=a(i)-decl*(a(i)-a(i-1))
end subroutine

attributes(global) subroutine copyaux_a(aux, a)
...
i = blockDim%x*(blockIdx%x-1)+threadIdx%x
a(i)=aux(i)
end subroutine copyaux_a


Note that I needed to do a similar algorithm in the following two articles: http://www.pgroup.com/lit/articles/insider/v3n3a2.htm and http://www.pgroup.com/lit/articles/insider/v2n1a4.htm

Quote:
2)I want "aux" to be an automatic array. Tried making it shared, but then the code won't run. Dies with "unspecified failure to launch".
You probably forgot to put the size of the shared memory as the third argument in the chevrons when you launched your kernel.

Quote:
3) I really don't want to define a separate "aux" array. I want "a" to be something like a(:,2). Then I would update a(:,2) in every iteration and copy it to a(:,1).
That's fine so long as you don't copy a(:,2) back to a(:,1) in the same kernel.

Hope this helps,
Mat
Back to top
View user's profile
cablesb



Joined: 21 Jan 2010
Posts: 33

PostPosted: Mon Mar 26, 2012 3:37 pm    Post subject: Reply with quote

Mat,

Thanks very much for the reply.

Re: 1) and 3), oh man, do I have alot to learn.

Re: 2) I have been trying to puzzle out the amount of memory needed from the PGI CUDA Fortran Programming Guide, and I'm not having much luck. How does this get calculated? In my current example, I am operating on a real*8 array of 256 elements. I have run with a variety of threads per block, from 2 up to 256. Thanks again.
Back to top
View user's profile
cablesb



Joined: 21 Jan 2010
Posts: 33

PostPosted: Mon Mar 26, 2012 4:13 pm    Post subject: Reply with quote

Mat (or anyone!),
A modified code is posted below in all it's glory. Well, in most of its glory. I am apparently running up against some limit in comment lines or something. I will post the second subroutine in another reply. I am now completely lost, I think. The "a" array is not being modified at all, neither by the time_step routine, nor by the switch_places routine. Any ideas are most appreciated.

Code:


program simple_wave

use cudafor

integer, parameter :: simlength=256
integer :: tPB = 2

real*8 :: a(simlength,2)

real*8, device :: a_dev(simlength,2)
real*8, device :: v, dt, dx
integer, device :: ntimes

real*8 :: v_h, dt_h, dx_h
integer :: ntimes_h

! parameters for triangle wave
integer :: peak_start, peak_half_width

peak_start=50
peak_half_width=50

! define triangle wave
a(1:peak_start-1,1)=0.d0
a(peak_start:peak_start+peak_half_width-1,1)=(/(dble(i),i=1,peak_half_width)/)
a(peak_start+peak_half_width:peak_start+2*peak_half_width-1,1)=(/(dble(i),i=peak_half_width,1,-1)/)
a(peak_start+2*peak_half_width:simlength,1)=0.d0
a(:,2)=1.d0

! define advection parameters.  let's make everything one for now.
v=1.d0
dt=1.d0
dx=1.d0
v_h=v
dt_h=dt
dx_h=dx

! how many time steps:
ntimes=50
ntimes_h=ntimes

write(15,fmt='(1f10.5)') a(:,1)
write(6,'(10f10.5)'),a(:,1)
print *,'========'


a_dev=a
!do on host first, for comparison
do n=1,ntimes_h
  a(2:simlength,2)=a(2:simlength,1)-v_h*dt_h/dx_h*(a(2:simlength,1)-a(1:simlength-1,1))
  a(:,1)=a(:,2)
enddo
write(6,'(10f10.5)'),a(:,1)

! now on device
do n=1,ntimes_h
call time_step<<<ceiling>>>(a_dev,v,dt,dx,simlength)
a=a_dev
print *,'**********'
write(6,'(10f10.5)'),a(:,2)
call switch_places<<<ceiling>>>(a_dev,simlength)
enddo
a=a_dev

print *,'========'
write(6,'(10f10.5)'),a(:,1)
print *,'========'
write(6,'(10f10.5)'),a(:,2)
write(16,fmt='(1f10.5)') a(:,1)


end program


Code:

attributes(global) subroutine switch_places(a,simlength)

use cudafor

integer, intent(in), value :: simlength
real*8, intent(inout) :: a(simlength,2)

integer :: i

i = blockDim%x*(blockIdx%x-1)+threadIdx%x
if (i>=1 .and. i<=simlength) a(i,1)=a(i,2)

end subroutine switch_places
Back to top
View user's profile
cablesb



Joined: 21 Jan 2010
Posts: 33

PostPosted: Mon Mar 26, 2012 4:14 pm    Post subject: Reply with quote

... and here is time_step:
Code:

attributes(global) subroutine time_step(a,v,dt,dx,simlength)

use cudafor

integer, value :: simlength
real*8, intent(inout)  :: a(simlength,2)
real*8, intent(in) :: v, dt, dx
integer :: i, n

real*8 :: decl

decl=v*dt/dx

i = blockDim%x*(blockIdx%x-1)+threadIdx%x
!if (i>1 .and. i<=simlength) a(i,2)=a(i,1)-decl*(a(i,1)-a(i-1,1))
! this is really what's going on:
  if (i>1 .and. i<=simlength) a(i,2)=a(i-1,1)


end subroutine time_step
[/code]
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group