PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

reduction within "!$acc kernels loop" ?
Goto page Previous  1, 2
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
JMa



Joined: 30 Nov 2012
Posts: 22

PostPosted: Fri Jan 11, 2013 11:00 am    Post subject: Reply with quote

Hi Mat,
I found the solution: just very simpley by changing Line 11:
from
real tmp

to

real*8 tmp

Now the compilor reports:
27, Generating present_or_copy(c(:,:))
Generating present_or_copyin(a(:,:))
Generating present_or_copyin(b(:,:))
28, Loop is parallelizable
29, Loop is parallelizable
30, Complex loop carried dependence of 'c' prevents parallelization
Loop carried dependence of 'c' prevents parallelization
Loop carried backward dependence of 'c' prevents vectorization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
28, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
29, !$acc loop gang ! blockidx%y
32, Sum reduction generated for tmp

It is very interesting to see this tiny change makes huge difference, however, I'm still confused and curious about why it is like this?
Do you have any thoughts?

Thanks,
Jingsen
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Fri Jan 11, 2013 11:10 am    Post subject: Reply with quote

Hi Jingsen,

The first error (i.e. without the reduction clause) is being caused by the data type mismatch. Please declare tmp as DOUBLE PRECISION,or change "1.d0" to "1.0".

As for the second error, the problem being that the "k" loop is being sequentially executed so each kernel needs to sum up multiple "tmp" values. We handle it correctly when we auto-generate the reduction, but obviously not when the reduction clause is being used. I'll write-up a report.

- Mat
Back to top
View user's profile
JMa



Joined: 30 Nov 2012
Posts: 22

PostPosted: Fri Jan 11, 2013 11:30 am    Post subject: Reply with quote

Hi Mat,
For the 1st error, I also tried:
integer tmp, and consistently: tmp=tmp+1
Now I ran into another issue, by getting wrong answer of "tmp":

iternation#: 2001
GPU costs 1045000 micronseconds

The compiling results:

27, Generating present_or_copy(c(:,:))
Generating present_or_copyin(a(:,:))
Generating present_or_copyin(b(:,:))
28, Loop is parallelizable
29, Loop is parallelizable
30, Complex loop carried dependence of 'c' prevents parallelization
Loop carried dependence of 'c' prevents parallelization
Loop carried backward dependence of 'c' prevents vectorization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
28, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
29, !$acc loop gang ! blockidx%y
32, Accelerator restriction: multilevel induction variable: tmp
Accelerator restriction: induction variable live-out from loop: tmp



The program now looks like:

! matrix-acc.f
program example1


parameter ( n_size=2000 )
real*8, dimension(:,:) :: a(n_size,n_size)
real*8, dimension(:,:) :: b(n_size,n_size)
real*8, dimension(:,:) :: c(n_size,n_size)
real*8, dimension(:,:) :: d(n_size,n_size)
character(10) :: time
integer tmp
integer count1, count2, count_rate, count_max


! Initialize matrices (values differ from C version)
do i=1, n_size
do j=1, n_size
a(i,j) = i + j;
b(i,j) = i - j;
enddo
enddo
c=0.d0
d=0.d0

tmp=1
call system_clock(count1, count_rate, count_max)
!$acc kernels loop !reduction(+:tmp)
do i=1, n_size
do j=1, n_size
do k = 1, n_size
c(i,j) = c(i,j) + a(i,k)*b(k,j)
tmp=tmp+1
enddo
enddo
enddo

print*, 'iternation#:',tmp

call system_clock(count2, count_rate, count_max)
write(*,*)'GPU costs',(count2-count1),'micronseconds'

tmp=0
call system_clock(count1, count_rate, count_max)
do i=1, n_size
do j=1, n_size
do k = 1, n_size
d(i,j) = d(i,j) + a(i,k)*b(k,j)
tmp=tmp+1
enddo
enddo
enddo
call system_clock(count2, count_rate, count_max)
write(*,*)'CPU costs',(count2-count1),'micronseconds'

! check the results
do i=1, n_size
do j=1, n_size
if( c(i,j) .ne. d(i,j) )then
print *, i,j, c(i,j), d(i,j)
stop 'error found'
endif
enddo
enddo
print *, n_size*n_size, 'iterations completed'


end program example1
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Fri Jan 11, 2013 11:53 am    Post subject: Reply with quote

Ok, so it looks like that with multi-level reductions we're not really handling them well and that we're just getting lucky when the variable is a real. The next step would be to add multiple summation variables:

Code:

tmp=0
call system_clock(count1, count_rate, count_max)
!$acc kernels loop private(tmp2)
do i=1, n_size
do j=1, n_size
tmp2 = 0
do k = 1, n_size
tmp2 = tmp2+1
c(i,j) = c(i,j) + a(i,k)*b(k,j)
enddo
tmp=tmp+tmp2
enddo
enddo


I'll add this to my report.

Thanks,
Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group