JMa
Joined: 30 Nov 2012 Posts: 14
|
Posted: Thu Jan 10, 2013 5:00 pm Post subject: ($acc parallel loop) VS ( $acc kernels loop ) ? |
|
|
Hi Mat and All,
I found very surprising speed difference (about 20 times) between these two from the following simple loop tests:
1, $acc kernels loop :
CODE:
call system_clock(count1, count_rate, count_max)
!$acc kernels loop
do i=1, n_size
do j=1, n_size
do k = 1, n_size
c(i,j) = c(i,j) + a(i,k)*b(k,j)
enddo
enddo
enddo
print*, 'iternation#:',n_size*n_size
call system_clock(count2, count_rate, count_max)
write(*,*)'GPU costs',(count2-count1),'micronseconds'
RESULTS:
iteration#: 4000000
GPU costs 1030000 micronseconds
2, $acc parallel loop :
CODE:
call system_clock(count1, count_rate, count_max)
!$acc parallel loop
do i=1, n_size
do j=1, n_size
do k = 1, n_size
c(i,j) = c(i,j) + a(i,k)*b(k,j)
enddo
enddo
enddo
!$acc end parallel
print*, 'iternation#:',n_size*n_size
call system_clock(count2, count_rate, count_max)
write(*,*)'GPU costs',(count2-count1),'micronseconds'
RESULTS:
iteration#: 4000000
GPU costs 22168000 micronseconds
Why they are so different? Any inputs of the reasons behind this is very appreciated.
Thanks,
Jingsen |
|
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Fri Jan 11, 2013 9:40 am Post subject: |
|
|
Hi Jingsen,
The main difference between the "kernels" and "parallel" constructs is that with "kernels" the default is for the compiler do all the scheduling and kernel generation automatically, while with "parallel", it's up to the user to decide how to create the kernels and schedule the loops.
This article goes more in-depth: http://www.pgroup.com/lit/articles/insider/v4n2a1.htm
Take a look at the compiler feedback messages (-Minfo=accel) and pay particular attention to how the loops are being scheduled. This should give you your answer as to the performance difference. Note that the schedule will also effect the use of caching, which may be another factor in the performance difference.
- Mat |
|