PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

How to speed up a 3D triangle filter?

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
haz13



Joined: 17 Mar 2014
Posts: 13

PostPosted: Tue Mar 25, 2014 11:03 am    Post subject: How to speed up a 3D triangle filter? Reply with quote

Hi everyone.

I have a Fortran CPU code that performs a weighted 'triangle filter' in 3D on a 3D array (which had a padded layer of boundary values on all edges preset) by shifting a pointer subarray (2:n-1 in each dimension) using three nested do loops and a weighting factor for the different 'directions'. Basically this led to it performing 27 operations on often large (500^3) arrays. I tried to implement this into an OpenACC directive but couldn't get the pointer shifting to work on the device, and so went for an element by element implentation as follows, which gives the same results as the CPU filter:

Code:
 subroutine filtergpu(r, a, n)
    real,dimension(:,:,:) :: a,r
    integer :: i,j,k,n
!
! -- loop through the full 3^N stencil
!$acc data region copyin(a(1:n,1:n,1:n)) copyout(r(1:n,1:n,1:n))
!$acc region
    do i = 2,n-1
    do j = 2,n-1
    do k = 2,n-1
       r(i,j,k) =     0.125*(a(i,j,k))&
                 +   0.0625*(a(i-1,j,k)+&
                            a(i,j-1,k)+&
                            a(i,j,k-1)+&
                            a(i+1,j,k)+&
                            a(i,j+1,k)+&
                            a(i,j,k+1))&
                 +  0.03125*(a(i,j+1,k+1)+&
                            a(i,j+1,k-1)+&
                            a(i,j-1,k+1)+&
                            a(i,j-1,k-1)+&
                            a(i+1,j,k+1)+&
                            a(i+1,j,k-1)+&
                            a(i-1,j,k+1)+&
                            a(i-1,j,k-1)+&
                            a(i+1,j+1,k)+&
                            a(i+1,j-1,k)+&
                            a(i-1,j+1,k)+&
                            a(i-1,j-1,k))+&
                 + 0.015265*(a(i+1,j+1,k+1)+&
                            a(i+1,j+1,k-1)+&
                            a(i+1,j-1,k+1)+&
                            a(i+1,j-1,k-1)+&
                            a(i-1,j+1,k+1)+&
                            a(i-1,j+1,k-1)+&
                            a(i-1,j-1,k+1)+&
                            a(i-1,j-1,k-1))
    end do
    end do
    end do
!$acc end region
!$acc end data region
 end subroutine


Now, this is faster than the CPU code running on a single core, but scales up in the same linear fashion as the CPU code does as the matrix size 'n' is increased:

Code:
n   CPU Time (s)   GPU Time (s)
32     2.74E-004    1.62E-004
64     2.11E-003    1.00E-003
128   7.65E-003     6.57E-003
256   5.83E-002     4.87E-002
512   0.53               0.39

I have a feeling this is due to many threads trying to access the same memory locations. I tried another approach and made each of the 27 calculations an individual accelerator region, which also worked but was slower than the code above. I assume this is because it isn't efficient to launch 27 small kernels one after the other.

So, does anyone have any tips on how to speed this filter up? I am very new to OpenACC coding, haven't got any experience in CUDA and only really have average experience with Fortran itself.

Thanks, Harry
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Wed Apr 02, 2014 10:15 am    Post subject: Reply with quote

Hi Harry,

Minor note, that you may want to swtich to using OpenACC syntax rather than the PGI Accelerator model to allow your code to be compiled by a wider number of compilers.

As for performance, I'd try either reordering the loops or use "loop" directives so that the "I" dimension gets the "vector" loop and thus memory access will be better coalesced across threads. For example:

Code:
! -- loop through the full 3^N stencil
!$acc data copyin(a(1:n,1:n,1:n)) copyout(r(1:n,1:n,1:n))
!$acc kernels loop
    do k = 2,n-1
    do j = 2,n-1
    do i = 2,n-1
       r(i,j,k) =     0.125*(a(i,j,k))&

or
Code:
! -- loop through the full 3^N stencil
!$acc data copyin(a(1:n,1:n,1:n)) copyout(r(1:n,1:n,1:n))
!$acc kernels loop vector   
    do i = 2,n-1
!$acc loop gang vector
    do j = 2,n-1
!$acc loop gang
    do k = 2,n-1       
r(i,j,k) =     0.125*(a(i,j,k))&


- Mat
Back to top
View user's profile
haz13



Joined: 17 Mar 2014
Posts: 13

PostPosted: Wed Apr 02, 2014 11:07 am    Post subject: Reply with quote

Thank you for that - it has given me a small performance boost, and the code is performing at an expected speed up compared to serial CPU code.

Rather than posting a new question, can I just ask if there is a compiler flag to change between real(4) and real(8)? (I have everything declared as 'real' in the code) ... it's just comparing the results of my CPU and GPU implementations is showing some differences and I am interested to see the performance using double precision but would rather not go and rewrite a lot of declarations.

Thanks, Harry
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Wed Apr 02, 2014 11:12 am    Post subject: Reply with quote

Quote:
can I just ask if there is a compiler flag to change between real(4) and real(8)?
Yes, add "-r8" to change the default kind from real(4) to real(8).
Back to top
View user's profile
haz13



Joined: 17 Mar 2014
Posts: 13

PostPosted: Wed Apr 02, 2014 4:59 pm    Post subject: Reply with quote

Ah, thanks, I probably should have been able to guess that! :)
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group