PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Performance of pgi openaccfor a matrix-matrix multiplication

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
jigo3635



Joined: 15 Jun 2012
Posts: 6

PostPosted: Wed Apr 30, 2014 6:43 am    Post subject: Performance of pgi openaccfor a matrix-matrix multiplication Reply with quote

Hi,

The question is related to the two older questions.

http://www.pgroup.com/userforum/viewtopic.php?t=3754&highlight=jigo3635
http://www.pgroup.com/userforum/viewtopic.php?t=3572&highlight=jigo3635

The matrix-matrix multiplication likes

Code:

do e=1,nel
   do k=1,n
   do j=1,n
   do i=1,n
       tmp = 0.
       do l=1,n
          tmp = tmp + D(i,l)*u(l,j,k,e)+D(j,l)*v(i,l,k,e)+D(k,l)*w(i,j,l,e)
       enddo
       m(i,j,k,e) = tmp
   enddo
   enddo
   enddo
enddo


where typically n=4-16 and nel=10-1000. We can only obtain around maximum 20G FLOPS using pgi compilers on Tesla k20x for n=16 and nel=400.

Code:

!$acc kernels
!$acc loop gang
671: do e=1,nel
!$acc loop gang vector
673:   do k=1,n
!$acc loop vector
675:  do j=1,n
677: do i=1,n
       tmp = 0.
679:   do l=1,n
              tmp = tmp + D(i,l)*u(l,j,k,e)+D(j,l)*v(i,l,k,e)+D(k,l)*w(i,j,l,e)
          enddo
       m(i,j,k,e) = tmp
   enddo
   enddo
   enddo
enddo
!$acc end kernels



1) In that case, should "reduction" be added on the inner-loop ?
Code:
 !$ACC LOOP REDUCTION(+:tmp)                                                             
            do l=1,n  ! serial loop, no reductio


Even the reduction is added, the loop "do l=1,n" still is parallelized
....
671, Loop is parallelizable
673, Loop is parallelizable
675, Loop is parallelizable
677, Loop is parallelizable

679, Loop is parallelizable
...
(but performance reduced from 20G to 13G FLOGS)


2) The performance for the kernels on a Fermi GPU was obtained around 23G. Is there any way to improve the performance on K20x ? I have used the compiler flag (-ta=nvidia,5.0) and other openacc implementation have been tested, such as
Code:
 !$acc kernels
    do ;
      do; do;
...
  !$acc parallel loop collapse(4)
     do;
       do, do

but cannot obtain better performance.

Thanks for your assistance.

/Jing
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6206
Location: The Portland Group Inc.

PostPosted: Thu May 01, 2014 10:30 am    Post subject: Reply with quote

Hi Jing,

For #1, don't confuse the message "Loop is parallelizable" with what's actually scheduled. This message is from the analysis stage and just means that it could be parallelized. Look for the loops schedule messages for what actually was parallelized.

Given the values of "n" are so small, I would collapse the k, j, and I loops together, and just perform the reduction loop sequentially in the kernel. In the just released PGI 14.4, we've made considerable improvements with loop collapsing so please consider using this new release.

Code:
!$acc kernels
!$acc loop gang
671: do e=1,nel
!$acc loop vector(512) collapse(3)
673:   do k=1,n
675:  do j=1,n
677: do i=1,n
       tmp = 0.
679:   do l=1,n
              tmp = tmp + D(i,l)*u(l,j,k,e)+D(j,l)*v(i,l,k,e)+D(k,l)*w(i,j,l,e)
          enddo
       m(i,j,k,e) = tmp
   enddo
   enddo
   enddo
enddo
!$acc end kernels


For #2, try targeting compute capability 3.5 (i.e. -ta=tesla:cc35) and use the "INTENT(IN)" attribute on your read-only arrays. In these cases, we attempt to utilize texture memory can help considerably for random-access memory patterns such as how you're using the "D" array.

Some other general performance ideas:

Given the schedule above, you could also help your memory access a bit by using "j" or "k" as the leading dimension of the "u" array instead of "l".

Other things to are to disable RDC (-ta=tesla:nordc) assuming you don't use the "routine" directive. In 14.4, we've added an "unroll" option (enabled by default with -O3) which helps a few codes but can slow down others. Worth a try though.

Finally, look at the PTX information for the register usage (-ta=tesla:ptxinfo) and use this information to determine the occupancy. You can then adjust the vector width higher or lower or use "-ta=tesla:maxregcount:xx" to adjust the total register usage per gang to see how it effects occupancy and performance.

- Mat
Back to top
View user's profile
jigo3635



Joined: 15 Jun 2012
Posts: 6

PostPosted: Thu May 01, 2014 2:07 pm    Post subject: Reply with quote

Hi Mat,

Thank for your valuable input.

Now the code is changed to
Code:
!$acc kernels
!$acc loop gang
671: do e=1,nel
!$acc loop vector(16) collapse(3)
do ...
 

With pgi 14.2 (13.10 used previously) and flag -ta=tesla:cc35, 40G flops can be obtained for nel=400 and n=16. that indeed improves much performance.

I will try other suggestions, but it seems "-ta=tesla:nordc" does not work for the code, there is a compiler error

Accelerator Fatal Error: No CUDA device code available
File: math.f
Function: zero:1388

But the "zero" function is
Code:
subroutine zero(a,n)
      DIMENSION  A(n)
!$ACC DATA PRESENT(A(1:n))                                                                 
!$ACC PARALLEL LOOP                                                                                 
      DO I = 1, N
         A(I) = 0.0



Thanks again.

/Jing
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group