PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

should use to "acc reduction" in an inner loop

 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
jigo3635



Joined: 15 Jun 2012
Posts: 6

PostPosted: Thu Nov 22, 2012 4:04 am    Post subject: should use to "acc reduction" in an inner loop Reply with quote

Hi All,

I am a new user of pgi acelerator. I try to use openacc for a matirx multiplication, such as

...
!$ACC DATA COPYIN(D,u,v,w)
!$ACC& COPYOUT(m)
!$ACC PARALLEL LOOP
do k=1,n
do j=1,n
do i=1,n
m(i,j,k) = 0.
do l=1,n
m(i,j,k) = m(i,j,k) + D(l,i)*u(l,j,k)
$ + D(l,j)*v(i,l,k)
$ + D(l,k)*w(i,j,l)
enddo
enddo
enddo
enddo
!$ACC END PARALLEL LOOP
!$ACC END DATA

when compile the code, I obtain the following message
===
331, Complex loop carried dependence of 'm'' prevents parallelization
Loop carried reuse of 'm' prevents parallelization
===

Is it means that m(i,j,k) has been deal with reduction operation in the inner loop "do l=1,n" in the pgi compiler ? Is it NOT necessary to do :

...
tmp = 0.0
$ACC LOOP REDUCTION(+:tmp)
do l=1,n
tmp = tmp + D(l,i)*u(l,j,k)
$ + D(l,j)*v(i,l,k)
$ + D(l,k)*w(i,j,l)
enddo
m(i,j,k) = tmp
...

Thank you very much for your help
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6021
Location: The Portland Group Inc.

PostPosted: Mon Nov 26, 2012 10:41 am    Post subject: Reply with quote

Hi jigo3635,

The unable to parallelize message applies to the "l" loop since the same values of "m" are updated for each iteration of the loop. So, yes, you would need to use the reduction clause with a scalar to get this loop to accelerate.

However since you ave three levels of outer loops, you may be better off only scheduling these loop and having the reduction loop performed sequentially. As you have it now, the "k" loop would be scheduled as the "gang" and "l" would be your "vector". "j" and "i" are run sequentially within a "gang". Since you can have two dimensions in the "gang", you could collapse "k" and "j" together, but "i" would still be sequential. So not only have you reduced the amount of parallelism, there is additional overhead of setting up a reduction.

What I'd do is experiment with the schedule or use the "kernel" construct instead of "parallel" and let the compiler figure out the best schedule.

Some ideas:
Code:

!$ACC DATA COPYIN(D,u,v,w)
!$ACC& COPYOUT(m)
!$ACC KERNELS
do k=1,n
do j=1,n
do i=1,n
m(i,j,k) = 0.
do l=1,n
m(i,j,k) = m(i,j,k) + D(l,i)*u(l,j,k)
$ + D(l,j)*v(i,l,k)
$ + D(l,k)*w(i,j,l)
enddo
enddo
enddo
!$ACC END KERNELS


Code:

!$ACC DATA COPYIN(D,u,v,w)
!$ACC& COPYOUT(m)
!$ACC PARALLEL LOOP COLLAPSE(2)
do k=1,n
do j=1,n
do i=1,n
tmp = 0.
!$ACC LOOP REDUCTION(+:tmp)
do l=1,n
tmp = tmp + D(l,i)*u(l,j,k)
$ + D(l,j)*v(i,l,k)
$ + D(l,k)*w(i,j,l)
enddo
m(i,j,k) = tmp
enddo
enddo


Code:
!$ACC DATA COPYIN(D,u,v,w)
!$ACC& COPYOUT(m)
!$ACC PARALLEL
!$ACC LOOP GANG
do k=1,n
!$ACC LOOP GANG VECTOR
do j=1,n
!$ACC LOOP VECTOR
do i=1,n
m(i,j,k) = 0.
do l=1,n
m(i,j,k) = m(i,j,k) + D(l,i)*u(l,j,k)
$ + D(l,j)*v(i,l,k)
$ + D(l,k)*w(i,j,l)
enddo
enddo
enddo
enddo
!$ACC END PARALLEL
!$ACC END DATA


Hope this helps,
Mat
Back to top
View user's profile
jigo3635



Joined: 15 Jun 2012
Posts: 6

PostPosted: Tue Dec 04, 2012 9:40 am    Post subject: Reply with quote

Hi Mat,

Thank you very much for your responses.

Perhaps there is a typo in version 3 of the code.
Code:

...
!$ACC LOOP GANG
do k=1,n
!$ACC LOOP GANG VECTOR
do j=1,n
!$ACC LOOP VECTOR


"!$ACC LOOP GANG VECTOR" should be written "!$ACC LOOP WORKER" with PGI compiler otherwise the code cannot be compiled.

Though I rewritten this it seems that this code works with Cray compiler but NOT pgi compiler on a Cray XE6 machine. My test code is below.

Code:

     program test_axf

c      implicit none                                                                                         
      integer, parameter :: n=10
      integer :: i, j, k, l
      real :: us,ur,ut
      real, dimension(:,:),allocatable :: D
   real, dimension (:,:,:), allocatable :: w, u

   real :: summ
   allocate (D(n,n))
       allocate (w(n,n,n), u(n,n,n) )

      D = 0.0
      w = 0.0
      u = 0.0

   do j = 1,n
      do i = 1,n
            D(i,j) = 1.0
         enddo
      enddo

   do k = 1,n
         do j = 1,n
            do i = 1,n
              u(i,j,k) = 1.0
           enddo
        enddo
      enddo

!$ACC DATA COPYIN(g,D)                                                                                       
!$ACC& COPY(w,u)                                                                                             
   call ax3f(w,u,ur,us,ut,n,D)
!$ACC WAIT                                                                                                   
!$ACC END DATA                                                                                               

   summ = 0.0
      do k = 1,n
         do j = 1,n
            do i = 1,n
      summ = summ + w(i,j,k)
           enddo
        enddo
      enddo

      write(*,*) "SUMMM= ", summ

      deallocate (D,w,u)
      contains

c-----------------------------------------------------------------------                                     
      subroutine ax3f(w,u,ur,us,ut,n,D)
      real w (n,n,n), u (n,n,n), D(n,n)
   real ur,us,ut,wtmp
      integer i,j,k,l,e

!$ACC DATA PRESENT(u)                                                                                         
!$ACC& PRESENT(w)                                                                                             
!$ACC& PRESENT(g,D)                                                                                           
!$ACC  PARALLEL                                                                                               
!$ACC LOOP gang                                                                                               
   do k=1,n
!$ACC LOOP WORKER                                                                                             
         do j=1,n
!$ACC LOOP VECTOR                                                                                             
            do i=1,n
            w(i,j,k) = 0.
               do l=1,n
                  w(i,j,k) = w(i,j,k) + D(i,l)*u(l,j,k)
     $                 + D(i,l)*u(i,l,k)
     $                 + D(i,l)*u(i,j,l)
               enddo
            enddo
         enddo
      enddo
!$acc end parallel                                                                                           
!$ACC end data                                                                                               

      return
      end
     end


I just wonder if there are any differences between pgi and cray compiler using OpenACC syntax.

Thanks again.

Regards, Jin
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6021
Location: The Portland Group Inc.

PostPosted: Tue Dec 04, 2012 10:56 am    Post subject: Reply with quote

Quote:
Perhaps there is a typo in version 3 of the code.
Yes, I was mixing Kernels loop scheduling within a parallel construct. I should have used "kernels" or the following schedule where the outer loops are collapsed.
Code:

!$ACC PARALLEL
!$ACC LOOP gang collapse(2)
do k=1,n
         do j=1,n
!$ACC LOOP VECTOR
do i=1,n
            w(i,j,k) = 0.


or
Code:

!$ACC kernels
!$ACC LOOP gang 
 do k=1,n
!$ACC LOOP gang vector
do j=1,n
!$ACC LOOP VECTOR
do i=1,n


The "Worker" schedule on NVIDIA corresponds to a Warp (a group of 32 threads) and not configurable by a user. I think Cray has a different interruption of what a Worker is. We and the other OpenACC members are trying to work out these implementation differences but in the meantime, try one of the above schedules and see if Cray matches our.

Personally, I'd use the "kernels" construct without any loop schedules and let the compiler determine the best schedule.

Code:

!$ACC kernels
do k=1,n
      do j=1,n 
           do i=1,n


- Mat
Back to top
View user's profile
jigo3635



Joined: 15 Jun 2012
Posts: 6

PostPosted: Thu Dec 06, 2012 3:20 am    Post subject: Reply with quote

Hi Mat,

Now It works fine with pgi compiler.

Thank you very much for your help.

/Jin
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group