PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Vector array assignments within a $acc parallel region
Goto page 1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
wiersma



Joined: 16 May 2013
Posts: 29

PostPosted: Tue Nov 19, 2013 12:40 pm    Post subject: Vector array assignments within a $acc parallel region Reply with quote

Hi there,

This will almost certainly expose some misunderstanding I have with OpenACC, but I don't know why this code runs differently with and without the $acc statements:

Code:
program acc_error_test

  implicit none

  real(SELECTED_REAL_KIND( P=12,R=60)) :: temp1(20,260)

  integer :: b,bb,n
  integer :: g
  integer :: i,i2,j1,j2
  integer      :: MtrBind(260), MtrBpar(0:259)
  !---------------------------------------------------------------------------------------!

  MtrBpar = 0
  do i = 1, 260
     if (i .le. 4) then
        MtrBind(i) = 1
        MtrBpar(i) = MtrBpar(i - 1) + 2
     else
        MtrBind(i) = i
        MtrBpar(i) = MtrBpar(i - 1)
     end if
  end do
 
  b  = 1
  i2 = 20
 
!$acc parallel copyout(temp1),pcopyin(MtrBpar, MtrBind)
 
  temp1 = 0.0
 
  j2 = 0
  do  n  = 1, 260
     bb = MtrBind(n)
     j1 = j2 + 1
     j2 = j2 + MtrBpar(bb) - MtrBpar(bb-1)
     if (bb == b) then
        temp1(1:i2,j1:j2) = -1.0
     endif
  enddo
 
!$acc end parallel
  print *, temp1(1:i2,1)

end program acc_error_test


With the $acc statements, the output is
Quote:

-1.000000000000000 0.000000000000000 0.000000000000000
0.000000000000000 0.000000000000000 0.000000000000000
0.000000000000000 0.000000000000000 0.000000000000000
0.000000000000000 0.000000000000000 0.000000000000000
0.000000000000000 0.000000000000000 0.000000000000000
0.000000000000000 0.000000000000000 0.000000000000000
0.000000000000000 0.000000000000000


Without the $acc statements, the output is
Quote:

-1.000000000000000 -1.000000000000000 -1.000000000000000
-1.000000000000000 -1.000000000000000 -1.000000000000000
-1.000000000000000 -1.000000000000000 -1.000000000000000
-1.000000000000000 -1.000000000000000 -1.000000000000000
-1.000000000000000 -1.000000000000000 -1.000000000000000
-1.000000000000000 -1.000000000000000 -1.000000000000000
-1.000000000000000 -1.000000000000000


The original code has some loops within the parallel region which I'd like to accelerate, but I managed to isolate this behaviour that's giving me problems.

I'm compiling:
Quote:
pgf90 -acc -Minfo=accel -Mlarge_arrays -mcmodel=medium -fast -o acc_error_test acc_error_test.f90


Is this behaviour expected? Presumably temp1 is being distributed across a gang (or multiple gangs?) and only one version of it is being returned. How would I otherwise copy back to the host an array that is set this way in a parallel region?

Thanks,
Rob
Back to top
View user's profile
jtull



Joined: 30 Jun 2004
Posts: 441

PostPosted: Tue Nov 19, 2013 3:23 pm    Post subject: To run code in parallel, it first must be written that way. Reply with quote

Try examples that can be broken into parallel operations.

You cannot calculate MtrBpar(i) until AFTER you calculate
MtrBpar(I-1), so it cannot run in parallel.

When the operations do not involve results from a
previous iteration of the loop, you can distribute the work
over multiple, parallel processors.

I did not try to determine why your answers differ. I suspect you
assumed arrays are always initially zero, which is a bad assumption in fortran. Local data can come from the stack, which means it will initially
be garbage. Initialize any data to zero if the loop depends on initial
values being zero.

dave
Back to top
View user's profile
wiersma



Joined: 16 May 2013
Posts: 29

PostPosted: Wed Nov 20, 2013 9:54 am    Post subject: Re: To run code in parallel, it first must be written that w Reply with quote

Hi Dave,

I don't believe you read my post very closely :(.

jtull wrote:

When the operations do not involve results from a
previous iteration of the loop, you can distribute the work
over multiple, parallel processors.


As I said, the original code is much more involved with things that can be parallelized, I merely isolated the problem for the forum.

The code looks more like
Code:

!$acc parallel

... various isolated things that can be parallelized ...

... problem area ...

... various isolated things that can be parallelized ...
 
!$acc end parallel



jtull wrote:

I did not try to determine why your answers differ. I suspect you
assumed arrays are always initially zero, which is a bad assumption in fortran. Local data can come from the stack, which means it will initially
be garbage. Initialize any data to zero if the loop depends on initial
values being zero.


I'm not sure what you're talking about. The variables are initialized. Unless for some reason you can't initialize variables on the device (j2 in my example). Could you please take another look?

Thanks,
Rob


Last edited by wiersma on Wed Nov 20, 2013 12:41 pm; edited 2 times in total
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6146
Location: The Portland Group Inc.

PostPosted: Wed Nov 20, 2013 12:27 pm    Post subject: Reply with quote

Hi Robert,

First you have an out-of-bounds error with MtrBpar in the first loop when i==260. Though, that's not the main issue.

"parallel" expects the user to specify the work sharing loops via the "!$acc loop" directives. Sans the loop directives, a single sequential kernel should be created. The exception being when the OpenACC 2.0 "auto" keyword is used, then the compiler is free to auto-parallelize inner loops. For legacy reasons, we auto-parallelize by default and it's this auto-parallelization that's causing the wrong answers. (Note you can disable this via the -acc=noautopar flag, but you would be left with a sequential kernel). I'll go ahead and add a problem report.

What's creating the problem is that you have an outer sequential loop (n) combined with a parallel assignment to array segment where the seqment size varies from iteration to iteration of the n loop. It doesn't look to me that compiler generating the correct code for the variable size array assignment and only setting the first value. The easy work around is to use explicit loops instead of array syntax.

- Mat

Code:
% cat test.f90
program acc_error_test

   implicit none

   real(SELECTED_REAL_KIND( P=12,R=60)) :: temp1(20,260)

   integer :: b,bb,n
   integer :: g
   integer :: i,i2,j1,j2,j
   integer      :: MtrBind(260), MtrBpar(0:260)
   !---------------------------------------------------------------------------------------!

   MtrBpar = 0
   do i = 1, 260
      if (i .le. 4) then
         MtrBind(i) = 1
         MtrBpar(i) = MtrBpar(i - 1) + 2
      else
         MtrBind(i) = i
         MtrBpar(i) = MtrBpar(i - 1)
      end if
   end do

   b  = 1
   i2 = 20

 !$acc parallel copyout(temp1),pcopyin(MtrBpar, MtrBind)

   temp1 = 1.0

   j2 = 0
   do  n  = 1, 260
      bb = MtrBind(n)
      j1 = j2 + 1
      j2 = j2 + MtrBpar(bb) - MtrBpar(bb-1)
      if (bb == b) then
         do i=1,i2
          do j=j1,j2
              temp1(i,j)=-1.0
          enddo
         enddo
!         temp1(1:i2,j1:j2) = -1.0
      endif
   enddo

 !$acc end parallel
   print *, temp1(1:i2,1)

 end program acc_error_test
% pgf90 test.f90 -acc -Minfo=accel; a.out
acc_error_test:
     27, Generating copyout(temp1(:,:))
         Generating present_or_copyin(mtrbpar(:))
         Generating present_or_copyin(mtrbind(:))
         Accelerator kernel generated
         29, !$acc loop vector(256) ! threadidx%x
         37, !$acc loop vector(256) ! threadidx%x
     27, Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     29, Loop is parallelizable
     32, Loop carried scalar dependence for 'j2' at line 34
         Loop carried scalar dependence for 'j2' at line 35
         Parallelization would require privatization of array 'temp1(i2+1,:)'
     37, Loop is parallelizable
     38, Loop carried reuse of 'temp1' prevents parallelization
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000
Back to top
View user's profile
wiersma



Joined: 16 May 2013
Posts: 29

PostPosted: Wed Nov 20, 2013 12:36 pm    Post subject: Reply with quote

Hi Mat,

Thanks. I was getting a little frustrated with debugging accelerated code, so I tried to do things very incrementally. I.e., first specify the region so that I know the data is getting copied correctly, then insert the shared loop directives. Maybe I was a little too incremental :).

Rob

edit:
ps - I noticed that copying-in a structure causes a launch fail. I suppose that means only standard types can be copied-in?
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group