PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Vector array assignments within a $acc parallel region
Goto page Previous  1, 2, 3  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Thu Nov 21, 2013 10:12 am    Post subject: Reply with quote

Quote:
ps - I noticed that copying-in a structure causes a launch fail. I suppose that means only standard types can be copied-in?
You can use structures, provided that they are fixed sized since data is required to be contiguous. Dynamically allocated data within structures, classes, Fortran user defined types, are problematic in that they require a deep copy which would rebuild the data structure on the device.

This is a long standing limitation of OpenACC and one of the most difficult to solve. Though, the OpenACC committee is currently investigating for the 3.0 specification a standard method on how to perform a deep copy and/or restructure data so that it's contiguous.

- Mat
Back to top
View user's profile
wiersma



Joined: 16 May 2013
Posts: 29

PostPosted: Thu Nov 21, 2013 3:17 pm    Post subject: Reply with quote

Hi all/Mat,

I hate to be a pest, but I'm still having similar issues. When I de-vectorized the array assignment in the original code, it still fails (with and without noautopar). So I went back to the test program and initialized it to something a little bit closer to the original problem. It still produces different results.

I'm also wondering if I'm a little naive here. In the absence of a specific warning message, simply defining a parallel region shouldn't affect the results, correct?


Code:

program acc_error_test

  implicit none

  real(SELECTED_REAL_KIND( P=12,R=60)) :: d_temp1(20,260)

  integer :: b,bb,n
  integer :: x,x1,x2,y,y1,y2
  integer :: g
  integer :: i,i2,j,j1,j2,k,k1,k2

  integer      :: M1(13), M2(0:13)
  !---------------------------------------------------------------------------------------!

  b = 1
  g = 1
  M1 = (/1,2,3,4,5,6,7,8,9,10,11,12,13/)
  M2 = (/0,20,40,60,80,100,120,140,160,180,200,220,240,260/)

  i2 = M2(1)-M2(0)

!$acc parallel copyout(d_temp1), pcopyin(M1, M2)
 
  d_temp1 = 0.0
 
  j2 = 0
  k2 = 0
  do  n  = 1, 13
     bb = M1(n)
     j1 = j2 + 1
     j2 = j2 + M2(bb) - M2(bb-1)
     if (bb == b) then
        do i = 1, i2
           do j = j1, j2
              d_temp1(i,j) = -1.0
           end do
        end do
     else
        k1 = k2 + 1
        k2 = k2 + M2(bb) - M2(bb-1)
        do j = 1, k2-k1+1
           do i = 1,i2
              do k = 1,i2
                 d_temp1(i,j+j1-1) = 1.0
              end do
           end do
        end do
     endif
  enddo

!$acc end parallel
  print *, d_temp1


end program acc_error_test


Code:

~/codes/sandbox> pgf90 -acc=noautopar -i8 -Minfo=accel -Mlarge_arrays -mcmodel=medium -fast -o acc_error_test acc_error_test.f90
acc_error_test:
     22, Generating present_or_copyin(m2(:))
         Generating present_or_copyin(m1(:))
         Generating copyout(d_temp1(:,:))
         Accelerator kernel generated
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     24, Loop is parallelizable
     28, Loop carried scalar dependence for 'j2' at line 30
         Loop carried scalar dependence for 'j2' at line 31
         Loop carried scalar dependence for 'k2' at line 39
         Loop carried scalar dependence for 'k2' at line 40
         Parallelization would require privatization of array 'd_temp1(i2+1,:)'
     33, Loop is parallelizable
     34, Loop carried reuse of 'd_temp1' prevents parallelization
     41, Parallelization would require privatization of array 'd_temp1(i3+1,:)'
     42, Loop is parallelizable
     43, Loop carried reuse of 'd_temp1' prevents parallelization
~/codes/sandbox> acc_error_test > test.out
launch CUDA kernel  file=/home/wiersmar/codes/sandbox/acc_error_test.f90 function=acc_error_test line=22 device=0 grid=10 block=1

Accelerator Kernel Timing data
/home/wiersmar/codes/sandbox/acc_error_test.f90
  acc_error_test  NVIDIA  devicenum=0
    time(us): 3,939
    22: compute region reached 1 time
        22: data copyin reached 2 times
             device time(us): total=26 max=19 min=7 avg=13
        22: kernel launched 1 time
            grid: [10]  block: [1]
             device time(us): total=3,879 max=3,879 min=3,879 avg=3,879
            elapsed time(us): total=3,892 max=3,892 min=3,892 avg=3,892
        51: data copyout reached 1 time
             device time(us): total=34 max=34 min=34 avg=34
~/codes/sandbox> head test.out
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000         0.000000000000000
    0.000000000000000         0.000000000000000         0.000000000000000
    0.000000000000000         0.000000000000000         0.000000000000000
    0.000000000000000         0.000000000000000         0.000000000000000
~/codes/sandbox> pgf90 -i8 -Mlarge_arrays -mcmodel=medium -fast -o acc_error_test acc_error_test.f90
~/codes/sandbox> acc_error_test > test.out
~/codes/sandbox> head test.out
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000


Thanks,
Rob
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6208
Location: The Portland Group Inc.

PostPosted: Thu Nov 21, 2013 6:11 pm    Post subject: Reply with quote

Hi Robert,

I'm debating what do here. The compiler probably isn't generating good code, but the code is serial and not really a good fit for an accelerator. Would you consider something more like the following where only the array assignments are put on the accelerator?

- Mat

Code:
program acc_error_test

   implicit none

   real(SELECTED_REAL_KIND( P=12,R=60)) :: d_temp1(20,260)

   integer :: b,bb,n
   integer :: x,x1,x2,y,y1,y2
   integer :: g
   integer :: i,i2,j,j1,j2,k,k1,k2

   integer      :: M1(13), M2(0:13)
   !---------------------------------------------------------------------------------------!

   b = 1
   g = 1
   M1 = (/1,2,3,4,5,6,7,8,9,10,11,12,13/)
   M2 = (/0,20,40,60,80,100,120,140,160,180,200,220,240,260/)

   i2 = M2(1)-M2(0)
!$acc data copyout(d_temp1)
!$acc kernels
   d_temp1 = 0.0
!$acc end kernels
   j2 = 0
   k2 = 0

   do  n  = 1, 13
      bb = M1(n)
      j1 = j2 + 1
      j2 = j2 + M2(bb) - M2(bb-1)
      if (bb == b) then
!$acc kernels loop
         do i = 1, i2
            do j = j1, j2
               d_temp1(i,j) = -1.0
            end do
         end do
      else
         k1 = k2 + 1
         k2 = k2 + M2(bb) - M2(bb-1)
!$acc kernels loop
         do j = 1, k2-k1+1
            do i = 1,i2
               do k = 1,i2
                  d_temp1(i,j+j1-1) = 1.0
               end do
            end do
         end do
      endif
   enddo

 !$acc end data
   print *, d_temp1


 end program acc_error_test
Back to top
View user's profile
wiersma



Joined: 16 May 2013
Posts: 29

PostPosted: Fri Nov 22, 2013 8:37 am    Post subject: Reply with quote

Looks good Mat - it works in my test program (and I can go back to vectorizing that one assignment). I hadn't thought of using a data region for the arrays and letting the processor worry about the indices.

Naturally, I'll be back if it doesn't go so well in the main program :).

Thanks again for your patience,
Rob
Back to top
View user's profile
wiersma



Joined: 16 May 2013
Posts: 29

PostPosted: Fri Nov 22, 2013 8:59 am    Post subject: Reply with quote

Hi there,

Just one more clarification. Minfo spits out the following if I use Mat's code:

Code:

acc_error_test:
     22, Generating copyout(d_temp1(:,:))
     24, Generating present_or_copyout(d_temp1(:,:))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     25, Loop is parallelizable
         Accelerator kernel generated
         25, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
             !$acc loop gang, vector(32) ! blockidx%x threadidx%x
     35, Generating present_or_copyout(d_temp1(:,:))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     36, Loop is parallelizable
         Accelerator kernel generated
         36, !$acc loop gang ! blockidx%y
             !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     41, Generating present_or_copyout(d_temp1(:,:))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     42, Loop is parallelizable
     43, Loop is parallelizable
     44, Loop carried reuse of 'd_temp1' prevents parallelization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
         42, !$acc loop gang ! blockidx%y
         43, !$acc loop gang, vector(128) ! blockidx%x threadidx%x


Does that mean that two copyout statements are generated (line 35 and 41)? I'm trying desperately to avoid too much communication since I know it's going to kill me. On the other hand, the timing info shows as follows:

Code:

Accelerator Kernel Timing data
/home/wiersmar/codes/sandbox/acc_error_test.f90
  acc_error_test  NVIDIA  devicenum=0
    time(us): 1,176
    22: data region reached 1 time
        52: data copyout reached 1 time
             device time(us): total=34 max=34 min=34 avg=34
    24: compute region reached 1 time
        25: kernel launched 1 time
            grid: [1x65]  block: [32x4]
             device time(us): total=110 max=110 min=110 avg=110
            elapsed time(us): total=130 max=130 min=130 avg=130
    35: compute region reached 1 time
        36: kernel launched 1 time
            grid: [1x20]  block: [128]
             device time(us): total=122 max=122 min=122 avg=122
            elapsed time(us): total=136 max=136 min=136 avg=136
    41: compute region reached 12 times
        44: kernel launched 12 times
            grid: [1x20]  block: [128]
             device time(us): total=910 max=109 min=63 avg=75
            elapsed time(us): total=1,088 max=122 min=77 avg=90


This seems to indicate that only one copyout was used. Which is it then?

Thanks again,
Rob
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page Previous  1, 2, 3  Next
Page 2 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group