PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Code execution depends strangely on irrelevant parameters
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
wiersma



Joined: 16 May 2013
Posts: 29

PostPosted: Tue Oct 15, 2013 1:45 pm    Post subject: Code execution depends strangely on irrelevant parameters Reply with quote

Hi,

This is likely going to be something stupid, but I'm at my wits end here. My code doesn't seem to be executing properly. I've stripped the problem down to the most simple form (obviously I've changed variable names and initialized the variables to silly values since the work is proprietary).

Code:

program Mat_test

  implicit none
  integer :: i, j, endn
  integer :: inds(260)
  real*8  :: A(260, 260, 89), B(260, 89), C(260, 89, 89), D(260, 260)
  real*8  :: test(89), test1(89), test2(89)
 

  do i = 1, 89
     A(:,:,i) = i*2.5
     B(:,i) = i * 3.5
     do j = 1, 89
        C(:,i,i) = i*j*0.01
     end do
  end do

  endn = 250

!$acc kernels
!$acc loop independent private(D)
  do  i = 1, 2
     
     do  j = 1, 260
        A(:,j, i) = A(:,j,i) / B(:, i)
     enddo
     
     
     do  j = 1, endn
        D(:,j) = - A(:,j,i) * C(j,i,i)
        D(j,j) = D(j,j) + 1.0         
     enddo
     test(i) = D(5,5)
     test1(i) = - A(5,5,i)
     test2(i) = C(5,i,i)
     
     
  enddo
!$acc end kernels

  print *, test(1), test1(1), test2(1)
  print *, test(2), test1(2), test2(2)
  print *, A(5,5,1)
 
end program Mat_test


In my full version, D is used later on in the loop, but is unique for each iteration through i.

The problem is that the value of "test(2)" depends on "endn". For endn = 10, it gives the correct answer, while for endn = 250 is doesn't. In principle it should loop to 260, but it's causing me problems.

What am I missing here?
-Rob
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Wed Oct 16, 2013 6:24 am    Post subject: Reply with quote

Hi Rob,

It looks like we might be privatizing D for each thread, not each gang, when using the kernels region. This causes the program to use too much global data.

The work around is to use the "parallel" directive instead.
Code:

% cat mat_test.F90
program Mat_test

   implicit none
   integer :: i, j, endn
   integer :: inds(260)
   real*8  :: A(260, 260, 89), B(260, 89), C(260, 89, 89), D(260, 260)
   real*8  :: test(89), test1(89), test2(89)


   do i = 1, 89
      A(:,:,i) = i*2.5
      B(:,i) = i * 3.5
      do j = 1, 89
         C(:,i,i) = i*j*0.01
      end do
   end do

   endn = 250

 !$acc parallel loop gang private(D)
   do  i = 1, 2

!$acc loop vector
      do  j = 1, 260
         A(:,j, i) = A(:,j,i) / B(:, i)
      enddo

!$acc loop vector
      do  j = 1, endn
         D(:,j) = - A(:,j,i) * C(j,i,i)
         D(j,j) = D(j,j) + 1.0
      enddo
      test(i) = D(5,5)
      test1(i) = - A(5,5,i)
      test2(i) = C(5,i,i)


   enddo
 !acc end parallel

   print *, test(1), test1(1), test2(1)
   print *, test(2), test1(2), test2(2)
   print *, A(5,5,1)

 end program Mat_test
% pgf90 -acc -Minfo=accel mat_test.F90
mat_test:
     20, Accelerator kernel generated
         21, !$acc loop gang ! blockidx%x
         24, !$acc loop vector(256) ! threadidx%x
         29, !$acc loop vector(256) ! threadidx%x
     20, Generating present_or_copyout(test2(:2))
         Generating present_or_copyout(test1(:2))
         Generating present_or_copyout(test(:2))
         Generating present_or_copy(a(:,:,:2))
         Generating present_or_copyin(b(:,:2))
         Generating present_or_copyin(c(:250,:2,:2))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     24, Loop is parallelizable
     25, Loop is parallelizable
     29, Loop is parallelizable
     30, Loop is parallelizable
% a.out
   0.3642857245036534       -0.7142857142857143        0.8899999856948853
  -0.2714285509926933       -0.7142857142857143         1.779999971389771
   0.7142857142857143


Hope this helps,
Mat
Back to top
View user's profile
wiersma



Joined: 16 May 2013
Posts: 29

PostPosted: Wed Oct 16, 2013 10:52 am    Post subject: Reply with quote

Still no dice. I used the same directives as you did and for endn=250 I still don't get the right result. I also included some other info in case it can be of help.

Code:

crw8398 ~/codes/sandbox> pgf90 -acc -Minfo=accel -fast -o Mat_test Mat_test.f90
mat_test:
     20, Accelerator kernel generated
         21, !$acc loop gang ! blockidx%x
         24, !$acc loop vector(256) ! threadidx%x
         30, !$acc loop vector(256) ! threadidx%x
     20, Generating present_or_copyout(test2(:2))
         Generating present_or_copyout(test1(:2))
         Generating present_or_copyout(test(:2))
         Generating present_or_copy(a(:,:,:2))
         Generating present_or_copyin(b(:,:2))
         Generating present_or_copyin(c(:250,:2,:2))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     24, Loop is parallelizable
     25, Loop is parallelizable
     30, Loop is parallelizable
     31, Loop is parallelizable
crw8398 ~/codes/sandbox> Mat_test
launch CUDA kernel  file=/home/wiersmar/codes/sandbox/Mat_test.f90 function=mat_test line=20 device=0 grid=2 block=256
  -0.2714285509926933       -0.7142857142857143        0.8899999856948853
  -0.2714285509926933       -0.7142857142857143         1.779999971389771
   0.7142857142857143

Accelerator Kernel Timing data
/home/wiersmar/codes/sandbox/Mat_test.f90
  mat_test  NVIDIA  devicenum=0
    time(us): 2,205
    20: compute region reached 1 time
        20: data copyin reached 4 times
             device time(us): total=407 max=367 min=10 avg=101
        20: kernel launched 1 time
            grid: [2]  block: [256]
             device time(us): total=1,435 max=1,435 min=1,435 avg=1,435
            elapsed time(us): total=1,452 max=1,452 min=1,452 avg=1,452
        42: data copyout reached 4 times
             device time(us): total=363 max=334 min=9 avg=90
crw8398 ~/codes/sandbox> pgaccelinfo
CUDA Driver Version:           5050
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  319.49  Tue Aug 13 21:15:53 PDT 2013

Device Number:                 0
Device Name:                   Tesla C2075
Device Revision Number:        2.0
Global Memory Size:            5636554752
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             1566 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 786432 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           79292 microseconds
Current free memory:           5569896448
Upload time (4MB):             2613 microseconds (1410 ms pinned)
Download time:                 2322 microseconds (1270 ms pinned)
Upload bandwidth:              1605 MB/sec (2974 MB/sec pinned)
Download bandwidth:            1806 MB/sec (3302 MB/sec pinned)
PGI Compiler Option:           -ta=nvidia,cc20

Device Number:                 1
Device Name:                   Quadro 600
Device Revision Number:        2.1
Global Memory Size:            1072889856
Number of Multiprocessors:     2
Number of Cores:               64
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1280 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             800 MHz
Memory Bus Width:              128 bits
L2 Cache Size:                 131072 bytes
Max Threads Per SMP:           1536
Async Engines:                 1
Unified Addressing:            Yes
Initialization time:           79292 microseconds
Current free memory:           577249280
Upload time (4MB):             1545 microseconds ( 897 ms pinned)
Download time:                 2280 microseconds (1989 ms pinned)
Upload bandwidth:              2714 MB/sec (4675 MB/sec pinned)
Download bandwidth:            1839 MB/sec (2108 MB/sec pinned)
PGI Compiler Option:           -ta=nvidia,cc20
crw8398 ~/codes/sandbox>


Thanks,
Rob
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5952
Location: The Portland Group Inc.

PostPosted: Wed Oct 16, 2013 11:53 am    Post subject: Reply with quote

What's the right result? I get the same output with and without directives.

Code:
% pgf90 mat_test.F90
% a.out
   0.3642857245036534       -0.7142857142857143        0.8899999856948853
  -0.2714285509926933       -0.7142857142857143         1.779999971389771
   0.7142857142857143
% pgf90 mat_test.F90 -acc
% a.out
   0.3642857245036534       -0.7142857142857143        0.8899999856948853
  -0.2714285509926933       -0.7142857142857143         1.779999971389771
   0.7142857142857143


% gfortran mat_test.F90
% a.out
  0.36428572450365337      -0.71428571428571430       0.88999998569488525
 -0.27142855099269325      -0.71428571428571430        1.7799999713897705
  0.71428571428571430
% ifort mat_test.F90; a.out
  0.364285724503653      -0.714285714285714       0.889999985694885
 -0.271428550992693      -0.714285714285714        1.77999997138977
  0.714285714285714
Back to top
View user's profile
wiersma



Joined: 16 May 2013
Posts: 29

PostPosted: Wed Oct 16, 2013 12:01 pm    Post subject: Reply with quote

mkcolg wrote:
What's the right result? I get the same output with and without directives.

Code:
% pgf90 mat_test.F90
% a.out
   0.3642857245036534       -0.7142857142857143        0.8899999856948853
  -0.2714285509926933       -0.7142857142857143         1.779999971389771
   0.7142857142857143
% pgf90 mat_test.F90 -acc
% a.out
   0.3642857245036534       -0.7142857142857143        0.8899999856948853
  -0.2714285509926933       -0.7142857142857143         1.779999971389771
   0.7142857142857143


% gfortran mat_test.F90
% a.out
  0.36428572450365337      -0.71428571428571430       0.88999998569488525
 -0.27142855099269325      -0.71428571428571430        1.7799999713897705
  0.71428571428571430
% ifort mat_test.F90; a.out
  0.364285724503653      -0.714285714285714       0.889999985694885
 -0.271428550992693      -0.714285714285714        1.77999997138977
  0.714285714285714


You've got the right result. I don't for some reason (although I do without directives) :(.

Look at the first column in my output:
Quote:
Code:

  -0.2714285509926933       -0.7142857142857143        0.8899999856948853
  -0.2714285509926933       -0.7142857142857143         1.779999971389771
   0.7142857142857143
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group