PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Possible bug in triangular loop
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
wsawyer



Joined: 19 Jan 2011
Posts: 7

PostPosted: Tue Oct 16, 2012 5:27 am    Post subject: Possible bug in triangular loop Reply with quote

I have extracted the following example from a much bigger application illustrating some slightly non-trivial doubly nested loops. Essentially the code transposes a matrix in place. The following code works fine on the host with OMP but fails on the device. Clearly triangular loops are less desirable than rectangular ones, but changing all the loops in the application is not an option.

Could you explain either (1) why the example code is buggy, or (2) if it is a compiler problem, could you open a ticket?

Compiler: 12.9

Compile line: pgf95 -c -O3 -fastsse -I. -acc -ta=nvidia -Minfo=accel -Mpreprocess -mp=nonuma transpose_bug.f90

PROGRAM transpose_bug

IMPLICIT NONE

INTEGER, PARAMETER :: n = 3
INTEGER :: i, j
DOUBLE PRECISION, POINTER :: array_orig_h(:,:), array1_h(:,:)
DOUBLE PRECISION :: tmp, tmp2(3)

ALLOCATE( array_orig_h( n, n ) )
ALLOCATE( array1_h( n, n ) )

DO i=1, n
DO j=1, n
array_orig_h(i,j) = DBLE(i) + DBLE(j)/1000.0
array1_h(i,j) = array_orig_h(i,j)
ENDDO
ENDDO

print *, "Original matrix", array_orig_h

! Transpose array1_h on host (OMP version)

!$OMP PARALLEL DO PRIVATE(i,j,tmp)
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); array1_h(i,j) = array1_h(j,i); array1_h(j,i) = tmp
ENDDO
ENDDO
!$OMP END PARALLEL DO

print *, "after first host transpose", array1_h

! Transpose array1_h on host (OMP version)

!$OMP PARALLEL DO PRIVATE(i,j,tmp)
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); array1_h(i,j) = array1_h(j,i); array1_h(j,i) = tmp
ENDDO
ENDDO
!$OMP END PARALLEL DO

print *, "after second host transpose", array1_h

print *, "HOST/HOST transpose: Difference to original should be zero ", SUM(array1_h-array_orig_h)


!$acc data copy(array1_h)

!$ACC PARALLEL LOOP PRIVATE(i,j,tmp)
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); array1_h(i,j) = array1_h(j,i); array1_h(j,i) = tmp
ENDDO
ENDDO
!$ACC END PARALLEL LOOP

!$acc wait

print *, "after first device transpose", array1_h

!$ACC PARALLEL LOOP PRIVATE(i,j,tmp)
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); array1_h(i,j) = array1_h(j,i); array1_h(j,i) = tmp
ENDDO
ENDDO
!$ACC END PARALLEL LOOP

!$acc wait

!$acc end data

print *, "result", array1_h

print *, "DEVICE/DEVICE transpose: Difference to original should be zero but is not ", SUM(array1_h-array_orig_h)

DEALLOCATE( array_orig_h, array1_h )

END PROGRAM transpose_bug

Comments: the host output gives the expected result, namely

HOST/HOST transpose: Difference to original should be zero
0.000000000000000

The second device transpose yields:

result 1.001000000000000 2.001000000000000
3.002000000000000 1.002000000000000 2.002000000000000
3.002000000000000 2.003000000000000 2.003000000000000
3.003000000000000
DEVICE/DEVICE transpose: Difference to original should be zero but is not
1.001000000000000


One interesting result of looking at the dump with NVDEBUG=1 is that tmp seems to be allocated with cu_alloc, and is not in a register (as it should be):

__pgi_cu_module_function( name=0x6906b0=transpose_bug_66_gpu, lineno=66, argname=(nil)=, argsize=40, varname=(nil)=, varsize=0, SWcachesize=0 )
Function handle is 0x2015770
__pgi_cu_alloc(size=24,lineno=66,name=tmp)
__pgi_cu_alloc(24) returns 0x200200200 (address=0x7fff4cc18158)

The fact that tmp is allocated with 24 bytes (3x8) is suspicious: if a GPU private array were used instead of a register, it would be safer to have an n x n array.

I've discussed this at length with Jeff Poznanovic and have played around with the directives, e.g., adding an !$acc loop directive around the inner loop. The problem persists.

Hope you can give us some input.

Thanks!

--Will
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6021
Location: The Portland Group Inc.

PostPosted: Tue Oct 16, 2012 10:02 am    Post subject: Reply with quote

Hi Will,

The problem here is that you privatized TMP in the outer loop. Since the outer loop is the "gang" (or block), this means TMP variable is private to the gang, but shared amongst the vectors (threads) in the gang.

To fix, remove the private clauses. Scalars are private by default so it's unnecessary. Also, by privatizing a scalar, you've created an array of these variables in global memory which needs to fetched. Without the private clause, "tmp" will be made a local variable and most likely be put in a register.

Here's the modified code:

Code:
% cat transpose1.f90
PROGRAM transpose_bug

IMPLICIT NONE

INTEGER, PARAMETER :: n = 3
INTEGER :: i, j
DOUBLE PRECISION, POINTER :: array_orig_h(:,:), array1_h(:,:)
DOUBLE PRECISION :: tmp, tmp2(3)

ALLOCATE( array_orig_h( n, n ) )
ALLOCATE( array1_h( n, n ) )

DO i=1, n
DO j=1, n
array_orig_h(i,j) = DBLE(i) + DBLE(j)/1000.0
array1_h(i,j) = array_orig_h(i,j)
ENDDO
ENDDO

print *, "Original matrix", array_orig_h

! Transpose array1_h on host (OMP version)

!$OMP PARALLEL DO PRIVATE(i,j,tmp)
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); array1_h(i,j) = array1_h(j,i); array1_h(j,i) = tmp
ENDDO
ENDDO
!$OMP END PARALLEL DO

print *, "after first host transpose", array1_h

! Transpose array1_h on host (OMP version)

!$OMP PARALLEL DO PRIVATE(i,j,tmp)
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j); array1_h(i,j) = array1_h(j,i); array1_h(j,i) = tmp
ENDDO
ENDDO
!$OMP END PARALLEL DO

print *, "after second host transpose", array1_h

print *, "HOST/HOST transpose: Difference to original should be zero ", SUM(array1_h-array_orig_h)


!$acc data copy(array1_h)

!$ACC PARALLEL LOOP
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j);
array1_h(i,j) = array1_h(j,i);
array1_h(j,i) = tmp
ENDDO
ENDDO
!$ACC END PARALLEL LOOP

!$acc wait

print *, "after first device transpose", array1_h

!$ACC PARALLEL LOOP 
DO i=1, n
DO j=1, i-1
! Exchange element at array1_h(i,j)
tmp = array1_h(i,j);
array1_h(i,j) = array1_h(j,i);
array1_h(j,i) = tmp
ENDDO
ENDDO
!$ACC END PARALLEL LOOP

!$acc wait

!$acc end data

print *, "result", array1_h

print *, "DEVICE/DEVICE transpose: Difference to original should be zero but is not ", SUM(array1_h-array_orig_h)

DEALLOCATE( array_orig_h, array1_h )

END PROGRAM transpose_bug
% pgf95 -O3 -fastsse -I. -acc -ta=nvidia -Minfo=accel -Mpreprocess -mp=nonuma transpose1.f90 -V12.9; a.out

transpose_bug:
     51, Generating copy(array1_h(:,:))
     53, Accelerator kernel generated
         53, CC 1.3 : 16 registers; 52 shared, 8 constant, 0 local memory bytes
             CC 2.0 : 17 registers; 4 shared, 64 constant, 0 local memory bytes
         54, !$acc loop gang ! blockidx%x
         55, !$acc loop vector(256) ! threadidx%x
     53, Generating present_or_copy(array1_h(:,:))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     55, Loop is parallelizable
     68, Accelerator kernel generated
         68, CC 1.3 : 16 registers; 52 shared, 8 constant, 0 local memory bytes
             CC 2.0 : 17 registers; 4 shared, 64 constant, 0 local memory bytes
         69, !$acc loop gang ! blockidx%x
         70, !$acc loop vector(256) ! threadidx%x
     68, Generating present_or_copy(array1_h(:,:))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     70, Loop is parallelizable
 Original matrix    1.001000000000000         2.001000000000000     
    3.001000000000000         1.002000000000000         2.002000000000000     
    3.002000000000000         1.003000000000000         2.003000000000000     
    3.003000000000000     
 after first host transpose    1.001000000000000         1.002000000000000     
     1.003000000000000         2.001000000000000         2.002000000000000     
     2.003000000000000         3.001000000000000         3.002000000000000     
     3.003000000000000     
 after second host transpose    1.001000000000000     
    2.001000000000000         3.001000000000000         1.002000000000000     
    2.002000000000000         3.002000000000000         1.003000000000000     
    2.003000000000000         3.003000000000000     
 HOST/HOST transpose: Difference to original should be zero 
    0.000000000000000     
 after first device transpose    1.001000000000000     
    2.001000000000000         3.001000000000000         1.002000000000000     
    2.002000000000000         3.002000000000000         1.003000000000000     
    2.003000000000000         3.003000000000000     
 result    1.001000000000000         2.001000000000000     
    3.001000000000000         1.002000000000000         2.002000000000000     
    3.002000000000000         1.003000000000000         2.003000000000000     
    3.003000000000000     
 DEVICE/DEVICE transpose: Difference to original should be zero but is not 
    0.000000000000000     

Back to top
View user's profile
wsawyer



Joined: 19 Jan 2011
Posts: 7

PostPosted: Wed Oct 24, 2012 12:56 am    Post subject: Re: Possible PGI bug Reply with quote

Hi Mat,

Thanks very much for the quick reply. I can confirm your results.

Related questions: why should the introduction of PRIVATE(tmp) yield incorrect results? When are there guarantees that the GPU will give the same results? Is the correctness of results implementation dependent?

We are now testing the application with Cray CCE, but running into problems with their compiler.

Thanks again,

--Will
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6021
Location: The Portland Group Inc.

PostPosted: Wed Oct 24, 2012 9:46 am    Post subject: Reply with quote

Hi Will,

Quote:
why should the introduction of PRIVATE(tmp) yield incorrect results

It's because you privatized it on the loop used to generate the "gang" schedule, hence a single "tmp" is being used for all threads within a gang.

Quote:
When are there guarantees that the GPU will give the same results?
Assuming a correct parallel program, the CPU and GPU results should be in agreement. There may be some differences in floating point value, but typically these are within a expected tolerance.

Quote:
Is the correctness of results implementation dependent?
Do you mean a vendor's implementation of OpenACC or user's specific code?

- Mat
Back to top
View user's profile
wsawyer



Joined: 19 Jan 2011
Posts: 7

PostPosted: Wed Oct 31, 2012 2:13 am    Post subject: Reply with quote

>> Is the correctness of results implementation dependent?

> Do you mean a vendor's implementation of OpenACC or user's specific code?

OpenACC vendor implementation. The original code did not yield incorrect results with CCE 8.1.1.

Thanks again, --Will


Last edited by wsawyer on Mon Nov 05, 2012 7:50 am; edited 1 time in total
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group