PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

CUDA-x86.

Wrong results when using the private directive with PGI 12.6
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Tue Aug 07, 2012 1:55 am    Post subject: Wrong results when using the private directive with PGI 12.6 Reply with quote

Hi,

Trying to run OpenACC code compiled with PGI 12.6, I am getting wrong results. After investigation it seems related to the private directive. I have been able to reproduce the problem in the following test code:
Code:

program main
  implicit none
  integer*4, parameter :: ireals=8
  integer*4 :: N,nlev,ip,k,kspec
  real*8, allocatable :: sohr_ref(:,:),thhr_ref(:,:)
  real*8, allocatable :: sohr(:,:),thhr(:,:),zfls(:,:),zflt(:,:),zsmu0(:),dp0(:,:)
  real*8 :: cp_d,zfac,g,zepemu

  N=1E3
  nlev=4

allocate(sohr_ref(N,nlev),thhr_ref(N,nlev))
allocate(sohr(N,nlev),thhr(N,nlev),zfls(N,nlev+1),zflt(N,nlev+1),zsmu0(N),dp0(N,nlev))

!----------------------------------
!init
cp_d     = 1005.0_ireals
g        =     9.80665_ireals
zepemu = 1.0E-9_ireals  !
DO  k = 1, nlev
   DO  ip = 1,N
      zfls(ip,k)=8.0_ireals*cos(6.28_ireals*REAL(ip+k,ireals)/REAL(N+nlev,ireals))
      zflt(ip,k)=6.0_ireals*cos(6.28_ireals*REAL(ip+k,ireals)/REAL(N+nlev,ireals))
      dp0(ip,k) = 1015.0_ireals*cos(6.28_ireals*REAL(ip+k,ireals)/REAL(N+nlev,ireals))
   END DO
END DO

DO  ip = 1,N
   zfls(ip,nlev+1)=8.0_ireals*cos(6.28_ireals*REAL(ip+nlev+1,ireals)/REAL(N+nlev+1,ireals))
   zflt(ip,nlev+1)=6.0_ireals*cos(6.28_ireals*REAL(ip+nlev+1,ireals)/REAL(N+nlev+1,ireals))
   zsmu0(ip)=cos(6.28_ireals*REAL(ip,ireals)/REAL(N,ireals))
END DO
!----------------------------------
!1: compute on cpu
DO  k = 1, nlev
   DO  ip = 1,N
      zfac = g/(cp_d*dp0 (ip,k))
      sohr(ip,k) = 0.0
      IF (zsmu0(ip) > zepemu) THEN
         sohr_ref(ip,k) = zfac * (zfls(ip,k)-zfls(ip,k+1))
      ENDIF
      thhr_ref(ip,k)   = zfac * (zflt(ip,k)-zflt(ip,k+1))
   ENDDO
END DO


!----------------------------------
!2: compute on gpu without private
!$acc data create(sohr,thhr,dp0,zfls,zflt,zsmu0)
!$acc update device(dp0,zfls,zflt,zsmu0)
!$acc parallel
DO  k = 1, nlev
   !$acc loop gang vector
   DO  ip = 1,N
      zfac = g/(cp_d*dp0 (ip,k))
      sohr(ip,k) = 0.0
      IF (zsmu0(ip) > zepemu) THEN
         sohr(ip,k) = zfac * (zfls(ip,k)-zfls(ip,k+1))
      ENDIF
      thhr(ip,k)   = zfac * (zflt(ip,k)-zflt(ip,k+1))
   ENDDO
END DO
!$acc end parallel
!$acc update host(sohr,thhr)
!$acc end data

print*, 'Max Diff without private CPU/GPU: sohr', maxval(abs(sohr_ref-sohr))
print*, 'Max Diff without private CPU/GPU: thhr', maxval(abs(thhr_ref-thhr))

!----------------------------------
!3: compute on gpu using private

!$acc data create(sohr,thhr,dp0,zfls,zflt,zsmu0)
!$acc update device(dp0,zfls,zflt,zsmu0)
!$acc parallel private(ip,k,zfac)
DO  k = 1, nlev
   !$acc loop gang vector
   DO  ip = 1,N
      zfac = g/(cp_d*dp0 (ip,k))
      sohr(ip,k) = 0.0
      IF (zsmu0(ip) > zepemu) THEN
         sohr(ip,k) = zfac * (zfls(ip,k)-zfls(ip,k+1))
      ENDIF
      thhr(ip,k)   = zfac * (zflt(ip,k)-zflt(ip,k+1))
   ENDDO
END DO
!$acc end parallel
!$acc update host(sohr,thhr)
!$acc end data

print*, 'Max Diff with private CPU/GPU: sohr', maxval(abs(sohr_ref-sohr))
print*, 'Max Diff with private CPU/GPU: thhr', maxval(abs(thhr_ref-thhr))


end program main


Compiling and running shows different results when using the private directive:

Code:

> pgf90 -acc -o test_private_12_6 test_private_12_6.f90
> ./test_private_12_6
 Max Diff without private CPU/GPU: sohr    0.000000000000000     
 Max Diff without private CPU/GPU: thhr    0.000000000000000     
 Max Diff with private CPU/GPU: sohr   6.1529741284087325E-004
 Max Diff with private CPU/GPU: thhr   4.6147305963065491E-004




Xavier
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Tue Aug 07, 2012 10:48 am    Post subject: Reply with quote

Hi Xavier,

A private on a parallel construct declares that a copy of the listed variables are created for each gang. This means that every vector in the gang will share the same variable. In your case, this means that every vector is using the same "zfac" variable and hence clobbering each others value.

Since scalars are privatized by default, I recommend to not explicitly privatize them unless absolutely necessary (such as if you're getting a live-out error). Granted, this is very confusing especially since OpenMP users are use to privatizing scalars. So I'm wondering if the compiler should just "do the right thing" in this context and make the scalar a local variable in the kernel (essentially ignoring the private clause). I'll sent the question off to Michael.

- Mat
Back to top
View user's profile
xlapillonne



Joined: 16 Feb 2011
Posts: 69

PostPosted: Wed Aug 08, 2012 12:21 am    Post subject: Reply with quote

Hi Mat,

Thanks for the clarification, I didn't saw when reading the OpenACC specification that private had a different meaning on the parallel and on the loop construct.

Also I think the documentation is somewhat confusion p.10 OpenACC 1.0:

Quote:

A scalar variable referenced in the parallel construct that does not appear in a data clause for the construct or any enclosing data construct will be treated as if it appeared in a private clause (if not live-in or live-out) or a copy clause for the parallel construct.


This is why I thought that the private statment on the parallel construct would work.

In case I would need to privatize a scalar (private to each worker) I should then add the private directive to the "!$acc loop" statment ?

Xavier
Back to top
View user's profile
Tiziano



Joined: 14 May 2012
Posts: 8

PostPosted: Wed Aug 08, 2012 2:54 am    Post subject: Reply with quote

mkcolg wrote:

Since scalars are privatized by default, I recommend to not explicitly privatize them unless absolutely necessary (such as if you're getting a live-out error).
- Mat


Hi Mat,
I'm actually running in such an error or warning (I'm not sure what it really is), so I'd like to ask what is the best practice; I have a loop like this:
!$acc parallel
!$acc loop
DO j = 1, je
!$acc loop
DO i = 1, ie
zaiu (i,j) = c0
zaiv (i,j) = c0
zrotv (i,j) = c0
ENDDO
ENDDO
!$acc end parallel

Where c0 is a constant, this is basically an initialization. The first line is line 4612 in my code.
Compiling like that I would get the following messages from the compiler (12.6):

4612, Accelerator kernel generated
4612, CC 2.0 : 22 registers; 0 shared, 104 constant, 0 local memory bytes
4614, !$acc loop gang ! blockidx%x
4612, Generating copyout(zaiu(1:ie,1:je))
Generating copyout(zaiv(1:ie,1:je))
Generating copyout(zrotv(1:ie,1:je))
Generating compute capability 2.0 binary
4614, Accelerator restriction: scalar variable live-out from loop: i
4617, Accelerator restriction: induction variable live-out from loop: j
Accelerator restriction: induction variable live-out from loop: i
4618, Accelerator restriction: induction variable live-out from loop: j
Accelerator restriction: induction variable live-out from loop: i
4619, Accelerator restriction: induction variable live-out from loop: j
Accelerator restriction: induction variable live-out from loop: i
4620, Accelerator restriction: induction variable live-out from loop: i
4621, Accelerator restriction: induction variable live-out from loop: j

The only combination where I don't get such messages is:

!$acc parallel
!$acc loop private (i,j)
DO j = 1, je
!$acc loop
DO i = 1, ie
zaiu (i,j) = c0
zaiv (i,j) = c0
zrotv (i,j) = c0
ENDDO
ENDDO
!$acc end parallel

I get (skipping copy-out):
4612, Accelerator kernel generated
4612, CC 2.0 : 18 registers; 0 shared, 120 constant, 0 local memory bytes
4614, !$acc loop gang ! blockidx%x
4616, !$acc loop vector(256) ! threadidx%x

But I'm not sure that it is correct that "i" has to be declared private in the outer loop, does that mean that "i" is privatized per every gang? I would want "i" to be private per each worker, no?
So I thought the most logic solution would be:
!$acc parallel
!$acc loop private (j)
DO j = 1, je
!$acc loop private (i)
DO i = 1, ie
zaiu (i,j) = c0
zaiv (i,j) = c0
zrotv (i,j) = c0
ENDDO
ENDDO
!$acc end parallel
but in this case I get:
4612, Accelerator kernel generated
4612, CC 2.0 : 14 registers; 0 shared, 120 constant, 0 local memory bytes
4614, !$acc loop gang ! blockidx%x
4616, !$acc loop vector(256) ! threadidx%x
4612, Generating copyout(zaiu(1:ie,1:je))
Generating copyout(zaiv(1:ie,1:je))
Generating copyout(zrotv(1:ie,1:je))
Generating compute capability 2.0 binary
4614, Accelerator restriction: scalar variable live-out from loop: i
4616, Loop is parallelizable

Why do I get a live-out for "i" in the declaration of the outer loop on "j"? And what is the best practice?
Thank you
Tiziano
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 5815
Location: The Portland Group Inc.

PostPosted: Wed Aug 08, 2012 3:33 pm    Post subject: Reply with quote

Hi Xavier,

Quote:
In case I would need to privatize a scalar (private to each worker) I should then add the private directive to the "!$acc loop" statment ?
Yes, but I don't like this either and agree that the OpenACC Spec needs to cleaned-up. Michael is aware and will bring it up to the OpenACC committee.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group