PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Different performance for cc20 and cc13
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
_sayan_



Joined: 07 Apr 2012
Posts: 29

PostPosted: Fri Jun 15, 2012 12:42 pm    Post subject: Different performance for cc20 and cc13 Reply with quote

Greetings,

I noticed some divergent results when I used compute device 2.0, vs compute device 1.3. The cc13 is using more shared memory than cc20, according to the feedback, the GPU occupation is also more for cc13.
Code:

pgi cc20:
63 registers, 56 shared, 228 constant, 0 local memory bytes, 33% occupation

pgi cc13:
32 registers, 256 shared, 52 constant, 40 local memory bytes, 50% occupation

The code is a 4th order ISO stencil, in Fortran.
Code:

Size   pgi_sm13   pgi_sm20
200   1.12235   1.38163
400   5.19504   6.02635
512   10.29905   11.59021
600   21.73567   25.39310
650   26.29589   30.57195

Is this a known behavior?

Thanks,
Sayan
Back to top
View user's profile
toepfer



Joined: 04 Dec 2007
Posts: 50

PostPosted: Fri Jun 15, 2012 3:21 pm    Post subject: Reply with quote

This is not a known behavior. If possible would you be able to send us an
example of this code so that we can investigate further?
Back to top
View user's profile
_sayan_



Joined: 07 Apr 2012
Posts: 29

PostPosted: Fri Jun 15, 2012 4:12 pm    Post subject: Reply with quote

Hello,

I have tried creating a minimal working example using exactly the same pragmas that is there in the original code. Here is the compilation message for cc13 and cc20:

Code:

CC 1.3 : 21 registers; 2168 shared, 12 constant, 0 local memory bytes; 50% occupancy

CC 2.0 : 38 registers; 2064 shared, 120 constant, 0 local memory bytes; 33% occupancy


I can give the actual code if needed, just that I need to extract it and make it individually compilable, if this example does not serve the purpose, then on Monday I will do it. Thanks for all the help. I have used CUDA 4.2 and PGI 12.3. Here is the entire code:

Code:

PROGRAM simpleFD25

        IMPLICIT NONE
        INTEGER                         :: nx, ny, nz           !grid points and stencil order
        REAL, DIMENSION(5)              :: c
        REAL                            :: time1, time2
        INTEGER                         :: i,j,k,l
        REAL, ALLOCATABLE               :: u(:,:,:)
        REAL, ALLOCATABLE               :: r(:,:,:)
        !$acc mirror(r)

        !prompt user to enter input
        WRITE(*,'(A)',ADVANCE="NO") "Enter NX NY NZ: "
        READ(*,*) nx, ny, nz

        !init
        ALLOCATE (u(0:nx,0:ny,0:nz), r(0:nx,0:ny,0:nz))
        u = 0.; r = 0.
        c = (/1.,1.,1.,1.,1./)
        FORALL (i=1:nx, j=1:ny, k=1:nz)
                u(i,j,k) = float(i+j+k)/(nx+nz+ny)
        END FORALL

        !compute
        CALL cpu_time(time1)

        !$acc data region copyin(c,u)
        !$acc region
        DO l=1,4
        !$acc do parallel(32) unroll(2)
                DO i=4,nx-4
                !$acc do parallel(64)
                        DO j=4,ny-4
                        !$acc do vector(512)
                                DO k=4,nz-4
                                        r(i,j,k) = c(5) * u(i,j,k) + ( c(l) * u(i+l,j,k) + c(l) * u(i-l,j,k) )       &
                                        + ( c(l) * u(i,j+l,k) + c(l) * u(i,j-l,k) )                                  &
                                        + ( c(l) * u(i,j,k+l) + c(l) * u(i,j,k-l) )
                                END DO
                        END DO
                END DO
END DO
        !$acc end region
        !$acc update host(r)
        !$acc end data region

        CALL cpu_time(time2)
        WRITE(*,*) "Time taken = ", (time2 - time1), "secs"

        !deallocate
        DEALLOCATE(u, r)

END PROGRAM simpleFD2



- Sayan
Sayan
Back to top
View user's profile
toepfer



Joined: 04 Dec 2007
Posts: 50

PostPosted: Fri Jun 15, 2012 5:05 pm    Post subject: Reply with quote

For now, I think this example will suffice.

Thanks!
Back to top
View user's profile
_sayan_



Joined: 07 Apr 2012
Posts: 29

PostPosted: Wed Jun 20, 2012 1:38 pm    Post subject: Reply with quote

Perhaps this divergence is because of different front-ends used in the previous and latest version of NVCC. An Nvidia employee in the forums suggested that I use "-ftz=true -prec-div=false -prec-sqrt=false" option to bring my sm_20 program closer to the sm_13 performance. Can you please let me know how I could pass NVCC options to PGI?

Thanks-
Sayan
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group