PGI User Forum
 SearchSearch   MemberlistMemberlist     RegisterRegister   ProfileProfile    Log inLog in 

Free OpenACC Webinar

Confusing fortran accelerator problem
Goto page 1, 2  Next
 
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming
View previous topic :: View next topic  
Author Message
CStackpole



Joined: 16 Sep 2009
Posts: 3

PostPosted: Thu Jan 07, 2010 12:06 pm    Post subject: Confusing fortran accelerator problem Reply with quote

OS: CentOS release 5.3 (Final) x64
NVidia drivers: cudadriver_2.3_linux_64_190.18
PGI: 10.0

I am trying to help debug some code and having a few troubles. I have two pieces of code that are very identical but have very different results.

The first will always return an error 700 and occasionally (don't know why) will cause a kernel panic in the nvidia drivers and crash the system (!).
The second always works, but since the calculations are different it obviously is not very useful.

I have been trying to figure this out today and would greatly appreciate any help understanding the problem.

[EDIT] Forgot to mention that line 133 is the k loop.

The first that fails:
Code:

!$acc region
      DO k=2,nz-2
        DO j=3,ny-2
          DO i=3,nx-2
            u(i,j,k,2)=-u(i,j,k,3)*temh
     :  -tema*(((((u(i+2,j,k,1)-ubar(i+2,j,k))-
     :             (u(i-2,j,k,1)-ubar(i-2,j,k))))))
          END DO
        END DO
      END DO
!$acc end region

Code:

pgfortran  -fast -ta=nvidia -Minfo -c loop.f
looptest:
     76, Loop not vectorized/parallelized: loop count too small
     78, Unrolled inner loop 8 times
    130, Loop not vectorized/parallelized: contains call
    132, Generating copyin(ubar(1:nx,3:ny-2,2:nz-2))
         Generating copyin(u(1:nx,3:ny-2,2:nz-2,1:3))
         Generating copyout(u(3:nx-2,3:ny-2,2:nz-2,2))
    133, Loop is parallelizable
    134, Loop is parallelizable
    135, Loop is parallelizable
         Accelerator kernel generated
        133, !$acc do parallel
             Cached references to size [260x3] block of 'u'
             Cached references to size [260] block of 'ubar'
        134, !$acc do parallel
        135, !$acc do vector(256)
pgfortran -fast -ta=nvidia -Minfo -o loop loop.o \
   -lpapi

Code:

call to cuMemcpy2D returned error 700: Launch failed
Kernel Crash

The second that works:
Code:

!$acc region
      DO k=2,nz-2
        DO j=3,ny-2
          DO i=3,nx-2
            u(i,j,k,2)=-u(i,j,k,3)*temh
     :  -tema*(((((u(i,j,k,1)-ubar(i,j,k))-
     :             (u(i,j,k,1)-ubar(i,j,k))))))
          END DO
        END DO
      END DO
!$acc end region

Code:

pgfortran  -fast -ta=nvidia -Minfo -c loop.f
looptest:
     76, Loop not vectorized/parallelized: loop count too small
     78, Unrolled inner loop 8 times
    130, Loop not vectorized/parallelized: contains call
    132, Generating copyin(u(3:nx-2,3:ny-2,2:nz-2,1:3))
         Generating copyout(u(3:nx-2,3:ny-2,2:nz-2,2))
         Generating copyin(ubar(3:nx-2,3:ny-2,2:nz-2))
    133, Loop is parallelizable
    134, Loop is parallelizable
    135, Loop is parallelizable
         Accelerator kernel generated
        133, !$acc do parallel, vector(4)
        134, !$acc do parallel, vector(4)
        135, !$acc do vector(16)
pgfortran -fast -ta=nvidia -Minfo -o loop loop.o \
   -lpapi

Output is wrong but it does finish.
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Thu Jan 07, 2010 2:51 pm    Post subject: Reply with quote

Hi CStackpole,

Quote:
call to cuMemcpy2D returned error 700: Launch failed
This is normally caused by a seg fault when copying the data to the GPU. In this case, it's most likely the compilers fault due to the different bounds being used to copy "u" in and out. Try using the "copy" directive to tell the compiler to copy in and out the entire "u" array. This will also give you better performance since the array can be copied in a single DMA transfer versus the multiple copies required to copy array segments.

I'm a bit surprised that the compiler didn't detect the forward and backward loop dependencies in the "i" loop. (i.e the "u(i+2" and "u(i-2" references). Let's also add a "seq" directive to sequential the "i". Otherwise, you might get non-deterministic results.

Here's what I suggest to try:
Code:

!$acc region copy(u, ubar)
      DO k=2,nz-2
        DO j=3,ny-2
!$acc do seq
          DO i=3,nx-2
            u(i,j,k,2)=-u(i,j,k,3)*temh
     :  -tema*(((((u(i+2,j,k,1)-ubar(i+2,j,k))-
     :             (u(i-2,j,k,1)-ubar(i-2,j,k))))))
          END DO
        END DO
      END DO
!$acc end region


Also, do you mind sending your code to PGI Customer Support (trs@pgroup.com) and ask them to forward it to me? I'd like to confirm that I'm correct and send a report to our engineers.

Thanks,
Mat


Last edited by mkcolg on Fri Jan 08, 2010 10:53 am; edited 1 time in total
Back to top
View user's profile
CStackpole



Joined: 16 Sep 2009
Posts: 3

PostPosted: Fri Jan 08, 2010 6:54 am    Post subject: Reply with quote

Thanks for your reply.

I did as you asked and inserted that code. Now when it runs I get:
Quote:
call to cuMemcpyDtoH returned error 700: Launch failed


However, after several runs I have yet to get a kernel crash. That is already a great step forward. :)

I will compose an email with the code right after I post.

Thank you so much for your time and help!
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Fri Jan 08, 2010 10:52 am    Post subject: Reply with quote

Hi Chris,

My bad. I had a typo in the code. It should be "!$acc seq" not "!$add seq". I've fixed it above.

- Mat
Back to top
View user's profile
mkcolg



Joined: 30 Jun 2004
Posts: 6211
Location: The Portland Group Inc.

PostPosted: Fri Jan 08, 2010 12:06 pm    Post subject: Reply with quote

Hi Chris,

In looking at the code further, the compiler is correct in that there isn't a dependency in the "i" loop. What I failed to notice was that the left-hand u uses "2" for the last dimension and the right-hand uses "1". Hence, no dependency. Unfortunately, the "seq" is still needed to work around the "cuMemcpyDtoH" error.

The good news is that this error appears to have already been fixed internally so hopefully will be available in the next release (10.2 in February). I've assigned this to TPR#16479 for tracking purposes.

- Mat
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    PGI User Forum Forum Index -> Accelerator Programming All times are GMT - 7 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © phpBB Group