|
| View previous topic :: View next topic |
| Author |
Message |
CStackpole
Joined: 16 Sep 2009 Posts: 3
|
Posted: Thu Jan 07, 2010 12:06 pm Post subject: Confusing fortran accelerator problem |
|
|
OS: CentOS release 5.3 (Final) x64
NVidia drivers: cudadriver_2.3_linux_64_190.18
PGI: 10.0
I am trying to help debug some code and having a few troubles. I have two pieces of code that are very identical but have very different results.
The first will always return an error 700 and occasionally (don't know why) will cause a kernel panic in the nvidia drivers and crash the system (!).
The second always works, but since the calculations are different it obviously is not very useful.
I have been trying to figure this out today and would greatly appreciate any help understanding the problem.
[EDIT] Forgot to mention that line 133 is the k loop.
The first that fails:
| Code: |
!$acc region
DO k=2,nz-2
DO j=3,ny-2
DO i=3,nx-2
u(i,j,k,2)=-u(i,j,k,3)*temh
: -tema*(((((u(i+2,j,k,1)-ubar(i+2,j,k))-
: (u(i-2,j,k,1)-ubar(i-2,j,k))))))
END DO
END DO
END DO
!$acc end region
|
| Code: |
pgfortran -fast -ta=nvidia -Minfo -c loop.f
looptest:
76, Loop not vectorized/parallelized: loop count too small
78, Unrolled inner loop 8 times
130, Loop not vectorized/parallelized: contains call
132, Generating copyin(ubar(1:nx,3:ny-2,2:nz-2))
Generating copyin(u(1:nx,3:ny-2,2:nz-2,1:3))
Generating copyout(u(3:nx-2,3:ny-2,2:nz-2,2))
133, Loop is parallelizable
134, Loop is parallelizable
135, Loop is parallelizable
Accelerator kernel generated
133, !$acc do parallel
Cached references to size [260x3] block of 'u'
Cached references to size [260] block of 'ubar'
134, !$acc do parallel
135, !$acc do vector(256)
pgfortran -fast -ta=nvidia -Minfo -o loop loop.o \
-lpapi
|
| Code: |
call to cuMemcpy2D returned error 700: Launch failed
Kernel Crash
|
The second that works:
| Code: |
!$acc region
DO k=2,nz-2
DO j=3,ny-2
DO i=3,nx-2
u(i,j,k,2)=-u(i,j,k,3)*temh
: -tema*(((((u(i,j,k,1)-ubar(i,j,k))-
: (u(i,j,k,1)-ubar(i,j,k))))))
END DO
END DO
END DO
!$acc end region
|
| Code: |
pgfortran -fast -ta=nvidia -Minfo -c loop.f
looptest:
76, Loop not vectorized/parallelized: loop count too small
78, Unrolled inner loop 8 times
130, Loop not vectorized/parallelized: contains call
132, Generating copyin(u(3:nx-2,3:ny-2,2:nz-2,1:3))
Generating copyout(u(3:nx-2,3:ny-2,2:nz-2,2))
Generating copyin(ubar(3:nx-2,3:ny-2,2:nz-2))
133, Loop is parallelizable
134, Loop is parallelizable
135, Loop is parallelizable
Accelerator kernel generated
133, !$acc do parallel, vector(4)
134, !$acc do parallel, vector(4)
135, !$acc do vector(16)
pgfortran -fast -ta=nvidia -Minfo -o loop loop.o \
-lpapi
|
Output is wrong but it does finish. |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Thu Jan 07, 2010 2:51 pm Post subject: |
|
|
Hi CStackpole,
| Quote: | | call to cuMemcpy2D returned error 700: Launch failed | This is normally caused by a seg fault when copying the data to the GPU. In this case, it's most likely the compilers fault due to the different bounds being used to copy "u" in and out. Try using the "copy" directive to tell the compiler to copy in and out the entire "u" array. This will also give you better performance since the array can be copied in a single DMA transfer versus the multiple copies required to copy array segments.
I'm a bit surprised that the compiler didn't detect the forward and backward loop dependencies in the "i" loop. (i.e the "u(i+2" and "u(i-2" references). Let's also add a "seq" directive to sequential the "i". Otherwise, you might get non-deterministic results.
Here's what I suggest to try:
| Code: |
!$acc region copy(u, ubar)
DO k=2,nz-2
DO j=3,ny-2
!$acc do seq
DO i=3,nx-2
u(i,j,k,2)=-u(i,j,k,3)*temh
: -tema*(((((u(i+2,j,k,1)-ubar(i+2,j,k))-
: (u(i-2,j,k,1)-ubar(i-2,j,k))))))
END DO
END DO
END DO
!$acc end region |
Also, do you mind sending your code to PGI Customer Support (trs@pgroup.com) and ask them to forward it to me? I'd like to confirm that I'm correct and send a report to our engineers.
Thanks,
Mat
Last edited by mkcolg on Fri Jan 08, 2010 10:53 am; edited 1 time in total |
|
| Back to top |
|
 |
CStackpole
Joined: 16 Sep 2009 Posts: 3
|
Posted: Fri Jan 08, 2010 6:54 am Post subject: |
|
|
Thanks for your reply.
I did as you asked and inserted that code. Now when it runs I get:
| Quote: | | call to cuMemcpyDtoH returned error 700: Launch failed |
However, after several runs I have yet to get a kernel crash. That is already a great step forward. :)
I will compose an email with the code right after I post.
Thank you so much for your time and help! |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Fri Jan 08, 2010 10:52 am Post subject: |
|
|
Hi Chris,
My bad. I had a typo in the code. It should be "!$acc seq" not "!$add seq". I've fixed it above.
- Mat |
|
| Back to top |
|
 |
mkcolg
Joined: 30 Jun 2004 Posts: 4996 Location: The Portland Group Inc.
|
Posted: Fri Jan 08, 2010 12:06 pm Post subject: |
|
|
Hi Chris,
In looking at the code further, the compiler is correct in that there isn't a dependency in the "i" loop. What I failed to notice was that the left-hand u uses "2" for the last dimension and the right-hand uses "1". Hence, no dependency. Unfortunately, the "seq" is still needed to work around the "cuMemcpyDtoH" error.
The good news is that this error appears to have already been fixed internally so hopefully will be available in the next release (10.2 in February). I've assigned this to TPR#16479 for tracking purposes.
- Mat |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|