|
| View previous topic :: View next topic |
| Author |
Message |
TheMatt
Joined: 06 Jul 2009 Posts: 263 Location: Greenbelt, MD
|
Posted: Mon Apr 22, 2013 7:18 am Post subject: cudaFuncSetCacheConfig and CUDA 5.0 |
|
|
All,
I seem to be encountering an issue with CUDA 5 that I didn't with CUDA 4.2, and it seems to involve the use of cudaFuncSetCacheConfig.
Namely, my CUDA Fortran code runs just swimmingly if I compile -Mcuda=4.2,cc20 (which is my default). However, I just decided to try it out with -Mcuda=5.0,cc20 and, boom:
| Code: | [janus:26230] *** Process received signal ***
[janus:26230] Signal: Segmentation fault (11)
[janus:26230] Signal code: Address not mapped (1)
[janus:26230] Failing at address: (nil)
[janus:26230] [ 0] /lib64/libpthread.so.0() [0x3d4440f500]
[janus:26230] *** End of error message *** |
This being CUDA Fortran, I "debugged" (aka added some print statements) and was able to track down the crash to a cudaFuncSetCacheConfig call made just before the first GPU kernel call. And, indeed, commenting out this call allowed the program to proceed...until the next cudaFuncSetCacheConfig call, where it crashed.
I decided to whip up a tester (which is just the .../13.4/etc/samples/cudafor/sgemm.cuf code with a FuncSet call) where the important bits are:
| Code: | call sgemm_cpu(A, B, gold, m, N, k, alpha, beta)
! timing experiment
time = 0.0
istat = cudaFuncSetCacheConfig('saxpy_sgemm_sgemmnn_16x16',cudaFuncCachePreferL1)
istat = cudaGetLastError()
if (istat /= 0) then
write (*,*) "Error code from cache set call: ", istat
write (*,*) "Kernel call failed: ", cudaGetErrorString(istat)
end if
istat = cudaEventRecord(start, 0)
do j = 1, NREPS
call sgemmNN_16x16<<<blocks, threads>>>(dA, dB, dC, m, N, k, alpha, beta)
end do
istat = cudaEventRecord(stop, 0)
istat = cudaThreadSynchronize()
|
I then did a couple experiments:
| Code: | (387) $ pgfortran -V
pgfortran 13.4-0 64-bit target on x86-64 Linux -tp nehalem
Copyright 1989-2000, The Portland Group, Inc. All Rights Reserved.
Copyright 2000-2013, STMicroelectronics, Inc. All Rights Reserved.
(388) $ pgfortran -Mcuda=4.2,cc20,ptxinfo -Minfo sgemm-funccache.cuf -o cuda42.exe
ptxas info : Compiling entry function 'saxpy_sgemm_sgemmnn_16x16_' for 'sm_20'
ptxas info : Function properties for saxpy_sgemm_sgemmnn_16x16_
64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 20 registers, 1088+0 bytes smem, 76 bytes cmem[0]
(389) $ pgfortran -Mcuda=5.0,cc20,ptxinfo -Minfo sgemm-funccache.cuf -o cuda50.exe
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'saxpy_sgemm_sgemmnn_16x16_' for 'sm_20'
ptxas info : Function properties for saxpy_sgemm_sgemmnn_16x16_
64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 19 registers, 1088 bytes smem, 76 bytes cmem[0]
(390) $ ./cuda42.exe
Device:Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory.
Test passed!
256x256 * 256x256: 0.414 ms 81.004 GFlops/s
(391) $ ./cuda50.exe
Device:Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory.
Error code from cache set call: 8
Kernel call failed:
invalid device function
Test passed!
256x256 * 256x256: 1.578 ms 21.267 GFlops/s
|
Thus, it looks like the CUDA 5 version has a different behavior. In the case of my tester it doesn't crash out with a Segfault, but it's still different than CUDA 4.2 in that it triggers an error (and is slower...for some reason).
So, my question is: is this expected? Was cudaFuncSetCacheConfig deprecated? Or, perhaps, is my driver to old? I'm using 304.60.
(And, I suppose, I'm not sure why the CUDA 5 version is slower...even without the cudaFuncSetCacheConfig call; I tested that. Ideas?)
Thanks,
Matt |
|
| Back to top |
|
 |
brentl
Joined: 20 Jul 2004 Posts: 107
|
Posted: Tue Apr 23, 2013 10:42 am Post subject: |
|
|
NVIDIA changed the API from CUDA 4.2 to CUDA 5.0 for cudaFuncSetCacheConfig.
In CUDA 5.0, try just using the subroutine name.
istat = cudaFuncSetCacheConfig(sgemmnn_16x16, cudaFuncCachePreferL1)
Sorry, we try to hide users from things changing out from underneath them, but in this case, there was nothing we could do.
As for performance, you'll be a lot happier if you turn on optimization.
brentl@sb-leback:~/simple/cuda> pgf90 -Mcuda=4.2 sgemm.cuf
brentl@sb-leback:~/simple/cuda> ./a.out
Test PASSED!
256x256 * 256x256: 0.841 ms 39.884 GFlops/s
brentl@sb-leback:~/simple/cuda> pgf90 -Mcuda=5.0 sgemm.cuf
brentl@sb-leback:~/simple/cuda> ./a.out
Test PASSED!
256x256 * 256x256: 1.873 ms 17.913 GFlops/s
brentl@sb-leback:~/simple/cuda> pgf90 -Mcuda=4.2 -fast sgemm.cuf
brentl@sb-leback:~/simple/cuda> ./a.out
Test PASSED!
256x256 * 256x256: 0.171 ms 196.531 GFlops/s
brentl@sb-leback:~/simple/cuda> pgf90 -Mcuda=5.0 -fast sgemm.cuf
brentl@sb-leback:~/simple/cuda> ./a.out
Test PASSED!
256x256 * 256x256: 0.176 ms 191.071 GFlops/s |
|
| Back to top |
|
 |
TheMatt
Joined: 06 Jul 2009 Posts: 263 Location: Greenbelt, MD
|
Posted: Tue Apr 23, 2013 11:27 am Post subject: |
|
|
Brent,
It still didn't seem to help me with the tester on my box:
| Code: | (551) $ grep FuncSet sgemm-funccache-for50.cuf
istat = cudaFuncSetCacheConfig('sgemmnn_16x16',cudaFuncCachePreferL1)
(552) $ pgfortran -fast -Mcuda=5.0,cc20,ptxinfo sgemm-funccache-for50.cuf -o cuda50.exe
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function 'saxpy_sgemm_sgemmnn_16x16_' for 'sm_20'
ptxas info : Function properties for saxpy_sgemm_sgemmnn_16x16_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 51 registers, 1088 bytes smem, 76 bytes cmem[0]
(553) $ ./cuda50.exe
Device:Tesla S2050, 1147.0 MHz clock, 2687.4 MB memory.
Error code from cache set call: 8
Kernel call failed:
invalid device function
Test passed!
256x256 * 256x256: 0.156 ms 215.302 GFlops/s |
At least, the error call still returns a non-zero istat:
| Code: | (559) $ diff sgemm-funccache-for50.cuf /opt/pgi/linux86-64/13.4/etc/samples/cudafor/sgemm.cuf
148,156d147
<
< istat = cudaFuncSetCacheConfig('sgemmnn_16x16',cudaFuncCachePreferL1)
<
< istat = cudaGetLastError()
< if (istat /= 0) then
< write (*,*) "Error code from cache set call: ", istat
< write (*,*) "Kernel call failed: ", cudaGetErrorString(istat)
< end if
<
|
|
|
| Back to top |
|
 |
MuellerM
Joined: 04 Apr 2013 Posts: 4
|
Posted: Tue Apr 23, 2013 8:16 pm Post subject: |
|
|
| Hey Matt, just a general input on debugging with CUDA Fortran: I've lately had some good experience starting with cuda-memcheck on programs that fail with a CUDA error, before going into lengthy debugging with write statements. It may not help here, but it can easily save you a half an hour here and there in the future. The reason is that uninitialized memory on CUDA often seems to lead to a corruption of your state such that strange errors occur at places where you wouldn't expect it - cuda-memcheck will find those uninitialized places for you. |
|
| Back to top |
|
 |
brentl
Joined: 20 Jul 2004 Posts: 107
|
Posted: Wed Apr 24, 2013 8:33 am Post subject: |
|
|
| Take the name out of quotes. The new API takes a function pointer, not a character string |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB © 2001, 2002 phpBB Group
|